Patch: Write Amplification Reduction Method (WARM)
Hi All,
As previously discussed [1]/messages/by-id/CABOikdMop5Rb_RnS2xF dAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw@mail.gmail.com, WARM is a technique to reduce write
amplification when an indexed column of a table is updated. HOT fails to
handle such updates and ends up inserting a new index entry in all indexes
of the table, irrespective of whether the index key has changed or not for
a specific index. The problem was highlighted by Uber's blog post [2]https://eng.uber.com/mysql-migration/, but
it was a well known problem and affects many workloads.
Simon brought up the idea originally within 2ndQuadrant and I developed it
further with inputs from my other colleagues and community members.
There were two important problems identified during the earlier discussion.
This patch addresses those issues in a simplified way. There are other
complex ideas to solve those issues, but as the results demonstrate, even a
simple approach will go far way in improving performance characteristics of
many workloads, yet keeping the code complexity to relatively low.
Two problems have so far been identified with the WARM design.
“*Duplicate Scan*” - Claudio Freire brought up a design flaw which may lead
an IndexScan to return same tuple twice or more, thus impacting the
correctness of the solution.
“*Root Pointer Search*” - Andres raised the point that it could be
inefficient to find the root line pointer for a tuple in the HOT or WARM
chain since it may require us to scan through the entire page.
The Duplicate Scan problem has correctness issues so could block WARM
completely. We propose the following solution:
We discussed a few ideas to address the "Duplicate Scan" problem. For
example, we can teach Index AMs to discard any duplicate (key, CTID) insert
requests. Or we could guarantee uniqueness by either only allowing updates
in one lexical order. While the former is a more complete solution to avoid
duplicate entries, searching through large number of keys for non-unique
indexes could be a drag on performance. The latter approach may not be
sufficient for many workloads. Also tracking increment/decrement for many
indexes will be non-trivial.
There is another problem with allowing many index entries pointing to the
same WARM chain. It will be non-trivial to know how many index entries are
currently pointing to the WARM chain and index/heap vacuum will throw up
more challenges.
Instead, what I would like to propose and the patch currently implements is
to restrict WARM update to once per chain. So the first non-HOT update to a
tuple or a HOT chain can be a WARM update. The chain can further be HOT
updated any number of times. But it can no further be WARM updated. This
might look too restrictive, but it can still bring down the number of
regular updates by almost 50%. Further, if we devise a strategy to convert
a WARM chain back to HOT chain, it can again be WARM updated. (This part is
currently not implemented). A good side effect of this simple strategy is
that we know there can maximum two index entries pointing to any given WARM
chain.
The other problem Andres brought up can be solved by storing the root line
pointer offset in the t_ctid field of the last tuple in the update chain.
Barring some aborted update case, usually it's the last tuple in the update
chain that will be updated, hence it seems logical and sufficient if we can
find the root line pointer while accessing that tuple. Note that the t_ctid
field in the latest tuple is usually useless and is made to point to
itself. Instead, I propose to use a bit from t_infomask2 to identify the
LATEST tuple in the chain and use OffsetNumber field in t_ctid to store
root line pointer offset. For rare aborted update case, we can scan the
heap page and find root line pointer is a hard way.
Index Recheck
--------------------
As the original proposal explains, while doing index scan we must recheck
if the heap tuple matches the index keys. This has to be done only when the
chain is marked as a WARM chain. Currently we do that by setting the last
free bit in t_infomask2 to HEAP_WARM_TUPLE. The bit is set on the tuple
that gets WARM updated and all subsequent tuples in the chain. But the
information can subsequently be copied to root line pointer when it's
converted to a LP_REDIRECT line pointer.
Since each index AM has its own view of the index tuples, each AM must
implement its "amrecheck" routine. This routine to used to confirm that a
tuple returned from a WARM chain indeed satisfies the index keys. If the
index AM does not implement "amrecheck" routine, WARM update is disabled on
a table which uses such an index. The patch currently implements
"amrecheck" routines for hash and btree indexes. Hence a table with GiST or
GIN index will not honour WARM updates.
Results
----------
We used a customised pgbench workload to test the feature. In particular,
the pgbench_accounts table was widened to include many more columns and
indexes. We also added an index on "abalance" field which gets updated in
every transaction. This replicates a workload where there are many indexes
on a table and an update changes just one index key.
CREATE TABLE pgbench_accounts (
aid bigint,
bid bigint,
abalance bigint,
filler1 text DEFAULT md5(random()::text),
filler2 text DEFAULT md5(random()::text),
filler3 text DEFAULT md5(random()::text),
filler4 text DEFAULT md5(random()::text),
filler5 text DEFAULT md5(random()::text),
filler6 text DEFAULT md5(random()::text),
filler7 text DEFAULT md5(random()::text),
filler8 text DEFAULT md5(random()::text),
filler9 text DEFAULT md5(random()::text),
filler10 text DEFAULT md5(random()::text),
filler11 text DEFAULT md5(random()::text),
filler12 text DEFAULT md5(random()::text)
);
CREATE UNIQUE INDEX pgb_a_aid ON pgbench_accounts(aid);
CREATE INDEX pgb_a_abalance ON pgbench_accounts(abalance);
CREATE INDEX pgb_a_filler1 ON pgbench_accounts(filler1);
CREATE INDEX pgb_a_filler2 ON pgbench_accounts(filler2);
CREATE INDEX pgb_a_filler3 ON pgbench_accounts(filler3);
CREATE INDEX pgb_a_filler4 ON pgbench_accounts(filler4);
These tests are run on c3.4xlarge AWS instances, with 30GB of RAM, 16 vCPU
and 2x160GB SSD. Data and WAL were mounted on a separate SSD.
The scale factor of 700 was chosen to ensure that the database does not fit
in memory and implications of additional write activity is evident.
The actual transactional tests would just update the pgbench_accounts table:
\set aid random(1, 100000 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
END;
The tests were run for a long duration of 16 hrs each with 16 pgbench
clients to ensure that effects of the patch are captured correctly.
Headline TPS numbers:
Master:
transaction type: update.sql
scaling factor: 700
query mode: simple
number of clients: 16
number of threads: 8
duration: 57600 s
number of transactions actually processed: 65552986
latency average: 14.059 ms
*tps = 1138.072117 (including connections establishing)*
tps = 1138.072156 (excluding connections establishing)
WARM:
transaction type: update.sql
scaling factor: 700
query mode: simple
number of clients: 16
number of threads: 8
duration: 57600 s
number of transactions actually processed: 116168454
latency average: 7.933 ms
*tps = 2016.812924 (including connections establishing)*
tps = 2016.812997 (excluding connections establishing)
So WARM shows about *77% increase* in TPS. Note that these are fairly long
running tests with nearly 100M transactions and the tests show a steady
performance.
We also measured the amount of WAL generated by Master and WARM per
transaction. While master generated 34967 bytes of WAL per transaction,
WARM generated 18421 bytes of WAL per transaction.
We plotted a moving average of TPS against time and also against the
percentage of WARM updates. Clearly higher the number of WARM updates,
higher is the TPS. A graph showing percentage of WARM updates is also
plotted and it shows a steady convergence to 50% mark with time.
We repeated the same tests starting with 90% heap fill factor such that
there are many more WARM updates. Since with 90% fill factor and in
combination with HOT pruning, most initial updates will be WARM updates and
that impacts TPS positively. WARM shows nearly *150% increase *in TPS for
that workload.
Master:
transaction type: update.sql
scaling factor: 700
query mode: simple
number of clients: 16
number of threads: 8
duration: 57600 s
number of transactions actually processed: 78134617
latency average: 11.795 ms
*tps = 1356.503629 (including connections establishing)*
tps = 1356.503679 (excluding connections establishing)
WARM:
transaction type: update.sql
scaling factor: 700
query mode: simple
number of clients: 16
number of threads: 8
duration: 57600 s
number of transactions actually processed: 196782770
latency average: 4.683 ms
*tps = 3416.364822 (including connections establishing)*
tps = 3416.364949 (excluding connections establishing)
In this case, master produced ~49000 bytes of WAL per transaction where as
WARM produced ~14000 bytes of WAL per transaction.
I concede that we haven't yet done many tests to measure overhead of the
technique, especially in circumstances where WARM may not be very useful.
What I have in mind are couple of tests:
- With many indexes and a good percentage of them requiring update
- A mix of read-write workload
Any other ideas to do that are welcome.
Concerns:
--------------
The additional heap recheck may have negative impact on performance. We
tried to measure this by running a SELECT only workload for 1hr after 16hr
test finished. But the TPS did not show any negative impact. The impact
could be more if the update changes many index keys, something these tests
don't test.
The patch also changes things such that index tuples are always returned
because they may be needed for recheck. It's not clear if this is something
to be worried about, but we could try to further fine tune this change.
There seems to be some modularity violations since index AM needs to access
some of the executor stuff to form index datums. If that's a real concern,
we can look at improving amrecheck signature so that it gets index datums
from the caller.
The patch uses remaining 2 free bits in t_infomask, thus closing any
further improvements which may need to use heap tuple flags. During the
patch development we tried several other approaches such as reusing
3-higher order bits in OffsetNumber since the current max BLCKSZ limits the
MaxOffsetNumber to 8192 and that can be represented in 13 bits. We finally
reverted that change to keep the patch simple. But there is clearly a way
to free up more bits if required.
Converting WARM chains back to HOT chains (VACUUM ?)
---------------------------------------------------------------------------------
The current implementation of WARM allows only one WARM update per chain.
This
simplifies the design and addresses certain issues around duplicate scans.
But
this also implies that the benefit of WARM will be no more than 50%, which
is
still significant, but if we could return WARM chains back to normal
status, we
could do far more WARM updates.
A distinct property of a WARM chain is that at least one index has more than
one live index entries pointing to the root of the chain. In other words,
if we
can remove duplicate entry from every index or conclusively prove that there
are no duplicate index entries for the root line pointer, the chain can
again
be marked as HOT.
Here is one idea, but more thoughts/suggestions are most welcome.
A WARM chain has two parts, separated by the tuple that caused WARM update.
All
tuples in each part has matching index keys, but certain index keys may not
match between these two parts. Lets say we mark heap tuples in each part
with a
special Red-Blue flag. The same flag is replicated in the index tuples. For
example, when new rows are inserted in a table, they are marked with Blue
flag
and the index entries associated with those rows are also marked with Blue
flag. When a row is WARM updated, the new version is marked with Red flag
and
the new index entry created by the update is also marked with Red flag.
Heap chain: lp [1]/messages/by-id/CABOikdMop5Rb_RnS2xF dAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw@mail.gmail.com [2]https://eng.uber.com/mysql-migration/ [3] [4]
[aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
(bbbb)R points to 1 (satisfies only tuples marked with R)
Index2: (1111)B points to 1 (satisfies both B and R tuples)
It's clear that for indexes with Red and Blue pointers, a heap tuple with
Blue
flag will be reachable from Blue pointer and that with Red flag will be
reachable from Red pointer. But for indexes which did not create a new
entry,
both Blue and Red tuples will be reachable from Blue pointer (there is no
Red
pointer in such indexes). So, as a side note, matching Red and Blue flags is
not enough from index scan perspective.
During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE
set.
If all live tuples in the chain are either marked with Blue flag or Red flag
(but no mix of Red and Blue), then the chain is a candidate for HOT
conversion.
We remember the root line pointer and Red-Blue flag of the WARM chain in a
separate array.
If we have a Red WARM chain, then our goal is to remove Blue pointers and
vice
versa. But there is a catch. For Index2 above, there is only Blue pointer
and that must not be removed. IOW we should remove Blue pointer iff a Red
pointer exists. Since index vacuum may visit Red and Blue pointers in any
order, I think we will need another index pass to remove dead
index pointers. So in the first index pass we check which WARM candidates
have
2 index pointers. In the second pass, we remove the dead pointer and reset
Red
flag is the surviving index pointer is Red.
During the second heap scan, we fix WARM chain by clearing HEAP_WARM_TUPLE
flag
and also reset Red flag to Blue.
There are some more problems around aborted vacuums. For example, if vacuum
aborts after changing Red index flag to Blue but before removing the other
Blue
pointer, we will end up with two Blue pointers to a Red WARM chain. But
since
the HEAP_WARM_TUPLE flag on the heap tuple is still set, further WARM
updates
to the chain will be blocked. I guess we will need some special handling for
case with multiple Blue pointers. We can either leave these WARM chains
alone
and let them die with a subsequent non-WARM update or must apply
heap-recheck
logic during index vacuum to find the dead pointer. Given that vacuum-aborts
are not common, I am inclined to leave this case unhandled. We must still
check
for presence of multiple Blue pointers and ensure that we don't accidently
remove any of the Blue pointers and not clear WARM chains either.
Of course, the idea requires one bit each in index and heap tuple. There is
already a free bit in index tuple and I've some ideas to free up additional
bits in heap tuple (as mentioned above).
Further Work
------------------
1.The patch currently disables WARM updates on system relations. This is
mostly to keep the patch simple, but in theory we should be able to support
WARM updates on system tables too. It's not clear if its worth the
complexity though.
2. AFAICS both CREATE INDEX and CIC should just work fine, but need
validation for that.
3. GiST and GIN indexes are currently disabled for WARM. I don't see a
fundamental reason why they won't work once we implement "amrecheck"
method, but I don't understand those indexes well enough.
4. There are some modularity invasions I am worried about (is amrecheck
signature ok?). There are also couple of hacks around to get access to
index tuples during scans and I hope to get them correct during review
process, with some feedback.
5. Patch does not implement machinery to convert WARM chains into HOT
chains. I would give it go unless someone finds a problem with the idea or
has a better idea.
Thanks,
Pavan
[1]: /messages/by-id/CABOikdMop5Rb_RnS2xF dAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw@mail.gmail.com
dAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw@mail.gmail.com
[2]: https://eng.uber.com/mysql-migration/
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v2.patchapplication/octet-stream; name=0001_track_root_lp_v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c63dfa0..ae5839a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
@@ -2250,13 +2251,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2415,7 +2416,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2713,7 +2715,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2721,7 +2724,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2993,6 +2997,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3044,7 +3049,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3174,7 +3180,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3251,7 +3257,7 @@ l1:
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
/* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3450,6 +3456,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3506,6 +3514,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3789,7 +3798,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3967,7 +3976,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4148,6 +4157,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4155,10 +4178,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4171,7 +4213,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4210,6 +4254,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4571,7 +4616,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4583,6 +4629,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4629,7 +4676,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5067,7 +5114,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5143,7 +5190,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5657,6 +5704,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5665,6 +5713,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5883,7 +5933,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5892,7 +5942,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -6009,7 +6059,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6135,7 +6186,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7484,6 +7537,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7603,6 +7657,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8258,7 +8313,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8348,7 +8403,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8483,8 +8540,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8620,7 +8678,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8754,12 +8813,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8887,9 +8951,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..8183920 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,17 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +74,13 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
- ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..f0cbf77 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f9ce986..4656533 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -421,14 +421,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -441,7 +441,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -527,7 +529,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -733,7 +737,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..079a77f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..94b46b8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d7e5fad..23a330a 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x1000 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,24 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -541,6 +565,44 @@ do { \
(((tup)->t_infomask & HEAP_HASEXTERNAL) != 0)
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
0002_warm_updates_v2.patchapplication/octet-stream; name=0002_warm_updates_v2.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index debf4f4..d49d179 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b194d33..cefb071 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 9a417ca..8b83955 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 07496f8..0cc37c0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -264,6 +265,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -301,8 +304,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..cf44214 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -263,6 +265,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..c11a7ac 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
/*
@@ -352,3 +356,107 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..5f81b7c
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,268 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but
+the regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the index
+changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index entry
+must be inserted for the changed index. But if the index key hasn't changed
+for other indexes, we don't really need to insert a new entry. Even though the
+existing index entry is pointing to the old tuple, the new tuple is reachable
+via the t_ctid chain. To keep things simple, a WARM update requires that the
+heap block must have enough space to store the new version of the tuple. This
+is same as HOT updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for updated
+tuple, and if we are doing a WARM update, the new entry is made point to the
+root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM must
+implement its own method to recheck heap tuples. For example, a hash index
+stores the hash value of the column and hence recheck routine for hash AM must
+first compute the hash value of the heap attribute and then compare it against
+the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree indexes. If
+the table has an index which doesn't support recheck routine, WARM updates are
+disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a
+tuple from a WARM chain. HOT updates are fine because they do not add a
+new index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression indexes. Similarly, for
+partial indexes, the predicate expression must be evaluated to decide
+whether or not to cause a new index entry when columns referred in the
+predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the
+update chain. In such cases, the c_tid field usually points the tuple
+itself. So in theory, we could use the t_ctid to store additional
+information in the last tuple of the update chain, if the information
+about the tuple being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that this is
+the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain becomes
+dead. The root line pointer information stored in the tuple which remains the
+last valid tuple in the chain is also lost. In such rare cases, the root line
+pointer must be found in a hard way by scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to store
+this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain,
+it must still be rechecked for the index key match (case when
+old tuple is returned by the new index key). So we must follow the
+update chain everytime to the end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per chain. This
+simplifies the design and addresses certain issues around duplicate scans. But
+this also implies that the benefit of WARM will be no more than 50%, which is
+still significant, but if we could return WARM chains back to normal status, we
+could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more than
+one live index entries pointing to the root of the chain. In other words, if we
+can remove duplicate entry from every index or conclusively prove that there
+are no duplicate index entries for the root line pointer, the chain can again
+be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM update. All
+tuples in each part has matching index keys, but certain index keys may not
+match between these two parts. Lets say we mark heap tuples in each part with a
+special Red-Blue flag. The same flag is replicated in the index tuples. For
+example, when new rows are inserted in a table, they are marked with Blue flag
+and the index entries associated with those rows are also marked with Blue
+flag. When a row is WARM updated, the new version is marked with Red flag and
+the new index entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple with Blue
+flag will be reachable from Blue pointer and that with Red flag will be
+reachable from Red pointer. But for indexes which did not create a new entry,
+both Blue and Red tuples will be reachable from Blue pointer (there is no Red
+pointer in such indexes). So, as a side note, matching Red and Blue flags is
+not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE set.
+If all live tuples in the chain are either marked with Blue flag or Red flag
+(but no mix of Red and Blue), then the chain is a candidate for HOT conversion.
+We remember the root line pointer and Red-Blue flag of the WARM chain in a
+separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers and vice
+versa. But there is a catch. For Index2 above, there is only Blue pointer
+and that must not be removed. IOW we should remove Blue pointer iff a Red
+pointer exists. Since index vacuum may visit Red and Blue pointers in any
+order, I think we will need another index pass to remove dead
+index pointers. So in the first index pass we check which WARM candidates have
+2 index pointers. In the second pass, we remove the dead pointer and reset Red
+flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing HEAP_WARM_TUPLE flag
+and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if vacuum
+aborts after changing Red index flag to Blue but before removing the other Blue
+pointer, we will end up with two Blue pointers to a Red WARM chain. But since
+the HEAP_WARM_TUPLE flag on the heap tuple is still set, further WARM updates
+to the chain will be blocked. I guess we will need some special handling for
+case with multiple Blue pointers. We can either leave these WARM chains alone
+and let them die with a subsequent non-WARM update or must apply heap-recheck
+logic during index vacuum to find the dead pointer. Given that vacuum-aborts
+are not common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't accidently
+remove either of the Blue pointers and not clear WARM chains either.
+
+
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ae5839a..eafedae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -99,7 +99,10 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot, bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
@@ -1960,6 +1963,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1979,11 +2052,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2025,6 +2101,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2039,7 +2125,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/*
* Shouldn't see a HEAP_ONLY tuple at chain start.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2051,6 +2138,20 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
HeapTupleHeaderGetXmin(heapTuple->t_data)))
break;
+ /*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+ if (recheck && *recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
/*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
@@ -2124,18 +2225,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3442,13 +3566,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3469,9 +3595,11 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool satisfies_hot;
+ bool satisfies_warm;
bool satisfies_key;
bool satisfies_id;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3496,6 +3624,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for HOT update. This is
* wasted effort if we fail to update or have to put the new tuple on a
@@ -3512,6 +3644,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3571,7 +3705,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* serendipitiously arrive at the same key values.
*/
HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
+ exprindx_attrs,
+ updated_attrs,
+ &satisfies_hot, &satisfies_warm,
+ &satisfies_key,
&satisfies_id, &oldtup, newtup);
if (satisfies_key)
{
@@ -4117,6 +4254,34 @@ l2:
*/
if (satisfies_hot)
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (satisfies_warm &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4157,6 +4322,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4172,12 +4352,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4296,7 +4502,12 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Even with WARM we still count stats using use_hot_update,
+ * since we continue to still use that term even though it is
+ * now more frequent that previously.
+ */
+ pgstat_count_heap_update(relation, use_hot_update || use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4403,6 +4614,13 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
* will be checking very similar sets of columns, and doing the same tests on
* them, it makes sense to optimize and do them together.
*
+ * The exprindx_attrs designates the set of attributes used in expression or
+ * predicate indexes. In this version, we don't allow WARM updates if
+ * expression or predicate index column is updated
+ *
+ * If updated_attrs is not NULL, then the caller is always interested in
+ * knowing the list of changed attributes
+ *
* We receive three bitmapsets comprising the three sets of columns we're
* interested in. Note these are destructively modified; that is OK since
* this is invoked at most once in heap_update.
@@ -4415,7 +4633,11 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
static void
HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot,
+ bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup)
{
@@ -4452,8 +4674,11 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* Since the HOT attributes are a superset of the key attributes and
* the key attributes are a superset of the id attributes, this logic
* is guaranteed to identify the next column that needs to be checked.
+ *
+ * If the caller also wants to know the list of updated index
+ * attributes, we must scan through all the attributes
*/
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
+ if ((hot_result || updated_attrs) && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_hot_attnum;
else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_key_attnum;
@@ -4474,8 +4699,12 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
if (check_now == next_id_attnum)
id_result = false;
+ if (updated_attrs)
+ *updated_attrs = bms_add_member(*updated_attrs, check_now -
+ FirstLowInvalidHeapAttributeNumber);
+
/* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
+ if (!hot_result && !key_result && !id_result && !updated_attrs)
break;
}
@@ -4486,7 +4715,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* bms_first_member() will return -1 and the attribute number will end
* up with a value less than FirstLowInvalidHeapAttributeNumber.
*/
- if (hot_result && check_now == next_hot_attnum)
+ if ((hot_result || updated_attrs) && check_now == next_hot_attnum)
{
next_hot_attnum = bms_first_member(hot_attrs);
next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
@@ -4503,6 +4732,13 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
}
}
+ if (updated_attrs && bms_overlap(*updated_attrs, exprindx_attrs))
+ *satisfies_warm = false;
+ else if (!relation->rd_supportswarm)
+ *satisfies_warm = false;
+ else
+ *satisfies_warm = true;
+
*satisfies_hot = hot_result;
*satisfies_key = key_result;
*satisfies_id = id_result;
@@ -4526,7 +4762,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7413,6 +7649,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7426,6 +7663,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7448,6 +7686,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7553,6 +7795,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7564,6 +7807,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7637,6 +7883,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8004,24 +8252,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8608,16 +8870,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8677,6 +8945,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8812,6 +9084,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f0cbf77..6a03f9d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..6d9dc68 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -409,7 +411,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +450,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +477,13 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +493,50 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+ if (scan->xs_tuple_recheck &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ef69290..1fb077e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -322,115 +330,156 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
/*
* We check the whole HOT-chain to see if there is any tuple
* that satisfies SnapshotDirty. This is necessary because we
- * have just a single index entry for the entire chain.
- */
- else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ * have just a single index entry for the entire chain.
+ */
+ else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4668c5e..7a59a7f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5d335c7..72b5750 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2067,3 +2071,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b0b43cf..36467b2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1674,6 +1675,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..392c102 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2495,6 +2495,7 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
NIL);
@@ -2610,6 +2611,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..ca40e1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0e2d834..da27cf6 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *updated_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If updated_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (updated_attrs)
+ {
+ if (!bms_overlap(updated_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..49bda34 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -141,6 +141,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
* but it's not clear whether it's a win to do so. The next index
* entry might require a visit to the same heap page.
*/
+
+ /*
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
+ */
+ if (scandesc->xs_tuple_recheck)
+ {
+ ExecStoreTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->xs_cbuf, /* buffer containing tuple */
+ false); /* don't pfree */
+ econtext->ecxt_scantuple = slot;
+ ResetExprContext(econtext);
+ if (!ExecQual(node->indexqual, econtext, false))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ continue;
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..0b04bb8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..7367e9a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -433,6 +433,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -479,6 +480,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -809,6 +811,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *updated_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -923,7 +928,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &updated_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1011,9 +1016,24 @@ lreplace:;
*
* If it's a HOT update, we mustn't insert new index entries.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(updated_attrs);
+ updated_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ updated_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8d2ad01..7706a37 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2038,6 +2038,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4381,12 +4382,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4399,6 +4403,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4437,6 +4443,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4482,19 +4489,32 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4510,7 +4530,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4522,6 +4543,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5178f7..aa7b265 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -111,6 +111,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1271,6 +1272,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..2031a76 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index ce31418..7950739 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -369,5 +369,7 @@ extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
extern void hash_redo(XLogReaderState *record);
extern void hash_desc(StringInfo buf, XLogReaderState *record);
extern const char *hash_identify(uint8 info);
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 94b46b8..4c05947 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 23a330a..441dfac 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,9 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1000 are available */
+
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +273,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +512,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -754,6 +771,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..83af072 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 49c2a6f..880e62e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -110,7 +110,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39521ed..60a5445 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *updated_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7fd7bd..3b2c012 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/storage/itemid.h b/src/include/storage/itemid.h
index 509c577..166ef3b 100644
--- a/src/include/storage/itemid.h
+++ b/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ed14442..dac32b5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
We discussed a few ideas to address the "Duplicate Scan" problem. For
example, we can teach Index AMs to discard any duplicate (key, CTID) insert
requests. Or we could guarantee uniqueness by either only allowing updates
in one lexical order. While the former is a more complete solution to avoid
duplicate entries, searching through large number of keys for non-unique
indexes could be a drag on performance. The latter approach may not be
sufficient for many workloads. Also tracking increment/decrement for many
indexes will be non-trivial.There is another problem with allowing many index entries pointing to the
same WARM chain. It will be non-trivial to know how many index entries are
currently pointing to the WARM chain and index/heap vacuum will throw up
more challenges.Instead, what I would like to propose and the patch currently implements
is to restrict WARM update to once per chain. So the first non-HOT update
to a tuple or a HOT chain can be a WARM update. The chain can further be
HOT updated any number of times. But it can no further be WARM updated.
This might look too restrictive, but it can still bring down the number of
regular updates by almost 50%. Further, if we devise a strategy to convert
a WARM chain back to HOT chain, it can again be WARM updated. (This part is
currently not implemented). A good side effect of this simple strategy is
that we know there can maximum two index entries pointing to any given WARM
chain.
We should probably think about coordinating with my btree patch.
From the description above, the strategy is quite readily "upgradable" to
one in which the indexam discards duplicate (key,ctid) pairs and that would
remove the limitation of only one WARM update... right?
On Wed, Aug 31, 2016 at 10:38 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:
On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:We discussed a few ideas to address the "Duplicate Scan" problem. For
example, we can teach Index AMs to discard any duplicate (key, CTID) insert
requests. Or we could guarantee uniqueness by either only allowing updates
in one lexical order. While the former is a more complete solution to avoid
duplicate entries, searching through large number of keys for non-unique
indexes could be a drag on performance. The latter approach may not be
sufficient for many workloads. Also tracking increment/decrement for many
indexes will be non-trivial.There is another problem with allowing many index entries pointing to the
same WARM chain. It will be non-trivial to know how many index entries are
currently pointing to the WARM chain and index/heap vacuum will throw up
more challenges.Instead, what I would like to propose and the patch currently implements
is to restrict WARM update to once per chain. So the first non-HOT update
to a tuple or a HOT chain can be a WARM update. The chain can further be
HOT updated any number of times. But it can no further be WARM updated.
This might look too restrictive, but it can still bring down the number of
regular updates by almost 50%. Further, if we devise a strategy to convert
a WARM chain back to HOT chain, it can again be WARM updated. (This part is
currently not implemented). A good side effect of this simple strategy is
that we know there can maximum two index entries pointing to any given WARM
chain.We should probably think about coordinating with my btree patch.
From the description above, the strategy is quite readily "upgradable" to
one in which the indexam discards duplicate (key,ctid) pairs and that would
remove the limitation of only one WARM update... right?
Yes, we should be able to add further optimisations on lines you're working
on, but what I like about the current approach is that a) it reduces
complexity of the patch and b) having thought about cleaning up WARM
chains, limiting number of index entries per root chain to a small number
will simplify that aspect too.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 31, 2016 at 10:15 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
Hi All,
As previously discussed [1], WARM is a technique to reduce write
amplification when an indexed column of a table is updated. HOT fails to
handle such updates and ends up inserting a new index entry in all indexes
of the table, irrespective of whether the index key has changed or not for
a specific index. The problem was highlighted by Uber's blog post [2], but
it was a well known problem and affects many workloads.
I realised that the patches were bit-rotten because of 8e1e3f958fb. Rebased
patches on the current master are attached. I also took this opportunity to
correct some white space errors and improve formatting of the README.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v3.patchapplication/octet-stream; name=0001_track_root_lp_v3.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a27ef4..69cd066 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
@@ -2250,13 +2251,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2415,7 +2416,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2713,7 +2715,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2721,7 +2724,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2993,6 +2997,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3044,7 +3049,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3174,7 +3180,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3251,7 +3257,7 @@ l1:
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
/* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3450,6 +3456,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3506,6 +3514,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3789,7 +3798,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3968,7 +3977,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4149,6 +4158,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4156,10 +4179,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4172,7 +4214,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4211,6 +4255,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4573,7 +4618,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4585,6 +4631,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4631,7 +4678,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5069,7 +5116,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5145,7 +5192,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5659,6 +5706,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5667,6 +5715,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5885,7 +5935,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5894,7 +5944,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -6011,7 +6061,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6137,7 +6188,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7486,6 +7539,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7605,6 +7659,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8260,7 +8315,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8350,7 +8405,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8485,8 +8542,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8622,7 +8680,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8756,12 +8815,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8889,9 +8953,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..8183920 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,17 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +74,13 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
- ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..7c2231a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..09a164c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..079a77f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..94b46b8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d7e5fad..76328ff 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x1000 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,24 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -541,6 +565,43 @@ do { \
(((tup)->t_infomask & HEAP_HASEXTERNAL) != 0)
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
0002_warm_updates_v3.patchapplication/octet-stream; name=0002_warm_updates_v3.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index debf4f4..d49d179 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..ba3fffb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f7f44b4..813c2c3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..d7c50c1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -265,6 +266,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -302,8 +305,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..cf44214 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -263,6 +265,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..ebb9d6c 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
/*
@@ -352,3 +356,107 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 69cd066..e84f041 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -99,7 +99,10 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot, bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
@@ -1960,6 +1963,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1979,11 +2052,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2025,6 +2101,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2039,7 +2125,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/*
* Shouldn't see a HEAP_ONLY tuple at chain start.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2052,6 +2139,20 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+ if (recheck && *recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2124,18 +2225,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3442,13 +3566,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3469,9 +3595,11 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool satisfies_hot;
+ bool satisfies_warm;
bool satisfies_key;
bool satisfies_id;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3496,6 +3624,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for HOT update. This is
* wasted effort if we fail to update or have to put the new tuple on a
@@ -3512,6 +3644,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3571,7 +3705,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* serendipitiously arrive at the same key values.
*/
HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
+ exprindx_attrs,
+ updated_attrs,
+ &satisfies_hot, &satisfies_warm,
+ &satisfies_key,
&satisfies_id, &oldtup, newtup);
if (satisfies_key)
{
@@ -4118,6 +4255,34 @@ l2:
*/
if (satisfies_hot)
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (satisfies_warm &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4158,6 +4323,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4173,12 +4353,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4297,7 +4503,12 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Even with WARM we still count stats using use_hot_update,
+ * since we continue to still use that term even though it is
+ * now more frequent that previously.
+ */
+ pgstat_count_heap_update(relation, use_hot_update || use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4405,6 +4616,13 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
* will be checking very similar sets of columns, and doing the same tests on
* them, it makes sense to optimize and do them together.
*
+ * The exprindx_attrs designates the set of attributes used in expression or
+ * predicate indexes. In this version, we don't allow WARM updates if
+ * expression or predicate index column is updated
+ *
+ * If updated_attrs is not NULL, then the caller is always interested in
+ * knowing the list of changed attributes
+ *
* We receive three bitmapsets comprising the three sets of columns we're
* interested in. Note these are destructively modified; that is OK since
* this is invoked at most once in heap_update.
@@ -4417,7 +4635,11 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
static void
HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot,
+ bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup)
{
@@ -4454,8 +4676,11 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* Since the HOT attributes are a superset of the key attributes and
* the key attributes are a superset of the id attributes, this logic
* is guaranteed to identify the next column that needs to be checked.
+ *
+ * If the caller also wants to know the list of updated index
+ * attributes, we must scan through all the attributes
*/
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
+ if ((hot_result || updated_attrs) && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_hot_attnum;
else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_key_attnum;
@@ -4476,8 +4701,12 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
if (check_now == next_id_attnum)
id_result = false;
+ if (updated_attrs)
+ *updated_attrs = bms_add_member(*updated_attrs, check_now -
+ FirstLowInvalidHeapAttributeNumber);
+
/* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
+ if (!hot_result && !key_result && !id_result && !updated_attrs)
break;
}
@@ -4488,7 +4717,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* bms_first_member() will return -1 and the attribute number will end
* up with a value less than FirstLowInvalidHeapAttributeNumber.
*/
- if (hot_result && check_now == next_hot_attnum)
+ if ((hot_result || updated_attrs) && check_now == next_hot_attnum)
{
next_hot_attnum = bms_first_member(hot_attrs);
next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
@@ -4505,6 +4734,13 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
}
}
+ if (updated_attrs && bms_overlap(*updated_attrs, exprindx_attrs))
+ *satisfies_warm = false;
+ else if (!relation->rd_supportswarm)
+ *satisfies_warm = false;
+ else
+ *satisfies_warm = true;
+
*satisfies_hot = hot_result;
*satisfies_key = key_result;
*satisfies_id = id_result;
@@ -4528,7 +4764,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7415,6 +7651,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7428,6 +7665,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7450,6 +7688,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7555,6 +7797,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7566,6 +7809,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7639,6 +7885,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8006,24 +8254,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8610,16 +8872,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8679,6 +8947,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8814,6 +9086,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 7c2231a..d71a297 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..6ca1d15 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -409,7 +411,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +450,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +477,13 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +493,50 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+ if (scan->xs_tuple_recheck &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ef69290..e0afffd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..6b1236a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..c9c0501 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b0b43cf..36467b2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1674,6 +1675,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5947e72..75af34c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2491,6 +2491,7 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
NIL);
@@ -2606,6 +2607,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..ca40e1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0e2d834..da27cf6 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *updated_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If updated_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (updated_attrs)
+ {
+ if (!bms_overlap(updated_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..49bda34 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -141,6 +141,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
* but it's not clear whether it's a win to do so. The next index
* entry might require a visit to the same heap page.
*/
+
+ /*
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
+ */
+ if (scandesc->xs_tuple_recheck)
+ {
+ ExecStoreTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->xs_cbuf, /* buffer containing tuple */
+ false); /* don't pfree */
+ econtext->ecxt_scantuple = slot;
+ ResetExprContext(econtext);
+ if (!ExecQual(node->indexqual, econtext, false))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ continue;
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..0b04bb8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..7367e9a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -433,6 +433,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -479,6 +480,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -809,6 +811,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *updated_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -923,7 +928,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &updated_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1011,9 +1016,24 @@ lreplace:;
*
* If it's a HOT update, we mustn't insert new index entries.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(updated_attrs);
+ updated_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ updated_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..37874ca 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2030,6 +2030,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4373,12 +4374,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4391,6 +4395,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4429,6 +4435,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4474,19 +4481,32 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4502,7 +4522,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4514,6 +4535,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5178f7..aa7b265 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -111,6 +111,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1271,6 +1272,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..37eaf76 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..a25ce5a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -364,4 +364,8 @@ extern bool _hash_convert_tuple(Relation index,
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 94b46b8..4c05947 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 76328ff..b139bb2 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1000 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -753,6 +769,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..83af072 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 49c2a6f..880e62e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -110,7 +110,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39521ed..60a5445 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *updated_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a4ea1b9..42f8ecf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/storage/itemid.h b/src/include/storage/itemid.h
index 509c577..8c9cc99 100644
--- a/src/include/storage/itemid.h
+++ b/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ed14442..dac32b5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
On Wed, Aug 31, 2016 at 10:15:33PM +0530, Pavan Deolasee wrote:
Instead, what I would like to propose and the patch currently implements is to
restrict WARM update to once per chain. So the first non-HOT update to a tuple
or a HOT chain can be a WARM update. The chain can further be HOT updated any
number of times. But it can no further be WARM updated. This might look too
restrictive, but it can still bring down the number of regular updates by
almost 50%. Further, if we devise a strategy to convert a WARM chain back to
HOT chain, it can again be WARM updated. (This part is currently not
implemented). A good side effect of this simple strategy is that we know there
can maximum two index entries pointing to any given WARM chain.
I like the simplified approach, as long as it doesn't block further
improvements.
Headline TPS numbers:
Master:
transaction type: update.sql
scaling factor: 700
query mode: simple
number of clients: 16
number of threads: 8
duration: 57600 s
number of transactions actually processed: 65552986
latency average: 14.059 ms
tps = 1138.072117 (including connections establishing)
tps = 1138.072156 (excluding connections establishing)WARM:
transaction type: update.sql
scaling factor: 700
query mode: simple
number of clients: 16
number of threads: 8
duration: 57600 s
number of transactions actually processed: 116168454
latency average: 7.933 ms
tps = 2016.812924 (including connections establishing)
tps = 2016.812997 (excluding connections establishing)
These are very impressive results.
Converting WARM chains back to HOT chains (VACUUM ?)
---------------------------------------------------------------------------------The current implementation of WARM allows only one WARM update per chain. This
simplifies the design and addresses certain issues around duplicate scans. But
this also implies that the benefit of WARM will be no more than 50%, which is
still significant, but if we could return WARM chains back to normal status, we
could do far more WARM updates.A distinct property of a WARM chain is that at least one index has more than
one live index entries pointing to the root of the chain. In other words, if we
can remove duplicate entry from every index or conclusively prove that there
are no duplicate index entries for the root line pointer, the chain can again
be marked as HOT.
I had not thought of how to convert from WARM to HOT yet.
Here is one idea, but more thoughts/suggestions are most welcome.�
A WARM chain has two parts, separated by the tuple that caused WARM update. All
tuples in each part has matching index keys, but certain index keys may not
match between these two parts. Lets say we mark heap tuples in each part with a
special Red-Blue flag. The same flag is replicated in the index tuples. For
example, when new rows are inserted in a table, they are marked with Blue flag
and the index entries associated with those rows are also marked with Blue
flag. When a row is WARM updated, the new version is marked with Red flag and
the new index entry created by the update is also marked with Red flag.Heap chain: lp �[1] [2] [3] [4]
� [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]RIndex1: (aaaa)B points to 1 (satisfies only tuples marked with B)
(bbbb)R points to 1 (satisfies only tuples marked with R)Index2: (1111)B points to 1 (satisfies both B and R tuples)
It's clear that for indexes with Red and Blue pointers, a heap tuple with Blue
flag will be reachable from Blue pointer and that with Red flag will be
reachable from Red pointer. But for indexes which did not create a new entry,
both Blue and Red tuples will be reachable from Blue pointer (there is no Red
pointer in such indexes). So, as a side note, matching Red and Blue flags is
not enough from index scan perspective.During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE set.
If all live tuples in the chain are either marked with Blue flag or Red flag
(but no mix of Red and Blue), then the chain is a candidate for HOT conversion.
Uh, if the chain is all blue, then there is are WARM entries so it
already a HOT chain, so there is nothing to do, right?
We remember the root line pointer and Red-Blue flag of the WARM chain in a
separate array.If we have a Red WARM chain, then our goal is to remove Blue pointers and vice
versa. But there is a catch. For Index2 above, there is only Blue pointer
and that must not be removed. IOW we should remove Blue pointer iff a Red
pointer exists. Since index vacuum may visit Red and Blue pointers in any
order, I think we will need another index pass to remove dead
index pointers. So in the first index pass we check which WARM candidates have
2 index pointers. In the second pass, we remove the dead pointer and reset Red
flag is the surviving index pointer is Red.
Why not just remember the tid of chains converted from WARM to HOT, then
use "amrecheck" on an index entry matching that tid to see if the index
matches one of the entries in the chain. (It will match all of them or
none of them, because they are all red.) I don't see a point in
coloring the index entries as reds as later you would have to convert to
blue in the WARM-to-HOT conversion, and a vacuum crash could lead to
inconsistencies. Consider that you can just call "amrecheck" on the few
chains that have converted from WARM to HOT. I believe this is more
crash-safe too. However, if you have converted WARM to HOT in the heap,
but crash durin the index entry removal, you could potentially have
duplicates in the index later, which is bad.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 31, 2016 at 04:03:29PM -0400, Bruce Momjian wrote:
Why not just remember the tid of chains converted from WARM to HOT, then
use "amrecheck" on an index entry matching that tid to see if the index
matches one of the entries in the chain. (It will match all of them or
none of them, because they are all red.) I don't see a point in
coloring the index entries as reds as later you would have to convert to
blue in the WARM-to-HOT conversion, and a vacuum crash could lead to
inconsistencies. Consider that you can just call "amrecheck" on the few
chains that have converted from WARM to HOT. I believe this is more
crash-safe too. However, if you have converted WARM to HOT in the heap,
but crash during the index entry removal, you could potentially have
duplicates in the index later, which is bad.
I think Pavan had the "crash durin the index entry removal" fixed via:
During the second heap scan, we fix WARM chain by clearing HEAP_WARM_TUPLE flag
and also reset Red flag to Blue.
so the marking from WARM to HOT only happens after the index has been cleaned.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 1, 2016 at 1:33 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 31, 2016 at 10:15:33PM +0530, Pavan Deolasee wrote:
Instead, what I would like to propose and the patch currently implements
is to
restrict WARM update to once per chain. So the first non-HOT update to a
tuple
or a HOT chain can be a WARM update. The chain can further be HOT
updated any
number of times. But it can no further be WARM updated. This might look
too
restrictive, but it can still bring down the number of regular updates by
almost 50%. Further, if we devise a strategy to convert a WARM chainback to
HOT chain, it can again be WARM updated. (This part is currently not
implemented). A good side effect of this simple strategy is that we knowthere
can maximum two index entries pointing to any given WARM chain.
I like the simplified approach, as long as it doesn't block further
improvements.
Yes, the proposed approach is simple yet does not stop us from improving
things further. Moreover it has shown good performance characteristics and
I believe it's a good first step.
Master:
tps = 1138.072117 (including connections establishing)WARM:
tps = 2016.812924 (including connections establishing)These are very impressive results.
Thanks. What's also interesting and something that headline numbers don't
show is that WARM TPS is as much as 3 times of master TPS when the
percentage of WARM updates is very high. Notice the spike in TPS in the
comparison graph.
Results with non-default heap fill factor are even better. In both cases,
the improvement in TPS stays constant over long periods.
During first heap scan of VACUUM, we look for tuples with
HEAP_WARM_TUPLE set.
If all live tuples in the chain are either marked with Blue flag or Red
flag
(but no mix of Red and Blue), then the chain is a candidate for HOT
conversion.
Uh, if the chain is all blue, then there is are WARM entries so it
already a HOT chain, so there is nothing to do, right?
For aborted WARM updates, the heap chain may be all blue, but there may
still be a red index pointer which must be cleared before we allow further
WARM updates to the chain.
We remember the root line pointer and Red-Blue flag of the WARM chain in
a
separate array.
If we have a Red WARM chain, then our goal is to remove Blue pointers
and vice
versa. But there is a catch. For Index2 above, there is only Blue pointer
and that must not be removed. IOW we should remove Blue pointer iff a Red
pointer exists. Since index vacuum may visit Red and Blue pointers in any
order, I think we will need another index pass to remove dead
index pointers. So in the first index pass we check which WARMcandidates have
2 index pointers. In the second pass, we remove the dead pointer and
reset Red
flag is the surviving index pointer is Red.
Why not just remember the tid of chains converted from WARM to HOT, then
use "amrecheck" on an index entry matching that tid to see if the index
matches one of the entries in the chain.
That will require random access to heap during index vacuum phase,
something I would like to avoid. But we can have that as a fall back
solution for handling aborted vacuums.
(It will match all of them or
none of them, because they are all red.) I don't see a point in
coloring the index entries as reds as later you would have to convert to
blue in the WARM-to-HOT conversion, and a vacuum crash could lead to
inconsistencies.
Yes, that's a concern since the conversion of red to blue will also need to
WAL logged to ensure that a crash doesn't leave us in inconsistent state. I
still think that this will be an overall improvement as compared to
allowing one WARM update per chain.
Consider that you can just call "amrecheck" on the few
chains that have converted from WARM to HOT. I believe this is more
crash-safe too. However, if you have converted WARM to HOT in the heap,
but crash durin the index entry removal, you could potentially have
duplicates in the index later, which is bad.
As you probably already noted, we clear heap flags only after all indexes
are cleared of duplicate entries and hence a crash in between should not
cause any correctness issue. As long as heap tuples are marked as warm,
amrecheck will ensure that only valid tuples are returned to the caller.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 1, 2016 at 02:37:40PM +0530, Pavan Deolasee wrote:
I like the simplified approach, as long as it doesn't block further
improvements.Yes, the proposed approach is simple yet does not stop us from improving things
further. Moreover it has shown good performance characteristics and I believe
it's a good first step.
Agreed. This is BIG. Do you think it can be done for PG 10?
Thanks. What's also interesting and something that headline numbers don't show
is that WARM TPS is as much as 3 times of master TPS when the percentage of
WARM updates is very high. Notice the spike in TPS in the comparison graph.Results with non-default heap fill factor are even better. In both cases, the
improvement in TPS stays constant over long periods.�
Yes, I expect the benefits of this to show up in better long-term
performance.
During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE
set.
If all live tuples in the chain are either marked with Blue flag or Red
flag
(but no mix of Red and Blue), then the chain is a candidate for HOT
conversion.
Uh, if the chain is all blue, then there is are WARM entries so it
already a HOT chain, so there is nothing to do, right?For aborted WARM updates, the heap chain may be all blue, but there may still
be a red index pointer which must be cleared before we allow further WARM
updates to the chain.
Ah, understood now. Thanks.
Why not just remember the tid of chains converted from WARM to HOT, then
use "amrecheck" on an index entry matching that tid to see if the index
matches one of the entries in the chain.�That will require random access to heap during index vacuum phase, something I
would like to avoid. But we can have that as a fall back solution for handling
aborted vacuums.�
Yes, that is true. So the challenge is figuring how which of the index
entries pointing to the same tid is valid, and coloring helps with that?
(It will match all of them or
none of them, because they are all red.)� I don't see a point in
coloring the index entries as reds as later you would have to convert to
blue in the WARM-to-HOT conversion, and a vacuum crash could lead to
inconsistencies.�Yes, that's a concern since the conversion of red to blue will also need to WAL
logged to ensure that a crash doesn't leave us in inconsistent state. I still
think that this will be an overall improvement as compared to allowing one WARM
update per chain.
OK. I will think some more on this to see if I can come with another
approach.
�
Consider that you can just call "amrecheck" on the few
chains that have converted from WARM to HOT.� I believe this is more
crash-safe too.� However, if you have converted WARM to HOT in the heap,
but crash durin the index entry removal, you could potentially have
duplicates in the index later, which is bad.As you probably already noted, we clear heap flags only after all indexes are
cleared of duplicate entries and hence a crash in between should not cause any
correctness issue. As long as heap tuples are marked as warm, amrecheck will
ensure that only valid tuples are returned to the caller.
OK, got it.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 1, 2016 at 9:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Sep 1, 2016 at 02:37:40PM +0530, Pavan Deolasee wrote:
I like the simplified approach, as long as it doesn't block further
improvements.Yes, the proposed approach is simple yet does not stop us from improving
things
further. Moreover it has shown good performance characteristics and I
believe
it's a good first step.
Agreed. This is BIG. Do you think it can be done for PG 10?
I definitely think so. The patches as submitted are fully functional and
sufficient. Of course, there are XXX and TODOs that I hope to sort out
during the review process. There are also further tests needed to ensure
that the feature does not cause significant regression in the worst cases.
Again something I'm willing to do once I get some feedback on the broader
design and test cases. What I am looking at this stage is to know if I've
missed something important in terms of design or if there is some show
stopper that I overlooked.
Latest patches rebased with current master are attached. I also added a few
more comments to the code. I forgot to give a brief about the patches, so
including that as well.
0001_track_root_lp_v4.patch: This patch uses a free t_infomask2 bit to
track latest tuple in an update chain. The t_ctid.ip_posid is used to track
the root line pointer of the update chain. We do this only in the latest
tuple in the chain because most often that tuple will be updated and we
need to quickly find the root only during update.
0002_warm_updates_v4.patch: This patch implements the core of WARM logic.
During WARM update, we only insert new entries in the indexes whose key has
changed. But instead of indexing the real TID of the new tuple, we index
the root line pointer and then use additional recheck logic to ensure only
correct tuples are returned from such potentially broken HOT chains. Each
index AM must implement a amrecheck method to support WARM. The patch
currently implements this for hash and btree indexes.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v4.patchapplication/octet-stream; name=0001_track_root_lp_v4.patchDownload
commit f33ee503463137aa1a2ae4c3ab04d1468ae1941c
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sat Sep 3 14:51:00 2016 +0530
Use HEAP_TUPLE_LATEST to mark a tuple as the latest tuple in an update chain
and use OffsetNumber in t_ctid to store the root line pointer of the chain.
t_ctid field in the tuple header is usually used to store TID of the next tuple
in an update chain. But for the last tuple in the chain, t_ctid is made to
point to itself. When t_ctid points to itself, that signals the end of the
chain. With this patch, information about a tuple being the last tuple in the
chain is stored a separate HEAP_TUPLE_LATEST flag. This uses another free bit
in t_infomask2. When HEAP_TUPLE_LATEST is set, OffsetNumber field in the t_ctid
stores the root line pointer of the chain. This will help us quickly find root
of a update chain.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a27ef4..ccf84be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
@@ -2250,13 +2251,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2415,7 +2416,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2713,7 +2715,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2721,7 +2724,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2993,6 +2997,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3044,7 +3049,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3174,7 +3180,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3250,8 +3256,8 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ /* Mark this tuple as the latest tuple in the update chain */
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3450,6 +3456,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3506,6 +3514,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3789,7 +3798,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3968,7 +3977,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4149,6 +4158,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4156,10 +4179,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4172,7 +4214,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4211,6 +4255,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4573,7 +4618,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4585,6 +4631,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4631,7 +4678,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5069,7 +5116,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5145,7 +5192,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5659,6 +5706,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5667,6 +5715,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5885,7 +5935,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5894,7 +5944,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -6011,7 +6061,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6137,7 +6188,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7486,6 +7539,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7605,6 +7659,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8260,7 +8315,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8350,7 +8405,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8485,8 +8542,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8622,7 +8680,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8756,12 +8815,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8889,9 +8953,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..e32deb1 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +75,13 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
- ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..7c2231a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..09a164c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..079a77f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..94b46b8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d7e5fad..d01e0d8 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,30 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +572,55 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Traditionally, we have stored
+ * self TID in the t_ctid field if the tuple is the last tuple in the chain. We
+ * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE
+ * flag is set.
+ */
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0002_warm_updates_v4.patchapplication/octet-stream; name=0002_warm_updates_v4.patchDownload
commit b0fa1d2aeadecbcba10ef90cf467c835aef693b1
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Mon Sep 5 09:26:11 2016 +0530
Add support for WARM (Write Amplification Reduction Method)
We have used HOT updates to handle the cases when the UPDATE does not
change any index column. In such cases, we could avoid inserting duplicate
index entries, which not only reduces index bloat, but also makes it easier to
clean up dead tuples without requiring to visit the indexes. But when an UPDATE
changes an index column, we insert duplicate entries in all indexes. This
results in write amplification especially for tables with large number of
indexes.
WARM takes it further by now avoiding duplicate index entries for indexes
whose columns are not being updated. What we do is only insert new entries
those indexes which have changed, but instead of indexing the actual TID of the
new tuple, we index the root line pointer of the HOT chain. As a side effect,
for correctness we now must verify that the index is pointing to a tuple which
really satisifes the index key.
Each index AM must implement an amrecheck method which returns true iff the
index key constructed from the given heap tuple matches the given index key.
The patch currently works with several restrictions:
1. WARM updates on system tables are disabled. While we disabled them for
ease of development, there could be some issues with system tables because they
apparently do not support lossy indexes.
2. Only one WARM update per HOT chain is allowed. That seems very
restrictive, but even that should reduce index bloat by 50%. Subsequently, we
will optimise this by either allowing multiple WARM updates or by turning WARM
chains to HOT chains as and when tuples retire.
3. Expression and partial indexes don't work with WARM updates. For
expression indexes, we will need to find a way to determine if the old and new
tuple computes to the same index expression and avoid adding a duplicate index
entry in such cases. This is not only required to avoid unnecessary index
bloat, but also for correctness purposes. Similarly, for partial indexes, we
must index the new entry if the old tuple did not satisfy the predicate but the
new one does.
4. If table has an index which does not support amrecheck method, WARM is
disabled on such tables.
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index debf4f4..d49d179 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..ba3fffb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f7f44b4..813c2c3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..d7c50c1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -265,6 +266,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -302,8 +305,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..cf44214 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -263,6 +265,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..71377ab 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
/*
@@ -352,3 +356,110 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ccf84be..800a7c0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -99,7 +99,10 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot, bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
@@ -1960,6 +1963,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1979,11 +2052,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2025,6 +2101,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2037,9 +2123,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2052,6 +2141,22 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+
+ /* WARM is not supported on system tables yet */
+ if (*recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2124,18 +2229,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3442,13 +3570,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3469,9 +3599,11 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool satisfies_hot;
+ bool satisfies_warm;
bool satisfies_key;
bool satisfies_id;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3496,6 +3628,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for HOT update. This is
* wasted effort if we fail to update or have to put the new tuple on a
@@ -3512,6 +3648,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3571,7 +3709,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* serendipitiously arrive at the same key values.
*/
HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
+ exprindx_attrs,
+ updated_attrs,
+ &satisfies_hot, &satisfies_warm,
+ &satisfies_key,
&satisfies_id, &oldtup, newtup);
if (satisfies_key)
{
@@ -4118,6 +4259,34 @@ l2:
*/
if (satisfies_hot)
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (satisfies_warm &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4158,6 +4327,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4173,12 +4357,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4297,7 +4507,12 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Even with WARM we still count stats using use_hot_update,
+ * since we continue to still use that term even though it is
+ * now more frequent that previously.
+ */
+ pgstat_count_heap_update(relation, use_hot_update || use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4405,6 +4620,13 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
* will be checking very similar sets of columns, and doing the same tests on
* them, it makes sense to optimize and do them together.
*
+ * The exprindx_attrs designates the set of attributes used in expression or
+ * predicate indexes. Currently, we don't allow WARM updates if expression or
+ * predicate index column is updated
+ *
+ * If updated_attrs is not NULL, then the caller is always interested in
+ * knowing the list of changed attributes
+ *
* We receive three bitmapsets comprising the three sets of columns we're
* interested in. Note these are destructively modified; that is OK since
* this is invoked at most once in heap_update.
@@ -4417,7 +4639,11 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
static void
HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot,
+ bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup)
{
@@ -4454,8 +4680,11 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* Since the HOT attributes are a superset of the key attributes and
* the key attributes are a superset of the id attributes, this logic
* is guaranteed to identify the next column that needs to be checked.
+ *
+ * If the caller also wants to know the list of updated index
+ * attributes, we must scan through all the attributes
*/
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
+ if ((hot_result || updated_attrs) && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_hot_attnum;
else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_key_attnum;
@@ -4476,8 +4705,16 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
if (check_now == next_id_attnum)
id_result = false;
+ /*
+ * Add the changed attribute to updated_attrs if the caller has
+ * asked for it
+ */
+ if (updated_attrs)
+ *updated_attrs = bms_add_member(*updated_attrs, check_now -
+ FirstLowInvalidHeapAttributeNumber);
+
/* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
+ if (!hot_result && !key_result && !id_result && !updated_attrs)
break;
}
@@ -4488,7 +4725,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* bms_first_member() will return -1 and the attribute number will end
* up with a value less than FirstLowInvalidHeapAttributeNumber.
*/
- if (hot_result && check_now == next_hot_attnum)
+ if ((hot_result || updated_attrs) && check_now == next_hot_attnum)
{
next_hot_attnum = bms_first_member(hot_attrs);
next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
@@ -4505,6 +4742,23 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
}
}
+ /*
+ * If an attributed used in the expression of an expression index or
+ * predicate of a predicate index has changed, we don't yet support WARM
+ * update
+ */
+ if (updated_attrs && bms_overlap(*updated_attrs, exprindx_attrs))
+ *satisfies_warm = false;
+ /* If the table does not support WARM update, honour that */
+ else if (!relation->rd_supportswarm)
+ *satisfies_warm = false;
+ /*
+ * XXX Should we handle some more cases, such as when an update will result
+ * in many or most indexes, should we fall back to a regular update?
+ */
+ else
+ *satisfies_warm = true;
+
*satisfies_hot = hot_result;
*satisfies_key = key_result;
*satisfies_id = id_result;
@@ -4528,7 +4782,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7415,6 +7669,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7428,6 +7683,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7450,6 +7706,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7555,6 +7815,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7566,6 +7827,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7639,6 +7903,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8006,24 +8272,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8610,16 +8890,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8679,6 +8965,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8814,6 +9105,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 7c2231a..d71a297 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..149a02d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -409,7 +411,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +450,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +477,13 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +493,63 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ef69290..e0afffd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..6b1236a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..c9c0501 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b0b43cf..36467b2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1674,6 +1675,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5947e72..75af34c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2491,6 +2491,7 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
NIL);
@@ -2606,6 +2607,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..ca40e1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0e2d834..da27cf6 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *updated_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If updated_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (updated_attrs)
+ {
+ if (!bms_overlap(updated_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..49bda34 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -141,6 +141,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
* but it's not clear whether it's a win to do so. The next index
* entry might require a visit to the same heap page.
*/
+
+ /*
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
+ */
+ if (scandesc->xs_tuple_recheck)
+ {
+ ExecStoreTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->xs_cbuf, /* buffer containing tuple */
+ false); /* don't pfree */
+ econtext->ecxt_scantuple = slot;
+ ResetExprContext(econtext);
+ if (!ExecQual(node->indexqual, econtext, false))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ continue;
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..0b04bb8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..11bd3c0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -433,6 +433,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -479,6 +480,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -809,6 +811,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *updated_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -923,7 +928,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &updated_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1010,10 +1015,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(updated_attrs);
+ updated_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ updated_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..37874ca 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2030,6 +2030,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4373,12 +4374,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4391,6 +4395,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4429,6 +4435,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4474,19 +4481,32 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4502,7 +4522,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4514,6 +4535,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5178f7..aa7b265 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -111,6 +111,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1271,6 +1272,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..37eaf76 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..a25ce5a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -364,4 +364,8 @@ extern bool _hash_convert_tuple(Relation index,
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 94b46b8..4c05947 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d01e0d8..3a51681 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -771,6 +787,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..83af072 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 49c2a6f..880e62e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -110,7 +110,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39521ed..60a5445 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *updated_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a4ea1b9..42f8ecf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/storage/itemid.h b/src/include/storage/itemid.h
index 509c577..8c9cc99 100644
--- a/src/include/storage/itemid.h
+++ b/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ed14442..dac32b5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
On Mon, Sep 5, 2016 at 1:53 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
0001_track_root_lp_v4.patch: This patch uses a free t_infomask2 bit to track
latest tuple in an update chain. The t_ctid.ip_posid is used to track the
root line pointer of the update chain. We do this only in the latest tuple
in the chain because most often that tuple will be updated and we need to
quickly find the root only during update.0002_warm_updates_v4.patch: This patch implements the core of WARM logic.
During WARM update, we only insert new entries in the indexes whose key has
changed. But instead of indexing the real TID of the new tuple, we index the
root line pointer and then use additional recheck logic to ensure only
correct tuples are returned from such potentially broken HOT chains. Each
index AM must implement a amrecheck method to support WARM. The patch
currently implements this for hash and btree indexes.
Moved to next CF, I was surprised to see that it is not *that* large:
43 files changed, 1539 insertions(+), 199 deletions(-)
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/05/2016 06:53 AM, Pavan Deolasee wrote:
On Thu, Sep 1, 2016 at 9:44 PM, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:On Thu, Sep 1, 2016 at 02:37:40PM +0530, Pavan Deolasee wrote:
I like the simplified approach, as long as it doesn't block further
improvements.Yes, the proposed approach is simple yet does not stop us from improving things
further. Moreover it has shown good performance characteristics and I believe
it's a good first step.Agreed. This is BIG. Do you think it can be done for PG 10?
I definitely think so. The patches as submitted are fully functional and
sufficient. Of course, there are XXX and TODOs that I hope to sort out
during the review process. There are also further tests needed to ensure
that the feature does not cause significant regression in the worst
cases. Again something I'm willing to do once I get some feedback on the
broader design and test cases. What I am looking at this stage is to
know if I've missed something important in terms of design or if there
is some show stopper that I overlooked.Latest patches rebased with current master are attached. I also added a
few more comments to the code. I forgot to give a brief about the
patches, so including that as well.0001_track_root_lp_v4.patch: This patch uses a free t_infomask2 bit to
track latest tuple in an update chain. The t_ctid.ip_posid is used to
track the root line pointer of the update chain. We do this only in the
latest tuple in the chain because most often that tuple will be updated
and we need to quickly find the root only during update.0002_warm_updates_v4.patch: This patch implements the core of WARM
logic. During WARM update, we only insert new entries in the indexes
whose key has changed. But instead of indexing the real TID of the new
tuple, we index the root line pointer and then use additional recheck
logic to ensure only correct tuples are returned from such potentially
broken HOT chains. Each index AM must implement a amrecheck method to
support WARM. The patch currently implements this for hash and btree
indexes.
Hi,
I've been looking at the patch over the past few days, running a bunch
of benchmarks etc. I can confirm the significant speedup, often by more
than 75% (depending on number of indexes, whether the data set fits into
RAM, etc.). Similarly for the amount of WAL generated, although that's a
bit more difficult to evaluate due to full_page_writes.
I'm not going to send detailed results, as that probably does not make
much sense at this stage of the development - I can repeat the tests
once the open questions get resolved.
There's a lot of useful and important feedback in the thread(s) so far,
particularly the descriptions of various failure cases. I think it'd be
very useful to collect those examples and turn them into regression
tests - that's something the patch should include anyway.
I don't really have much comments regarding the code, but during the
testing I noticed a bit strange behavior when updating statistics.
Consider a table like this:
create table t (a int, b int, c int) with (fillfactor = 10);
insert into t select i, i from generate_series(1,1000) s(i);
create index on t(a);
create index on t(b);
and update:
update t set a = a+1, b=b+1;
which has to update all indexes on the table, but:
select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables
n_tup_upd | n_tup_hot_upd
-----------+---------------
1000 | 1000
So it's still counted as "WARM" - does it make sense? I mean, we're
creating a WARM chain on the page, yet we have to add pointers into all
indexes (so not really saving anything). Doesn't this waste the one WARM
update per HOT chain without actually getting anything in return?
The way this is piggy-backed on the current HOT statistics seems a bit
strange for another reason, although WARM is a relaxed version of HOT.
Until now, HOT was "all or nothing" - we've either added index entries
to all indexes or none of them. So the n_tup_hot_upd was fine.
But WARM changes that - it allows adding index entries only to a subset
of indexes, which means the "per row" n_tup_hot_upd counter is not
sufficient. When you have a table with 10 indexes, and the counter
increases by 1, does that mean the update added index tuple to 1 index
or 9 of them?
So I think we'll need two counters to track WARM - number of index
tuples we've added, and number of index tuples we've skipped. So
something like blks_hit and blks_read. I'm not sure whether we should
replace the n_tup_hot_upd entirely, or keep it for backwards
compatibility (and to track perfectly HOT updates).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 5, 2016 at 1:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
I've been looking at the patch over the past few days, running a bunch of
benchmarks etc.
Thanks for doing that.
I can confirm the significant speedup, often by more than 75% (depending
on number of indexes, whether the data set fits into RAM, etc.). Similarly
for the amount of WAL generated, although that's a bit more difficult to
evaluate due to full_page_writes.I'm not going to send detailed results, as that probably does not make
much sense at this stage of the development - I can repeat the tests once
the open questions get resolved.
Sure. Anything that stands out? Any regression that you see? I'm not sure
if your benchmarks exercise the paths which might show overheads without
any tangible benefits. For example, I wonder if a test with many indexes
where most of them get updated and then querying the table via those
updated indexes could be one such test case.
There's a lot of useful and important feedback in the thread(s) so far,
particularly the descriptions of various failure cases. I think it'd be
very useful to collect those examples and turn them into regression tests -
that's something the patch should include anyway.
Sure. I added only a handful test cases which I knew regression isn't
covering. But I'll write more of them. One good thing is that the code gets
heavily exercised even during regression. I caught and fixed multiple bugs
running regression. I'm not saying that's enough, but it certainly gives
some confidence.
and update:
update t set a = a+1, b=b+1;
which has to update all indexes on the table, but:
select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables
n_tup_upd | n_tup_hot_upd
-----------+---------------
1000 | 1000So it's still counted as "WARM" - does it make sense?
No, it does not. The code currently just marks any update as a WARM update
if the table supports it and there is enough free space in the page. And
yes, you're right. It's worth fixing that because of one-WARM update per
chain limitation. Will fix.
The way this is piggy-backed on the current HOT statistics seems a bit
strange for another reason,
Agree. We could add a similar n_tup_warm_upd counter.
But WARM changes that - it allows adding index entries only to a subset of
indexes, which means the "per row" n_tup_hot_upd counter is not sufficient.
When you have a table with 10 indexes, and the counter increases by 1, does
that mean the update added index tuple to 1 index or 9 of them?
How about having counters similar to n_tup_ins/n_tup_del for indexes as
well? Today it does not make sense because every index gets the same number
of inserts, but WARM will change that.
For example, we could have idx_tup_insert and idx_tup_delete that shows up
in pg_stat_user_indexes. I don't know if idx_tup_delete adds any value, but
one can then look at idx_tup_insert for various indexes to get a sense
which indexes receives more inserts than others. The indexes which receive
more inserts are the ones being frequently updated as compared to other
indexes.
This also relates to vacuuming strategies. Today HOT updates do not count
for triggering vacuum (or to be more precise, HOT pruned tuples are
discounted while counting dead tuples). WARM tuples get the same treatment
as far as pruning is concerned, but since they cause fresh index inserts, I
wonder if we need some mechanism to cleanup the dead line pointers and dead
index entries. This will become more important if we do something to
convert WARM chains into HOT chains, something that only VACUUM can do in
the design I've proposed so far.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 10/06/2016 07:36 AM, Pavan Deolasee wrote:
On Wed, Oct 5, 2016 at 1:43 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
...
I can confirm the significant speedup, often by more than 75%
(depending on number of indexes, whether the data set fits into RAM,
etc.). Similarly for the amount of WAL generated, although that's a
bit more difficult to evaluate due to full_page_writes.I'm not going to send detailed results, as that probably does not
make much sense at this stage of the development - I can repeat the
tests once the open questions get resolved.Sure. Anything that stands out? Any regression that you see? I'm not
sure if your benchmarks exercise the paths which might show overheads
without any tangible benefits. For example, I wonder if a test with many
indexes where most of them get updated and then querying the table via
those updated indexes could be one such test case.
No, nothing that would stand out. Let me explain what benchmark(s) I've
done. I've made some minor mistakes when running the benchmarks, so I
plan to rerun them and post the results after that. So let's take the
data with a grain of salt.
My goal was to compare current non-HOT behavior (updating all indexes)
with the WARM (updating only indexes on modified columns), and I've
taken two approaches:
1) fixed number of indexes, update variable number of columns
Create a table with 8 secondary indexes and then run a bunch of
benchmarks updating increasing number of columns. So the first run did
UPDATE t SET c1 = c1+1 WHERE id = :id;
while the second did
UPDATE t SET c1 = c1+1, c2 = c2+1 WHERE id = :id;
and so on, up to updating all the columns in the last run. I've used
multiple scripts to update all the columns / indexes uniformly
(essentially using multiple "-f" flags with pgbench). The runs were
fairly long (2h, enough to get stable behavior).
For a small data set (fits into RAM), the results look like this:
master patched diff
1 5994 8490 +42%
2 4347 7903 +81%
3 4340 7400 +70%
4 4324 6929 +60%
5 4256 6495 +52%
6 4253 5059 +19%
7 4235 4534 +7%
8 4194 4237 +1%
and the amount of WAL generated (after correction for tps difference)
looks like this (numbers are MBs)
master patched diff
1 27257 18508 -32%
2 21753 14599 -33%
3 21912 15864 -28%
4 22021 17135 -22%
5 21819 18258 -16%
6 21929 20659 -6%
7 21994 22234 +1%
8 21851 23267 +6%
So this is quite significant difference. I'm pretty sure the minor WAL
increase for the last two runs is due to full page writes (which also
affects the preceding runs, making the WAL reduction smaller than the
tps increase).
I do have results for larger data sets (>RAM), the results are very
similar although the speedup seems a bit smaller. But I need to rerun those.
2) single-row update, adding indexes between runs
This is kinda the opposite of the previous approach, i.e. transactions
always update a single column (multiple scripts to update the columns
uniformly), but there are new indexes added between runs. The results
(for a large data set, exceeding RAM) look like this:
master patched diff
0 954 1404 +47%
1 701 1045 +49%
2 484 816 +70%
3 346 683 +97%
4 248 608 +145%
5 190 525 +176%
6 152 397 +161%
7 123 315 +156%
8 123 270 +119%
So this looks really interesting.
There's a lot of useful and important feedback in the thread(s) so
far, particularly the descriptions of various failure cases. I think
it'd be very useful to collect those examples and turn them into
regression tests - that's something the patch should include anyway.Sure. I added only a handful test cases which I knew regression isn't
covering. But I'll write more of them. One good thing is that the code
gets heavily exercised even during regression. I caught and fixed
multiple bugs running regression. I'm not saying that's enough, but it
certainly gives some confidence.
I don't see any changes to src/test in the patch, so I'm not sure what
you mean when you say you added a handful of test cases?
and update:
update t set a = a+1, b=b+1;
which has to update all indexes on the table, but:
select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables
n_tup_upd | n_tup_hot_upd
-----------+---------------
1000 | 1000So it's still counted as "WARM" - does it make sense?
No, it does not. The code currently just marks any update as a WARM
update if the table supports it and there is enough free space in the
page. And yes, you're right. It's worth fixing that because of one-WARM
update per chain limitation. Will fix.
Hmmm, so this makes monitoring of %WARM during benchmarks less reliable
than I hoped for :-(
The way this is piggy-backed on the current HOT statistics seems a
bit strange for another reason,Agree. We could add a similar n_tup_warm_upd counter.
Yes, although HOT is a special case of WARM. But it probably makes sense
to differentiate them, I guess.
But WARM changes that - it allows adding index entries only to a
subset of indexes, which means the "per row" n_tup_hot_upd counter
is not sufficient. When you have a table with 10 indexes, and the
counter increases by 1, does that mean the update added index tuple
to 1 index or 9 of them?How about having counters similar to n_tup_ins/n_tup_del for indexes
as well? Today it does not make sense because every index gets the
same number of inserts, but WARM will change that.For example, we could have idx_tup_insert and idx_tup_delete that shows
up in pg_stat_user_indexes. I don't know if idx_tup_delete adds any
value, but one can then look at idx_tup_insert for various indexes to
get a sense which indexes receives more inserts than others. The indexes
which receive more inserts are the ones being frequently updated as
compared to other indexes.
Hmmm, I'm not sure that'll work. I mean, those metrics would be useful
(although I can't think of a use case for idx_tup_delete), but I'm not
sure it's a enough to measure WARM. We need to compute
index_tuples_inserted / index_tuples_total
where (index_tuples_total - index_tuples_inserted) is the number of
index tuples we've been able to skip thanks to WARM. So we'd also need
to track the number of index tuples that we skipped for the index, and
I'm not sure that's a good idea.
Also, we really don't care about inserted tuples - what matters for WARM
are updates, so idx_tup_insert is either useless (because it also
includes non-UPDATE entries) or the naming is misleading.
This also relates to vacuuming strategies. Today HOT updates do not
count for triggering vacuum (or to be more precise, HOT pruned tuples
are discounted while counting dead tuples). WARM tuples get the same
treatment as far as pruning is concerned, but since they cause fresh
index inserts, I wonder if we need some mechanism to cleanup the dead
line pointers and dead index entries. This will become more important if
we do something to convert WARM chains into HOT chains, something that
only VACUUM can do in the design I've proposed so far.
True.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for the patch. This shows a very good performance improvement.
I started reviewing the patch, during this process and I ran the regression
test on the WARM patch. I observed a failure in create_index test.
This may be a bug in code or expected that needs to be corrected.
Regards,
Hari Babu
Fujitsu Australia
On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:
Thanks for the patch. This shows a very good performance improvement.
Thank you. Can you please share the benchmark you ran, results and
observations?
I started reviewing the patch, during this process and I ran the regression
test on the WARM patch. I observed a failure in create_index test.
This may be a bug in code or expected that needs to be corrected.
Can you please share the diff? I ran regression after applying patch on the
current master and did not find any change? Does it happen consistently?
I'm also attaching fresh set of patches. The first patch hasn't changed at
all (though I changed the name to v5 to keep it consistent with the other
patch). The second patch has the following changes:
1. WARM updates are now tracked separately. We still don't count number of
index inserts separately as suggested by Tomas.
2. We don't do a WARM update if all columns referenced by all indexes have
changed. Ideally, we should check if all indexes will require an update and
avoid WARM. So there is still some room for improvement here
3. I added a very minimal regression test case. But really, it just
contains one test case which I specifically wanted to test.
So not a whole lot of changes since the last version. I'm still waiting for
some serious review of the design/code before I spend a lot more time on
the patch. I hope the patch receives some attention in this CF.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v5.patchapplication/octet-stream; name=0001_track_root_lp_v5.patchDownload
commit f33ee503463137aa1a2ae4c3ab04d1468ae1941c
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sat Sep 3 14:51:00 2016 +0530
Use HEAP_TUPLE_LATEST to mark a tuple as the latest tuple in an update chain
and use OffsetNumber in t_ctid to store the root line pointer of the chain.
t_ctid field in the tuple header is usually used to store TID of the next tuple
in an update chain. But for the last tuple in the chain, t_ctid is made to
point to itself. When t_ctid points to itself, that signals the end of the
chain. With this patch, information about a tuple being the last tuple in the
chain is stored a separate HEAP_TUPLE_LATEST flag. This uses another free bit
in t_infomask2. When HEAP_TUPLE_LATEST is set, OffsetNumber field in the t_ctid
stores the root line pointer of the chain. This will help us quickly find root
of a update chain.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a27ef4..ccf84be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
@@ -2250,13 +2251,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2415,7 +2416,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2713,7 +2715,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2721,7 +2724,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2993,6 +2997,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3044,7 +3049,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3174,7 +3180,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3250,8 +3256,8 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ /* Mark this tuple as the latest tuple in the update chain */
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3450,6 +3456,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3506,6 +3514,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3789,7 +3798,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3968,7 +3977,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4149,6 +4158,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4156,10 +4179,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4172,7 +4214,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4211,6 +4255,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4573,7 +4618,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4585,6 +4631,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4631,7 +4678,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5069,7 +5116,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5145,7 +5192,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5659,6 +5706,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5667,6 +5715,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5885,7 +5935,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5894,7 +5944,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -6011,7 +6061,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6137,7 +6188,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7486,6 +7539,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7605,6 +7659,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8260,7 +8315,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8350,7 +8405,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8485,8 +8542,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8622,7 +8680,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8756,12 +8815,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8889,9 +8953,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..e32deb1 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +75,13 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
- ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..7c2231a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..09a164c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..079a77f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..94b46b8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d7e5fad..d01e0d8 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,30 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +572,55 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Traditionally, we have stored
+ * self TID in the t_ctid field if the tuple is the last tuple in the chain. We
+ * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE
+ * flag is set.
+ */
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0002_warm_updates_v5.patchapplication/octet-stream; name=0002_warm_updates_v5.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index b68a0d1..b95275f 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..ba3fffb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..491e411 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..d7c50c1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -265,6 +266,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -302,8 +305,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..cf44214 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -263,6 +265,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..71377ab 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
/*
@@ -352,3 +356,110 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bef9c84..b3de79c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -99,7 +99,10 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot, bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
@@ -1960,6 +1963,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1979,11 +2052,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2025,6 +2101,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2037,9 +2123,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2052,6 +2141,22 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+
+ /* WARM is not supported on system tables yet */
+ if (*recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2124,18 +2229,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3442,13 +3570,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3469,9 +3599,11 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool satisfies_hot;
+ bool satisfies_warm;
bool satisfies_key;
bool satisfies_id;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3496,6 +3628,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for HOT update. This is
* wasted effort if we fail to update or have to put the new tuple on a
@@ -3512,6 +3648,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3571,7 +3709,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* serendipitiously arrive at the same key values.
*/
HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
+ exprindx_attrs,
+ updated_attrs,
+ &satisfies_hot, &satisfies_warm,
+ &satisfies_key,
&satisfies_id, &oldtup, newtup);
if (satisfies_key)
{
@@ -4118,6 +4259,34 @@ l2:
*/
if (satisfies_hot)
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (satisfies_warm &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4158,6 +4327,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4173,12 +4357,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4297,7 +4507,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4405,6 +4618,13 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
* will be checking very similar sets of columns, and doing the same tests on
* them, it makes sense to optimize and do them together.
*
+ * The exprindx_attrs designates the set of attributes used in expression or
+ * predicate indexes. Currently, we don't allow WARM updates if expression or
+ * predicate index column is updated
+ *
+ * If updated_attrs is not NULL, then the caller is always interested in
+ * knowing the list of changed attributes
+ *
* We receive three bitmapsets comprising the three sets of columns we're
* interested in. Note these are destructively modified; that is OK since
* this is invoked at most once in heap_update.
@@ -4417,7 +4637,11 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
static void
HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot,
+ bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup)
{
@@ -4427,6 +4651,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
bool hot_result = true;
bool key_result = true;
bool id_result = true;
+ Bitmapset *hot_attrs_copy = bms_copy(hot_attrs);
/* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
Assert(bms_is_subset(id_attrs, key_attrs));
@@ -4454,8 +4679,11 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* Since the HOT attributes are a superset of the key attributes and
* the key attributes are a superset of the id attributes, this logic
* is guaranteed to identify the next column that needs to be checked.
+ *
+ * If the caller also wants to know the list of updated index
+ * attributes, we must scan through all the attributes
*/
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
+ if ((hot_result || updated_attrs) && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_hot_attnum;
else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_key_attnum;
@@ -4476,8 +4704,16 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
if (check_now == next_id_attnum)
id_result = false;
+ /*
+ * Add the changed attribute to updated_attrs if the caller has
+ * asked for it
+ */
+ if (updated_attrs)
+ *updated_attrs = bms_add_member(*updated_attrs, check_now -
+ FirstLowInvalidHeapAttributeNumber);
+
/* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
+ if (!hot_result && !key_result && !id_result && !updated_attrs)
break;
}
@@ -4488,7 +4724,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* bms_first_member() will return -1 and the attribute number will end
* up with a value less than FirstLowInvalidHeapAttributeNumber.
*/
- if (hot_result && check_now == next_hot_attnum)
+ if ((hot_result || updated_attrs) && check_now == next_hot_attnum)
{
next_hot_attnum = bms_first_member(hot_attrs);
next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
@@ -4505,6 +4741,29 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
}
}
+ /*
+ * If an attributed used in the expression of an expression index or
+ * predicate of a predicate index has changed, we don't yet support WARM
+ * update
+ */
+ if (updated_attrs && bms_overlap(*updated_attrs, exprindx_attrs))
+ *satisfies_warm = false;
+ /* If the table does not support WARM update, honour that */
+ else if (!relation->rd_supportswarm)
+ *satisfies_warm = false;
+ /*
+ * If all index keys are being updated, there is hardly any point in doing
+ * a WARM update.
+ */
+ else if (updated_attrs && bms_is_subset(hot_attrs_copy, *updated_attrs))
+ *satisfies_warm = false;
+ /*
+ * XXX Should we handle some more cases, such as when an update will result
+ * in many or most indexes, should we fall back to a regular update?
+ */
+ else
+ *satisfies_warm = true;
+
*satisfies_hot = hot_result;
*satisfies_key = key_result;
*satisfies_id = id_result;
@@ -4528,7 +4787,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7426,6 +7685,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7439,6 +7699,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7461,6 +7722,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7566,6 +7831,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7577,6 +7843,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7650,6 +7919,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8017,24 +8288,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8621,16 +8906,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8690,6 +8981,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8825,6 +9121,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 7c2231a..d71a297 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..149a02d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -409,7 +411,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +450,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +477,13 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +493,63 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ef69290..e0afffd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..6b1236a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..c9c0501 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 08b646d..e76e928 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ada2142..b3db673 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -455,6 +455,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -485,7 +486,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b4140eb..2126c61 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,6 +2534,7 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
NIL);
@@ -2649,6 +2650,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..cd9b9a7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..03c6b62 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *updated_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If updated_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (updated_attrs)
+ {
+ if (!bms_overlap(updated_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..49bda34 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -141,6 +141,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
* but it's not clear whether it's a win to do so. The next index
* entry might require a visit to the same heap page.
*/
+
+ /*
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
+ */
+ if (scandesc->xs_tuple_recheck)
+ {
+ ExecStoreTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->xs_cbuf, /* buffer containing tuple */
+ false); /* don't pfree */
+ econtext->ecxt_scantuple = slot;
+ ResetExprContext(econtext);
+ if (!ExecQual(node->indexqual, econtext, false))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ continue;
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..0b04bb8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index efb0c5e..0ba71a3 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -448,6 +448,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -494,6 +495,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -824,6 +826,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *updated_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -938,7 +943,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &updated_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1025,10 +1030,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(updated_attrs);
+ updated_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ updated_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..86f803a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4080,6 +4082,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5189,6 +5192,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5216,6 +5220,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2d3cf9e..25752b0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -115,6 +115,7 @@ extern Datum pg_stat_get_xact_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
@@ -245,6 +246,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1744,6 +1761,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..37874ca 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2030,6 +2030,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4373,12 +4374,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4391,6 +4395,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4429,6 +4435,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4474,19 +4481,32 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4502,7 +4522,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4514,6 +4535,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..e7bf734 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -112,6 +112,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1288,6 +1289,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..37eaf76 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 725e2f2..2f5ef36 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -363,4 +363,8 @@ extern bool _hash_convert_tuple(Relation index,
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 81f7982..78e16a9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d01e0d8..3a51681 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -771,6 +787,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..83af072 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index de98dd6..da7ec84 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -111,7 +111,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 17ec71d..0c4b160 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2732,6 +2732,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3344 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2882,6 +2884,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3343 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 136276b..f463014 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *updated_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b0bdc46 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -61,6 +61,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e8dac6..8e18c16 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1175,7 +1177,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/itemid.h b/src/include/storage/itemid.h
index 509c577..8c9cc99 100644
--- a/src/include/storage/itemid.h
+++ b/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c867ebb..af25f44 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa1b83
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,51 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8641769..a610039 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..166ea37
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,15 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
On Sat, Nov 12, 2016 at 10:12 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:Thanks for the patch. This shows a very good performance improvement.
Thank you. Can you please share the benchmark you ran, results and
observations?
I just ran a performance test on my laptop with minimal configuration, it
didn't show much
improvement, currently I don't have access to a big machine to test the
performance.
I started reviewing the patch, during this process and I ran the regression
test on the WARM patch. I observed a failure in create_index test.
This may be a bug in code or expected that needs to be corrected.Can you please share the diff? I ran regression after applying patch on
the current master and did not find any change? Does it happen consistently?
Yes, it is happening consistently. I ran the make installcheck. Attached
the regression.diffs file with the failed test.
I applied the previous warm patch on this commit -
e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15
Regards,
Hari Babu
Fujitsu Australia
Attachments:
regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /media/sf_code/fast/fujitsu-oss-postgres/src/test/regress/expected/create_index.out 2016-11-09 12:50:55.017043300 +1100
--- /media/sf_code/fast/fujitsu-oss-postgres/src/test/regress/results/create_index.out 2016-11-15 11:16:39.341650900 +1100
***************
*** 473,479 ****
f1
---------------------
((2,0),(2,4),(0,0))
! (1 row)
EXPLAIN (COSTS OFF)
SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
--- 473,480 ----
f1
---------------------
((2,0),(2,4),(0,0))
! ((3,1),(3,3),(1,0))
! (2 rows)
EXPLAIN (COSTS OFF)
SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
***************
*** 508,514 ****
SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
count
-------
! 2
(1 row)
EXPLAIN (COSTS OFF)
--- 509,515 ----
SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
count
-------
! 4
(1 row)
EXPLAIN (COSTS OFF)
======================================================================
On Tue, Nov 15, 2016 at 5:58 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:
On Sat, Nov 12, 2016 at 10:12 PM, Pavan Deolasee <pavan.deolasee@gmail.com
wrote:
On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:Thanks for the patch. This shows a very good performance improvement.
Thank you. Can you please share the benchmark you ran, results and
observations?I just ran a performance test on my laptop with minimal configuration, it
didn't show much
improvement, currently I don't have access to a big machine to test the
performance.I started reviewing the patch, during this process and I ran the
regression
test on the WARM patch. I observed a failure in create_index test.
This may be a bug in code or expected that needs to be corrected.Can you please share the diff? I ran regression after applying patch on
the current master and did not find any change? Does it happen consistently?Yes, it is happening consistently. I ran the make installcheck. Attached
the regression.diffs file with the failed test.
I applied the previous warm patch on this commit
- e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15
Are you able to reproduce the issue?
Currently the patch is moved to next CF with "needs review" state.
Regards,
Hari Babu
Fujitsu Australia
On Fri, Dec 2, 2016 at 8:34 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:
On Tue, Nov 15, 2016 at 5:58 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:Yes, it is happening consistently. I ran the make installcheck. Attached
the regression.diffs file with the failed test.
I applied the previous warm patch on this commit
- e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15Are you able to reproduce the issue?
Apologies for the delay. I could reproduce this on a different environment.
It was a case of uninitialised variable and hence the inconsistent results.
I've updated the patches after fixing the issue. Multiple rounds of
regression passes for me without any issue. Please let me know if it works
for you.
Currently the patch is moved to next CF with "needs review" state.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v6.patchapplication/octet-stream; name=0001_track_root_lp_v6.patchDownload
commit f33ee503463137aa1a2ae4c3ab04d1468ae1941c
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sat Sep 3 14:51:00 2016 +0530
Use HEAP_TUPLE_LATEST to mark a tuple as the latest tuple in an update chain
and use OffsetNumber in t_ctid to store the root line pointer of the chain.
t_ctid field in the tuple header is usually used to store TID of the next tuple
in an update chain. But for the last tuple in the chain, t_ctid is made to
point to itself. When t_ctid points to itself, that signals the end of the
chain. With this patch, information about a tuple being the last tuple in the
chain is stored a separate HEAP_TUPLE_LATEST flag. This uses another free bit
in t_infomask2. When HEAP_TUPLE_LATEST is set, OffsetNumber field in the t_ctid
stores the root line pointer of the chain. This will help us quickly find root
of a update chain.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a27ef4..ccf84be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
@@ -2250,13 +2251,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2415,7 +2416,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2713,7 +2715,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2721,7 +2724,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2993,6 +2997,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3044,7 +3049,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3174,7 +3180,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3250,8 +3256,8 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ /* Mark this tuple as the latest tuple in the update chain */
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3450,6 +3456,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3506,6 +3514,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3789,7 +3798,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3968,7 +3977,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4149,6 +4158,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4156,10 +4179,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4172,7 +4214,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4211,6 +4255,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4573,7 +4618,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4585,6 +4631,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4631,7 +4678,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5069,7 +5116,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5145,7 +5192,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5659,6 +5706,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5667,6 +5715,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5885,7 +5935,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5894,7 +5944,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -6011,7 +6061,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6137,7 +6188,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7486,6 +7539,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7605,6 +7659,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8260,7 +8315,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8350,7 +8405,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8485,8 +8542,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8622,7 +8680,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8756,12 +8815,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8889,9 +8953,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..e32deb1 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +75,13 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
- ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..7c2231a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..09a164c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..079a77f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..94b46b8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index d7e5fad..d01e0d8 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,30 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +572,55 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Traditionally, we have stored
+ * self TID in the t_ctid field if the tuple is the last tuple in the chain. We
+ * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE
+ * flag is set.
+ */
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0002_warm_updates_v6.patchapplication/octet-stream; name=0002_warm_updates_v6.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index b68a0d1..b95275f 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..ba3fffb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..491e411 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6806e32..2026004 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -265,6 +266,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -302,8 +305,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 8d43b38..05b078f 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -407,6 +409,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index fa9cbdc..6897985 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bef9c84..b3de79c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -99,7 +99,10 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot, bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
@@ -1960,6 +1963,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1979,11 +2052,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2025,6 +2101,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2037,9 +2123,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2052,6 +2141,22 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+
+ /* WARM is not supported on system tables yet */
+ if (*recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2124,18 +2229,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3442,13 +3570,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3469,9 +3599,11 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool satisfies_hot;
+ bool satisfies_warm;
bool satisfies_key;
bool satisfies_id;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3496,6 +3628,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for HOT update. This is
* wasted effort if we fail to update or have to put the new tuple on a
@@ -3512,6 +3648,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3571,7 +3709,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* serendipitiously arrive at the same key values.
*/
HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
+ exprindx_attrs,
+ updated_attrs,
+ &satisfies_hot, &satisfies_warm,
+ &satisfies_key,
&satisfies_id, &oldtup, newtup);
if (satisfies_key)
{
@@ -4118,6 +4259,34 @@ l2:
*/
if (satisfies_hot)
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (satisfies_warm &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4158,6 +4327,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4173,12 +4357,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4297,7 +4507,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4405,6 +4618,13 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
* will be checking very similar sets of columns, and doing the same tests on
* them, it makes sense to optimize and do them together.
*
+ * The exprindx_attrs designates the set of attributes used in expression or
+ * predicate indexes. Currently, we don't allow WARM updates if expression or
+ * predicate index column is updated
+ *
+ * If updated_attrs is not NULL, then the caller is always interested in
+ * knowing the list of changed attributes
+ *
* We receive three bitmapsets comprising the three sets of columns we're
* interested in. Note these are destructively modified; that is OK since
* this is invoked at most once in heap_update.
@@ -4417,7 +4637,11 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
static void
HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
+ Bitmapset *exprindx_attrs,
+ Bitmapset **updated_attrs,
+ bool *satisfies_hot,
+ bool *satisfies_warm,
+ bool *satisfies_key,
bool *satisfies_id,
HeapTuple oldtup, HeapTuple newtup)
{
@@ -4427,6 +4651,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
bool hot_result = true;
bool key_result = true;
bool id_result = true;
+ Bitmapset *hot_attrs_copy = bms_copy(hot_attrs);
/* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
Assert(bms_is_subset(id_attrs, key_attrs));
@@ -4454,8 +4679,11 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* Since the HOT attributes are a superset of the key attributes and
* the key attributes are a superset of the id attributes, this logic
* is guaranteed to identify the next column that needs to be checked.
+ *
+ * If the caller also wants to know the list of updated index
+ * attributes, we must scan through all the attributes
*/
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
+ if ((hot_result || updated_attrs) && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_hot_attnum;
else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
check_now = next_key_attnum;
@@ -4476,8 +4704,16 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
if (check_now == next_id_attnum)
id_result = false;
+ /*
+ * Add the changed attribute to updated_attrs if the caller has
+ * asked for it
+ */
+ if (updated_attrs)
+ *updated_attrs = bms_add_member(*updated_attrs, check_now -
+ FirstLowInvalidHeapAttributeNumber);
+
/* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
+ if (!hot_result && !key_result && !id_result && !updated_attrs)
break;
}
@@ -4488,7 +4724,7 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
* bms_first_member() will return -1 and the attribute number will end
* up with a value less than FirstLowInvalidHeapAttributeNumber.
*/
- if (hot_result && check_now == next_hot_attnum)
+ if ((hot_result || updated_attrs) && check_now == next_hot_attnum)
{
next_hot_attnum = bms_first_member(hot_attrs);
next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
@@ -4505,6 +4741,29 @@ HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
}
}
+ /*
+ * If an attributed used in the expression of an expression index or
+ * predicate of a predicate index has changed, we don't yet support WARM
+ * update
+ */
+ if (updated_attrs && bms_overlap(*updated_attrs, exprindx_attrs))
+ *satisfies_warm = false;
+ /* If the table does not support WARM update, honour that */
+ else if (!relation->rd_supportswarm)
+ *satisfies_warm = false;
+ /*
+ * If all index keys are being updated, there is hardly any point in doing
+ * a WARM update.
+ */
+ else if (updated_attrs && bms_is_subset(hot_attrs_copy, *updated_attrs))
+ *satisfies_warm = false;
+ /*
+ * XXX Should we handle some more cases, such as when an update will result
+ * in many or most indexes, should we fall back to a regular update?
+ */
+ else
+ *satisfies_warm = true;
+
*satisfies_hot = hot_result;
*satisfies_key = key_result;
*satisfies_id = id_result;
@@ -4528,7 +4787,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7426,6 +7685,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7439,6 +7699,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7461,6 +7722,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7566,6 +7831,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7577,6 +7843,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7650,6 +7919,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8017,24 +8288,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8621,16 +8906,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8690,6 +8981,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8825,6 +9121,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 7c2231a..d71a297 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..7632573 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -409,7 +411,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +450,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +477,15 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+ else
+ scan->xs_tuple_recheck = true;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +495,63 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ef69290..e0afffd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..6b1236a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..c9c0501 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 08b646d..e76e928 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e011af1..97672a9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -472,6 +472,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -502,7 +503,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ec5d6f1..5e57cc9 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2551,6 +2551,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2669,6 +2671,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..cd9b9a7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..03c6b62 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *updated_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If updated_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (updated_attrs)
+ {
+ if (!bms_overlap(updated_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..49bda34 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -141,6 +141,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
* but it's not clear whether it's a win to do so. The next index
* entry might require a visit to the same heap page.
*/
+
+ /*
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
+ */
+ if (scandesc->xs_tuple_recheck)
+ {
+ ExecStoreTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->xs_cbuf, /* buffer containing tuple */
+ false); /* don't pfree */
+ econtext->ecxt_scantuple = slot;
+ ResetExprContext(econtext);
+ if (!ExecQual(node->indexqual, econtext, false))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ continue;
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..daa0826 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index efb0c5e..0ba71a3 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -448,6 +448,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -494,6 +495,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -824,6 +826,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *updated_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -938,7 +943,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &updated_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1025,10 +1030,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(updated_attrs);
+ updated_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ updated_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7584cb..d89d37b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4083,6 +4085,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5192,6 +5195,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5219,6 +5223,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2d3cf9e..25752b0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -115,6 +115,7 @@ extern Datum pg_stat_get_xact_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
@@ -245,6 +246,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1744,6 +1761,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..37874ca 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2030,6 +2030,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4373,12 +4374,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4391,6 +4395,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4429,6 +4435,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4474,19 +4481,32 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4502,7 +4522,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4514,6 +4535,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 28ebcb6..2241ffb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -112,6 +112,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1288,6 +1289,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..37eaf76 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6dfc41f..f1c73a0 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -389,4 +389,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 81f7982..78e16a9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **updated_attrs, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 4313eb9..09246b2 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -771,6 +787,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..83af072 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index de98dd6..da7ec84 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -111,7 +111,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 047a1ce..31f295f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2734,6 +2734,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3344 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2884,6 +2886,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3343 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 136276b..f463014 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *updated_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8004d85..3bf4b5f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -61,6 +61,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 152ff06..e0c8a90 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1176,7 +1178,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/itemid.h b/src/include/storage/itemid.h
index 509c577..8c9cc99 100644
--- a/src/include/storage/itemid.h
+++ b/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index fa15f28..982bf4c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 031e8c2..c416fe6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1705,6 +1705,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1838,6 +1839,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1881,6 +1883,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1918,7 +1921,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1934,7 +1938,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1956,7 +1961,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa1b83
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,51 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8641769..a610039 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..166ea37
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,15 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
I noticed that this patch changes HeapSatisfiesHOTAndKeyUpdate() by
adding one more set of attributes to check, and one more output boolean
flag. My patch to add indirect indexes also modifies that routine to
add the same set of things. I think after committing both these
patches, the API is going to be fairly ridiculous. I propose to use a
different approach.
With your WARM and my indirect indexes, plus the additions for for-key
locks, plus identity columns, there is no longer a real expectation that
we can exit early from the function. In your patch, as well as mine,
there is a semblance of optimization that tries to avoid computing the
updated_attrs output bitmapset if the pointer is not passed in, but it's
effectively pointless because the only interesting use case is from
ExecUpdate() which always activates the feature. Can we just agree to
drop that?
If we do drop that, then the function can become much simpler: compare
all columns in new vs. old, return output bitmapset of changed columns.
Then "satisfies_hot" and all the other boolean output flags can be
computed simply in the caller by doing bms_overlap().
Thoughts?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
With your WARM and my indirect indexes, plus the additions for for-key
locks, plus identity columns, there is no longer a real expectation that
we can exit early from the function. In your patch, as well as mine,
there is a semblance of optimization that tries to avoid computing the
updated_attrs output bitmapset if the pointer is not passed in, but it's
effectively pointless because the only interesting use case is from
ExecUpdate() which always activates the feature. Can we just agree to
drop that?
I think the only case that gets worse is the path that does
simple_heap_update, which is used for DDL. I would be very surprised if
a change there is noticeable, when compared to the rest of the stuff
that goes on for DDL commands.
Now, after saying that, I think that a table with a very large number of
columns is going to be affected by this. But we don't really need to
compute the output bits for every single column -- we only care about
those that are covered by some index. So we should pass an input
bitmapset comprising all such columns, and the output bitmapset only
considers those columns, and ignores columns not indexed. My patch for
indirect indexes already does something similar (though it passes a
bitmapset of columns indexed by indirect indexes only, so it needs a
tweak there.)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 December 2016 at 07:36, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
I've updated the patches after fixing the issue. Multiple rounds of
regression passes for me without any issue. Please let me know if it works
for you.
Hi Pavan,
Today i was playing with your patch and running some tests and found
some problems i wanted to report before i forget them ;)
* You need to add a prototype in src/backend/utils/adt/pgstatfuncs.c:
extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);
* The isolation test for partial_index fails (attached the regression.diffs)
* running a home-made test i have at hand i got this assertion:
"""
TRAP: FailedAssertion("!(buf_state & (1U << 24))", File: "bufmgr.c", Line: 837)
LOG: server process (PID 18986) was terminated by signal 6: Aborted
"""
To reproduce:
1) run prepare_test.sql
2) then run the following pgbench command (sql scripts attached):
pgbench -c 24 -j 24 -T 600 -n -f inserts.sql@15 -f updates_1.sql@20 -f
updates_2.sql@20 -f deletes.sql@45 db_test
* sometimes when i have made the server crash the attempt to recovery
fails with this assertion:
"""
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/157F970
TRAP: FailedAssertion("!(!warm_update)", File: "heapam.c", Line: 8924)
LOG: startup process (PID 14031) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
"""
still cannot reproduce this one consistently but happens often enough
will continue playing with it...
--
Jaime Casanova www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /home/jcasanov/Documentos/postgres/postgresql/src/test/isolation/expected/partial-index.out 2016-11-19 11:25:53.839629689 -0500
--- /home/jcasanov/Documentos/postgres/postgresql/src/test/isolation/results/partial-index.out 2016-12-26 01:05:09.594369943 -0500
***************
*** 30,35 ****
--- 30,37 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
+ 10 a 2
step c2: COMMIT;
starting permutation: rxy1 wx1 wy2 c1 rxy2 c2
***************
*** 83,88 ****
--- 85,91 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c1: COMMIT;
step c2: COMMIT;
***************
*** 117,122 ****
--- 120,126 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c2: COMMIT;
step c1: COMMIT;
***************
*** 173,178 ****
--- 177,183 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c1: COMMIT;
step c2: COMMIT;
***************
*** 207,212 ****
--- 212,218 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c2: COMMIT;
step c1: COMMIT;
***************
*** 240,245 ****
--- 246,252 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
step c1: COMMIT;
***************
*** 274,279 ****
--- 281,287 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
step c2: COMMIT;
***************
*** 308,313 ****
--- 316,322 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c2: COMMIT;
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
***************
*** 365,370 ****
--- 374,380 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c1: COMMIT;
step c2: COMMIT;
***************
*** 399,404 ****
--- 409,415 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c2: COMMIT;
step c1: COMMIT;
***************
*** 432,437 ****
--- 443,449 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
step c1: COMMIT;
***************
*** 466,471 ****
--- 478,484 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
step c2: COMMIT;
***************
*** 500,505 ****
--- 513,519 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c2: COMMIT;
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
***************
*** 520,525 ****
--- 534,540 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step rxy1: select * from test_t where val2 = 1;
id val1 val2
***************
*** 554,559 ****
--- 569,575 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step rxy1: select * from test_t where val2 = 1;
id val1 val2
***************
*** 588,593 ****
--- 604,610 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step rxy1: select * from test_t where val2 = 1;
id val1 val2
***************
*** 622,627 ****
--- 639,645 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step c2: COMMIT;
step rxy1: select * from test_t where val2 = 1;
***************
*** 636,641 ****
--- 654,660 ----
6 a 1
7 a 1
8 a 1
+ 9 a 2
10 a 1
step wx1: update test_t set val2 = 2 where val2 = 1 and id = 10;
step c1: COMMIT;
======================================================================
Jaime Casanova wrote:
* The isolation test for partial_index fails (attached the regression.diffs)
Hmm, I had a very similar (if not identical) failure with indirect
indexes; in my case it was a bug in RelationGetIndexAttrBitmap() -- I
was missing to have HOT considerate the columns in index predicate, that
is, the second pull_varattnos() call.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
Jaime Casanova wrote:
* The isolation test for partial_index fails (attached the regression.diffs)
Hmm, I had a very similar (if not identical) failure with indirect
indexes; in my case it was a bug in RelationGetIndexAttrBitmap() -- I
was missing to have HOT considerate the columns in index predicate, that
is, the second pull_varattnos() call.
Sorry, I meant:
Hmm, I had a very similar (if not identical) failure with indirect
indexes; in my case it was a bug in RelationGetIndexAttrBitmap() -- I
was missing to have HOT [take into account] the columns in index predicate, that
is, the second pull_varattnos() call.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Dec 24, 2016 at 1:18 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Alvaro Herrera wrote:
With your WARM and my indirect indexes, plus the additions for for-key
locks, plus identity columns, there is no longer a real expectation that
we can exit early from the function. In your patch, as well as mine,
there is a semblance of optimization that tries to avoid computing the
updated_attrs output bitmapset if the pointer is not passed in, but it's
effectively pointless because the only interesting use case is from
ExecUpdate() which always activates the feature. Can we just agree to
drop that?
Yes, I agree. As you noted below, the only case that may be affected is
simple_heap_update() which does a lot more and hence this function will be
least of the worries.
I think the only case that gets worse is the path that does
simple_heap_update, which is used for DDL. I would be very surprised if
a change there is noticeable, when compared to the rest of the stuff
that goes on for DDL commands.Now, after saying that, I think that a table with a very large number of
columns is going to be affected by this. But we don't really need to
compute the output bits for every single column -- we only care about
those that are covered by some index. So we should pass an input
bitmapset comprising all such columns, and the output bitmapset only
considers those columns, and ignores columns not indexed. My patch for
indirect indexes already does something similar (though it passes a
bitmapset of columns indexed by indirect indexes only, so it needs a
tweak there.)
Yes, that looks like a good compromise. This would require us to compare
only those columns that any caller of the function might be interested in.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 26, 2016 at 11:49 AM, Jaime Casanova <
jaime.casanova@2ndquadrant.com> wrote:
On 2 December 2016 at 07:36, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:I've updated the patches after fixing the issue. Multiple rounds of
regression passes for me without any issue. Please let me know if itworks
for you.
Hi Pavan,
Today i was playing with your patch and running some tests and found
some problems i wanted to report before i forget them ;)
Thanks Jaime for the tests and bug reports. I'm attaching an add-on patch
which fixes these issues for me. I'm deliberately not sending a fresh
revision because the changes are still minor.
* You need to add a prototype in src/backend/utils/adt/pgstatfuncs.c:
extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);
Added.
* The isolation test for partial_index fails (attached the
regression.diffs)
Fixed. Looks like I forgot to include attributes from predicates and
expressions in the list of index attributes (as pointed by Alvaro)
* running a home-made test i have at hand i got this assertion:
"""
TRAP: FailedAssertion("!(buf_state & (1U << 24))", File: "bufmgr.c", Line:
837)
LOG: server process (PID 18986) was terminated by signal 6: Aborted
"""
To reproduce:
1) run prepare_test.sql
2) then run the following pgbench command (sql scripts attached):
pgbench -c 24 -j 24 -T 600 -n -f inserts.sql@15 -f updates_1.sql@20 -f
updates_2.sql@20 -f deletes.sql@45 db_test
Looks like the patch was failing to set the block number correctly in the
t_ctid field, leading to these strange failures. There was also couple of
instances where the t_ctid field was being accessed directly, instead of
the newly added macro. I think we need some better mechanism to ensure that
we don't miss out on such things. But I don't have a very good idea about
doing that right now.
* sometimes when i have made the server crash the attempt to recovery
fails with this assertion:
"""
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/157F970
TRAP: FailedAssertion("!(!warm_update)", File: "heapam.c", Line: 8924)
LOG: startup process (PID 14031) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
"""
still cannot reproduce this one consistently but happens often enough
This could be a case of uninitialised variable in log_heap_update(). What
surprises me though that none of the compilers I tried so far could catch
that. In the following code snippet, if the condition evaluates to false
then "warm_update" may remain uninitialised, leading to wrong xlog entry,
which may later result in assertion failure during redo recovery.
7845
7846 if (HeapTupleIsHeapWarmTuple(newtup))
7847 warm_update = true;
7848
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0003_warm_fixes_v6.patchapplication/octet-stream; name=0003_warm_fixes_v6.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b3de79c..9353175 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7831,7 +7831,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
- bool warm_update;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e32deb1..39ee6ac 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -75,6 +75,9 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number */
+ ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
if (OffsetNumberIsValid(root_offnum))
HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 03c6b62..c24e486 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -801,7 +801,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tup->t_data, &ctid_wait,
+ ItemPointerGetOffsetNumber(&tup->t_self));
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 079a77f..466609c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2451,7 +2451,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple.t_data, &tuple.t_self,
+ ItemPointerGetOffsetNumber(&tuple.t_self));
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 25752b0..ef4f5b4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -37,6 +37,7 @@ extern Datum pg_stat_get_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 37874ca..c6ef4e2 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4487,6 +4487,12 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
/*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ /*
* Check if the index has amrecheck method defined. If the method is
* not defined, the index does not support WARM update. Completely
* disable WARM updates on such tables
On Tue, Dec 27, 2016 at 6:51 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
Thanks Jaime for the tests and bug reports. I'm attaching an add-on patch
which fixes these issues for me. I'm deliberately not sending a fresh
revision because the changes are still minor.
Per Alvaro's request in another thread, I've rebased these patches on his
patch to refactor HeapSatisfiesHOTandKeyUpdate(). I've also attached that
patch here for easy reference.
The fixes based on bug reports by Jaime are also included in this patch
set. Other than that there are not any significant changes. The patch still
disables WARM on system tables, something I would like to fix. But I've
been delaying that because it will require changes at several places since
indexes on system tables are managed separately. In addition to that, the
patch only works with btree and hash indexes. We must implement the recheck
method for other index types so as to support them.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
interesting-attrs-2.patchapplication/octet-stream; name=interesting-attrs-2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ea579a0..19edbdf 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -95,11 +95,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3443,6 +3440,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3460,9 +3459,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3489,21 +3485,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3524,7 +3529,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3550,6 +3555,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3561,10 +3570,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3803,6 +3809,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4107,7 +4115,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4122,7 +4130,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4270,13 +4280,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4310,7 +4322,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4355,114 +4367,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
- *
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
+ int attnum;
+ Bitmapset *modified = NULL;
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
-
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
-
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
0001_track_root_lp_v7.patchapplication/octet-stream; name=0001_track_root_lp_v7.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f1b4602..a22aae7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2247,13 +2248,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2412,7 +2413,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2710,7 +2712,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2718,7 +2721,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2990,6 +2994,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3041,7 +3046,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3171,7 +3177,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3247,8 +3253,8 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ /* Mark this tuple as the latest tuple in the update chain */
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3449,6 +3455,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3511,6 +3519,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3795,7 +3804,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3976,7 +3985,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4159,6 +4168,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4166,10 +4189,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4182,7 +4224,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4221,6 +4265,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4501,7 +4546,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4513,6 +4559,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4559,7 +4606,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -4997,7 +5044,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5073,7 +5120,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5587,6 +5634,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5595,6 +5643,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5824,7 +5874,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5833,7 +5883,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5950,7 +6000,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6076,7 +6127,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7425,6 +7478,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7544,6 +7598,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8199,7 +8254,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8289,7 +8344,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8424,8 +8481,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8561,7 +8619,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8695,12 +8754,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8828,9 +8892,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..39ee6ac 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +75,16 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..7c2231a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..09a164c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..882ce18 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -788,7 +788,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tup->t_data, &ctid_wait,
+ ItemPointerGetOffsetNumber(&tup->t_self));
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..466609c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2451,7 +2451,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple.t_data, &tuple.t_self,
+ ItemPointerGetOffsetNumber(&tuple.t_self));
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0d12bbb..81f7982 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 8fb1f6d..4313eb9 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,30 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +572,55 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Traditionally, we have stored
+ * self TID in the t_ctid field if the tuple is the last tuple in the chain. We
+ * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE
+ * flag is set.
+ */
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0002_warm_updates_v7.patchapplication/octet-stream; name=0002_warm_updates_v7.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index b68a0d1..b95275f 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..ba3fffb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..491e411 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6806e32..2026004 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -265,6 +266,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -302,8 +305,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 8d43b38..05b078f 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -407,6 +409,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index fa9cbdc..6897985 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a22aae7..082bd1f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1957,6 +1957,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1976,11 +2046,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2022,6 +2095,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2034,9 +2117,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2049,6 +2135,22 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+
+ /* WARM is not supported on system tables yet */
+ if (*recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2121,18 +2223,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3439,13 +3564,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
ItemId lp;
@@ -3468,6 +3595,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3492,6 +3620,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3513,10 +3645,13 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3568,6 +3703,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3818,6 +3956,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4126,6 +4265,36 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4168,6 +4337,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4183,12 +4367,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4307,7 +4517,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4456,7 +4669,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7354,6 +7567,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7367,6 +7581,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7389,6 +7604,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7494,6 +7713,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7505,6 +7725,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7578,6 +7801,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -7945,24 +8170,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8549,16 +8788,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8618,6 +8863,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8753,6 +9003,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 7c2231a..d71a297 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..3f9a0cf 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -228,6 +230,20 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes, but we
+ * can't be sure if the function was called at this point and we can't call
+ * it now for the risk of deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -409,7 +425,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +464,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +491,15 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+ else
+ scan->xs_tuple_recheck = true;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +509,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ef69290..e0afffd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..6b1236a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..c9c0501 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 08b646d..e76e928 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e011af1..97672a9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -472,6 +472,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -502,7 +503,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ec5d6f1..5e57cc9 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2551,6 +2551,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2669,6 +2671,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..cd9b9a7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 882ce18..5fe6182 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..49bda34 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -141,6 +141,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
* but it's not clear whether it's a win to do so. The next index
* entry might require a visit to the same heap page.
*/
+
+ /*
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
+ */
+ if (scandesc->xs_tuple_recheck)
+ {
+ ExecStoreTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->xs_cbuf, /* buffer containing tuple */
+ false); /* don't pfree */
+ econtext->ecxt_scantuple = slot;
+ ResetExprContext(econtext);
+ if (!ExecQual(node->indexqual, econtext, false))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ continue;
+ }
+ }
}
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..daa0826 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index efb0c5e..3183db4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -448,6 +448,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -494,6 +495,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -824,6 +826,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -938,7 +943,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1025,10 +1030,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7584cb..d89d37b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4083,6 +4085,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5192,6 +5195,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5219,6 +5223,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2d3cf9e..ef4f5b4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -37,6 +37,7 @@ extern Datum pg_stat_get_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS);
@@ -115,6 +116,7 @@ extern Datum pg_stat_get_xact_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
@@ -245,6 +247,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1744,6 +1762,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..c6ef4e2 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2030,6 +2030,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4373,12 +4374,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4391,6 +4395,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4429,6 +4435,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4474,19 +4481,38 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4502,7 +4528,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4514,6 +4541,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 28ebcb6..2241ffb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -112,6 +112,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1288,6 +1289,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..37eaf76 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6dfc41f..f1c73a0 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -389,4 +389,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 81f7982..04ffd67 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 4313eb9..09246b2 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -771,6 +787,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..83af072 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index de98dd6..da7ec84 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -111,7 +111,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 047a1ce..31f295f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2734,6 +2734,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3344 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2884,6 +2886,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3343 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 136276b..e324deb 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8004d85..3bf4b5f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -61,6 +61,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 152ff06..e0c8a90 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1176,7 +1178,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/itemid.h b/src/include/storage/itemid.h
index 509c577..8c9cc99 100644
--- a/src/include/storage/itemid.h
+++ b/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index fa15f28..982bf4c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 031e8c2..c416fe6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1705,6 +1705,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1838,6 +1839,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1881,6 +1883,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1918,7 +1921,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1934,7 +1938,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1956,7 +1961,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa1b83
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,51 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8641769..a610039 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..166ea37
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,15 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
On Tue, Jan 3, 2017 at 9:43 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
The patch still disables WARM on system tables, something I would like to
fix. But I've been delaying that because it will require changes at several
places since indexes on system tables are managed separately.
Here is another version which fixes a bug that I discovered while adding
support for system tables. The patch set now also includes a patch to
enable WARM on system tables. I'm attaching that as a separate patch
because while the changes to support WARM on system tables are many, almost
all of them are purely mechanical. We need to pass additional information
to CatalogUpdateIndexes()/CatalogIndexInsert(). We need to tell these
routines whether the update leading to them was a WARM update and which
columns were modified so that it can correctly avoid adding new index
tuples for indexes for which index keys haven't changed.
I wish I could find another way of passing this information instead of
making changes at so many places, but the only other way I could think of
was tracking that information as part of the HeapTuple itself, which
doesn't seem nice and may also require changes at many call sites where
tuples are constructed. One minor improvement could be that instead of two,
we could just pass "modified_attrs" and a NULL value may imply non-WARM
update. Other suggestions are welcome though.
I'm quite happy that all tests pass even after adding support for system
tables. One reason for testing support for system tables was to ensure some
more code paths get exercised. As before, I've included Alvaro's
refactoring patch too.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v8.patchapplication/octet-stream; name=0001_track_root_lp_v8.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f1b4602..a22aae7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2247,13 +2248,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &ctid, offnum);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2412,7 +2413,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2710,7 +2712,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2718,7 +2721,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2990,6 +2994,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3041,7 +3046,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3171,7 +3177,7 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tp.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3247,8 +3253,8 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+ /* Mark this tuple as the latest tuple in the update chain */
+ HeapTupleHeaderSetHeapLatest(tp.t_data);
MarkBufferDirty(buffer);
@@ -3449,6 +3455,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3511,6 +3519,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3795,7 +3804,7 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(oldtup.t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3976,7 +3985,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4159,6 +4168,20 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
}
else
{
@@ -4166,10 +4189,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data);
+ /*
+ * Also update the in-memory copy with the root line pointer information
+ */
+ if (OffsetNumberIsValid(root_offnum))
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);
+ }
+ else
+ {
+ HeapTupleHeaderSetRootOffset(heaptup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data,
+ ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4182,7 +4224,9 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextCtid(oldtup.t_data,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ ItemPointerGetOffsetNumber(&(heaptup->t_self)));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4221,6 +4265,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4501,7 +4546,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4513,6 +4559,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4559,7 +4606,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &t_ctid, offnum);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -4997,7 +5044,7 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple->t_data, &hufd->ctid, offnum);
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5073,7 +5120,7 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ HeapTupleHeaderSetHeapLatest(tuple->t_data);
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5587,6 +5634,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5595,6 +5643,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5824,7 +5874,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5833,7 +5883,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextCtid(mytup.t_data, &tupid, offnum);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5950,7 +6000,8 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6076,7 +6127,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
+ HeapTupleHeaderSetNextCtid(tp.t_data,
+ ItemPointerGetBlockNumber(&tp.t_self),
+ ItemPointerGetOffsetNumber(&tp.t_self));
MarkBufferDirty(buffer);
@@ -7425,6 +7478,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7544,6 +7598,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
/* Prepare WAL data for the new page */
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ xlrec.root_offnum = root_offnum;
bufflags = REGBUF_STANDARD;
if (init)
@@ -8199,7 +8254,7 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ HeapTupleHeaderSetHeapLatest(htup);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8289,7 +8344,9 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8424,8 +8481,9 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup);
+ HeapTupleHeaderSetRootOffset(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8561,7 +8619,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
+ ItemPointerGetOffsetNumber(&newtid));
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8695,12 +8754,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetHeapLatest(htup);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ if (OffsetNumberIsValid(xlrec->root_offnum))
+ HeapTupleHeaderSetRootOffset(htup, xlrec->root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset(htup, offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8828,9 +8892,7 @@ heap_xlog_lock(XLogReaderState *record)
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ HeapTupleHeaderSetHeapLatest(htup);
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c90fb71..39ee6ac 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -69,7 +75,16 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
+ if (OffsetNumberIsValid(root_offnum))
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ root_offnum);
+ else
+ HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
+ offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..7c2231a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
@@ -820,6 +823,14 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/* Remember the root line pointer for this item */
root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If the caller is interested in just one offset and we found
+ * that, just return
+ */
+ if (OffsetNumberIsValid(target_offnum) &&
+ (nextoffnum == target_offnum))
+ return;
+
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
@@ -829,3 +840,25 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ heap_get_root_tuples_internal(page, target_offnum, offsets);
+ *root_offnum = offsets[target_offnum - 1];
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..09a164c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,14 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(old_tuple->t_data, &hashkey.tid,
+ ItemPointerGetOffsetNumber(&old_tuple->t_self));
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&mapping->new_tid));
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+ ItemPointerGetBlockNumber(&new_tid),
+ ItemPointerGetOffsetNumber(&new_tid));
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ HeapTupleHeaderSetNextCtid(onpage_tup,
+ ItemPointerGetBlockNumber(&tup->t_self),
+ ItemPointerGetOffsetNumber(&tup->t_self));
+ HeapTupleHeaderSetHeapLatest(onpage_tup);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..882ce18 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -788,7 +788,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tup->t_data, &ctid_wait,
+ ItemPointerGetOffsetNumber(&tup->t_self));
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..466609c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2451,7 +2451,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple.t_data, &tuple.t_self,
+ ItemPointerGetOffsetNumber(&tuple.t_self));
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0d12bbb..81f7982 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5a04561 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..82e5b5f 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 8fb1f6d..4313eb9 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,30 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
+)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +572,55 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextCtid(tup, block, offset) \
+do { \
+ ItemPointerSetBlockNumber(&((tup)->t_ctid), (block)); \
+ ItemPointerSetOffsetNumber(&((tup)->t_ctid), (offset)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Traditionally, we have stored
+ * self TID in the t_ctid field if the tuple is the last tuple in the chain. We
+ * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE
+ * flag is set.
+ */
+#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \
+do { \
+ if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ (offnum)); \
+ } \
+ else \
+ { \
+ ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \
+ } \
+} while (0)
+
+#define HeapTupleHeaderSetRootOffset(tup, offset) \
+do { \
+ AssertMacro(!HeapTupleHeaderIsHotUpdated(tup)); \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE); \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offset)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro((tup)->t_infomask2 & HEAP_LATEST_TUPLE), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ (tup)->t_infomask2 & HEAP_LATEST_TUPLE \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0002_warm_updates_v8.patchapplication/octet-stream; name=0002_warm_updates_v8.patchDownload
diff --git b/contrib/bloom/blutils.c a/contrib/bloom/blutils.c
index b68a0d1..b95275f 100644
--- b/contrib/bloom/blutils.c
+++ a/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/access/brin/brin.c a/src/backend/access/brin/brin.c
index 1b45a4c..ba3fffb 100644
--- b/src/backend/access/brin/brin.c
+++ a/src/backend/access/brin/brin.c
@@ -111,6 +111,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/access/gist/gist.c a/src/backend/access/gist/gist.c
index b8aa9bc..491e411 100644
--- b/src/backend/access/gist/gist.c
+++ a/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/access/hash/hash.c a/src/backend/access/hash/hash.c
index 6806e32..2026004 100644
--- b/src/backend/access/hash/hash.c
+++ a/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -265,6 +266,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -302,8 +305,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git b/src/backend/access/hash/hashsearch.c a/src/backend/access/hash/hashsearch.c
index 8d43b38..05b078f 100644
--- b/src/backend/access/hash/hashsearch.c
+++ a/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -407,6 +409,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git b/src/backend/access/hash/hashutil.c a/src/backend/access/hash/hashutil.c
index fa9cbdc..6897985 100644
--- b/src/backend/access/hash/hashutil.c
+++ a/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git b/src/backend/access/heap/README.WARM a/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ a/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
diff --git b/src/backend/access/heap/heapam.c a/src/backend/access/heap/heapam.c
index a22aae7..082bd1f 100644
--- b/src/backend/access/heap/heapam.c
+++ a/src/backend/access/heap/heapam.c
@@ -1957,6 +1957,76 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain is originating or continuing at tid ever became a
+ * WARM chain, even if the actual UPDATE operation finally aborted.
+ */
+static void
+hot_check_warm_chain(Page dp, ItemPointer tid, bool *recheck)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ if (*recheck == true)
+ return;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ *recheck = true;
+ break;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (HeapTupleIsHotUpdated(&heapTuple))
+ {
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+ else
+ break; /* end of chain */
+ }
+
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1976,11 +2046,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2022,6 +2095,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* Follow the redirect */
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
+
+ /* Check if it's a WARM chain */
+ if (recheck && *recheck == false)
+ {
+ if (ItemIdIsHeapWarm(lp))
+ {
+ *recheck = true;
+ Assert(!IsSystemRelation(relation));
+ }
+ }
continue;
}
/* else must be end of chain */
@@ -2034,9 +2117,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2049,6 +2135,22 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ {
+ hot_check_warm_chain(dp, &heapTuple->t_self, recheck);
+
+ /* WARM is not supported on system tables yet */
+ if (*recheck == true)
+ Assert(!IsSystemRelation(relation));
+ }
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2121,18 +2223,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3439,13 +3564,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
ItemId lp;
@@ -3468,6 +3595,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3492,6 +3620,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3513,10 +3645,13 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3568,6 +3703,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3818,6 +3956,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4126,6 +4265,36 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup) &&
+ !IsSystemRelation(relation))
+ use_warm_update = true;
+ }
}
else
{
@@ -4168,6 +4337,21 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * XXX This should be revisited if we get index (key, CTID) duplicate
+ * detection mechanism in place
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4183,12 +4367,38 @@ l2:
ItemPointerGetOffsetNumber(&(oldtup.t_self)),
&root_offnum);
}
+ else if (use_warm_update)
+ {
+ Assert(!IsSystemRelation(relation));
+
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4307,7 +4517,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4456,7 +4669,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, NULL, NULL);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7354,6 +7567,7 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid)
@@ -7367,6 +7581,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
+ xlrec.nwarm = nwarm;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
@@ -7389,6 +7604,10 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRegisterBufData(0, (char *) nowdead,
ndead * sizeof(OffsetNumber));
+ if (nwarm > 0)
+ XLogRegisterBufData(0, (char *) warm,
+ nwarm * sizeof(OffsetNumber));
+
if (nunused > 0)
XLogRegisterBufData(0, (char *) nowunused,
nunused * sizeof(OffsetNumber));
@@ -7494,6 +7713,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7505,6 +7725,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7578,6 +7801,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -7945,24 +8170,38 @@ heap_xlog_clean(XLogReaderState *record)
OffsetNumber *redirected;
OffsetNumber *nowdead;
OffsetNumber *nowunused;
+ OffsetNumber *warm;
int nredirected;
int ndead;
int nunused;
+ int nwarm;
+ int i;
Size datalen;
+ bool warmchain[MaxHeapTuplesPerPage + 1];
redirected = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
nredirected = xlrec->nredirected;
ndead = xlrec->ndead;
+ nwarm = xlrec->nwarm;
+
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
- nowunused = nowdead + ndead;
- nunused = (end - nowunused);
+ warm = nowdead + ndead;
+ nowunused = warm + nwarm;
+
+ nunused = (end - warm);
Assert(nunused >= 0);
+ memset(warmchain, 0, sizeof (warmchain));
+ for (i = 0; i < nwarm; i++)
+ warmchain[warm[i]] = true;
+
+
/* Update all item pointers per the record, and repair fragmentation */
heap_page_prune_execute(buffer,
redirected, nredirected,
+ warmchain,
nowdead, ndead,
nowunused, nunused);
@@ -8549,16 +8788,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8618,6 +8863,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextCtid(htup, ItemPointerGetBlockNumber(&newtid),
ItemPointerGetOffsetNumber(&newtid));
@@ -8753,6 +9003,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Make sure there is no forward chain link in t_ctid */
HeapTupleHeaderSetHeapLatest(htup);
diff --git b/src/backend/access/heap/pruneheap.c a/src/backend/access/heap/pruneheap.c
index 7c2231a..d71a297 100644
--- b/src/backend/access/heap/pruneheap.c
+++ a/src/backend/access/heap/pruneheap.c
@@ -36,12 +36,19 @@ typedef struct
int nredirected; /* numbers of entries in arrays below */
int ndead;
int nunused;
+ int nwarm;
/* arrays that accumulate indexes of items to be changed */
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage];
+ OffsetNumber warm[MaxHeapTuplesPerPage];
/* marked[i] is TRUE if item i is entered in one of the above arrays */
bool marked[MaxHeapTuplesPerPage + 1];
+ /*
+ * warmchain[i] is TRUE if item is becoming redirected lp and points a WARM
+ * chain
+ */
+ bool warmchain[MaxHeapTuplesPerPage + 1];
} PruneState;
/* Local functions */
@@ -54,6 +61,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum);
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_warmupdate(PruneState *prstate,
+ OffsetNumber offnum);
static void heap_get_root_tuples_internal(Page page,
OffsetNumber target_offnum, OffsetNumber *root_offsets);
@@ -203,8 +212,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = *latestRemovedXid;
- prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+ prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nwarm = 0;
memset(prstate.marked, 0, sizeof(prstate.marked));
+ memset(prstate.warmchain, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
@@ -241,6 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warmchain,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
@@ -268,6 +279,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
recptr = log_heap_clean(relation, buffer,
prstate.redirected, prstate.nredirected,
+ prstate.warm, prstate.nwarm,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
prstate.latestRemovedXid);
@@ -479,6 +491,12 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(htup))
+ {
+ Assert(!IsSystemRelation(relation));
+ heap_prune_record_warmupdate(prstate, rootoffnum);
+ }
+
/*
* OK, this tuple is indeed a member of the chain.
*/
@@ -668,6 +686,18 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
prstate->marked[offnum] = true;
}
+/* Record item pointer which is a root of a WARM chain */
+static void
+heap_prune_record_warmupdate(PruneState *prstate, OffsetNumber offnum)
+{
+ Assert(prstate->nwarm < MaxHeapTuplesPerPage);
+ if (prstate->warmchain[offnum])
+ return;
+ prstate->warm[prstate->nwarm] = offnum;
+ prstate->nwarm++;
+ prstate->warmchain[offnum] = true;
+}
+
/*
* Perform the actual page changes needed by heap_page_prune.
@@ -681,6 +711,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
@@ -697,6 +728,12 @@ heap_page_prune_execute(Buffer buffer,
ItemId fromlp = PageGetItemId(page, fromoff);
ItemIdSetRedirect(fromlp, tooff);
+
+ /*
+ * Save information about WARM chains in the item itself
+ */
+ if (warmchain[fromoff])
+ ItemIdSetHeapWarm(fromlp);
}
/* Update all now-dead line pointers */
diff --git b/src/backend/access/index/genam.c a/src/backend/access/index/genam.c
index 65c941d..4f9fb12 100644
--- b/src/backend/access/index/genam.c
+++ a/src/backend/access/index/genam.c
@@ -99,7 +99,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
else
scan->orderByData = NULL;
- scan->xs_want_itup = false; /* may be set later */
+ scan->xs_want_itup = true; /* hack for now to always get index tuple */
/*
* During recovery we ignore killed tuples and don't bother to kill them
diff --git b/src/backend/access/index/indexam.c a/src/backend/access/index/indexam.c
index 54b71cb..3f9a0cf 100644
--- b/src/backend/access/index/indexam.c
+++ a/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -228,6 +230,20 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes, but we
+ * can't be sure if the function was called at this point and we can't call
+ * it now for the risk of deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -409,7 +425,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +464,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +491,15 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+ else
+ scan->xs_tuple_recheck = true;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +509,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git b/src/backend/access/nbtree/nbtinsert.c a/src/backend/access/nbtree/nbtinsert.c
index ef69290..e0afffd 100644
--- b/src/backend/access/nbtree/nbtinsert.c
+++ a/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git b/src/backend/access/nbtree/nbtree.c a/src/backend/access/nbtree/nbtree.c
index 128744c..6b1236a 100644
--- b/src/backend/access/nbtree/nbtree.c
+++ a/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -292,8 +294,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
- scan->xs_recheck = false;
+ /* btree indexes are never lossy, except for WARM tuples */
+ scan->xs_recheck = indexscan_recheck;
+ scan->xs_tuple_recheck = indexscan_recheck;
/*
* If we have any array keys, initialize them during first call for a
diff --git b/src/backend/access/nbtree/nbtutils.c a/src/backend/access/nbtree/nbtutils.c
index 063c988..c9c0501 100644
--- b/src/backend/access/nbtree/nbtutils.c
+++ a/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git b/src/backend/access/spgist/spgutils.c a/src/backend/access/spgist/spgutils.c
index d570ae5..813b5c3 100644
--- b/src/backend/access/spgist/spgutils.c
+++ a/src/backend/access/spgist/spgutils.c
@@ -67,6 +67,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/catalog/index.c a/src/backend/catalog/index.c
index 08b646d..e76e928 100644
--- b/src/backend/catalog/index.c
+++ a/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git b/src/backend/catalog/system_views.sql a/src/backend/catalog/system_views.sql
index e011af1..97672a9 100644
--- b/src/backend/catalog/system_views.sql
+++ a/src/backend/catalog/system_views.sql
@@ -472,6 +472,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -502,7 +503,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git b/src/backend/commands/constraint.c a/src/backend/commands/constraint.c
index 26f9114..997c8f5 100644
--- b/src/backend/commands/constraint.c
+++ a/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git b/src/backend/commands/copy.c a/src/backend/commands/copy.c
index ec5d6f1..5e57cc9 100644
--- b/src/backend/commands/copy.c
+++ a/src/backend/commands/copy.c
@@ -2551,6 +2551,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2669,6 +2671,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git b/src/backend/commands/vacuumlazy.c a/src/backend/commands/vacuumlazy.c
index b5fb325..cd9b9a7 100644
--- b/src/backend/commands/vacuumlazy.c
+++ a/src/backend/commands/vacuumlazy.c
@@ -1468,6 +1468,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
+ NULL, 0,
unused, uncnt,
vacrelstats->latestRemovedXid);
PageSetLSN(page, recptr);
@@ -2128,6 +2129,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git b/src/backend/executor/execIndexing.c a/src/backend/executor/execIndexing.c
index 882ce18..5fe6182 100644
--- b/src/backend/executor/execIndexing.c
+++ a/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
diff --git b/src/backend/executor/nodeBitmapHeapscan.c a/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..ff77349 100644
--- b/src/backend/executor/nodeBitmapHeapscan.c
+++ a/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,23 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git b/src/backend/executor/nodeIndexscan.c a/src/backend/executor/nodeIndexscan.c
index 3143bd9..daa0826 100644
--- b/src/backend/executor/nodeIndexscan.c
+++ a/src/backend/executor/nodeIndexscan.c
@@ -39,6 +39,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+bool indexscan_recheck = false;
+
/*
* When an ordering operator is used, tuples fetched from the index that
* need to be reordered are queued in a pairing heap, as ReorderTuples.
@@ -115,10 +117,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git b/src/backend/executor/nodeModifyTable.c a/src/backend/executor/nodeModifyTable.c
index efb0c5e..3183db4 100644
--- b/src/backend/executor/nodeModifyTable.c
+++ a/src/backend/executor/nodeModifyTable.c
@@ -448,6 +448,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -494,6 +495,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -824,6 +826,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -938,7 +943,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1025,10 +1030,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git b/src/backend/postmaster/pgstat.c a/src/backend/postmaster/pgstat.c
index c7584cb..d89d37b 100644
--- b/src/backend/postmaster/pgstat.c
+++ a/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4083,6 +4085,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5192,6 +5195,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5219,6 +5223,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git b/src/backend/utils/adt/pgstatfuncs.c a/src/backend/utils/adt/pgstatfuncs.c
index 2d3cf9e..ef4f5b4 100644
--- b/src/backend/utils/adt/pgstatfuncs.c
+++ a/src/backend/utils/adt/pgstatfuncs.c
@@ -37,6 +37,7 @@ extern Datum pg_stat_get_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS);
@@ -115,6 +116,7 @@ extern Datum pg_stat_get_xact_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
@@ -245,6 +247,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1744,6 +1762,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git b/src/backend/utils/cache/relcache.c a/src/backend/utils/cache/relcache.c
index 79e0b1f..c6ef4e2 100644
--- b/src/backend/utils/cache/relcache.c
+++ a/src/backend/utils/cache/relcache.c
@@ -2030,6 +2030,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_idattr);
if (relation->rd_options)
@@ -4373,12 +4374,15 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *idindexattrs; /* columns in the replica identity */
List *indexoidlist;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4391,6 +4395,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_keyattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4429,6 +4435,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
idindexattrs = NULL;
foreach(l, indexoidlist)
@@ -4474,19 +4481,38 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_idattr);
@@ -4502,7 +4528,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4514,6 +4541,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return uindexattrs;
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git b/src/backend/utils/misc/guc.c a/src/backend/utils/misc/guc.c
index 28ebcb6..2241ffb 100644
--- b/src/backend/utils/misc/guc.c
+++ a/src/backend/utils/misc/guc.c
@@ -112,6 +112,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern bool indexscan_recheck;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -1288,6 +1289,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"indexscan_recheck", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Recheck heap rows returned from an index scan."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &indexscan_recheck,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"debug_deadlocks", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Dumps information about all current locks when a deadlock timeout occurs."),
NULL,
diff --git b/src/include/access/amapi.h a/src/include/access/amapi.h
index 1036cca..37eaf76 100644
--- b/src/include/access/amapi.h
+++ a/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git b/src/include/access/hash.h a/src/include/access/hash.h
index 6dfc41f..f1c73a0 100644
--- b/src/include/access/hash.h
+++ a/src/include/access/hash.h
@@ -389,4 +389,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git b/src/include/access/heapam.h a/src/include/access/heapam.h
index 81f7982..04ffd67 100644
--- b/src/include/access/heapam.h
+++ a/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -186,6 +188,7 @@ extern int heap_page_prune(Relation relation, Buffer buffer,
bool report_stats, TransactionId *latestRemovedXid);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ bool *warmchain,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
diff --git b/src/include/access/heapam_xlog.h a/src/include/access/heapam_xlog.h
index 5a04561..ddc3a7a 100644
--- b/src/include/access/heapam_xlog.h
+++ a/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -211,7 +212,9 @@ typedef struct xl_heap_update
* * for each redirected item: the item offset, then the offset redirected to
* * for each now-dead item: the item offset
* * for each now-unused item: the item offset
- * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
+ * * for each now-warm item: the item offset
+ * The total number of OffsetNumbers is therefore
+ * 2*nredirected+ndead+nunused+nwarm.
* Note that nunused is not explicitly stored, but may be found by reference
* to the total record length.
*/
@@ -220,10 +223,11 @@ typedef struct xl_heap_clean
TransactionId latestRemovedXid;
uint16 nredirected;
uint16 ndead;
+ uint16 nwarm;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_clean;
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapClean (offsetof(xl_heap_clean, nwarm) + sizeof(uint16))
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
@@ -384,6 +388,7 @@ extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
+ OffsetNumber *warm, int nwarm,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
diff --git b/src/include/access/htup_details.h a/src/include/access/htup_details.h
index 4313eb9..09246b2 100644
--- b/src/include/access/htup_details.h
+++ a/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) \
+)
+
#define HeapTupleHeaderSetHeapLatest(tup) \
( \
(tup)->t_infomask2 |= HEAP_LATEST_TUPLE \
@@ -771,6 +787,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git b/src/include/access/nbtree.h a/src/include/access/nbtree.h
index c580f51..83af072 100644
--- b/src/include/access/nbtree.h
+++ a/src/include/access/nbtree.h
@@ -751,6 +751,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git b/src/include/access/relscan.h a/src/include/access/relscan.h
index de98dd6..da7ec84 100644
--- b/src/include/access/relscan.h
+++ a/src/include/access/relscan.h
@@ -111,7 +111,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git b/src/include/catalog/pg_proc.h a/src/include/catalog/pg_proc.h
index 047a1ce..31f295f 100644
--- b/src/include/catalog/pg_proc.h
+++ a/src/include/catalog/pg_proc.h
@@ -2734,6 +2734,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3344 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2884,6 +2886,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3343 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git b/src/include/executor/executor.h a/src/include/executor/executor.h
index 136276b..e324deb 100644
--- b/src/include/executor/executor.h
+++ a/src/include/executor/executor.h
@@ -366,6 +366,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git b/src/include/executor/nodeIndexscan.h a/src/include/executor/nodeIndexscan.h
index 194fadb..fe9c78e 100644
--- b/src/include/executor/nodeIndexscan.h
+++ a/src/include/executor/nodeIndexscan.h
@@ -38,4 +38,5 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool indexscan_recheck;
#endif /* NODEINDEXSCAN_H */
diff --git b/src/include/nodes/execnodes.h a/src/include/nodes/execnodes.h
index 8004d85..3bf4b5f 100644
--- b/src/include/nodes/execnodes.h
+++ a/src/include/nodes/execnodes.h
@@ -61,6 +61,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git b/src/include/pgstat.h a/src/include/pgstat.h
index 152ff06..e0c8a90 100644
--- b/src/include/pgstat.h
+++ a/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1176,7 +1178,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git b/src/include/storage/itemid.h a/src/include/storage/itemid.h
index 509c577..8c9cc99 100644
--- b/src/include/storage/itemid.h
+++ a/src/include/storage/itemid.h
@@ -46,6 +46,12 @@ typedef ItemIdData *ItemId;
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
+/*
+ * Special value used in lp_len to indicate that the chain starting at line
+ * pointer may contain WARM tuples. This must only be interpreted along with
+ * LP_REDIRECT flag
+ */
+#define SpecHeapWarmLen 0x1ffb
/* ----------------
* support macros
@@ -112,12 +118,15 @@ typedef uint16 ItemLength;
#define ItemIdIsDead(itemId) \
((itemId)->lp_flags == LP_DEAD)
+#define ItemIdIsHeapWarm(itemId) \
+ (((itemId)->lp_flags == LP_REDIRECT) && \
+ ((itemId)->lp_len == SpecHeapWarmLen))
/*
* ItemIdHasStorage
* True iff item identifier has associated storage.
*/
#define ItemIdHasStorage(itemId) \
- ((itemId)->lp_len != 0)
+ (!ItemIdIsRedirected(itemId) && (itemId)->lp_len != 0)
/*
* ItemIdSetUnused
@@ -168,6 +177,26 @@ typedef uint16 ItemLength;
)
/*
+ * ItemIdSetHeapWarm
+ * Set the item identifier to identify as starting of a WARM chain
+ *
+ * Note: Since all bits in lp_flags are currently used, we store a special
+ * value in lp_len field to indicate this state. This is required only for
+ * LP_REDIRECT tuple and lp_len field is unused for such line pointers.
+ */
+#define ItemIdSetHeapWarm(itemId) \
+do { \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = SpecHeapWarmLen; \
+} while (0)
+
+#define ItemIdClearHeapWarm(itemId) \
+( \
+ AssertMacro((itemId)->lp_flags == LP_REDIRECT); \
+ (itemId)->lp_len = 0; \
+)
+
+/*
* ItemIdMarkDead
* Set the item identifier to be DEAD, keeping its existing storage.
*
diff --git b/src/include/utils/rel.h a/src/include/utils/rel.h
index fa15f28..982bf4c 100644
--- b/src/include/utils/rel.h
+++ a/src/include/utils/rel.h
@@ -101,8 +101,11 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
/*
* rd_options is set whenever rd_rel is loaded into the relcache entry.
diff --git b/src/include/utils/relcache.h a/src/include/utils/relcache.h
index 6ea7dd2..290e9b7 100644
--- b/src/include/utils/relcache.h
+++ a/src/include/utils/relcache.h
@@ -48,7 +48,8 @@ typedef enum IndexAttrBitmapKind
{
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git b/src/test/regress/expected/rules.out a/src/test/regress/expected/rules.out
index 031e8c2..c416fe6 100644
--- b/src/test/regress/expected/rules.out
+++ a/src/test/regress/expected/rules.out
@@ -1705,6 +1705,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1838,6 +1839,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1881,6 +1883,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1918,7 +1921,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1934,7 +1938,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1956,7 +1961,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git b/src/test/regress/expected/warm.out a/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa1b83
--- /dev/null
+++ a/src/test/regress/expected/warm.out
@@ -0,0 +1,51 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git b/src/test/regress/parallel_schedule a/src/test/regress/parallel_schedule
index 8641769..a610039 100644
--- b/src/test/regress/parallel_schedule
+++ a/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git b/src/test/regress/sql/warm.sql a/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..166ea37
--- /dev/null
+++ a/src/test/regress/sql/warm.sql
@@ -0,0 +1,15 @@
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
0003_warm_fixes_v6.patchapplication/octet-stream; name=0003_warm_fixes_v6.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b3de79c..9353175 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7831,7 +7831,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
- bool warm_update;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e32deb1..39ee6ac 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -75,6 +75,9 @@ RelationPutHeapTuple(Relation relation,
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number */
+ ((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);
if (OffsetNumberIsValid(root_offnum))
HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 03c6b62..c24e486 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -801,7 +801,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tup->t_data, &ctid_wait,
+ ItemPointerGetOffsetNumber(&tup->t_self));
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 079a77f..466609c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2451,7 +2451,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextCtid(tuple.t_data, &tuple.t_self,
+ ItemPointerGetOffsetNumber(&tuple.t_self));
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 25752b0..ef4f5b4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -37,6 +37,7 @@ extern Datum pg_stat_get_tuples_inserted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_deleted(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 37874ca..c6ef4e2 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4487,6 +4487,12 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
/*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ /*
* Check if the index has amrecheck method defined. If the method is
* not defined, the index does not support WARM update. Completely
* disable WARM updates on such tables
interesting-attrs-2.patchapplication/octet-stream; name=interesting-attrs-2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ea579a0..19edbdf 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -95,11 +95,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3443,6 +3440,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3460,9 +3459,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3489,21 +3485,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3524,7 +3529,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3550,6 +3555,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3561,10 +3570,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3803,6 +3809,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4107,7 +4115,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4122,7 +4130,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4270,13 +4280,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4310,7 +4322,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4355,114 +4367,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
- *
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
+ int attnum;
+ Bitmapset *modified = NULL;
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
-
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
-
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
Reading through the track_root_lp patch now.
+ /* + * For HOT (or WARM) updated tuples, we store the offset of the root + * line pointer of this chain in the ip_posid field of the new tuple. + * Usually this information will be available in the corresponding + * field of the old tuple. But for aborted updates or pg_upgraded + * databases, we might be seeing the old-style CTID chains and hence + * the information must be obtained by hard way + */ + if (HeapTupleHeaderHasRootOffset(oldtup.t_data)) + root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data); + else + heap_get_root_tuple_one(page, + ItemPointerGetOffsetNumber(&(oldtup.t_self)), + &root_offnum);
Hmm. So the HasRootOffset tests the HEAP_LATEST_TUPLE bit, which is
reset temporarily during an update. So that case shouldn't occur often.
Oh, I just noticed that HeapTupleHeaderSetNextCtid also clears the flag.
@@ -4166,10 +4189,29 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */ + /* insert new tuple */ + RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum); + HeapTupleHeaderSetHeapLatest(heaptup->t_data); + HeapTupleHeaderSetHeapLatest(newtup->t_data);+ /* + * Also update the in-memory copy with the root line pointer information + */ + if (OffsetNumberIsValid(root_offnum)) + { + HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum); + HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum); + } + else + { + HeapTupleHeaderSetRootOffset(heaptup->t_data, + ItemPointerGetOffsetNumber(&heaptup->t_self)); + HeapTupleHeaderSetRootOffset(newtup->t_data, + ItemPointerGetOffsetNumber(&heaptup->t_self)); + }
This is repetitive. I think after RelationPutHeapTuple it'd be better
to assign root_offnum = &heaptup->t_self, so that we can just call
SetRootOffset() on each tuple without the if().
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item); + if (OffsetNumberIsValid(root_offnum)) + HeapTupleHeaderSetRootOffset((HeapTupleHeader) item, + root_offnum); + else + HeapTupleHeaderSetRootOffset((HeapTupleHeader) item, + offnum);
Just a matter of style, but this reads nicer IMO:
HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
OffsetNumberIsValid(root_offnum) ? root_offnum : offnum);
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer, * holds a pin on the buffer. Once pin is released, a tuple might be pruned * and reused by a completely unrelated tuple. */ -void -heap_get_root_tuples(Page page, OffsetNumber *root_offsets) +static void +heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum, + OffsetNumber *root_offsets) { OffsetNumber offnum,
I think this function deserves more/better/updated commentary.
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state, * set the ctid of this tuple to point to the new location, and * insert it right away. */ - new_tuple->t_data->t_ctid = mapping->new_tid; + HeapTupleHeaderSetNextCtid(new_tuple->t_data, + ItemPointerGetBlockNumber(&mapping->new_tid), + ItemPointerGetOffsetNumber(&mapping->new_tid));
I think this would be nicer:
HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
AFAICS all the callers are doing ItemPointerGetFoo for a TID, so this is
overly verbose for no reason. Also, the "c" in Ctid stands for
"current"; I think we can omit that.
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state, new_tuple = unresolved->tuple; free_new = true; old_tid = unresolved->old_tid; - new_tuple->t_data->t_ctid = new_tid; + HeapTupleHeaderSetNextCtid(new_tuple->t_data, + ItemPointerGetBlockNumber(&new_tid), + ItemPointerGetOffsetNumber(&new_tid));
Did you forget to SetHeapLatest here, or ..? (If not, a comment is
warranted).
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 32bb3f9..466609c 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode, * As above, it should be safe to examine xmax and t_ctid without the * buffer content lock, because they can't be changing. */ - if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid)) + if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self)) { /* deleted, so forget about it */ ReleaseBuffer(buffer);
This is the place where this patch would have an effect. To test this
bit I think we're going to need an ad-hoc stress-test harness.
+/* + * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for + * clusters which are upgraded from pre-10.0 release, we still check if c_tid + * is pointing to itself and declare such tuple as the latest tuple in the + * chain + */ +#define HeapTupleHeaderIsHeapLatest(tup, tid) \ +( \ + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \ + ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \ + (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \ +)
Please add a "!= 0" to the first arm of the ||, so that we return a boolean.
+/* + * Get TID of next tuple in the update chain. Traditionally, we have stored + * self TID in the t_ctid field if the tuple is the last tuple in the chain. We + * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE + * flag is set. + */ +#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \ +do { \ + if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \ + { \ + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \ + (offnum)); \ + } \ + else \ + { \ + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \ + ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \ + } \ +} while (0)
This is a really odd macro, I think. Is any of the callers really
depending on the traditional behavior? If so, can we change them to
avoid that? (I think the "else" can be more easily written with
ItemPointerCopy). In any case, I think the documentation of the macro
leaves a bit to be desired -- I don't think we really care all that much
what we used to do, except perhaps as a secondary comment, but we do
care very much about what it actually does, which the current comment
doesn't really explain.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Alvaro,
On Tue, Jan 17, 2017 at 8:41 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Reading through the track_root_lp patch now.
Thanks for the review.
+ /* + * For HOT (or WARM) updated tuples, we store the offsetof the root
+ * line pointer of this chain in the ip_posid field of the
new tuple.
+ * Usually this information will be available in the
corresponding
+ * field of the old tuple. But for aborted updates or
pg_upgraded
+ * databases, we might be seeing the old-style CTID chains
and hence
+ * the information must be obtained by hard way + */ + if (HeapTupleHeaderHasRootOffset(oldtup.t_data)) + root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else + heap_get_root_tuple_one(page, + ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
Hmm. So the HasRootOffset tests the HEAP_LATEST_TUPLE bit, which is
reset temporarily during an update. So that case shouldn't occur often.
Right. The root offset is stored only in those tuples where
HEAP_LATEST_TUPLE is set. This flag should generally be set on the tuples
that are being updated, except for the case when the last update failed and
the flag was cleared. In other common case is going to be pg-upgraded
cluster where none of the existing tuples will have this flag set. So in
those cases, we must find the root line pointer hard way.
Oh, I just noticed that HeapTupleHeaderSetNextCtid also clears the flag.
Yes, but this should happen only during updates and unless the update
fails, the next-to-be-updated tuple should have the flag set.
@@ -4166,10 +4189,29 @@ l2: HeapTupleClearHotUpdated(&oldtup); HeapTupleClearHeapOnly(heaptup); HeapTupleClearHeapOnly(newtup); + root_offnum = InvalidOffsetNumber; }- RelationPutHeapTuple(relation, newbuf, heaptup, false);
/* insert new tuple */
+ /* insert new tuple */ + RelationPutHeapTuple(relation, newbuf, heaptup, false,root_offnum);
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data); + HeapTupleHeaderSetHeapLatest(newtup->t_data);+ /* + * Also update the in-memory copy with the root line pointerinformation
+ */ + if (OffsetNumberIsValid(root_offnum)) + { + HeapTupleHeaderSetRootOffset(heaptup->t_data,root_offnum);
+ HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum); + } + else + { + HeapTupleHeaderSetRootOffset(heaptup->t_data, + ItemPointerGetOffsetNumber(&heaptup->t_self));
+ HeapTupleHeaderSetRootOffset(newtup->t_data, + ItemPointerGetOffsetNumber(&heaptup->t_self));
+ }
This is repetitive. I think after RelationPutHeapTuple it'd be better
to assign root_offnum = &heaptup->t_self, so that we can just call
SetRootOffset() on each tuple without the if().
Fixed. I actually ripped off HeapTupleHeaderSetRootOffset() completely and
pushed setting of root line pointer into the HeapTupleHeaderSetHeapLatest().
That seems much cleaner because the system expects to find root line
pointer whenever HEAP_LATEST_TUPLE flag is set. Hence it makes sense to set
them together.
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item); + if (OffsetNumberIsValid(root_offnum)) + HeapTupleHeaderSetRootOffset((HeapTupleHeader)item,
+ root_offnum); + else + HeapTupleHeaderSetRootOffset((HeapTupleHeader)item,
+ offnum);
Just a matter of style, but this reads nicer IMO:
HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,
OffsetNumberIsValid(root_offnum) ? root_offnum : offnum);
Understood. This code no longer exists in the new patch since
HeapTupleHeaderSetRootOffset is merged with HeapTupleHeaderSetHeapLatest.
@@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,
* holds a pin on the buffer. Once pin is released, a tuple might bepruned
* and reused by a completely unrelated tuple. */ -void -heap_get_root_tuples(Page page, OffsetNumber *root_offsets) +static void +heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum, + OffsetNumber *root_offsets) { OffsetNumber offnum,I think this function deserves more/better/updated commentary.
Sure. I added more commentary. I also reworked the function so that the
caller can pass just one item array when it's interested in finding root
line pointer for just one item. Hopefully that will save a few bytes on the
stack.
@@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the newlocation, and
* insert it right away. */ - new_tuple->t_data->t_ctid = mapping->new_tid; + HeapTupleHeaderSetNextCtid(new_tuple->t_data, + ItemPointerGetBlockNumber(&mapping->new_tid),
+ ItemPointerGetOffsetNumber(&m
apping->new_tid));
I think this would be nicer:
HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
AFAICS all the callers are doing ItemPointerGetFoo for a TID, so this is
overly verbose for no reason. Also, the "c" in Ctid stands for
"current"; I think we can omit that.
Yes, fixed. I realised that all callers where anyways calling the macro
with the block/offset of the same TID. So it makes sense to just pass TID
to the macro.
@@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state, new_tuple = unresolved->tuple; free_new = true; old_tid = unresolved->old_tid; - new_tuple->t_data->t_ctid = new_tid; + HeapTupleHeaderSetNextCtid(new_tuple->t_data,
+
ItemPointerGetBlockNumber(&new_tid),
+
ItemPointerGetOffsetNumber(&new_tid));
Did you forget to SetHeapLatest here, or ..? (If not, a comment is
warranted).
Umm probably not. The way I see it, new_tuple is not actually the new tuple
when this is called, but it's changed to the unresolved tuple (see the
start of the hunk). So what we're doing is setting next CTID in the
previous tuple in the chain. SetHeapLatest is called on the new tuple
inside raw_heap_insert(). I did not add any more comments, but please let
me know if you think it's still confusing or if I'm missing something.
diff --git a/src/backend/executor/execMain.cb/src/backend/executor/execMain.c
index 32bb3f9..466609c 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relationrelation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid
without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self,&tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data,
tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);This is the place where this patch would have an effect. To test this
bit I think we're going to need an ad-hoc stress-test harness.
Sure. I did some pgbench tests and ran consistency checks during and at the
end of tests. I chose a small scale factor and many clients so that same
tuple is often concurrently updated. That should exercise the new
chain-following code reguorsly. But I'll do more of those on a bigger box.
Do you have other suggestions for ad-hoc tests?
+/* + * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain.But for
+ * clusters which are upgraded from pre-10.0 release, we still check if
c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in
the
+ * chain + */ +#define HeapTupleHeaderIsHeapLatest(tup, tid) \ +( \ + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \ + ((ItemPointerGetBlockNumber(&(tup)->t_ctid) ==ItemPointerGetBlockNumber(&tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) ==
ItemPointerGetOffsetNumber(&tid))) \
+)
Please add a "!= 0" to the first arm of the ||, so that we return a
boolean.
Done. Also rebased with new master where similar changes have been done.
+/* + * Get TID of next tuple in the update chain. Traditionally, we havestored
+ * self TID in the t_ctid field if the tuple is the last tuple in the
chain. We
+ * try to preserve that behaviour by returning self-TID if
HEAP_LATEST_TUPLE
+ * flag is set. + */ +#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \ +do { \ + if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \ + { \ + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid),\
+ (offnum)); \ + } \ + else \ + { \ + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid),\
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid));
\
+ } \
+} while (0)This is a really odd macro, I think. Is any of the callers really
depending on the traditional behavior? If so, can we change them to
avoid that? (I think the "else" can be more easily written with
ItemPointerCopy). In any case, I think the documentation of the macro
leaves a bit to be desired -- I don't think we really care all that much
what we used to do, except perhaps as a secondary comment, but we do
care very much about what it actually does, which the current comment
doesn't really explain.
I reworked this quite a bit and I believe the new code does what you
suggested. The HeapTupleHeaderGetNextTid macro is now much simpler (it
just copies the TID) and we leave it to the caller to ensure they don't
call this on a tuple which is already at the end of the chain (i.e has
HEAP_LATEST_TUPLE set, but we don't look for old-style end-of-the-chain
markers). The callers can choose to return the same TID back if their
callers rely on that behaviour. But inside this macro, we now assert that
HEAP_LATEST_TUPLE is not set.
One thing that worried me is if there exists a path which sets the
t_infomask (and hence HEAP_LATEST_TUPLE) during redo recovery and if we
will fail to set the root line pointer correctly along with that. But
AFAICS the interesting cases of insert, multi-insert and update are being
handled ok. The only other places where I saw t_infomask being copied as-is
from the WAL record is DecodeXLogTuple() and DecodeMultiInsert(), but those
should not cause any problem AFAICS.
Revised patch is attached. All regression tests, isolation tests and
pgbench test with -c40 -j10 pass on my laptop.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001_track_root_lp_v9.patchapplication/octet-stream; name=0001_track_root_lp_v9.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91c13d4..8e57bae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2247,13 +2248,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2373,6 +2374,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2411,8 +2413,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
+ root_offnum = InvalidOffsetNumber;
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ &root_offnum);
+
+ /* We must not overwrite the speculative insertion token */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2640,6 +2648,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2710,7 +2719,13 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = InvalidOffsetNumber;
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ &root_offnum);
+
+ /* Mark this tuple as the latest and also set root offset */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2718,7 +2733,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = InvalidOffsetNumber;
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ &root_offnum);
+ /* Mark each tuple as the latest and also set root offset */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2990,6 +3009,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3000,6 +3020,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3041,7 +3062,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3171,7 +3193,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3220,6 +3252,23 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&tp.t_self),
+ &root_offnum);
+
START_CRIT_SECTION();
/*
@@ -3247,8 +3296,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3449,6 +3500,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3511,6 +3564,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3795,7 +3849,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3935,6 +3994,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3962,6 +4022,15 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self),
+ &root_offnum);
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3976,7 +4045,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4134,6 +4204,11 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4159,6 +4234,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above)
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4166,10 +4252,21 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, &root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4182,7 +4279,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4221,6 +4318,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4501,7 +4599,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4510,9 +4609,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4532,6 +4633,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4559,7 +4661,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -4997,7 +5103,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5045,6 +5156,11 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self),
+ &root_offnum);
+
START_CRIT_SECTION();
/*
@@ -5073,7 +5189,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5587,6 +5706,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5595,6 +5715,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5824,7 +5946,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5833,7 +5955,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5950,7 +6072,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6076,8 +6198,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7425,6 +7546,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7545,6 +7667,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8199,7 +8324,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ heap_get_root_tuple_one(page, xlrec->offnum, &root_offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8289,7 +8420,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8424,8 +8556,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8561,7 +8693,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8694,13 +8826,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8763,6 +8899,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8826,11 +8965,18 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ heap_get_root_tuple_one(page,
+ offnum, &root_offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6529fe3..14ed263 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber *root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,16 +66,21 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead)
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(*root_offnum))
+ *root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, *root_offnum);
}
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..2406e77 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that)
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,48 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
+ *
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line
+ * pointers and tuples on the page in order to find the root line pointers. To
+ * minimize the cost, we break early if target_offnum is specified and root
+ * line
+ * pointer to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +808,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +870,64 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ return heap_get_root_tuples_internal(page, target_offnum, root_offnum);
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 90ab6f2..5f64ca6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 8d119f6..9920f48 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -788,7 +788,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ff277d3..9182fa7 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2563,7 +2563,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2571,7 +2571,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ee7e05a..22507dc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 52f28b8..a4a1fe1 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 2824f23..8752f69 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber *root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index fae955e..11bd1c8 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,32 @@ do { \
(tup)->t_infomask2 & HEAP_ONLY_TUPLE \
)
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +574,45 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller should have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain whose member this
+ * tuple is.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * We use the same HEAP_LATEST_TUPLE flag to check if the tuple's t_ctid field
+ * contains the root line pointer. We can't use the same
+ * HeapTupleHeaderIsHeapLatest macro because that also checks for TID-equality
+ * to decide whether a tuple is at the of the chain
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
On Thu, Jan 19, 2017 at 6:35 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
Revised patch is attached.
I've now also rebased the main WARM patch against the current master
(3eaf03b5d331b7a06d79 to be precise). I'm attaching Alvaro's patch to get
interesting attributes (prefixed with 0000 since the other two patches are
based on that). The changes to support system tables are now merged with
the main patch. I could separate them if it helps in review.
I am also including a stress test workload that I am currently running to
test WARM's correctness since Robert raised a valid concern about that. The
idea is to include a few more columns in the pgbench_accounts table and
have a few more indexes. The additional columns with indexes kind of share
a relationship with the "aid" column. But instead of a fixed value, values
for these columns can vary within a fixed, non-overlapping range. For
example, for aid = 1, aid1's original value will be 10 and it can vary
between 8 to 12. Similarly, aid2's original value will be 20 and it can
vary between 16 to 24. This setup allows us to update these additional
columns (thus force WARM), but still ensure that we can do some sanity
checks on the results.
The test contains a bunch of UPDATE, FOR UPDATE, FOR SHARE transactions.
Some of these transactions commit and some rollback. The checks are
in-place to ensure that we always find exactly one tuple irrespective of
which column we use to fetch the row. Of course, when the aid[1-4] columns
are used to fetch tuples, we need to scan with a range instead of an
equality. Then we do a bunch of operations like CREATE INDEX, DROP INDEX,
CIC, run long transactions, VACUUM FULL etc while the tests are running and
ensure that the sanity checks always pass. We could do a few other things
like, may be marking these indexes as UNIQUE or keeping a long transaction
open while doing updates and other operations. I'll add some of those to
the test, but suggestions are welcome.
I do see a problem with CREATE INDEX CONCURRENTLY with these tests, though
everything else has run ok so far (I am yet to do very long running tests.
Probably just a few hours tests today).
I'm trying to understand why CIC fails to build a consistent index. I think
I've some clue now why it could be happening. With HOT, we don't need to
worry about broken chains since at the very beginning we add the index
tuple and all subsequent updates will honour the new index while deciding
on HOT updates i.e. we won't create any new broken HOT chains once we start
building the index. Later during validation phase, we only need to insert
tuples that are not already in the index. But with WARM, I think the check
needs to be more elaborate. So even if the TID (we always look at its root
line pointer etc) exists in the index, we will need to ensure that the
index key matches the heap tuple we are dealing with. That looks a bit
tricky. May be we can lookup the index using key from the current heap
tuple and then see if we get a tuple with the same TID back. Of course, we
need to do this only if the tuple is a WARM tuple. The other option is that
we collect not only TIDs but also keys while scanning the index. That might
increase the size of the state information for wildly wide indexes. Or may
be just turn WARM off if there exists a build-in-progress index.
Suggestions/reviews/tests welcome.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0000_interesting_attrs.patchapplication/octet-stream; name=0000_interesting_attrs.patchDownload
commit 4e8623eadc6adbc31143ba1a774ef2db533fc7d2
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sun Jan 1 16:29:10 2017 +0530
Alvaro's patch on interesting attrs
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1ce42ea..91c13d4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -95,11 +95,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3443,6 +3440,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3460,9 +3459,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3489,21 +3485,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3524,7 +3529,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3550,6 +3555,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3561,10 +3570,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3803,6 +3809,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4107,7 +4115,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4122,7 +4130,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4270,13 +4280,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4310,7 +4322,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4355,114 +4367,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
- *
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
-
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
+ int attnum;
+ Bitmapset *modified = NULL;
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
-
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
0001_track_root_lp_v9.patchapplication/octet-stream; name=0001_track_root_lp_v9.patchDownload
diff --git b/src/backend/access/heap/heapam.c a/src/backend/access/heap/heapam.c
index 91c13d4..8e57bae 100644
--- b/src/backend/access/heap/heapam.c
+++ a/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2247,13 +2248,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2373,6 +2374,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2411,8 +2413,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
+ root_offnum = InvalidOffsetNumber;
RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ &root_offnum);
+
+ /* We must not overwrite the speculative insertion token */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2640,6 +2648,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2710,7 +2719,13 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = InvalidOffsetNumber;
+ RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ &root_offnum);
+
+ /* Mark this tuple as the latest and also set root offset */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2718,7 +2733,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = InvalidOffsetNumber;
+ RelationPutHeapTuple(relation, buffer, heaptup, false,
+ &root_offnum);
+ /* Mark each tuple as the latest and also set root offset */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -2990,6 +3009,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3000,6 +3020,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3041,7 +3062,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3171,7 +3193,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3220,6 +3252,23 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&tp.t_self),
+ &root_offnum);
+
START_CRIT_SECTION();
/*
@@ -3247,8 +3296,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3449,6 +3500,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3511,6 +3564,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3795,7 +3849,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3935,6 +3994,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3962,6 +4022,15 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self),
+ &root_offnum);
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3976,7 +4045,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4134,6 +4204,11 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4159,6 +4234,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above)
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4166,10 +4252,21 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ RelationPutHeapTuple(relation, newbuf, heaptup, false, &root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4182,7 +4279,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4221,6 +4318,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4501,7 +4599,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4510,9 +4609,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4532,6 +4633,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4559,7 +4661,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -4997,7 +5103,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5045,6 +5156,11 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self),
+ &root_offnum);
+
START_CRIT_SECTION();
/*
@@ -5073,7 +5189,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5587,6 +5706,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5595,6 +5715,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5824,7 +5946,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5833,7 +5955,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5950,7 +6072,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6076,8 +6198,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7425,6 +7546,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7545,6 +7667,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8199,7 +8324,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ heap_get_root_tuple_one(page, xlrec->offnum, &root_offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8289,7 +8420,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8424,8 +8556,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8561,7 +8693,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8694,13 +8826,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8763,6 +8899,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8826,11 +8965,18 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ heap_get_root_tuple_one(page,
+ offnum, &root_offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git b/src/backend/access/heap/hio.c a/src/backend/access/heap/hio.c
index 6529fe3..14ed263 100644
--- b/src/backend/access/heap/hio.c
+++ a/src/backend/access/heap/hio.c
@@ -31,12 +31,18 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple while latter is
+ * used during insertion of a new row.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber *root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,16 +66,21 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead)
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(*root_offnum))
+ *root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, *root_offnum);
}
}
diff --git b/src/backend/access/heap/pruneheap.c a/src/backend/access/heap/pruneheap.c
index d69a266..2406e77 100644
--- b/src/backend/access/heap/pruneheap.c
+++ a/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that)
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,48 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
+ *
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line
+ * pointers and tuples on the page in order to find the root line pointers. To
+ * minimize the cost, we break early if target_offnum is specified and root
+ * line
+ * pointer to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +808,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +870,64 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple
+ */
+void
+heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum)
+{
+ return heap_get_root_tuples_internal(page, target_offnum, root_offnum);
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git b/src/backend/access/heap/rewriteheap.c a/src/backend/access/heap/rewriteheap.c
index 90ab6f2..5f64ca6 100644
--- b/src/backend/access/heap/rewriteheap.c
+++ a/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git b/src/backend/executor/execIndexing.c a/src/backend/executor/execIndexing.c
index 8d119f6..9920f48 100644
--- b/src/backend/executor/execIndexing.c
+++ a/src/backend/executor/execIndexing.c
@@ -788,7 +788,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git b/src/backend/executor/execMain.c a/src/backend/executor/execMain.c
index 0bc146c..c38e290 100644
--- b/src/backend/executor/execMain.c
+++ a/src/backend/executor/execMain.c
@@ -2589,7 +2589,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2597,7 +2597,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git b/src/include/access/heapam.h a/src/include/access/heapam.h
index ee7e05a..22507dc 100644
--- b/src/include/access/heapam.h
+++ a/src/include/access/heapam.h
@@ -188,6 +188,8 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern void heap_get_root_tuple_one(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git b/src/include/access/heapam_xlog.h a/src/include/access/heapam_xlog.h
index 52f28b8..a4a1fe1 100644
--- b/src/include/access/heapam_xlog.h
+++ a/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git b/src/include/access/hio.h a/src/include/access/hio.h
index 2824f23..8752f69 100644
--- b/src/include/access/hio.h
+++ a/src/include/access/hio.h
@@ -36,7 +36,7 @@ typedef struct BulkInsertStateData
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+ HeapTuple tuple, bool token, OffsetNumber *root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git b/src/include/access/htup_details.h a/src/include/access/htup_details.h
index a6c7e31..fff1832 100644
--- b/src/include/access/htup_details.h
+++ a/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,32 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for
+ * clusters which are upgraded from pre-10.0 release, we still check if c_tid
+ * is pointing to itself and declare such tuple as the latest tuple in the
+ * chain
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +574,45 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * probably have a new tuple in the chain
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller should have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain whose member this
+ * tuple is.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * We use the same HEAP_LATEST_TUPLE flag to check if the tuple's t_ctid field
+ * contains the root line pointer. We can't use the same
+ * HeapTupleHeaderIsHeapLatest macro because that also checks for TID-equality
+ * to decide whether a tuple is at the of the chain
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0002_warm_updates_v9.patchapplication/octet-stream; name=0002_warm_updates_v9.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 06077af..4ab30d6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = blendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d60ddd2..3785045 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = brinendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 597056a..d4634af 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -89,6 +89,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = gistendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a64a9b9..40fede5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,6 +86,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amendscan = hashendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -266,6 +267,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -303,8 +306,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index a59ad6f..46a334c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -408,6 +410,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..dcba734 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..f793570
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,271 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8e57bae..015aef1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1957,6 +1957,78 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain containing this tid is actually a WARM chain.
+ * Note that even if the WARM update ultimately aborted, we still must do a
+ * recheck because the failing UPDATE when have inserted created index entries
+ * which are now stale, but still referencing this chain.
+ */
+static bool
+hot_check_warm_chain(Page dp, ItemPointer tid)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ return true;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return false;
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1976,11 +2048,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2034,9 +2109,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2049,6 +2127,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2097,7 +2185,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Check to see if HOT chain continues past this tuple; if so fetch
* the next offnum and loop around.
*/
- if (HeapTupleIsHotUpdated(heapTuple))
+ if (HeapTupleIsHotUpdated(heapTuple) &&
+ !HeapTupleHeaderHasRootOffset(heapTuple->t_data))
{
Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
ItemPointerGetBlockNumber(tid));
@@ -2121,18 +2210,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3484,13 +3596,15 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
ItemId lp;
@@ -3513,6 +3627,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3537,6 +3652,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3558,10 +3677,13 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3613,6 +3735,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3868,6 +3993,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4187,6 +4313,35 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup))
+ use_warm_update = true;
+ }
}
else
{
@@ -4234,6 +4389,22 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * Note: If we ever have a mechanism to avoid duplicate <key, TID> in
+ * indexes, we could look at relaxing this restriction and allow even
+ * more WARM udpates
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4246,12 +4417,36 @@ l2:
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
+ else if (use_warm_update)
+ {
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ heap_get_root_tuple_one(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)),
+ &root_offnum);
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4360,7 +4555,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4500,7 +4698,8 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
* via ereport().
*/
void
-simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
+simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
+ bool *warm_update, Bitmapset **modified_attrs)
{
HTSU_Result result;
HeapUpdateFailureData hufd;
@@ -4509,7 +4708,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, modified_attrs,
+ warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7562,6 +7762,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7573,6 +7774,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7646,6 +7850,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8623,16 +8829,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8692,6 +8904,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextTid(htup, &newtid);
@@ -8827,6 +9044,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4822af9..de559fd 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -71,10 +71,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -228,6 +230,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -409,7 +426,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -448,7 +465,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -475,6 +492,15 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple
+ */
+ if (!scan->xs_recheck)
+ scan->xs_tuple_recheck = false;
+ else
+ scan->xs_tuple_recheck = true;
+
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -484,32 +510,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 883d70d..6efccf7 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acf..cb5a796 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -118,6 +119,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amendscan = btendscan;
amroutine->ammarkpos = btmarkpos;
amroutine->amrestrpos = btrestrpos;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -298,8 +300,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
+ /* btree indexes are never lossy, except for WARM tuples */
scan->xs_recheck = false;
+ scan->xs_tuple_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..9becaeb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index ca4b0bd..b0d2952 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amendscan = spgendscan;
amroutine->ammarkpos = NULL;
amroutine->amrestrpos = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index a96bf69..eed5b0b 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1230,6 +1230,9 @@ SetDefaultACL(InternalDefaultACL *iacls)
}
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Prepare to insert or update pg_default_acl entry */
MemSet(values, 0, sizeof(values));
MemSet(nulls, false, sizeof(nulls));
@@ -1245,6 +1248,8 @@ SetDefaultACL(InternalDefaultACL *iacls)
newtuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
simple_heap_insert(rel, newtuple);
+ warm_update = false;
+ modified_attrs = NULL;
}
else
{
@@ -1254,11 +1259,12 @@ SetDefaultACL(InternalDefaultACL *iacls)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
values, nulls, replaces);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
+ simple_heap_update(rel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
}
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
/* these dependencies don't change in an update */
if (isNew)
@@ -1686,13 +1692,17 @@ ExecGrant_Attribute(InternalGrant *istmt, Oid relOid, const char *relname,
if (need_update)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
newtuple = heap_modify_tuple(attr_tuple, RelationGetDescr(attRelation),
values, nulls, replaces);
- simple_heap_update(attRelation, &newtuple->t_self, newtuple);
+ simple_heap_update(attRelation, &newtuple->t_self, newtuple,
+ &warm_update, &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(attRelation, newtuple);
+ CatalogUpdateIndexes(attRelation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(relOid, RelationRelationId, attnum,
@@ -1899,6 +1909,8 @@ ExecGrant_Relation(InternalGrant *istmt)
int nnewmembers;
Oid *newmembers;
AclObjectKind aclkind;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Determine ID to do the grant as, and available grant options */
select_best_grantor(GetUserId(), this_privileges,
@@ -1955,10 +1967,12 @@ ExecGrant_Relation(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation),
values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple,
+ &warm_update, &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update,
+ modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(relOid, RelationRelationId, 0, new_acl);
@@ -2079,6 +2093,8 @@ ExecGrant_Database(InternalGrant *istmt)
Oid *oldmembers;
Oid *newmembers;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tuple = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(datId));
if (!HeapTupleIsValid(tuple))
@@ -2148,10 +2164,11 @@ ExecGrant_Database(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update the shared dependency ACL info */
updateAclDependencies(DatabaseRelationId, HeapTupleGetOid(tuple), 0,
@@ -2202,6 +2219,8 @@ ExecGrant_Fdw(InternalGrant *istmt)
int nnewmembers;
Oid *oldmembers;
Oid *newmembers;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tuple = SearchSysCache1(FOREIGNDATAWRAPPEROID,
ObjectIdGetDatum(fdwid));
@@ -2273,10 +2292,11 @@ ExecGrant_Fdw(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(fdwid, ForeignDataWrapperRelationId, 0,
@@ -2332,6 +2352,8 @@ ExecGrant_ForeignServer(InternalGrant *istmt)
int nnewmembers;
Oid *oldmembers;
Oid *newmembers;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tuple = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(srvid));
if (!HeapTupleIsValid(tuple))
@@ -2402,10 +2424,11 @@ ExecGrant_ForeignServer(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(srvid, ForeignServerRelationId, 0, new_acl);
@@ -2460,6 +2483,8 @@ ExecGrant_Function(InternalGrant *istmt)
int nnewmembers;
Oid *oldmembers;
Oid *newmembers;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tuple = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcId));
if (!HeapTupleIsValid(tuple))
@@ -2529,10 +2554,11 @@ ExecGrant_Function(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(funcId, ProcedureRelationId, 0, new_acl);
@@ -2586,6 +2612,8 @@ ExecGrant_Language(InternalGrant *istmt)
int nnewmembers;
Oid *oldmembers;
Oid *newmembers;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tuple = SearchSysCache1(LANGOID, ObjectIdGetDatum(langId));
if (!HeapTupleIsValid(tuple))
@@ -2663,10 +2691,11 @@ ExecGrant_Language(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(langId, LanguageRelationId, 0, new_acl);
@@ -2724,6 +2753,8 @@ ExecGrant_Largeobject(InternalGrant *istmt)
ScanKeyData entry[1];
SysScanDesc scan;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* There's no syscache for pg_largeobject_metadata */
ScanKeyInit(&entry[0],
@@ -2805,10 +2836,11 @@ ExecGrant_Largeobject(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation),
values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(loid, LargeObjectRelationId, 0, new_acl);
@@ -2863,6 +2895,8 @@ ExecGrant_Namespace(InternalGrant *istmt)
int nnewmembers;
Oid *oldmembers;
Oid *newmembers;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tuple = SearchSysCache1(NAMESPACEOID, ObjectIdGetDatum(nspid));
if (!HeapTupleIsValid(tuple))
@@ -2933,10 +2967,11 @@ ExecGrant_Namespace(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(nspid, NamespaceRelationId, 0, new_acl);
@@ -2990,6 +3025,9 @@ ExecGrant_Tablespace(InternalGrant *istmt)
Oid *oldmembers;
Oid *newmembers;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Search syscache for pg_tablespace */
tuple = SearchSysCache1(TABLESPACEOID, ObjectIdGetDatum(tblId));
@@ -3060,10 +3098,11 @@ ExecGrant_Tablespace(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update the shared dependency ACL info */
updateAclDependencies(TableSpaceRelationId, tblId, 0,
@@ -3113,6 +3152,8 @@ ExecGrant_Type(InternalGrant *istmt)
Oid *oldmembers;
Oid *newmembers;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Search syscache for pg_type */
tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typId));
@@ -3197,10 +3238,11 @@ ExecGrant_Type(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
/* Update initial privileges for extensions */
recordExtensionInitPriv(typId, TypeRelationId, 0, new_acl);
@@ -5354,6 +5396,9 @@ recordExtensionInitPriv(Oid objoid, Oid classoid, int objsubid, Acl *new_acl)
/* If we have a new ACL to set, then update the row with it. */
if (new_acl)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
MemSet(values, 0, sizeof(values));
MemSet(nulls, false, sizeof(nulls));
MemSet(replace, false, sizeof(replace));
@@ -5364,10 +5409,12 @@ recordExtensionInitPriv(Oid objoid, Oid classoid, int objsubid, Acl *new_acl)
oldtuple = heap_modify_tuple(oldtuple, RelationGetDescr(relation),
values, nulls, replace);
- simple_heap_update(relation, &oldtuple->t_self, oldtuple);
+ simple_heap_update(relation, &oldtuple->t_self, oldtuple,
+ &warm_update, &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, oldtuple);
+ CatalogUpdateIndexes(relation, oldtuple, warm_update,
+ modified_attrs);
}
else
/* new_acl is NULL, so delete the entry we found. */
@@ -5396,7 +5443,7 @@ recordExtensionInitPriv(Oid objoid, Oid classoid, int objsubid, Acl *new_acl)
simple_heap_insert(relation, tuple);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, tuple);
+ CatalogUpdateIndexes(relation, tuple, false, NULL);
}
systable_endscan(scan);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index bfc54a8..8a8cdee 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -644,9 +644,9 @@ InsertPgAttributeTuple(Relation pg_attribute_rel,
simple_heap_insert(pg_attribute_rel, tup);
if (indstate != NULL)
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, false, NULL);
else
- CatalogUpdateIndexes(pg_attribute_rel, tup);
+ CatalogUpdateIndexes(pg_attribute_rel, tup, false, NULL);
heap_freetuple(tup);
}
@@ -837,7 +837,7 @@ InsertPgClassTuple(Relation pg_class_desc,
/* finally insert the new tuple, update the indexes, and clean up */
simple_heap_insert(pg_class_desc, tup);
- CatalogUpdateIndexes(pg_class_desc, tup);
+ CatalogUpdateIndexes(pg_class_desc, tup, false, NULL);
heap_freetuple(tup);
}
@@ -1581,6 +1581,9 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
}
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Dropping user attributes is lots harder */
/* Mark the attribute as dropped */
@@ -1610,10 +1613,11 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
"........pg.dropped.%d........", attnum);
namestrcpy(&(attStruct->attname), newattname);
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
+ simple_heap_update(attr_rel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateIndexes(attr_rel, tuple, warm_update, modified_attrs);
}
/*
@@ -1701,6 +1705,8 @@ RemoveAttrDefaultById(Oid attrdefId)
HeapTuple tuple;
Oid myrelid;
AttrNumber myattnum;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Grab an appropriate lock on the pg_attrdef relation */
attrdef_rel = heap_open(AttrDefaultRelationId, RowExclusiveLock);
@@ -1742,10 +1748,11 @@ RemoveAttrDefaultById(Oid attrdefId)
((Form_pg_attribute) GETSTRUCT(tuple))->atthasdef = false;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
+ simple_heap_update(attr_rel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateIndexes(attr_rel, tuple, warm_update, modified_attrs);
/*
* Our update of the pg_attribute row will force a relcache rebuild, so
@@ -1945,7 +1952,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
tuple = heap_form_tuple(adrel->rd_att, values, nulls);
attrdefOid = simple_heap_insert(adrel, tuple);
- CatalogUpdateIndexes(adrel, tuple);
+ CatalogUpdateIndexes(adrel, tuple, false, NULL);
defobject.classId = AttrDefaultRelationId;
defobject.objectId = attrdefOid;
@@ -1974,10 +1981,14 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
if (!attStruct->atthasdef)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
attStruct->atthasdef = true;
- simple_heap_update(attrrel, &atttup->t_self, atttup);
+ simple_heap_update(attrrel, &atttup->t_self, atttup, &warm_update,
+ &modified_attrs);
/* keep catalog indexes current */
- CatalogUpdateIndexes(attrrel, atttup);
+ CatalogUpdateIndexes(attrrel, atttup, warm_update, modified_attrs);
}
heap_close(attrrel, RowExclusiveLock);
heap_freetuple(atttup);
@@ -2479,6 +2490,9 @@ MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
if (con->conrelid == RelationGetRelid(rel))
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Found it. Conflicts if not identical check constraint */
if (con->contype == CONSTRAINT_CHECK)
{
@@ -2572,8 +2586,9 @@ MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
Assert(is_local);
con->connoinherit = true;
}
- simple_heap_update(conDesc, &tup->t_self, tup);
- CatalogUpdateIndexes(conDesc, tup);
+ simple_heap_update(conDesc, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(conDesc, tup, warm_update, modified_attrs);
break;
}
}
@@ -2611,12 +2626,16 @@ SetRelationNumChecks(Relation rel, int numchecks)
if (relStruct->relchecks != numchecks)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
relStruct->relchecks = numchecks;
- simple_heap_update(relrel, &reltup->t_self, reltup);
+ simple_heap_update(relrel, &reltup->t_self, reltup, &warm_update,
+ &modified_attrs);
/* keep catalog indexes current */
- CatalogUpdateIndexes(relrel, reltup);
+ CatalogUpdateIndexes(relrel, reltup, warm_update, modified_attrs);
}
else
{
@@ -3159,7 +3178,7 @@ StorePartitionKey(Relation rel,
simple_heap_insert(pg_partitioned_table, tuple);
/* Update the indexes on pg_partitioned_table */
- CatalogUpdateIndexes(pg_partitioned_table, tuple);
+ CatalogUpdateIndexes(pg_partitioned_table, tuple, false, NULL);
heap_close(pg_partitioned_table, RowExclusiveLock);
/* Mark this relation as dependent on a few things as follows */
@@ -3243,6 +3262,8 @@ StorePartitionBound(Relation rel, Relation parent, Node *bound)
Datum new_val[Natts_pg_class];
bool new_null[Natts_pg_class],
new_repl[Natts_pg_class];
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Update pg_class tuple */
classRel = heap_open(RelationRelationId, RowExclusiveLock);
@@ -3276,8 +3297,9 @@ StorePartitionBound(Relation rel, Relation parent, Node *bound)
new_val, new_null, new_repl);
/* Also set the flag */
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = true;
- simple_heap_update(classRel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(classRel, newtuple);
+ simple_heap_update(classRel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(classRel, newtuple, warm_update, modified_attrs);
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 26cbc0e..04ea34f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -652,7 +653,7 @@ UpdateIndexRelation(Oid indexoid,
simple_heap_insert(pg_index, tuple);
/* update the indexes on pg_index */
- CatalogUpdateIndexes(pg_index, tuple);
+ CatalogUpdateIndexes(pg_index, tuple, false, NULL);
/*
* close the relation and free the tuple
@@ -1324,8 +1325,13 @@ index_constraint_create(Relation heapRelation,
if (dirty)
{
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
+ simple_heap_update(pg_index, &indexTuple->t_self, indexTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_index, indexTuple, warm_update,
+ modified_attrs);
InvokeObjectPostAlterHookArg(IndexRelationId, indexRelationId, 0,
InvalidOid, is_internal);
@@ -1691,6 +1697,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
@@ -2090,6 +2110,8 @@ index_build(Relation heapRelation,
Relation pg_index;
HeapTuple indexTuple;
Form_pg_index indexForm;
+ bool warm_update;
+ Bitmapset *modified_attrs;
pg_index = heap_open(IndexRelationId, RowExclusiveLock);
@@ -2103,8 +2125,9 @@ index_build(Relation heapRelation,
Assert(!indexForm->indcheckxmin);
indexForm->indcheckxmin = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ simple_heap_update(pg_index, &indexTuple->t_self, indexTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_index, indexTuple, warm_update, modified_attrs);
heap_freetuple(indexTuple);
heap_close(pg_index, RowExclusiveLock);
@@ -3441,6 +3464,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
(indexForm->indcheckxmin && !indexInfo->ii_BrokenHotChain) ||
early_pruning_enabled)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
if (!indexInfo->ii_BrokenHotChain && !early_pruning_enabled)
indexForm->indcheckxmin = false;
else if (index_bad || early_pruning_enabled)
@@ -3448,8 +3474,10 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
indexForm->indisvalid = true;
indexForm->indisready = true;
indexForm->indislive = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ simple_heap_update(pg_index, &indexTuple->t_self, indexTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_index, indexTuple, warm_update,
+ modified_attrs);
/*
* Invalidate the relcache for the table, so that after we commit
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 1915ca3..5046fd1 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -67,9 +67,13 @@ CatalogCloseIndexes(CatalogIndexState indstate)
* This should be called for each inserted or updated catalog tuple.
*
* This is effectively a cut-down version of ExecInsertIndexTuples.
+ *
+ * See comments at CatalogUpdateIndexes for details about warm_update and
+ * modified_attrs
*/
void
-CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
+CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
+ bool warm_update, Bitmapset *modified_attrs)
{
int i;
int numIndexes;
@@ -79,12 +83,27 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
IndexInfo **indexInfoArray;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData root_tid;
- /* HOT update does not require index inserts */
- if (HeapTupleIsHeapOnly(heapTuple))
+ /*
+ * HOT update does not require index inserts, but WARM may need for some
+ * indexes
+ */
+ if (HeapTupleIsHeapOnly(heapTuple) && !warm_update)
return;
/*
+ * If we've done a WARM update, then we must index the TID of the root line
+ * pointer and not the actual TID of the new tuple.
+ */
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(heapTuple->t_self)),
+ HeapTupleHeaderGetRootOffset(heapTuple->t_data));
+ else
+ ItemPointerCopy(&heapTuple->t_self, &root_tid);
+
+ /*
* Get information from the state structure. Fall out if nothing to do.
*/
numIndexes = indstate->ri_NumIndices;
@@ -112,6 +131,17 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
continue;
/*
+ * If we've done WARM update, then we must not insert a new index tuple
+ * if none of the index keys have changed. This is not just an
+ * optimization, but a requirement for WARM to work correctly.
+ */
+ if (warm_update)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
+ /*
* Expressional and partial indexes on system catalogs are not
* supported, nor exclusion constraints, nor deferred uniqueness
*/
@@ -136,7 +166,7 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
index_insert(relationDescs[i], /* index relation */
values, /* array of index Datums */
isnull, /* is-null flags */
- &(heapTuple->t_self), /* tid of heap tuple */
+ &root_tid,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO);
@@ -152,13 +182,21 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
* to insert or update a single tuple in a system catalog. Avoid using it for
* multiple tuples, since opening the indexes and building the index info
* structures is moderately expensive.
+ *
+ * warm_update is passed as true if the indexes are being updated as a result
+ * of an update operation on the underlying system table and that update was a
+ * WARM update.
+ *
+ * modified_attrs contains a bitmap of attributes modified by the update
+ * operation.
*/
void
-CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple)
+CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple,
+ bool warm_update, Bitmapset *modified_attrs)
{
CatalogIndexState indstate;
indstate = CatalogOpenIndexes(heapRel);
- CatalogIndexInsert(indstate, heapTuple);
+ CatalogIndexInsert(indstate, heapTuple, warm_update, modified_attrs);
CatalogCloseIndexes(indstate);
}
diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 3a4e22f..4642614 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -676,7 +676,7 @@ AggregateCreate(const char *aggName,
tup = heap_form_tuple(tupDesc, values, nulls);
simple_heap_insert(aggdesc, tup);
- CatalogUpdateIndexes(aggdesc, tup);
+ CatalogUpdateIndexes(aggdesc, tup, false, NULL);
heap_close(aggdesc, RowExclusiveLock);
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index 694c0f6..d143a4a 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -138,7 +138,7 @@ CollationCreate(const char *collname, Oid collnamespace,
Assert(OidIsValid(oid));
/* update the index if any */
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
/* set up dependencies for the new collation */
myself.classId = CollationRelationId;
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index b5a0ce9..6757d9c 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -229,7 +229,7 @@ CreateConstraintEntry(const char *constraintName,
conOid = simple_heap_insert(conDesc, tup);
/* update catalog indexes */
- CatalogUpdateIndexes(conDesc, tup);
+ CatalogUpdateIndexes(conDesc, tup, false, NULL);
conobject.classId = ConstraintRelationId;
conobject.objectId = conOid;
@@ -570,6 +570,8 @@ RemoveConstraintById(Oid conId)
Relation pgrel;
HeapTuple relTup;
Form_pg_class classForm;
+ bool warm_update;
+ Bitmapset *modified_attrs;
pgrel = heap_open(RelationRelationId, RowExclusiveLock);
relTup = SearchSysCacheCopy1(RELOID,
@@ -584,9 +586,10 @@ RemoveConstraintById(Oid conId)
RelationGetRelationName(rel));
classForm->relchecks--;
- simple_heap_update(pgrel, &relTup->t_self, relTup);
+ simple_heap_update(pgrel, &relTup->t_self, relTup, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pgrel, relTup);
+ CatalogUpdateIndexes(pgrel, relTup, warm_update, modified_attrs);
heap_freetuple(relTup);
@@ -632,6 +635,8 @@ RenameConstraintById(Oid conId, const char *newname)
Relation conDesc;
HeapTuple tuple;
Form_pg_constraint con;
+ bool warm_update;
+ Bitmapset *modified_attrs;
conDesc = heap_open(ConstraintRelationId, RowExclusiveLock);
@@ -666,10 +671,11 @@ RenameConstraintById(Oid conId, const char *newname)
/* OK, do the rename --- tuple is a copy, so OK to scribble on it */
namestrcpy(&(con->conname), newname);
- simple_heap_update(conDesc, &tuple->t_self, tuple);
+ simple_heap_update(conDesc, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* update the system catalog indexes */
- CatalogUpdateIndexes(conDesc, tuple);
+ CatalogUpdateIndexes(conDesc, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(ConstraintRelationId, conId, 0);
@@ -731,13 +737,17 @@ AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
/* Don't update if the object is already part of the namespace */
if (conform->connamespace == oldNspId && oldNspId != newNspId)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
tup = heap_copytuple(tup);
conform = (Form_pg_constraint) GETSTRUCT(tup);
conform->connamespace = newNspId;
- simple_heap_update(conRel, &tup->t_self, tup);
- CatalogUpdateIndexes(conRel, tup);
+ simple_heap_update(conRel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(conRel, tup, warm_update, modified_attrs);
/*
* Note: currently, the constraint will not have its own
diff --git a/src/backend/catalog/pg_conversion.c b/src/backend/catalog/pg_conversion.c
index adaf7b8..99da1f3 100644
--- a/src/backend/catalog/pg_conversion.c
+++ b/src/backend/catalog/pg_conversion.c
@@ -108,7 +108,7 @@ ConversionCreate(const char *conname, Oid connamespace,
simple_heap_insert(rel, tup);
/* update the index if any */
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
myself.classId = ConversionRelationId;
myself.objectId = HeapTupleGetOid(tup);
diff --git a/src/backend/catalog/pg_db_role_setting.c b/src/backend/catalog/pg_db_role_setting.c
index 117cc8d..f592419 100644
--- a/src/backend/catalog/pg_db_role_setting.c
+++ b/src/backend/catalog/pg_db_role_setting.c
@@ -78,6 +78,8 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
bool repl_null[Natts_pg_db_role_setting];
bool repl_repl[Natts_pg_db_role_setting];
HeapTuple newtuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
memset(repl_repl, false, sizeof(repl_repl));
@@ -88,10 +90,11 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
+ simple_heap_update(rel, &tuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
}
else
simple_heap_delete(rel, &tuple->t_self);
@@ -106,6 +109,8 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
Datum datum;
bool isnull;
ArrayType *a;
+ bool warm_update;
+ Bitmapset *modified_attrs;
memset(repl_repl, false, sizeof(repl_repl));
repl_repl[Anum_pg_db_role_setting_setconfig - 1] = true;
@@ -129,10 +134,11 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
+ simple_heap_update(rel, &tuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
}
else
simple_heap_delete(rel, &tuple->t_self);
@@ -158,7 +164,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
simple_heap_insert(rel, newtuple);
/* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, false, NULL);
}
InvokeObjectPostAlterHookArg(DbRoleSettingRelationId,
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index b71fa1b..cae00ad 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -113,7 +113,7 @@ recordMultipleDependencies(const ObjectAddress *depender,
if (indstate == NULL)
indstate = CatalogOpenIndexes(dependDesc);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, false, NULL);
heap_freetuple(tup);
}
@@ -356,14 +356,18 @@ changeDependencyFor(Oid classId, Oid objectId,
simple_heap_delete(depRel, &tup->t_self);
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* make a modifiable copy */
tup = heap_copytuple(tup);
depform = (Form_pg_depend) GETSTRUCT(tup);
depform->refobjid = newRefObjectId;
- simple_heap_update(depRel, &tup->t_self, tup);
- CatalogUpdateIndexes(depRel, tup);
+ simple_heap_update(depRel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(depRel, tup, warm_update, modified_attrs);
heap_freetuple(tup);
}
diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 089a9a0..4bf90a4 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -126,7 +126,7 @@ EnumValuesCreate(Oid enumTypeOid, List *vals)
HeapTupleSetOid(tup, oids[elemno]);
simple_heap_insert(pg_enum, tup);
- CatalogUpdateIndexes(pg_enum, tup);
+ CatalogUpdateIndexes(pg_enum, tup, false, NULL);
heap_freetuple(tup);
elemno++;
@@ -459,7 +459,7 @@ restart:
enum_tup = heap_form_tuple(RelationGetDescr(pg_enum), values, nulls);
HeapTupleSetOid(enum_tup, newOid);
simple_heap_insert(pg_enum, enum_tup);
- CatalogUpdateIndexes(pg_enum, enum_tup);
+ CatalogUpdateIndexes(pg_enum, enum_tup, false, NULL);
heap_freetuple(enum_tup);
heap_close(pg_enum, RowExclusiveLock);
@@ -483,6 +483,8 @@ RenameEnumLabel(Oid enumTypeOid,
HeapTuple old_tup;
bool found_new;
int i;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* check length of new label is ok */
if (strlen(newVal) > (NAMEDATALEN - 1))
@@ -543,8 +545,9 @@ RenameEnumLabel(Oid enumTypeOid,
/* Update the pg_enum entry */
namestrcpy(&en->enumlabel, newVal);
- simple_heap_update(pg_enum, &enum_tup->t_self, enum_tup);
- CatalogUpdateIndexes(pg_enum, enum_tup);
+ simple_heap_update(pg_enum, &enum_tup->t_self, enum_tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(pg_enum, enum_tup, warm_update, modified_attrs);
heap_freetuple(enum_tup);
heap_close(pg_enum, RowExclusiveLock);
@@ -588,6 +591,8 @@ RenumberEnumType(Relation pg_enum, HeapTuple *existing, int nelems)
HeapTuple newtup;
Form_pg_enum en;
float4 newsortorder;
+ bool warm_update;
+ Bitmapset *modified_attrs;
newtup = heap_copytuple(existing[i]);
en = (Form_pg_enum) GETSTRUCT(newtup);
@@ -597,9 +602,10 @@ RenumberEnumType(Relation pg_enum, HeapTuple *existing, int nelems)
{
en->enumsortorder = newsortorder;
- simple_heap_update(pg_enum, &newtup->t_self, newtup);
+ simple_heap_update(pg_enum, &newtup->t_self, newtup, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pg_enum, newtup);
+ CatalogUpdateIndexes(pg_enum, newtup, warm_update, modified_attrs);
}
heap_freetuple(newtup);
diff --git a/src/backend/catalog/pg_largeobject.c b/src/backend/catalog/pg_largeobject.c
index 24edf6a..7ece246 100644
--- a/src/backend/catalog/pg_largeobject.c
+++ b/src/backend/catalog/pg_largeobject.c
@@ -66,7 +66,7 @@ LargeObjectCreate(Oid loid)
loid_new = simple_heap_insert(pg_lo_meta, ntup);
Assert(!OidIsValid(loid) || loid == loid_new);
- CatalogUpdateIndexes(pg_lo_meta, ntup);
+ CatalogUpdateIndexes(pg_lo_meta, ntup, false, NULL);
heap_freetuple(ntup);
diff --git a/src/backend/catalog/pg_namespace.c b/src/backend/catalog/pg_namespace.c
index f048ad4..6b31e7e 100644
--- a/src/backend/catalog/pg_namespace.c
+++ b/src/backend/catalog/pg_namespace.c
@@ -79,7 +79,7 @@ NamespaceCreate(const char *nspName, Oid ownerId, bool isTemp)
nspoid = simple_heap_insert(nspdesc, tup);
Assert(OidIsValid(nspoid));
- CatalogUpdateIndexes(nspdesc, tup);
+ CatalogUpdateIndexes(nspdesc, tup, false, NULL);
heap_close(nspdesc, RowExclusiveLock);
diff --git a/src/backend/catalog/pg_operator.c b/src/backend/catalog/pg_operator.c
index 556f9fe..77bbd97 100644
--- a/src/backend/catalog/pg_operator.c
+++ b/src/backend/catalog/pg_operator.c
@@ -264,7 +264,7 @@ OperatorShellMake(const char *operatorName,
*/
operatorObjectId = simple_heap_insert(pg_operator_desc, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ CatalogUpdateIndexes(pg_operator_desc, tup, false, NULL);
/* Add dependencies for the entry */
makeOperatorDependencies(tup, false);
@@ -350,6 +350,8 @@ OperatorCreate(const char *operatorName,
NameData oname;
int i;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Sanity checks
@@ -526,7 +528,8 @@ OperatorCreate(const char *operatorName,
nulls,
replaces);
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
+ simple_heap_update(pg_operator_desc, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
}
else
{
@@ -536,10 +539,12 @@ OperatorCreate(const char *operatorName,
values, nulls);
operatorObjectId = simple_heap_insert(pg_operator_desc, tup);
+ warm_update = false;
+ modified_attrs = NULL;
}
/* Must update the indexes in either case */
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ CatalogUpdateIndexes(pg_operator_desc, tup, warm_update, modified_attrs);
/* Add dependencies for the entry */
address = makeOperatorDependencies(tup, isUpdate);
@@ -695,8 +700,12 @@ OperatorUpd(Oid baseId, Oid commId, Oid negId, bool isDelete)
/* If any columns were found to need modification, update tuple. */
if (update_commutator)
{
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ bool warm_update;
+ Bitmapset *modified_attrs;
+ simple_heap_update(pg_operator_desc, &tup->t_self, tup,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_operator_desc, tup, warm_update,
+ modified_attrs);
/*
* Do CCI to make the updated tuple visible. We must do this in
@@ -741,8 +750,13 @@ OperatorUpd(Oid baseId, Oid commId, Oid negId, bool isDelete)
/* If any columns were found to need modification, update tuple. */
if (update_negator)
{
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
+ simple_heap_update(pg_operator_desc, &tup->t_self, tup,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_operator_desc, tup, warm_update,
+ modified_attrs);
/*
* In the deletion case, do CCI to make the updated tuple visible.
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 7ae192a..0f7027a 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -118,6 +118,8 @@ ProcedureCreate(const char *procedureName,
referenced;
int i;
Oid trfid;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* sanity checks
@@ -573,7 +575,8 @@ ProcedureCreate(const char *procedureName,
/* Okay, do it... */
tup = heap_modify_tuple(oldtup, tupDesc, values, nulls, replaces);
- simple_heap_update(rel, &tup->t_self, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
ReleaseSysCache(oldtup);
is_update = true;
@@ -593,10 +596,12 @@ ProcedureCreate(const char *procedureName,
tup = heap_form_tuple(tupDesc, values, nulls);
simple_heap_insert(rel, tup);
is_update = false;
+ warm_update = false;
+ modified_attrs = NULL;
}
/* Need to update indexes for either the insert or update case */
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
retval = HeapTupleGetOid(tup);
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 576b7fa..b93f7c3 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -149,7 +149,7 @@ publication_add_relation(Oid pubid, Relation targetrel,
/* Insert tuple into catalog. */
prrelid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
ObjectAddressSet(myself, PublicationRelRelationId, prrelid);
diff --git a/src/backend/catalog/pg_range.c b/src/backend/catalog/pg_range.c
index d3a4c26..7d4cc5d 100644
--- a/src/backend/catalog/pg_range.c
+++ b/src/backend/catalog/pg_range.c
@@ -59,7 +59,7 @@ RangeCreate(Oid rangeTypeOid, Oid rangeSubType, Oid rangeCollation,
tup = heap_form_tuple(RelationGetDescr(pg_range), values, nulls);
simple_heap_insert(pg_range, tup);
- CatalogUpdateIndexes(pg_range, tup);
+ CatalogUpdateIndexes(pg_range, tup, false, NULL);
heap_freetuple(tup);
/* record type's dependencies on range-related items */
diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c
index 60ed957..3019b4e 100644
--- a/src/backend/catalog/pg_shdepend.c
+++ b/src/backend/catalog/pg_shdepend.c
@@ -253,6 +253,8 @@ shdepChangeDep(Relation sdepRel,
}
else if (oldtup)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Need to update existing entry */
Form_pg_shdepend shForm = (Form_pg_shdepend) GETSTRUCT(oldtup);
@@ -260,10 +262,11 @@ shdepChangeDep(Relation sdepRel,
shForm->refclassid = refclassid;
shForm->refobjid = refobjid;
- simple_heap_update(sdepRel, &oldtup->t_self, oldtup);
+ simple_heap_update(sdepRel, &oldtup->t_self, oldtup, &warm_update,
+ &modified_attrs);
/* keep indexes current */
- CatalogUpdateIndexes(sdepRel, oldtup);
+ CatalogUpdateIndexes(sdepRel, oldtup, warm_update, modified_attrs);
}
else
{
@@ -290,7 +293,7 @@ shdepChangeDep(Relation sdepRel,
simple_heap_insert(sdepRel, oldtup);
/* keep indexes current */
- CatalogUpdateIndexes(sdepRel, oldtup);
+ CatalogUpdateIndexes(sdepRel, oldtup, false, NULL);
}
if (oldtup)
@@ -762,7 +765,7 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId)
simple_heap_insert(sdepRel, newtup);
/* Keep indexes current */
- CatalogIndexInsert(indstate, newtup);
+ CatalogIndexInsert(indstate, newtup, false, NULL);
heap_freetuple(newtup);
}
@@ -885,7 +888,7 @@ shdepAddDependency(Relation sdepRel,
simple_heap_insert(sdepRel, tup);
/* keep indexes current */
- CatalogUpdateIndexes(sdepRel, tup);
+ CatalogUpdateIndexes(sdepRel, tup, false, NULL);
/* clean up */
heap_freetuple(tup);
diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c
index 6d9a324..5857045 100644
--- a/src/backend/catalog/pg_type.c
+++ b/src/backend/catalog/pg_type.c
@@ -144,7 +144,7 @@ TypeShellMake(const char *typeName, Oid typeNamespace, Oid ownerId)
*/
typoid = simple_heap_insert(pg_type_desc, tup);
- CatalogUpdateIndexes(pg_type_desc, tup);
+ CatalogUpdateIndexes(pg_type_desc, tup, false, NULL);
/*
* Create dependencies. We can/must skip this in bootstrap mode.
@@ -237,6 +237,8 @@ TypeCreate(Oid newTypeOid,
int i;
Acl *typacl = NULL;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* We assume that the caller validated the arguments individually, but did
@@ -430,7 +432,8 @@ TypeCreate(Oid newTypeOid,
nulls,
replaces);
- simple_heap_update(pg_type_desc, &tup->t_self, tup);
+ simple_heap_update(pg_type_desc, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
typeObjectId = HeapTupleGetOid(tup);
@@ -459,10 +462,12 @@ TypeCreate(Oid newTypeOid,
/* else allow system to assign oid */
typeObjectId = simple_heap_insert(pg_type_desc, tup);
+ warm_update = false;
+ modified_attrs = NULL;
}
/* Update indexes */
- CatalogUpdateIndexes(pg_type_desc, tup);
+ CatalogUpdateIndexes(pg_type_desc, tup, warm_update, modified_attrs);
/*
* Create dependencies. We can/must skip this in bootstrap mode.
@@ -700,6 +705,8 @@ RenameTypeInternal(Oid typeOid, const char *newTypeName, Oid typeNamespace)
HeapTuple tuple;
Form_pg_type typ;
Oid arrayOid;
+ bool warm_update;
+ Bitmapset *modified_attrs;
pg_type_desc = heap_open(TypeRelationId, RowExclusiveLock);
@@ -724,10 +731,11 @@ RenameTypeInternal(Oid typeOid, const char *newTypeName, Oid typeNamespace)
/* OK, do the rename --- tuple is a copy, so OK to scribble on it */
namestrcpy(&(typ->typname), newTypeName);
- simple_heap_update(pg_type_desc, &tuple->t_self, tuple);
+ simple_heap_update(pg_type_desc, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* update the system catalog indexes */
- CatalogUpdateIndexes(pg_type_desc, tuple);
+ CatalogUpdateIndexes(pg_type_desc, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TypeRelationId, typeOid, 0);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4dfedf8..27bc137 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -487,6 +487,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -517,7 +518,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ee4a182..b48f785 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -349,11 +349,15 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
if (!IsBootstrapProcessingMode())
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* normal case, use a transactional update */
- simple_heap_update(class_rel, &reltup->t_self, reltup);
+ simple_heap_update(class_rel, &reltup->t_self, reltup, &warm_update,
+ &modified_attrs);
/* Keep catalog indexes current */
- CatalogUpdateIndexes(class_rel, reltup);
+ CatalogUpdateIndexes(class_rel, reltup, warm_update, modified_attrs);
}
else
{
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 768fcc8..5c03207 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -172,6 +172,8 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
bool *nulls;
bool *replaces;
NameData nameattrdata;
+ bool warm_update;
+ Bitmapset *modified_attrs;
oldtup = SearchSysCache1(oidCacheId, ObjectIdGetDatum(objectId));
if (!HeapTupleIsValid(oldtup))
@@ -284,8 +286,9 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &oldtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ simple_heap_update(rel, &oldtup->t_self, newtup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, newtup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(classId, objectId, 0);
@@ -617,6 +620,8 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
Datum *values;
bool *nulls;
bool *replaces;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tup = SearchSysCacheCopy1(oidCacheId, ObjectIdGetDatum(objid));
if (!HeapTupleIsValid(tup)) /* should not happen */
@@ -722,8 +727,8 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &tup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ simple_heap_update(rel, &tup->t_self, newtup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, newtup, warm_update, modified_attrs);
/* Release memory */
pfree(values);
@@ -880,6 +885,8 @@ AlterObjectOwner_internal(Relation rel, Oid objectId, Oid new_ownerId)
Datum *values;
bool *nulls;
bool *replaces;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Superusers can bypass permission checks */
if (!superuser())
@@ -954,8 +961,9 @@ AlterObjectOwner_internal(Relation rel, Oid objectId, Oid new_ownerId)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &newtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ simple_heap_update(rel, &newtup->t_self, newtup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, newtup, warm_update, modified_attrs);
/* Update owner dependency reference */
if (classId == LargeObjectMetadataRelationId)
diff --git a/src/backend/commands/amcmds.c b/src/backend/commands/amcmds.c
index 29061b8..2f33b2c 100644
--- a/src/backend/commands/amcmds.c
+++ b/src/backend/commands/amcmds.c
@@ -88,7 +88,7 @@ CreateAccessMethod(CreateAmStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
amoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
myself.classId = AccessMethodRelationId;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e3e1a53..b9a9ede 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1498,6 +1498,8 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
Datum values[Natts_pg_statistic];
bool nulls[Natts_pg_statistic];
bool replaces[Natts_pg_statistic];
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Ignore attr if we weren't able to collect stats */
if (!stats->stats_valid)
@@ -1589,17 +1591,20 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
nulls,
replaces);
ReleaseSysCache(oldtup);
- simple_heap_update(sd, &stup->t_self, stup);
+ simple_heap_update(sd, &stup->t_self, stup, &warm_update,
+ &modified_attrs);
}
else
{
/* No, insert new tuple */
stup = heap_form_tuple(RelationGetDescr(sd), values, nulls);
simple_heap_insert(sd, stup);
+ warm_update = false;
+ modified_attrs = NULL;
}
/* update indexes too */
- CatalogUpdateIndexes(sd, stup);
+ CatalogUpdateIndexes(sd, stup, warm_update, modified_attrs);
heap_freetuple(stup);
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f9309fc..03ed871 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -522,18 +522,28 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
*/
if (indexForm->indisclustered)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
indexForm->indisclustered = false;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ simple_heap_update(pg_index, &indexTuple->t_self, indexTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_index, indexTuple, warm_update,
+ modified_attrs);
}
else if (thisIndexOid == indexOid)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* this was checked earlier, but let's be real sure */
if (!IndexIsValid(indexForm))
elog(ERROR, "cannot cluster on invalid index %u", indexOid);
indexForm->indisclustered = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ simple_heap_update(pg_index, &indexTuple->t_self, indexTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_index, indexTuple, warm_update,
+ modified_attrs);
}
InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
@@ -1287,13 +1297,18 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
*/
if (!target_is_pg_class)
{
- simple_heap_update(relRelation, &reltup1->t_self, reltup1);
- simple_heap_update(relRelation, &reltup2->t_self, reltup2);
+ bool warm_update1 = false, warm_update2 = false;
+ Bitmapset *modified_attrs1, *modified_attrs2;
+
+ simple_heap_update(relRelation, &reltup1->t_self, reltup1,
+ &warm_update1, &modified_attrs1);
+ simple_heap_update(relRelation, &reltup2->t_self, reltup2,
+ &warm_update2, &modified_attrs2);
/* Keep system catalogs current */
indstate = CatalogOpenIndexes(relRelation);
- CatalogIndexInsert(indstate, reltup1);
- CatalogIndexInsert(indstate, reltup2);
+ CatalogIndexInsert(indstate, reltup1, warm_update1, modified_attrs1);
+ CatalogIndexInsert(indstate, reltup2, warm_update2, modified_attrs2);
CatalogCloseIndexes(indstate);
}
else
@@ -1547,6 +1562,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
+ bool warm_update;
+ Bitmapset *modified_attrs;
relRelation = heap_open(RelationRelationId, RowExclusiveLock);
@@ -1558,8 +1575,9 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
relform->relfrozenxid = frozenXid;
relform->relminmxid = cutoffMulti;
- simple_heap_update(relRelation, &reltup->t_self, reltup);
- CatalogUpdateIndexes(relRelation, reltup);
+ simple_heap_update(relRelation, &reltup->t_self, reltup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(relRelation, reltup, warm_update, modified_attrs);
heap_close(relRelation, RowExclusiveLock);
}
diff --git a/src/backend/commands/comment.c b/src/backend/commands/comment.c
index ada0b03..60b3631 100644
--- a/src/backend/commands/comment.c
+++ b/src/backend/commands/comment.c
@@ -150,6 +150,8 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
bool nulls[Natts_pg_description];
bool replaces[Natts_pg_description];
int i;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Reduce empty-string to NULL case */
if (comment != NULL && strlen(comment) == 0)
@@ -199,7 +201,8 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
{
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(description), values,
nulls, replaces);
- simple_heap_update(description, &oldtuple->t_self, newtuple);
+ simple_heap_update(description, &oldtuple->t_self, newtuple,
+ &warm_update, &modified_attrs);
}
break; /* Assume there can be only one match */
@@ -214,12 +217,13 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
newtuple = heap_form_tuple(RelationGetDescr(description),
values, nulls);
simple_heap_insert(description, newtuple);
+ warm_update = false;
}
/* Update indexes, if necessary */
if (newtuple != NULL)
{
- CatalogUpdateIndexes(description, newtuple);
+ CatalogUpdateIndexes(description, newtuple, warm_update, modified_attrs);
heap_freetuple(newtuple);
}
@@ -249,6 +253,8 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
bool nulls[Natts_pg_shdescription];
bool replaces[Natts_pg_shdescription];
int i;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Reduce empty-string to NULL case */
if (comment != NULL && strlen(comment) == 0)
@@ -293,7 +299,8 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
{
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(shdescription),
values, nulls, replaces);
- simple_heap_update(shdescription, &oldtuple->t_self, newtuple);
+ simple_heap_update(shdescription, &oldtuple->t_self, newtuple,
+ &warm_update, &modified_attrs);
}
break; /* Assume there can be only one match */
@@ -308,12 +315,14 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
newtuple = heap_form_tuple(RelationGetDescr(shdescription),
values, nulls);
simple_heap_insert(shdescription, newtuple);
+ warm_update = false;
}
/* Update indexes, if necessary */
if (newtuple != NULL)
{
- CatalogUpdateIndexes(shdescription, newtuple);
+ CatalogUpdateIndexes(shdescription, newtuple, warm_update,
+ modified_attrs);
heap_freetuple(newtuple);
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 77cf8ce..faef5b4 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = (TriggerData *) fcinfo->context;
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c05e14e..55b955a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2669,6 +2669,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2823,6 +2825,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6ad8fd7..db9f9fc 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -549,7 +549,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
simple_heap_insert(pg_database_rel, tuple);
/* Update indexes */
- CatalogUpdateIndexes(pg_database_rel, tuple);
+ CatalogUpdateIndexes(pg_database_rel, tuple, false, NULL);
/*
* Now generate additional catalog entries associated with the new DB
@@ -978,6 +978,8 @@ RenameDatabase(const char *oldname, const char *newname)
int notherbackends;
int npreparedxacts;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Look up the target database's OID, and get exclusive lock on it. We
@@ -1040,8 +1042,9 @@ RenameDatabase(const char *oldname, const char *newname)
if (!HeapTupleIsValid(newtup))
elog(ERROR, "cache lookup failed for database %u", db_id);
namestrcpy(&(((Form_pg_database) GETSTRUCT(newtup))->datname), newname);
- simple_heap_update(rel, &newtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ simple_heap_update(rel, &newtup->t_self, newtup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, newtup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(DatabaseRelationId, db_id, 0);
@@ -1081,6 +1084,8 @@ movedb(const char *dbname, const char *tblspcname)
DIR *dstdir;
struct dirent *xlde;
movedb_failure_params fparms;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Look up the target database's OID, and get exclusive lock on it. We
@@ -1296,10 +1301,11 @@ movedb(const char *dbname, const char *tblspcname)
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(pgdbrel),
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pgdbrel, &oldtuple->t_self, newtuple);
+ simple_heap_update(pgdbrel, &oldtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* Update indexes */
- CatalogUpdateIndexes(pgdbrel, newtuple);
+ CatalogUpdateIndexes(pgdbrel, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(DatabaseRelationId,
HeapTupleGetOid(newtuple), 0);
@@ -1413,6 +1419,8 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
Datum new_record[Natts_pg_database];
bool new_record_nulls[Natts_pg_database];
bool new_record_repl[Natts_pg_database];
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Extract options from the statement node tree */
foreach(option, stmt->options)
@@ -1554,10 +1562,11 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel), new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
+ simple_heap_update(rel, &tuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
/* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(DatabaseRelationId,
HeapTupleGetOid(newtuple), 0);
@@ -1610,6 +1619,8 @@ AlterDatabaseOwner(const char *dbname, Oid newOwnerId)
SysScanDesc scan;
Form_pg_database datForm;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Get the old tuple. We don't need a lock on the database per se,
@@ -1692,8 +1703,9 @@ AlterDatabaseOwner(const char *dbname, Oid newOwnerId)
}
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel), repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ simple_heap_update(rel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
heap_freetuple(newtuple);
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 8125537..63cdff8 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -406,7 +406,7 @@ insert_event_trigger_tuple(char *trigname, char *eventname, Oid evtOwner,
/* Insert heap tuple. */
tuple = heap_form_tuple(tgrel->rd_att, values, nulls);
trigoid = simple_heap_insert(tgrel, tuple);
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogUpdateIndexes(tgrel, tuple, false, NULL);
heap_freetuple(tuple);
/* Depend on owner. */
@@ -503,6 +503,8 @@ AlterEventTrigger(AlterEventTrigStmt *stmt)
Oid trigoid;
Form_pg_event_trigger evtForm;
char tgenabled = stmt->tgenabled;
+ bool warm_update;
+ Bitmapset *modified_attrs;
tgrel = heap_open(EventTriggerRelationId, RowExclusiveLock);
@@ -524,8 +526,9 @@ AlterEventTrigger(AlterEventTrigStmt *stmt)
evtForm = (Form_pg_event_trigger) GETSTRUCT(tup);
evtForm->evtenabled = tgenabled;
- simple_heap_update(tgrel, &tup->t_self, tup);
- CatalogUpdateIndexes(tgrel, tup);
+ simple_heap_update(tgrel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(tgrel, tup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(EventTriggerRelationId,
trigoid, 0);
@@ -602,6 +605,8 @@ static void
AlterEventTriggerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
{
Form_pg_event_trigger form;
+ bool warm_update;
+ Bitmapset *modified_attrs;
form = (Form_pg_event_trigger) GETSTRUCT(tup);
@@ -621,8 +626,8 @@ AlterEventTriggerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of an event trigger must be a superuser.")));
form->evtowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/* Update owner dependency reference */
changeDependencyOnOwner(EventTriggerRelationId,
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 554fdc4..01e6a54 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -1773,7 +1773,7 @@ InsertExtensionTuple(const char *extName, Oid extOwner,
tuple = heap_form_tuple(rel->rd_att, values, nulls);
extensionOid = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
heap_freetuple(tuple);
heap_close(rel, RowExclusiveLock);
@@ -2332,6 +2332,8 @@ pg_extension_config_dump(PG_FUNCTION_ARGS)
bool repl_null[Natts_pg_extension];
bool repl_repl[Natts_pg_extension];
ArrayType *a;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* We only allow this to be called from an extension's SQL script. We
@@ -2484,8 +2486,9 @@ pg_extension_config_dump(PG_FUNCTION_ARGS)
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
repl_val, repl_null, repl_repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ simple_heap_update(extRel, &extTup->t_self, extTup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(extRel, extTup, warm_update, modified_attrs);
systable_endscan(extScan);
@@ -2516,6 +2519,8 @@ extension_config_remove(Oid extensionoid, Oid tableoid)
bool repl_null[Natts_pg_extension];
bool repl_repl[Natts_pg_extension];
ArrayType *a;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Find the pg_extension tuple */
extRel = heap_open(ExtensionRelationId, RowExclusiveLock);
@@ -2662,8 +2667,9 @@ extension_config_remove(Oid extensionoid, Oid tableoid)
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
repl_val, repl_null, repl_repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ simple_heap_update(extRel, &extTup->t_self, extTup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(extRel, extTup, warm_update, modified_attrs);
systable_endscan(extScan);
@@ -2691,6 +2697,8 @@ AlterExtensionNamespace(List *names, const char *newschema, Oid *oldschema)
HeapTuple depTup;
ObjectAddresses *objsMoved;
ObjectAddress extAddr;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (list_length(names) != 1)
ereport(ERROR,
@@ -2843,8 +2851,9 @@ AlterExtensionNamespace(List *names, const char *newschema, Oid *oldschema)
/* Now adjust pg_extension.extnamespace */
extForm->extnamespace = nspOid;
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ simple_heap_update(extRel, &extTup->t_self, extTup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(extRel, extTup, warm_update, modified_attrs);
heap_close(extRel, RowExclusiveLock);
@@ -3042,6 +3051,8 @@ ApplyExtensionUpdates(Oid extensionOid,
bool repl[Natts_pg_extension];
ObjectAddress myself;
ListCell *lc;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Fetch parameters for specific version (pcontrol is not changed)
@@ -3090,8 +3101,9 @@ ApplyExtensionUpdates(Oid extensionOid,
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
values, nulls, repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ simple_heap_update(extRel, &extTup->t_self, extTup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(extRel, extTup, warm_update, modified_attrs);
systable_endscan(extScan);
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 476a023..d76ccda 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -234,6 +234,9 @@ AlterForeignDataWrapperOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerI
if (form->fdwowner != newOwnerId)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
memset(repl_null, false, sizeof(repl_null));
memset(repl_repl, false, sizeof(repl_repl));
@@ -256,8 +259,9 @@ AlterForeignDataWrapperOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerI
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/* Update owner dependency reference */
changeDependencyOnOwner(ForeignDataWrapperRelationId,
@@ -349,6 +353,9 @@ AlterForeignServerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
if (form->srvowner != newOwnerId)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Superusers can always do it */
if (!superuser())
{
@@ -397,8 +404,9 @@ AlterForeignServerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/* Update owner dependency reference */
changeDependencyOnOwner(ForeignServerRelationId, HeapTupleGetOid(tup),
@@ -630,7 +638,7 @@ CreateForeignDataWrapper(CreateFdwStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
fdwId = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
heap_freetuple(tuple);
@@ -689,6 +697,8 @@ AlterForeignDataWrapper(AlterFdwStmt *stmt)
Oid fdwhandler;
Oid fdwvalidator;
ObjectAddress myself;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(ForeignDataWrapperRelationId, RowExclusiveLock);
@@ -786,8 +796,8 @@ AlterForeignDataWrapper(AlterFdwStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ simple_heap_update(rel, &tp->t_self, tp, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tp, warm_update, modified_attrs);
heap_freetuple(tp);
@@ -943,7 +953,7 @@ CreateForeignServer(CreateForeignServerStmt *stmt)
srvId = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
heap_freetuple(tuple);
@@ -985,6 +995,8 @@ AlterForeignServer(AlterForeignServerStmt *stmt)
Oid srvId;
Form_pg_foreign_server srvForm;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(ForeignServerRelationId, RowExclusiveLock);
@@ -1056,8 +1068,8 @@ AlterForeignServer(AlterForeignServerStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ simple_heap_update(rel, &tp->t_self, tp, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tp, warm_update, modified_attrs);
InvokeObjectPostAlterHook(ForeignServerRelationId, srvId, 0);
@@ -1192,7 +1204,7 @@ CreateUserMapping(CreateUserMappingStmt *stmt)
umId = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
heap_freetuple(tuple);
@@ -1240,6 +1252,8 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
ForeignServer *srv;
ObjectAddress address;
RoleSpec *role = (RoleSpec *) stmt->user;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(UserMappingRelationId, RowExclusiveLock);
@@ -1307,8 +1321,8 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ simple_heap_update(rel, &tp->t_self, tp, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tp, warm_update, modified_attrs);
ObjectAddressSet(address, UserMappingRelationId, umId);
@@ -1485,7 +1499,7 @@ CreateForeignTable(CreateForeignTableStmt *stmt, Oid relid)
tuple = heap_form_tuple(ftrel->rd_att, values, nulls);
simple_heap_insert(ftrel, tuple);
- CatalogUpdateIndexes(ftrel, tuple);
+ CatalogUpdateIndexes(ftrel, tuple, false, NULL);
heap_freetuple(tuple);
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index 22aecb2..8a48bdc 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -1181,6 +1181,8 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt)
DefElem *rows_item = NULL;
DefElem *parallel_item = NULL;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(ProcedureRelationId, RowExclusiveLock);
@@ -1295,8 +1297,8 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt)
procForm->proparallel = interpret_func_parallel(parallel_item);
/* Do the update */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(ProcedureRelationId, funcOid, 0);
@@ -1321,6 +1323,8 @@ SetFunctionReturnType(Oid funcOid, Oid newRetType)
Relation pg_proc_rel;
HeapTuple tup;
Form_pg_proc procForm;
+ bool warm_update;
+ Bitmapset *modified_attrs;
pg_proc_rel = heap_open(ProcedureRelationId, RowExclusiveLock);
@@ -1336,9 +1340,10 @@ SetFunctionReturnType(Oid funcOid, Oid newRetType)
procForm->prorettype = newRetType;
/* update the catalog and its indexes */
- simple_heap_update(pg_proc_rel, &tup->t_self, tup);
+ simple_heap_update(pg_proc_rel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pg_proc_rel, tup);
+ CatalogUpdateIndexes(pg_proc_rel, tup, warm_update, modified_attrs);
heap_close(pg_proc_rel, RowExclusiveLock);
}
@@ -1355,6 +1360,8 @@ SetFunctionArgType(Oid funcOid, int argIndex, Oid newArgType)
Relation pg_proc_rel;
HeapTuple tup;
Form_pg_proc procForm;
+ bool warm_update;
+ Bitmapset *modified_attrs;
pg_proc_rel = heap_open(ProcedureRelationId, RowExclusiveLock);
@@ -1371,9 +1378,10 @@ SetFunctionArgType(Oid funcOid, int argIndex, Oid newArgType)
procForm->proargtypes.values[argIndex] = newArgType;
/* update the catalog and its indexes */
- simple_heap_update(pg_proc_rel, &tup->t_self, tup);
+ simple_heap_update(pg_proc_rel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pg_proc_rel, tup);
+ CatalogUpdateIndexes(pg_proc_rel, tup, warm_update, modified_attrs);
heap_close(pg_proc_rel, RowExclusiveLock);
}
@@ -1661,7 +1669,7 @@ CreateCast(CreateCastStmt *stmt)
castid = simple_heap_insert(relation, tuple);
- CatalogUpdateIndexes(relation, tuple);
+ CatalogUpdateIndexes(relation, tuple, false, NULL);
/* make dependency entries */
myself.classId = CastRelationId;
@@ -1806,6 +1814,8 @@ CreateTransform(CreateTransformStmt *stmt)
ObjectAddress myself,
referenced;
bool is_replace;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Get the type
@@ -1924,7 +1934,8 @@ CreateTransform(CreateTransformStmt *stmt)
replaces[Anum_pg_transform_trftosql - 1] = true;
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ simple_heap_update(relation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
transformid = HeapTupleGetOid(tuple);
ReleaseSysCache(tuple);
@@ -1935,9 +1946,11 @@ CreateTransform(CreateTransformStmt *stmt)
newtuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
transformid = simple_heap_insert(relation, newtuple);
is_replace = false;
+ warm_update = false;
+ modified_attrs = NULL;
}
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateIndexes(relation, newtuple, warm_update, modified_attrs);
if (is_replace)
deleteDependencyRecordsFor(TransformRelationId, transformid, true);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6b5a9b6..c22173d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -83,6 +83,8 @@ SetMatViewPopulatedState(Relation relation, bool newstate)
{
Relation pgrel;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
Assert(relation->rd_rel->relkind == RELKIND_MATVIEW);
@@ -100,9 +102,10 @@ SetMatViewPopulatedState(Relation relation, bool newstate)
((Form_pg_class) GETSTRUCT(tuple))->relispopulated = newstate;
- simple_heap_update(pgrel, &tuple->t_self, tuple);
+ simple_heap_update(pgrel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pgrel, tuple);
+ CatalogUpdateIndexes(pgrel, tuple, warm_update, modified_attrs);
heap_freetuple(tuple);
heap_close(pgrel, RowExclusiveLock);
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index 7cfcc6d..609d92b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -280,7 +280,7 @@ CreateOpFamily(char *amname, char *opfname, Oid namespaceoid, Oid amoid)
opfamilyoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
@@ -657,7 +657,7 @@ DefineOpClass(CreateOpClassStmt *stmt)
opclassoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
@@ -1332,7 +1332,7 @@ storeOperators(List *opfamilyname, Oid amoid,
entryoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
@@ -1443,7 +1443,7 @@ storeProcedures(List *opfamilyname, Oid amoid,
entryoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
diff --git a/src/backend/commands/operatorcmds.c b/src/backend/commands/operatorcmds.c
index a273376..e93a71a 100644
--- a/src/backend/commands/operatorcmds.c
+++ b/src/backend/commands/operatorcmds.c
@@ -400,6 +400,8 @@ AlterOperator(AlterOperatorStmt *stmt)
List *joinName = NIL; /* optional join sel. procedure */
bool updateJoin = false;
Oid joinOid;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Look up the operator */
oprId = LookupOperNameTypeNames(NULL, stmt->opername,
@@ -518,8 +520,9 @@ AlterOperator(AlterOperatorStmt *stmt)
tup = heap_modify_tuple(tup, RelationGetDescr(catalog),
values, nulls, replaces);
- simple_heap_update(catalog, &tup->t_self, tup);
- CatalogUpdateIndexes(catalog, tup);
+ simple_heap_update(catalog, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(catalog, tup, warm_update, modified_attrs);
address = makeOperatorDependencies(tup, true);
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 5d9d3a6..a080fc0 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -536,6 +536,8 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
HeapTuple new_tuple;
ObjectAddress target;
ObjectAddress myself;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* zero-clear */
memset(values, 0, sizeof(values));
@@ -614,10 +616,12 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
new_tuple = heap_modify_tuple(tuple,
RelationGetDescr(pg_policy_rel),
values, isnull, replaces);
- simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple);
+ simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple,
+ &warm_update, &modified_attrs);
/* Update Catalog Indexes */
- CatalogUpdateIndexes(pg_policy_rel, new_tuple);
+ CatalogUpdateIndexes(pg_policy_rel, new_tuple, warm_update,
+ modified_attrs);
/* Remove all old dependencies. */
deleteDependencyRecordsFor(PolicyRelationId, policy_id, false);
@@ -826,7 +830,7 @@ CreatePolicy(CreatePolicyStmt *stmt)
policy_id = simple_heap_insert(pg_policy_rel, policy_tuple);
/* Update Indexes */
- CatalogUpdateIndexes(pg_policy_rel, policy_tuple);
+ CatalogUpdateIndexes(pg_policy_rel, policy_tuple, false, NULL);
/* Record Dependencies */
target.classId = RelationRelationId;
@@ -906,6 +910,8 @@ AlterPolicy(AlterPolicyStmt *stmt)
char polcmd;
bool polcmd_isnull;
int i;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Parse role_ids */
if (stmt->roles != NULL)
@@ -1150,10 +1156,12 @@ AlterPolicy(AlterPolicyStmt *stmt)
new_tuple = heap_modify_tuple(policy_tuple,
RelationGetDescr(pg_policy_rel),
values, isnull, replaces);
- simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple);
+ simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple,
+ &warm_update, &modified_attrs);
/* Update Catalog Indexes */
- CatalogUpdateIndexes(pg_policy_rel, new_tuple);
+ CatalogUpdateIndexes(pg_policy_rel, new_tuple, warm_update,
+ modified_attrs);
/* Update Dependencies. */
deleteDependencyRecordsFor(PolicyRelationId, policy_id, false);
@@ -1217,6 +1225,8 @@ rename_policy(RenameStmt *stmt)
SysScanDesc sscan;
HeapTuple policy_tuple;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Get id of table. Also handles permissions checks. */
table_id = RangeVarGetRelidExtended(stmt->relation, AccessExclusiveLock,
@@ -1287,10 +1297,12 @@ rename_policy(RenameStmt *stmt)
namestrcpy(&((Form_pg_policy) GETSTRUCT(policy_tuple))->polname,
stmt->newname);
- simple_heap_update(pg_policy_rel, &policy_tuple->t_self, policy_tuple);
+ simple_heap_update(pg_policy_rel, &policy_tuple->t_self, policy_tuple,
+ &warm_update, &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_policy_rel, policy_tuple);
+ CatalogUpdateIndexes(pg_policy_rel, policy_tuple, warm_update,
+ modified_attrs);
InvokeObjectPostAlterHook(PolicyRelationId,
HeapTupleGetOid(policy_tuple), 0);
diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index b684f41..aae5ef6 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -336,6 +336,8 @@ create_proc_lang(const char *languageName, bool replace,
bool is_update;
ObjectAddress myself,
referenced;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(LanguageRelationId, RowExclusiveLock);
tupDesc = RelationGetDescr(rel);
@@ -378,7 +380,8 @@ create_proc_lang(const char *languageName, bool replace,
/* Okay, do it... */
tup = heap_modify_tuple(oldtup, tupDesc, values, nulls, replaces);
- simple_heap_update(rel, &tup->t_self, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update,
+ &modified_attrs);
ReleaseSysCache(oldtup);
is_update = true;
@@ -389,10 +392,12 @@ create_proc_lang(const char *languageName, bool replace,
tup = heap_form_tuple(tupDesc, values, nulls);
simple_heap_insert(rel, tup);
is_update = false;
+ warm_update = false;
+ modified_attrs = NULL;
}
/* Need to update indexes for either the insert or update case */
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/*
* Create dependencies for the new language. If we are updating an
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 63dcc10..4980b36 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -215,7 +215,7 @@ CreatePublication(CreatePublicationStmt *stmt)
/* Insert tuple into catalog. */
puboid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
recordDependencyOnOwner(PublicationRelationId, puboid, GetUserId());
@@ -260,6 +260,8 @@ AlterPublicationOptions(AlterPublicationStmt *stmt, Relation rel,
bool publish_update;
bool publish_delete;
ObjectAddress obj;
+ bool warm_update;
+ Bitmapset *modified_attrs;
parse_publication_options(stmt->options,
&publish_insert_given, &publish_insert,
@@ -294,8 +296,8 @@ AlterPublicationOptions(AlterPublicationStmt *stmt, Relation rel,
replaces);
/* Update the catalog. */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
CommandCounterIncrement();
@@ -666,6 +668,8 @@ PublicationDropTables(Oid pubid, List *rels, bool missing_ok)
AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
{
Form_pg_publication form;
+ bool warm_update;
+ Bitmapset *modified_attrs;
form = (Form_pg_publication) GETSTRUCT(tup);
@@ -685,8 +689,8 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of a publication must be a superuser.")));
form->pubowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/* Update owner dependency reference */
changeDependencyOnOwner(PublicationRelationId,
diff --git a/src/backend/commands/schemacmds.c b/src/backend/commands/schemacmds.c
index c3b37b2..d89e093 100644
--- a/src/backend/commands/schemacmds.c
+++ b/src/backend/commands/schemacmds.c
@@ -245,6 +245,8 @@ RenameSchema(const char *oldname, const char *newname)
Relation rel;
AclResult aclresult;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(NamespaceRelationId, RowExclusiveLock);
@@ -281,8 +283,8 @@ RenameSchema(const char *oldname, const char *newname)
/* rename */
namestrcpy(&(((Form_pg_namespace) GETSTRUCT(tup))->nspname), newname);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(NamespaceRelationId, HeapTupleGetOid(tup), 0);
@@ -370,6 +372,8 @@ AlterSchemaOwner_internal(HeapTuple tup, Relation rel, Oid newOwnerId)
bool isNull;
HeapTuple newtuple;
AclResult aclresult;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Otherwise, must be owner of the existing object */
if (!pg_namespace_ownercheck(HeapTupleGetOid(tup), GetUserId()))
@@ -417,8 +421,9 @@ AlterSchemaOwner_internal(HeapTuple tup, Relation rel, Oid newOwnerId)
newtuple = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ simple_heap_update(rel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs); CatalogUpdateIndexes(rel, newtuple,
+ warm_update, modified_attrs);
heap_freetuple(newtuple);
diff --git a/src/backend/commands/seclabel.c b/src/backend/commands/seclabel.c
index 324f2e7..30d7af8 100644
--- a/src/backend/commands/seclabel.c
+++ b/src/backend/commands/seclabel.c
@@ -260,6 +260,8 @@ SetSharedSecurityLabel(const ObjectAddress *object,
Datum values[Natts_pg_shseclabel];
bool nulls[Natts_pg_shseclabel];
bool replaces[Natts_pg_shseclabel];
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Prepare to form or update a tuple, if necessary. */
memset(nulls, false, sizeof(nulls));
@@ -299,7 +301,8 @@ SetSharedSecurityLabel(const ObjectAddress *object,
replaces[Anum_pg_shseclabel_label - 1] = true;
newtup = heap_modify_tuple(oldtup, RelationGetDescr(pg_shseclabel),
values, nulls, replaces);
- simple_heap_update(pg_shseclabel, &oldtup->t_self, newtup);
+ simple_heap_update(pg_shseclabel, &oldtup->t_self, newtup,
+ &warm_update, &modified_attrs);
}
}
systable_endscan(scan);
@@ -310,12 +313,13 @@ SetSharedSecurityLabel(const ObjectAddress *object,
newtup = heap_form_tuple(RelationGetDescr(pg_shseclabel),
values, nulls);
simple_heap_insert(pg_shseclabel, newtup);
+ warm_update = false;
}
/* Update indexes, if necessary */
if (newtup != NULL)
{
- CatalogUpdateIndexes(pg_shseclabel, newtup);
+ CatalogUpdateIndexes(pg_shseclabel, newtup, warm_update, modified_attrs);
heap_freetuple(newtup);
}
@@ -339,6 +343,8 @@ SetSecurityLabel(const ObjectAddress *object,
Datum values[Natts_pg_seclabel];
bool nulls[Natts_pg_seclabel];
bool replaces[Natts_pg_seclabel];
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Shared objects have their own security label catalog. */
if (IsSharedRelation(object->classId))
@@ -390,7 +396,8 @@ SetSecurityLabel(const ObjectAddress *object,
replaces[Anum_pg_seclabel_label - 1] = true;
newtup = heap_modify_tuple(oldtup, RelationGetDescr(pg_seclabel),
values, nulls, replaces);
- simple_heap_update(pg_seclabel, &oldtup->t_self, newtup);
+ simple_heap_update(pg_seclabel, &oldtup->t_self, newtup,
+ &warm_update, &modified_attrs);
}
}
systable_endscan(scan);
@@ -401,12 +408,13 @@ SetSecurityLabel(const ObjectAddress *object,
newtup = heap_form_tuple(RelationGetDescr(pg_seclabel),
values, nulls);
simple_heap_insert(pg_seclabel, newtup);
+ warm_update = false;
}
/* Update indexes, if necessary */
if (newtup != NULL)
{
- CatalogUpdateIndexes(pg_seclabel, newtup);
+ CatalogUpdateIndexes(pg_seclabel, newtup, warm_update, modified_attrs);
heap_freetuple(newtup);
}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0c673f5..8d4d9a4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -237,7 +237,7 @@ DefineSequence(ParseState *pstate, CreateSeqStmt *seq)
tuple = heap_form_tuple(tupDesc, pgs_values, pgs_nulls);
simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
heap_freetuple(tuple);
heap_close(rel, RowExclusiveLock);
@@ -419,6 +419,8 @@ AlterSequence(ParseState *pstate, AlterSeqStmt *stmt)
ObjectAddress address;
Relation rel;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Open and lock sequence. */
relid = RangeVarGetRelid(stmt->sequence, AccessShareLock, stmt->missing_ok);
@@ -504,8 +506,9 @@ AlterSequence(ParseState *pstate, AlterSeqStmt *stmt)
relation_close(seqrel, NoLock);
- simple_heap_update(rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ simple_heap_update(rel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, tuple, warm_update, modified_attrs);
heap_close(rel, RowExclusiveLock);
return address;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 2b6d322..ccabde1 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -277,7 +277,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt)
/* Insert tuple into catalog. */
subid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, false, NULL);
heap_freetuple(tup);
recordDependencyOnOwner(SubscriptionRelationId, subid, owner);
@@ -339,6 +339,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
char *conninfo;
char *slot_name;
List *publications;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(SubscriptionRelationId, RowExclusiveLock);
@@ -397,8 +399,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
replaces);
/* Update the catalog. */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
ObjectAddressSet(myself, SubscriptionRelationId, subid);
@@ -558,6 +560,8 @@ static void
AlterSubscriptionOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
{
Form_pg_subscription form;
+ bool warm_update;
+ Bitmapset *modified_attrs;
form = (Form_pg_subscription) GETSTRUCT(tup);
@@ -577,8 +581,8 @@ AlterSubscriptionOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of an subscription must be a superuser.")));
form->subowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/* Update owner dependency reference */
changeDependencyOnOwner(SubscriptionRelationId,
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c4b0011..2d8d419 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2310,7 +2310,7 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
simple_heap_insert(inhRelation, tuple);
- CatalogUpdateIndexes(inhRelation, tuple);
+ CatalogUpdateIndexes(inhRelation, tuple, false, NULL);
heap_freetuple(tuple);
@@ -2397,11 +2397,16 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
if (classtuple->relhassubclass != relhassubclass)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
classtuple->relhassubclass = relhassubclass;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
+ simple_heap_update(relationRelation, &tuple->t_self, tuple,
+ &warm_update, &modified_attrs);
/* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateIndexes(relationRelation, tuple, warm_update,
+ modified_attrs);
}
else
{
@@ -2477,6 +2482,8 @@ renameatt_internal(Oid myrelid,
HeapTuple atttup;
Form_pg_attribute attform;
AttrNumber attnum;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Grab an exclusive lock on the target table, which we will NOT release
@@ -2592,10 +2599,11 @@ renameatt_internal(Oid myrelid,
/* apply the update */
namestrcpy(&(attform->attname), newattname);
- simple_heap_update(attrelation, &atttup->t_self, atttup);
+ simple_heap_update(attrelation, &atttup->t_self, atttup, &warm_update,
+ &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, atttup);
+ CatalogUpdateIndexes(attrelation, atttup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId, myrelid, attnum);
@@ -2871,6 +2879,8 @@ RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal)
HeapTuple reltup;
Form_pg_class relform;
Oid namespaceId;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Grab an exclusive lock on the target table, index, sequence, view,
@@ -2902,10 +2912,11 @@ RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal)
*/
namestrcpy(&(relform->relname), newrelname);
- simple_heap_update(relrelation, &reltup->t_self, reltup);
+ simple_heap_update(relrelation, &reltup->t_self, reltup, &warm_update,
+ &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(relrelation, reltup);
+ CatalogUpdateIndexes(relrelation, reltup, warm_update, modified_attrs);
InvokeObjectPostAlterHookArg(RelationRelationId, myrelid, 0,
InvalidOid, is_internal);
@@ -5039,6 +5050,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
ListCell *child;
AclResult aclresult;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* At top level, permission check was done in ATPrepCmd, else do it */
if (recursing)
@@ -5069,6 +5082,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
Oid ctypeId;
int32 ctypmod;
Oid ccollid;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Child column must match on type, typmod, and collation */
typenameTypeIdAndMod(NULL, colDef->typeName, &ctypeId, &ctypmod);
@@ -5097,8 +5112,9 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
/* Bump the existing child att's inhcount */
childatt->attinhcount++;
- simple_heap_update(attrdesc, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrdesc, tuple);
+ simple_heap_update(attrdesc, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(attrdesc, tuple, warm_update, modified_attrs);
heap_freetuple(tuple);
@@ -5191,10 +5207,11 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
else
((Form_pg_class) GETSTRUCT(reltup))->relnatts = newattnum;
- simple_heap_update(pgclass, &reltup->t_self, reltup);
+ simple_heap_update(pgclass, &reltup->t_self, reltup, &warm_update,
+ &modified_attrs);
/* keep catalog indexes current */
- CatalogUpdateIndexes(pgclass, reltup);
+ CatalogUpdateIndexes(pgclass, reltup, warm_update, modified_attrs);
heap_freetuple(reltup);
@@ -5628,12 +5645,16 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode)
*/
if (((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull = FALSE;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
+ simple_heap_update(attr_rel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateIndexes(attr_rel, tuple, warm_update, modified_attrs);
ObjectAddressSubSet(address, RelationRelationId,
RelationGetRelid(rel), attnum);
@@ -5706,12 +5727,16 @@ ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
*/
if (!((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull = TRUE;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
+ simple_heap_update(attr_rel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateIndexes(attr_rel, tuple, warm_update, modified_attrs);
/* Tell Phase 3 it needs to test the constraint */
tab->new_notnull = true;
@@ -5833,6 +5858,8 @@ ATExecSetStatistics(Relation rel, const char *colName, Node *newValue, LOCKMODE
Form_pg_attribute attrtuple;
AttrNumber attnum;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
Assert(IsA(newValue, Integer));
newtarget = intVal(newValue);
@@ -5876,10 +5903,11 @@ ATExecSetStatistics(Relation rel, const char *colName, Node *newValue, LOCKMODE
attrtuple->attstattarget = newtarget;
- simple_heap_update(attrelation, &tuple->t_self, tuple);
+ simple_heap_update(attrelation, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, tuple);
+ CatalogUpdateIndexes(attrelation, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -5912,6 +5940,8 @@ ATExecSetOptions(Relation rel, const char *colName, Node *options,
Datum repl_val[Natts_pg_attribute];
bool repl_null[Natts_pg_attribute];
bool repl_repl[Natts_pg_attribute];
+ bool warm_update;
+ Bitmapset *modified_attrs;
attrelation = heap_open(AttributeRelationId, RowExclusiveLock);
@@ -5953,8 +5983,9 @@ ATExecSetOptions(Relation rel, const char *colName, Node *options,
repl_val, repl_null, repl_repl);
/* Update system catalog. */
- simple_heap_update(attrelation, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attrelation, newtuple);
+ simple_heap_update(attrelation, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(attrelation, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -5986,6 +6017,8 @@ ATExecSetStorage(Relation rel, const char *colName, Node *newValue, LOCKMODE loc
Form_pg_attribute attrtuple;
AttrNumber attnum;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
Assert(IsA(newValue, String));
storagemode = strVal(newValue);
@@ -6037,10 +6070,11 @@ ATExecSetStorage(Relation rel, const char *colName, Node *newValue, LOCKMODE loc
errmsg("column data type %s can only have storage PLAIN",
format_type_be(attrtuple->atttypid))));
- simple_heap_update(attrelation, &tuple->t_self, tuple);
+ simple_heap_update(attrelation, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, tuple);
+ CatalogUpdateIndexes(attrelation, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -6275,13 +6309,18 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
}
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Child column must survive my deletion */
childatt->attinhcount--;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
+ simple_heap_update(attr_rel, &tuple->t_self, tuple,
+ &warm_update, &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateIndexes(attr_rel, tuple, warm_update,
+ modified_attrs);
/* Make update visible */
CommandCounterIncrement();
@@ -6289,6 +6328,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
}
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/*
* If we were told to drop ONLY in this table (no recursion),
* we need to mark the inheritors' attributes as locally
@@ -6297,10 +6339,12 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
childatt->attinhcount--;
childatt->attislocal = true;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
+ simple_heap_update(attr_rel, &tuple->t_self, tuple,
+ &warm_update, &modified_attrs);
/* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateIndexes(attr_rel, tuple, warm_update,
+ modified_attrs);
/* Make update visible */
CommandCounterIncrement();
@@ -6333,6 +6377,8 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
Relation class_rel;
Form_pg_class tuple_class;
AlteredTableInfo *tab;
+ bool warm_update;
+ Bitmapset *modified_attrs;
class_rel = heap_open(RelationRelationId, RowExclusiveLock);
@@ -6344,10 +6390,11 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
tuple_class = (Form_pg_class) GETSTRUCT(tuple);
tuple_class->relhasoids = false;
- simple_heap_update(class_rel, &tuple->t_self, tuple);
+ simple_heap_update(class_rel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* Keep the catalog indexes up to date */
- CatalogUpdateIndexes(class_rel, tuple);
+ CatalogUpdateIndexes(class_rel, tuple, warm_update, modified_attrs);
heap_close(class_rel, RowExclusiveLock);
@@ -7189,6 +7236,8 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
SysScanDesc tgscan;
Relation tgrel;
ListCell *lc;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Now update the catalog, while we have the door open.
@@ -7197,8 +7246,9 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->condeferrable = cmdcon->deferrable;
copy_con->condeferred = cmdcon->initdeferred;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ simple_heap_update(conrel, ©Tuple->t_self, copyTuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(conrel, copyTuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(contuple), 0);
@@ -7223,6 +7273,8 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
{
Form_pg_trigger tgform = (Form_pg_trigger) GETSTRUCT(tgtuple);
Form_pg_trigger copy_tg;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Remember OIDs of other relation(s) involved in FK constraint.
@@ -7251,8 +7303,9 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
copy_tg->tgdeferrable = cmdcon->deferrable;
copy_tg->tginitdeferred = cmdcon->initdeferred;
- simple_heap_update(tgrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(tgrel, copyTuple);
+ simple_heap_update(tgrel, ©Tuple->t_self, copyTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(tgrel, copyTuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TriggerRelationId,
HeapTupleGetOid(tgtuple), 0);
@@ -7351,6 +7404,8 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
{
HeapTuple copyTuple;
Form_pg_constraint copy_con;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (con->contype == CONSTRAINT_FOREIGN)
{
@@ -7438,8 +7493,9 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
copyTuple = heap_copytuple(tuple);
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->convalidated = true;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ simple_heap_update(conrel, ©Tuple->t_self, copyTuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(conrel, copyTuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(tuple), 0);
@@ -8339,10 +8395,14 @@ ATExecDropConstraint(Relation rel, const char *constrName,
}
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Child constraint must survive my deletion */
con->coninhcount--;
- simple_heap_update(conrel, ©_tuple->t_self, copy_tuple);
- CatalogUpdateIndexes(conrel, copy_tuple);
+ simple_heap_update(conrel, ©_tuple->t_self, copy_tuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(conrel, copy_tuple, warm_update, modified_attrs);
/* Make update visible */
CommandCounterIncrement();
@@ -8350,6 +8410,9 @@ ATExecDropConstraint(Relation rel, const char *constrName,
}
else
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/*
* If we were told to drop ONLY in this table (no recursion), we
* need to mark the inheritors' constraints as locally defined
@@ -8358,8 +8421,10 @@ ATExecDropConstraint(Relation rel, const char *constrName,
con->coninhcount--;
con->conislocal = true;
- simple_heap_update(conrel, ©_tuple->t_self, copy_tuple);
- CatalogUpdateIndexes(conrel, copy_tuple);
+ simple_heap_update(conrel, ©_tuple->t_self, copy_tuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(conrel, copy_tuple, warm_update,
+ modified_attrs);
/* Make update visible */
CommandCounterIncrement();
@@ -8675,6 +8740,8 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
SysScanDesc scan;
HeapTuple depTup;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
attrelation = heap_open(AttributeRelationId, RowExclusiveLock);
@@ -9005,10 +9072,11 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
ReleaseSysCache(typeTuple);
- simple_heap_update(attrelation, &heapTup->t_self, heapTup);
+ simple_heap_update(attrelation, &heapTup->t_self, heapTup, &warm_update,
+ &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, heapTup);
+ CatalogUpdateIndexes(attrelation, heapTup, warm_update, modified_attrs);
heap_close(attrelation, RowExclusiveLock);
@@ -9079,6 +9147,8 @@ ATExecAlterColumnGenericOptions(Relation rel,
Form_pg_attribute atttableform;
AttrNumber attnum;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (options == NIL)
return InvalidObjectAddress;
@@ -9146,8 +9216,9 @@ ATExecAlterColumnGenericOptions(Relation rel,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(attrel),
repl_val, repl_null, repl_repl);
- simple_heap_update(attrel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attrel, newtuple);
+ simple_heap_update(attrel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(attrel, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -9617,6 +9688,8 @@ ATExecChangeOwner(Oid relationOid, Oid newOwnerId, bool recursing, LOCKMODE lock
Datum aclDatum;
bool isNull;
HeapTuple newtuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* skip permission checks when recursing to index or toast table */
if (!recursing)
@@ -9667,8 +9740,9 @@ ATExecChangeOwner(Oid relationOid, Oid newOwnerId, bool recursing, LOCKMODE lock
newtuple = heap_modify_tuple(tuple, RelationGetDescr(class_rel), repl_val, repl_null, repl_repl);
- simple_heap_update(class_rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(class_rel, newtuple);
+ simple_heap_update(class_rel, &newtuple->t_self, newtuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(class_rel, newtuple, warm_update, modified_attrs);
heap_freetuple(newtuple);
@@ -9770,6 +9844,8 @@ change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId)
Datum aclDatum;
bool isNull;
HeapTuple newtuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Ignore dropped columns */
if (att->attisdropped)
@@ -9795,8 +9871,10 @@ change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId)
RelationGetDescr(attRelation),
repl_val, repl_null, repl_repl);
- simple_heap_update(attRelation, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attRelation, newtuple);
+ simple_heap_update(attRelation, &newtuple->t_self, newtuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(attRelation, newtuple, warm_update,
+ modified_attrs);
heap_freetuple(newtuple);
}
@@ -9966,6 +10044,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
bool repl_null[Natts_pg_class];
bool repl_repl[Natts_pg_class];
static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (defList == NIL && operation != AT_ReplaceRelOptions)
return; /* nothing to do */
@@ -10073,9 +10153,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(pgclass),
repl_val, repl_null, repl_repl);
- simple_heap_update(pgclass, &newtuple->t_self, newtuple);
+ simple_heap_update(pgclass, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pgclass, newtuple);
+ CatalogUpdateIndexes(pgclass, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId, RelationGetRelid(rel), 0);
@@ -10088,6 +10169,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
{
Relation toastrel;
Oid toastid = rel->rd_rel->reltoastrelid;
+ bool warm_update;
+ Bitmapset *modified_attrs;
toastrel = heap_open(toastid, lockmode);
@@ -10132,9 +10215,9 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(pgclass),
repl_val, repl_null, repl_repl);
- simple_heap_update(pgclass, &newtuple->t_self, newtuple);
+ simple_heap_update(pgclass, &newtuple->t_self, newtuple, &warm_update, &modified_attrs);
- CatalogUpdateIndexes(pgclass, newtuple);
+ CatalogUpdateIndexes(pgclass, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHookArg(RelationRelationId,
RelationGetRelid(toastrel), 0,
@@ -10169,6 +10252,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
ForkNumber forkNum;
List *reltoastidxids = NIL;
ListCell *lc;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Need lock here in case we are recursing to toast table or index
@@ -10295,8 +10380,9 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/* update the pg_class row */
rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace;
rd_rel->relfilenode = newrelfilenode;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_class, tuple);
+ simple_heap_update(pg_class, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(pg_class, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId, RelationGetRelid(rel), 0);
@@ -10901,6 +10987,9 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
attributeName);
if (HeapTupleIsValid(tuple))
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Check they are same type, typmod, and collation */
Form_pg_attribute childatt = (Form_pg_attribute) GETSTRUCT(tuple);
@@ -10946,8 +11035,9 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
childatt->attislocal = false;
}
- simple_heap_update(attrrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrrel, tuple);
+ simple_heap_update(attrrel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(attrrel, tuple, warm_update, modified_attrs);
heap_freetuple(tuple);
}
else
@@ -10976,6 +11066,8 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
if (HeapTupleIsValid(tuple))
{
Form_pg_attribute childatt = (Form_pg_attribute) GETSTRUCT(tuple);
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* See comments above; these changes should be the same */
childatt->attinhcount++;
@@ -10986,8 +11078,9 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
childatt->attislocal = false;
}
- simple_heap_update(attrrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrrel, tuple);
+ simple_heap_update(attrrel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(attrrel, tuple, warm_update, modified_attrs);
heap_freetuple(tuple);
}
else
@@ -11071,6 +11164,8 @@ MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel)
{
Form_pg_constraint child_con = (Form_pg_constraint) GETSTRUCT(child_tuple);
HeapTuple child_copy;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (child_con->contype != CONSTRAINT_CHECK)
continue;
@@ -11124,8 +11219,10 @@ MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel)
child_con->conislocal = false;
}
- simple_heap_update(catalog_relation, &child_copy->t_self, child_copy);
- CatalogUpdateIndexes(catalog_relation, child_copy);
+ simple_heap_update(catalog_relation, &child_copy->t_self,
+ child_copy, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(catalog_relation, child_copy, warm_update,
+ modified_attrs);
heap_freetuple(child_copy);
found = true;
@@ -11290,13 +11387,17 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
/* Decrement inhcount and possibly set islocal to true */
HeapTuple copyTuple = heap_copytuple(attributeTuple);
Form_pg_attribute copy_att = (Form_pg_attribute) GETSTRUCT(copyTuple);
+ bool warm_update;
+ Bitmapset *modified_attrs;
copy_att->attinhcount--;
if (copy_att->attinhcount == 0)
copy_att->attislocal = true;
- simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(catalogRelation, copyTuple);
+ simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(catalogRelation, copyTuple, warm_update,
+ modified_attrs);
heap_freetuple(copyTuple);
}
}
@@ -11361,6 +11462,8 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
/* Decrement inhcount and possibly set islocal to true */
HeapTuple copyTuple = heap_copytuple(constraintTuple);
Form_pg_constraint copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (copy_con->coninhcount <= 0) /* shouldn't happen */
elog(ERROR, "relation %u has non-inherited constraint \"%s\"",
@@ -11370,8 +11473,10 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
if (copy_con->coninhcount == 0)
copy_con->conislocal = true;
- simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(catalogRelation, copyTuple);
+ simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(catalogRelation, copyTuple, warm_update,
+ modified_attrs);
heap_freetuple(copyTuple);
}
}
@@ -11468,6 +11573,8 @@ ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode)
ObjectAddress tableobj,
typeobj;
HeapTuple classtuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Validate the type. */
typetuple = typenameType(NULL, ofTypename, NULL);
@@ -11571,8 +11678,10 @@ ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode)
if (!HeapTupleIsValid(classtuple))
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(classtuple))->reloftype = typeid;
- simple_heap_update(relationRelation, &classtuple->t_self, classtuple);
- CatalogUpdateIndexes(relationRelation, classtuple);
+ simple_heap_update(relationRelation, &classtuple->t_self, classtuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(relationRelation, classtuple, warm_update,
+ modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
@@ -11596,6 +11705,8 @@ ATExecDropOf(Relation rel, LOCKMODE lockmode)
Oid relid = RelationGetRelid(rel);
Relation relationRelation;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (!OidIsValid(rel->rd_rel->reloftype))
ereport(ERROR,
@@ -11616,8 +11727,9 @@ ATExecDropOf(Relation rel, LOCKMODE lockmode)
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->reloftype = InvalidOid;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
- CatalogUpdateIndexes(relationRelation, tuple);
+ simple_heap_update(relationRelation, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(relationRelation, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
@@ -11656,9 +11768,14 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
pg_class_form = (Form_pg_class) GETSTRUCT(pg_class_tuple);
if (pg_class_form->relreplident != ri_type)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
pg_class_form->relreplident = ri_type;
- simple_heap_update(pg_class, &pg_class_tuple->t_self, pg_class_tuple);
- CatalogUpdateIndexes(pg_class, pg_class_tuple);
+ simple_heap_update(pg_class, &pg_class_tuple->t_self, pg_class_tuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_class, pg_class_tuple, warm_update,
+ modified_attrs);
}
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(pg_class_tuple);
@@ -11717,8 +11834,13 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
if (dirty)
{
- simple_heap_update(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
- CatalogUpdateIndexes(pg_index, pg_index_tuple);
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
+ simple_heap_update(pg_index, &pg_index_tuple->t_self,
+ pg_index_tuple, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_index, pg_index_tuple, warm_update,
+ modified_attrs);
InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
InvalidOid, is_internal);
}
@@ -11856,6 +11978,8 @@ ATExecEnableRowSecurity(Relation rel)
Relation pg_class;
Oid relid;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
relid = RelationGetRelid(rel);
@@ -11867,10 +11991,11 @@ ATExecEnableRowSecurity(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = true;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
+ simple_heap_update(pg_class, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateIndexes(pg_class, tuple, warm_update, modified_attrs);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11882,6 +12007,8 @@ ATExecDisableRowSecurity(Relation rel)
Relation pg_class;
Oid relid;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
relid = RelationGetRelid(rel);
@@ -11894,10 +12021,11 @@ ATExecDisableRowSecurity(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = false;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
+ simple_heap_update(pg_class, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateIndexes(pg_class, tuple, warm_update, modified_attrs);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11912,6 +12040,8 @@ ATExecForceNoForceRowSecurity(Relation rel, bool force_rls)
Relation pg_class;
Oid relid;
HeapTuple tuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
relid = RelationGetRelid(rel);
@@ -11923,10 +12053,11 @@ ATExecForceNoForceRowSecurity(Relation rel, bool force_rls)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relforcerowsecurity = force_rls;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
+ simple_heap_update(pg_class, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateIndexes(pg_class, tuple, warm_update, modified_attrs);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11948,6 +12079,8 @@ ATExecGenericOptions(Relation rel, List *options)
bool repl_repl[Natts_pg_foreign_table];
Datum datum;
Form_pg_foreign_table tableform;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (options == NIL)
return;
@@ -11994,8 +12127,8 @@ ATExecGenericOptions(Relation rel, List *options)
tuple = heap_modify_tuple(tuple, RelationGetDescr(ftrel),
repl_val, repl_null, repl_repl);
- simple_heap_update(ftrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(ftrel, tuple);
+ simple_heap_update(ftrel, &tuple->t_self, tuple, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(ftrel, tuple, warm_update, modified_attrs);
/*
* Invalidate relcache so that all sessions will refresh any cached plans
@@ -12278,6 +12411,9 @@ AlterRelationNamespaceInternal(Relation classRel, Oid relOid,
already_done = object_address_present(&thisobj, objsMoved);
if (!already_done && oldNspOid != newNspOid)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* check for duplicate name (more friendly than unique-index failure) */
if (get_relname_relid(NameStr(classForm->relname),
newNspOid) != InvalidOid)
@@ -12290,8 +12426,9 @@ AlterRelationNamespaceInternal(Relation classRel, Oid relOid,
/* classTup is a copy, so OK to scribble on */
classForm->relnamespace = newNspOid;
- simple_heap_update(classRel, &classTup->t_self, classTup);
- CatalogUpdateIndexes(classRel, classTup);
+ simple_heap_update(classRel, &classTup->t_self, classTup, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(classRel, classTup, warm_update, modified_attrs);
/* Update dependency on schema if caller said so */
if (hasDependEntry &&
@@ -13499,6 +13636,8 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
new_null[Natts_pg_class],
new_repl[Natts_pg_class];
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
partRel = heap_openrv(name, AccessShareLock);
@@ -13526,8 +13665,9 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
new_val, new_null, new_repl);
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = false;
- simple_heap_update(classRel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(classRel, newtuple);
+ simple_heap_update(classRel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(classRel, newtuple, warm_update, modified_attrs);
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 651e1b3..bc18cfb 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -346,7 +346,7 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
tablespaceoid = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
heap_freetuple(tuple);
@@ -920,6 +920,8 @@ RenameTableSpace(const char *oldname, const char *newname)
HeapTuple newtuple;
Form_pg_tablespace newform;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Search pg_tablespace */
rel = heap_open(TableSpaceRelationId, RowExclusiveLock);
@@ -971,8 +973,9 @@ RenameTableSpace(const char *oldname, const char *newname)
/* OK, update the entry */
namestrcpy(&(newform->spcname), newname);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ simple_heap_update(rel, &newtuple->t_self, newtuple, &warm_update,
+ &modified_attrs);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TableSpaceRelationId, tspId, 0);
@@ -1001,6 +1004,8 @@ AlterTableSpaceOptions(AlterTableSpaceOptionsStmt *stmt)
bool repl_null[Natts_pg_tablespace];
bool repl_repl[Natts_pg_tablespace];
HeapTuple newtuple;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Search pg_tablespace */
rel = heap_open(TableSpaceRelationId, RowExclusiveLock);
@@ -1044,8 +1049,8 @@ AlterTableSpaceOptions(AlterTableSpaceOptionsStmt *stmt)
repl_null, repl_repl);
/* Update system catalog. */
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ simple_heap_update(rel, &newtuple->t_self, newtuple, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TableSpaceRelationId, HeapTupleGetOid(tup), 0);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3fc3a21..1ed9e4b 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -166,6 +166,8 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
referenced;
char *oldtablename = NULL;
char *newtablename = NULL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
if (OidIsValid(relOid))
rel = heap_open(relOid, ShareRowExclusiveLock);
@@ -777,7 +779,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
*/
simple_heap_insert(tgrel, tuple);
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogUpdateIndexes(tgrel, tuple, false, NULL);
heap_freetuple(tuple);
heap_close(tgrel, RowExclusiveLock);
@@ -804,9 +806,10 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
((Form_pg_class) GETSTRUCT(tuple))->relhastriggers = true;
- simple_heap_update(pgrel, &tuple->t_self, tuple);
+ simple_heap_update(pgrel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(pgrel, tuple);
+ CatalogUpdateIndexes(pgrel, tuple, warm_update, modified_attrs);
heap_freetuple(tuple);
heap_close(pgrel, RowExclusiveLock);
@@ -1436,6 +1439,9 @@ renametrig(RenameStmt *stmt)
NULL, 2, key);
if (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
tgoid = HeapTupleGetOid(tuple);
/*
@@ -1446,10 +1452,11 @@ renametrig(RenameStmt *stmt)
namestrcpy(&((Form_pg_trigger) GETSTRUCT(tuple))->tgname,
stmt->newname);
- simple_heap_update(tgrel, &tuple->t_self, tuple);
+ simple_heap_update(tgrel, &tuple->t_self, tuple, &warm_update,
+ &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogUpdateIndexes(tgrel, tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TriggerRelationId,
HeapTupleGetOid(tuple), 0);
@@ -1559,13 +1566,16 @@ EnableDisableTrigger(Relation rel, const char *tgname,
/* need to change this one ... make a copy to scribble on */
HeapTuple newtup = heap_copytuple(tuple);
Form_pg_trigger newtrig = (Form_pg_trigger) GETSTRUCT(newtup);
+ bool warm_update;
+ Bitmapset *modified_attrs;
newtrig->tgenabled = fires_when;
- simple_heap_update(tgrel, &newtup->t_self, newtup);
+ simple_heap_update(tgrel, &newtup->t_self, newtup, &warm_update,
+ &modified_attrs);
/* Keep catalog indexes current */
- CatalogUpdateIndexes(tgrel, newtup);
+ CatalogUpdateIndexes(tgrel, newtup, warm_update, modified_attrs);
heap_freetuple(newtup);
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index 479a160..d8355f6 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -273,7 +273,7 @@ DefineTSParser(List *names, List *parameters)
prsOid = simple_heap_insert(prsRel, tup);
- CatalogUpdateIndexes(prsRel, tup);
+ CatalogUpdateIndexes(prsRel, tup, false, NULL);
address = makeParserDependencies(tup);
@@ -484,7 +484,7 @@ DefineTSDictionary(List *names, List *parameters)
dictOid = simple_heap_insert(dictRel, tup);
- CatalogUpdateIndexes(dictRel, tup);
+ CatalogUpdateIndexes(dictRel, tup, false, NULL);
address = makeDictionaryDependencies(tup);
@@ -540,6 +540,8 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
bool repl_null[Natts_pg_ts_dict];
bool repl_repl[Natts_pg_ts_dict];
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
dictId = get_ts_dict_oid(stmt->dictname, false);
@@ -620,9 +622,10 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
newtup = heap_modify_tuple(tup, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtup->t_self, newtup);
+ simple_heap_update(rel, &newtup->t_self, newtup, &warm_update,
+ &modified_attrs);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateIndexes(rel, newtup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TSDictionaryRelationId, dictId, 0);
@@ -808,7 +811,7 @@ DefineTSTemplate(List *names, List *parameters)
tmplOid = simple_heap_insert(tmplRel, tup);
- CatalogUpdateIndexes(tmplRel, tup);
+ CatalogUpdateIndexes(tmplRel, tup, false, NULL);
address = makeTSTemplateDependencies(tup);
@@ -1068,7 +1071,7 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied)
cfgOid = simple_heap_insert(cfgRel, tup);
- CatalogUpdateIndexes(cfgRel, tup);
+ CatalogUpdateIndexes(cfgRel, tup, false, NULL);
if (OidIsValid(sourceOid))
{
@@ -1108,7 +1111,7 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied)
simple_heap_insert(mapRel, newmaptup);
- CatalogUpdateIndexes(mapRel, newmaptup);
+ CatalogUpdateIndexes(mapRel, newmaptup, false, NULL);
heap_freetuple(newmaptup);
}
@@ -1398,6 +1401,8 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
bool repl_null[Natts_pg_ts_config_map];
bool repl_repl[Natts_pg_ts_config_map];
HeapTuple newtup;
+ bool warm_update;
+ Bitmapset *modified_attrs;
memset(repl_val, 0, sizeof(repl_val));
memset(repl_null, false, sizeof(repl_null));
@@ -1409,9 +1414,10 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
newtup = heap_modify_tuple(maptup,
RelationGetDescr(relMap),
repl_val, repl_null, repl_repl);
- simple_heap_update(relMap, &newtup->t_self, newtup);
+ simple_heap_update(relMap, &newtup->t_self, newtup,
+ &warm_update, &modified_attrs);
- CatalogUpdateIndexes(relMap, newtup);
+ CatalogUpdateIndexes(relMap, newtup, warm_update, modified_attrs);
}
}
@@ -1437,7 +1443,7 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
tup = heap_form_tuple(relMap->rd_att, values, nulls);
simple_heap_insert(relMap, tup);
- CatalogUpdateIndexes(relMap, tup);
+ CatalogUpdateIndexes(relMap, tup, false, NULL);
heap_freetuple(tup);
}
diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 4c33d55..ccbe96f 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -2138,6 +2138,8 @@ AlterDomainDefault(List *names, Node *defaultRaw)
HeapTuple newtuple;
Form_pg_type typTup;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Make a TypeName so we can use standard type lookup machinery */
typename = makeTypeNameFromNameList(names);
@@ -2221,9 +2223,9 @@ AlterDomainDefault(List *names, Node *defaultRaw)
new_record, new_record_nulls,
new_record_repl);
- simple_heap_update(rel, &tup->t_self, newtuple);
+ simple_heap_update(rel, &tup->t_self, newtuple, &warm_update, &modified_attrs);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
/* Rebuild dependencies */
GenerateTypeDependencies(typTup->typnamespace,
@@ -2272,6 +2274,8 @@ AlterDomainNotNull(List *names, bool notNull)
HeapTuple tup;
Form_pg_type typTup;
ObjectAddress address = InvalidObjectAddress;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Make a TypeName so we can use standard type lookup machinery */
typename = makeTypeNameFromNameList(names);
@@ -2360,9 +2364,9 @@ AlterDomainNotNull(List *names, bool notNull)
*/
typTup->typnotnull = notNull;
- simple_heap_update(typrel, &tup->t_self, tup);
+ simple_heap_update(typrel, &tup->t_self, tup, &warm_update, &modified_attrs);
- CatalogUpdateIndexes(typrel, tup);
+ CatalogUpdateIndexes(typrel, tup, warm_update, modified_attrs);
InvokeObjectPostAlterHook(TypeRelationId, domainoid, 0);
@@ -2598,6 +2602,8 @@ AlterDomainValidateConstraint(List *names, char *constrName)
HeapTuple copyTuple;
ScanKeyData key;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Make a TypeName so we can use standard type lookup machinery */
typename = makeTypeNameFromNameList(names);
@@ -2662,8 +2668,8 @@ AlterDomainValidateConstraint(List *names, char *constrName)
copyTuple = heap_copytuple(tuple);
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->convalidated = true;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ simple_heap_update(conrel, ©Tuple->t_self, copyTuple, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(conrel, copyTuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(copyTuple), 0);
@@ -3374,6 +3380,8 @@ AlterTypeOwnerInternal(Oid typeOid, Oid newOwnerId)
Acl *newAcl;
Datum aclDatum;
bool isNull;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(TypeRelationId, RowExclusiveLock);
@@ -3404,9 +3412,9 @@ AlterTypeOwnerInternal(Oid typeOid, Oid newOwnerId)
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
/* If it has an array type, update that too */
if (OidIsValid(typTup->typarray))
@@ -3561,13 +3569,16 @@ AlterTypeNamespaceInternal(Oid typeOid, Oid nspOid,
if (oldNspOid != nspOid)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* OK, modify the pg_type row */
/* tup is a copy, so we can scribble directly on it */
typform->typnamespace = nspOid;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ simple_heap_update(rel, &tup->t_self, tup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(rel, tup, warm_update, modified_attrs);
}
/*
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index e6fdac3..5d67041 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -434,7 +434,7 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
* Insert new record in the pg_authid table
*/
roleid = simple_heap_insert(pg_authid_rel, tuple);
- CatalogUpdateIndexes(pg_authid_rel, tuple);
+ CatalogUpdateIndexes(pg_authid_rel, tuple, false, NULL);
/*
* Advance command counter so we can see new record; else tests in
@@ -531,6 +531,8 @@ AlterRole(AlterRoleStmt *stmt)
DefElem *dvalidUntil = NULL;
DefElem *dbypassRLS = NULL;
Oid roleid;
+ bool warm_update;
+ Bitmapset *modified_attrs;
check_rolespec_name(stmt->role,
"Cannot alter reserved roles.");
@@ -838,10 +840,11 @@ AlterRole(AlterRoleStmt *stmt)
new_tuple = heap_modify_tuple(tuple, pg_authid_dsc, new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authid_rel, &tuple->t_self, new_tuple);
+ simple_heap_update(pg_authid_rel, &tuple->t_self, new_tuple, &warm_update,
+ &modified_attrs);
/* Update indexes */
- CatalogUpdateIndexes(pg_authid_rel, new_tuple);
+ CatalogUpdateIndexes(pg_authid_rel, new_tuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(AuthIdRelationId, roleid, 0);
@@ -1149,6 +1152,8 @@ RenameRole(const char *oldname, const char *newname)
Oid roleid;
ObjectAddress address;
Form_pg_authid authform;
+ bool warm_update;
+ Bitmapset *modified_attrs;
rel = heap_open(AuthIdRelationId, RowExclusiveLock);
dsc = RelationGetDescr(rel);
@@ -1243,9 +1248,9 @@ RenameRole(const char *oldname, const char *newname)
}
newtuple = heap_modify_tuple(oldtuple, dsc, repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &oldtuple->t_self, newtuple);
+ simple_heap_update(rel, &oldtuple->t_self, newtuple, &warm_update, &modified_attrs);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateIndexes(rel, newtuple, warm_update, modified_attrs);
InvokeObjectPostAlterHook(AuthIdRelationId, roleid, 0);
@@ -1527,13 +1532,16 @@ AddRoleMems(const char *rolename, Oid roleid,
if (HeapTupleIsValid(authmem_tuple))
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
new_record_repl[Anum_pg_auth_members_grantor - 1] = true;
new_record_repl[Anum_pg_auth_members_admin_option - 1] = true;
tuple = heap_modify_tuple(authmem_tuple, pg_authmem_dsc,
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_authmem_rel, tuple, warm_update, modified_attrs);
ReleaseSysCache(authmem_tuple);
}
else
@@ -1541,7 +1549,7 @@ AddRoleMems(const char *rolename, Oid roleid,
tuple = heap_form_tuple(pg_authmem_dsc,
new_record, new_record_nulls);
simple_heap_insert(pg_authmem_rel, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogUpdateIndexes(pg_authmem_rel, tuple, false, NULL);
}
/* CCI after each change, in case there are duplicates in list */
@@ -1637,6 +1645,8 @@ DelRoleMems(const char *rolename, Oid roleid,
Datum new_record[Natts_pg_auth_members];
bool new_record_nulls[Natts_pg_auth_members];
bool new_record_repl[Natts_pg_auth_members];
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Build a tuple to update with */
MemSet(new_record, 0, sizeof(new_record));
@@ -1649,8 +1659,10 @@ DelRoleMems(const char *rolename, Oid roleid,
tuple = heap_modify_tuple(authmem_tuple, pg_authmem_dsc,
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple,
+ &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_authmem_rel, tuple, warm_update,
+ modified_attrs);
}
ReleaseSysCache(authmem_tuple);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..9e3d0ee 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -2158,6 +2158,22 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ *
+ * XXX Should we look at the root line pointer and check if
+ * WARM flag is set there or checking for tuples in the
+ * chain is good enough?
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9920f48..94cf92f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
@@ -790,6 +803,9 @@ retry:
{
if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
+ else
+ ItemPointerCopy(&tup->t_self, &ctid_wait);
+
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index a18ae51..fb81633 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -399,6 +399,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
NIL);
@@ -445,6 +446,8 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
if (!skip_tuple)
{
List *recheckIndexes = NIL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Check the constraints of the tuple */
if (rel->rd_att->constr)
@@ -455,13 +458,31 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
/* OK, update the tuple and index entries for it */
simple_heap_update(rel, &searchslot->tts_tuple->t_self,
- slot->tts_tuple);
+ slot->tts_tuple,
+ &warm_update,
+ &modified_attrs);
if (resultRelInfo->ri_NumIndices > 0 &&
- !HeapTupleIsHeapOnly(slot->tts_tuple))
+ (!HeapTupleIsHeapOnly(slot->tts_tuple) || warm_update))
+ {
+ ItemPointerData root_tid;
+
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
+
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid, modified_attrs,
estate, false, NULL,
NIL);
+ }
/* AFTER ROW UPDATE Triggers */
ExecARUpdateTriggers(estate, resultRelInfo,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index f18827d..f81d290 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,27 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ /*
+ * If the heap tuple needs a recheck because of a WARM update,
+ * it's a lossy case
+ */
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..c7be366 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,10 +115,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 2ac7407..142eb57 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -512,6 +512,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -558,6 +559,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -891,6 +893,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -1007,7 +1012,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1094,10 +1099,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..432dd4b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4085,6 +4087,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5194,6 +5197,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5221,6 +5225,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index d7dda6a..adafe23 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -300,7 +300,7 @@ replorigin_create(char *roname)
tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple, false, NULL);
CommandCounterIncrement();
break;
}
diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c
index 864d45f..59f163a 100644
--- a/src/backend/rewrite/rewriteDefine.c
+++ b/src/backend/rewrite/rewriteDefine.c
@@ -77,6 +77,8 @@ InsertRule(char *rulname,
ObjectAddress myself,
referenced;
bool is_update = false;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Set up *nulls and *values arrays
@@ -124,7 +126,7 @@ InsertRule(char *rulname,
tup = heap_modify_tuple(oldtup, RelationGetDescr(pg_rewrite_desc),
values, nulls, replaces);
- simple_heap_update(pg_rewrite_desc, &tup->t_self, tup);
+ simple_heap_update(pg_rewrite_desc, &tup->t_self, tup, &warm_update, &modified_attrs);
ReleaseSysCache(oldtup);
@@ -136,10 +138,12 @@ InsertRule(char *rulname,
tup = heap_form_tuple(pg_rewrite_desc->rd_att, values, nulls);
rewriteObjectId = simple_heap_insert(pg_rewrite_desc, tup);
+ warm_update = false;
+ modified_attrs = NULL;
}
/* Need to update indexes in either case */
- CatalogUpdateIndexes(pg_rewrite_desc, tup);
+ CatalogUpdateIndexes(pg_rewrite_desc, tup, warm_update, modified_attrs);
heap_freetuple(tup);
@@ -549,6 +553,8 @@ DefineQueryRewrite(char *rulename,
Oid toastrelid;
HeapTuple classTup;
Form_pg_class classForm;
+ bool warm_update;
+ Bitmapset *modified_attrs;
relationRelation = heap_open(RelationRelationId, RowExclusiveLock);
toastrelid = event_relation->rd_rel->reltoastrelid;
@@ -613,8 +619,8 @@ DefineQueryRewrite(char *rulename,
classForm->relminmxid = InvalidMultiXactId;
classForm->relreplident = REPLICA_IDENTITY_NOTHING;
- simple_heap_update(relationRelation, &classTup->t_self, classTup);
- CatalogUpdateIndexes(relationRelation, classTup);
+ simple_heap_update(relationRelation, &classTup->t_self, classTup, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(relationRelation, classTup, warm_update, modified_attrs);
heap_freetuple(classTup);
heap_close(relationRelation, RowExclusiveLock);
@@ -864,12 +870,15 @@ EnableDisableRule(Relation rel, const char *rulename,
if (DatumGetChar(((Form_pg_rewrite) GETSTRUCT(ruletup))->ev_enabled) !=
fires_when)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
((Form_pg_rewrite) GETSTRUCT(ruletup))->ev_enabled =
CharGetDatum(fires_when);
- simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup);
+ simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup, &warm_update, &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_rewrite_desc, ruletup);
+ CatalogUpdateIndexes(pg_rewrite_desc, ruletup, warm_update, modified_attrs);
changed = true;
}
@@ -938,6 +947,8 @@ RenameRewriteRule(RangeVar *relation, const char *oldName,
Form_pg_rewrite ruleform;
Oid ruleOid;
ObjectAddress address;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/*
* Look up name, check permissions, and acquire lock (which we will NOT
@@ -985,10 +996,10 @@ RenameRewriteRule(RangeVar *relation, const char *oldName,
/* OK, do the update */
namestrcpy(&(ruleform->rulename), newName);
- simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup);
+ simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup, &warm_update, &modified_attrs);
/* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_rewrite_desc, ruletup);
+ CatalogUpdateIndexes(pg_rewrite_desc, ruletup, warm_update, modified_attrs);
heap_freetuple(ruletup);
heap_close(pg_rewrite_desc, RowExclusiveLock);
diff --git a/src/backend/rewrite/rewriteSupport.c b/src/backend/rewrite/rewriteSupport.c
index 0154072..848ee7a 100644
--- a/src/backend/rewrite/rewriteSupport.c
+++ b/src/backend/rewrite/rewriteSupport.c
@@ -69,13 +69,16 @@ SetRelationRuleStatus(Oid relationId, bool relHasRules)
if (classForm->relhasrules != relHasRules)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/* Do the update */
classForm->relhasrules = relHasRules;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
+ simple_heap_update(relationRelation, &tuple->t_self, tuple, &warm_update, &modified_attrs);
/* Keep the catalog indexes up to date */
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateIndexes(relationRelation, tuple, warm_update, modified_attrs);
}
else
{
diff --git a/src/backend/storage/large_object/inv_api.c b/src/backend/storage/large_object/inv_api.c
index 262b0b2..7a643bf 100644
--- a/src/backend/storage/large_object/inv_api.c
+++ b/src/backend/storage/large_object/inv_api.c
@@ -638,6 +638,9 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
*/
if (olddata != NULL && olddata->pageno == pageno)
{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
/*
* Update an existing page with fresh data.
*
@@ -678,8 +681,9 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
replace[Anum_pg_largeobject_data - 1] = true;
newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
values, nulls, replace);
- simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
- CatalogIndexInsert(indstate, newtup);
+ simple_heap_update(lo_heap_r, &newtup->t_self, newtup,
+ &warm_update, &modified_attrs);
+ CatalogIndexInsert(indstate, newtup, warm_update, modified_attrs);
heap_freetuple(newtup);
/*
@@ -722,7 +726,7 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
simple_heap_insert(lo_heap_r, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogIndexInsert(indstate, newtup, false, NULL);
heap_freetuple(newtup);
}
pageno++;
@@ -824,6 +828,8 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
bytea *datafield;
int pagelen;
bool pfreeit;
+ bool warm_update;
+ Bitmapset *modified_attrs;
getdatafield(olddata, &datafield, &pagelen, &pfreeit);
memcpy(workb, VARDATA(datafield), pagelen);
@@ -850,8 +856,9 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
replace[Anum_pg_largeobject_data - 1] = true;
newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
values, nulls, replace);
- simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
- CatalogIndexInsert(indstate, newtup);
+ simple_heap_update(lo_heap_r, &newtup->t_self, newtup, &warm_update,
+ &modified_attrs);
+ CatalogIndexInsert(indstate, newtup, warm_update, modified_attrs);
heap_freetuple(newtup);
}
else
@@ -889,7 +896,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
simple_heap_insert(lo_heap_r, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogIndexInsert(indstate, newtup, false, NULL);
heap_freetuple(newtup);
}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a987d0d..b8677f3 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -145,6 +145,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1644,6 +1660,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 26ff7e1..1976753 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2338,6 +2338,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_pkattr);
bms_free(relation->rd_idattr);
@@ -3419,6 +3420,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
Relation pg_class;
HeapTuple tuple;
Form_pg_class classform;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Indexes, sequences must have Invalid frozenxid; other rels must not */
Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
@@ -3484,8 +3487,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
classform->relminmxid = minmulti;
classform->relpersistence = persistence;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_class, tuple);
+ simple_heap_update(pg_class, &tuple->t_self, tuple, &warm_update, &modified_attrs);
+ CatalogUpdateIndexes(pg_class, tuple, warm_update, modified_attrs);
heap_freetuple(tuple);
@@ -4757,6 +4760,8 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *pkindexattrs; /* columns in the primary index */
Bitmapset *idindexattrs; /* columns in the replica identity */
@@ -4765,6 +4770,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4779,6 +4785,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4819,6 +4827,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
pkindexattrs = NULL;
idindexattrs = NULL;
@@ -4873,19 +4882,38 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_pkattr);
@@ -4904,7 +4932,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_pkattr = bms_copy(pkindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4918,6 +4947,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6a5f279..0de82fa 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -137,6 +138,9 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
/* restore marked scan position */
typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
@@ -196,6 +200,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 69a3873..3e14023 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -364,4 +364,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 22507dc..06e22a3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -160,7 +161,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -175,7 +177,9 @@ extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
- HeapTuple tup);
+ HeapTuple tup,
+ bool *warm_update,
+ Bitmapset **modified_attrs);
extern void heap_sync(Relation relation);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a4a1fe1..b4238e5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
diff --git a/src/include/access/htup.h b/src/include/access/htup.h
index 870adf4..0ae223e 100644
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@@ -14,6 +14,7 @@
#ifndef HTUP_H
#define HTUP_H
+#include "nodes/bitmapset.h"
#include "storage/itemptr.h"
/* typedefs and forward declarations for structs defined in htup_details.h */
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index fff1832..2ea4865 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,22 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) != 0 \
+)
+
+
#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
do { \
AssertMacro(OffsetNumberIsValid(offnum)); \
@@ -763,6 +780,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..98129d6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -750,6 +750,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8746045..1f5b361 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -111,7 +111,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 45605a0..5a8fb70 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -31,8 +31,12 @@ typedef struct ResultRelInfo *CatalogIndexState;
extern CatalogIndexState CatalogOpenIndexes(Relation heapRel);
extern void CatalogCloseIndexes(CatalogIndexState indstate);
extern void CatalogIndexInsert(CatalogIndexState indstate,
- HeapTuple heapTuple);
-extern void CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple);
+ HeapTuple heapTuple,
+ bool warm_update,
+ Bitmapset *modified_attrs);
+extern void CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple,
+ bool warm_update,
+ Bitmapset *modified_attrs);
/*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ab12761..201e8b6 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2740,6 +2740,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3353 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2892,6 +2894,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3354 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b..c4495a3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -382,6 +382,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..2c4d884 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -37,5 +37,4 @@ extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..07f2900 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -62,6 +62,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..ee635be 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1177,7 +1179,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a1750ac..092491f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -138,9 +138,12 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_pkattr; /* cols included in primary key */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
PublicationActions *rd_pubactions; /* publication actions */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index da36b67..83a7f20 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -50,7 +50,8 @@ typedef enum IndexAttrBitmapKind
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
INDEX_ATTR_BITMAP_PRIMARY_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index b01be59..37719c9 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -161,15 +161,15 @@ ALTER SERVER alt_fserv1 RENAME TO alt_fserv3; -- OK
SELECT fdwname FROM pg_foreign_data_wrapper WHERE fdwname like 'alt_fdw%';
fdwname
----------
- alt_fdw2
alt_fdw3
+ alt_fdw2
(2 rows)
SELECT srvname FROM pg_foreign_server WHERE srvname like 'alt_fserv%';
srvname
------------
- alt_fserv2
alt_fserv3
+ alt_fserv2
(2 rows)
--
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 60abcad..42d45a1 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1718,6 +1718,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1861,6 +1862,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1904,6 +1906,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1941,7 +1944,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1957,7 +1961,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1979,7 +1984,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..ebbc4ca
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,367 @@
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=72)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=4)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1 (cost=0.29..5.16 rows=50 width=4)
+ Index Cond: (b = 140001)
+(2 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab1;
+------------------
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE a = 1;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE b = 701;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2 (cost=0.14..4.16 rows=1 width=4)
+ Index Cond: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab2 WHERE b = 701;
+ b
+-----
+ 701
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab2;
+------------------
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 1;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on updtst_tab3 (cost=0.00..2.25 rows=1 width=4)
+ Filter: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 701;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+ b
+------
+ 1421
+(1 row)
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+SET enable_seqscan = false;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 98
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 2;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3 (cost=0.14..8.16 rows=1 width=4)
+ Index Cond: (b = 702)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 702;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+ b
+------
+ 1422
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab3;
+------------------
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index e9b2bad..a9a269b 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..b73c278
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,172 @@
+
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab1;
+
+------------------
+
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE a = 1;
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE b = 701;
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+SELECT b FROM updtst_tab2 WHERE b = 701;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab2;
+------------------
+
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+SELECT * FROM updtst_tab3 WHERE a = 1;
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+SET enable_seqscan = false;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+SELECT * FROM updtst_tab3 WHERE a = 2;
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab3;
+------------------
+
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
Reading 0001_track_root_lp_v9.patch again:
+/* + * We use the same HEAP_LATEST_TUPLE flag to check if the tuple's t_ctid field + * contains the root line pointer. We can't use the same + * HeapTupleHeaderIsHeapLatest macro because that also checks for TID-equality + * to decide whether a tuple is at the of the chain + */ +#define HeapTupleHeaderHasRootOffset(tup) \ +( \ + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \ +)+#define HeapTupleHeaderGetRootOffset(tup) \ +( \ + AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \ + ItemPointerGetOffsetNumber(&(tup)->t_ctid) \ +)
Interesting stuff; it took me a bit to see why these macros are this
way. I propose the following wording which I think is clearer:
Return whether the tuple has a cached root offset. We don't use
HeapTupleHeaderIsHeapLatest because that one also considers the slow
case of scanning the whole block.
Please flag the macros that have multiple evaluation hazards -- there
are a few of them.
+/* + * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for + * clusters which are upgraded from pre-10.0 release, we still check if c_tid + * is pointing to itself and declare such tuple as the latest tuple in the + * chain + */ +#define HeapTupleHeaderIsHeapLatest(tup, tid) \ +( \ + (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \ + ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \ + (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \ +)
I suggest rewording this comment as:
Starting from PostgreSQL 10, the latest tuple in an update chain has
HEAP_LATEST_TUPLE set; but tuples upgraded from earlier versions do
not. For those, we determine whether a tuple is latest by testing
that its t_ctid points to itself.
(as discussed, there is no "10.0 release"; it's called the "10 release"
only, no ".0". Feel free to use "v10" or "pg10").
+/* + * Get TID of next tuple in the update chain. Caller should have checked that + * we are not already at the end of the chain because in that case t_ctid may + * actually store the root line pointer of the HOT chain whose member this + * tuple is. + */ +#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ +do { \ + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ +} while (0)
Actually, I think this macro could just return the TID so that it can be
used as struct assignment, just like ItemPointerCopy does internally --
callers can do
ctid = HeapTupleHeaderGetNextTid(tup);
or more precisely, this pattern
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self)) + HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid); + else + ItemPointerCopy(&tp.t_self, &hufd->ctid);
becomes
hufd->ctid = HeapTupleHeaderIsHeapLatest(foo) ?
HeapTupleHeaderGetNextTid(foo) : &tp->t_self;
or something like that. I further wonder if it'd make sense to hide
this into yet another macro.
The API of RelationPutHeapTuple appears a bit contorted, where
root_offnum is both input and output. I think it's cleaner to have the
argument be the input, and have the output offset be the return value --
please check whether that simplifies things; for example I think this:
+ root_offnum = InvalidOffsetNumber; + RelationPutHeapTuple(relation, buffer, heaptup, false, + &root_offnum);
becomes
root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, false,
InvalidOffsetNumber);
Please remove the words "must have" in this comment:
+ /* + * Also mark both copies as latest and set the root offset information. If + * we're doing a HOT/WARM update, then we just copy the information from + * old tuple, if available or computed above. For regular updates, + * RelationPutHeapTuple must have returned us the actual offset number + * where the new version was inserted and we store the same value since the + * update resulted in a new HOT-chain + */
Many comments lack finishing periods in complete sentences, which looks
odd. Please fix.
I have not looked at the other patch yet.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Looking at your 0002 patch now. It no longer applies, but the conflicts
are trivial to fix. Please rebase and resubmit.
I think the way WARM works has been pretty well hammered by now, other
than the CREATE INDEX CONCURRENTLY issues, so I'm looking at the code
from a maintainability point of view only.
I think we should have some test harness for WARM as part of the source
repository. A test that runs for an hour hammering the machine to
highest possible load cannot be run in "make check", of course, but we
could have some specific Make target to run it manually. We don't have
this for any other feature, but this looks like a decent place to start.
Maybe we should even do it before going any further. The test code you
submitted looks OK to test the feature, but I'm not in love with it
enough to add it to the repo. Maybe I will spend some time trying to
convert it to Perl using PostgresNode.
I think having the "recheck" index methods create an ExecutorState looks
out of place. How difficult is it to pass the estate from the calling
code?
IMO heap_get_root_tuple_one should be called just heap_get_root_tuple().
That function and its plural sibling heap_get_root_tuples() should
indicate in their own comments what the expectations are regarding the
root_offsets output argument, rather than deferring to the comments in
the "internal" function, since they differ on that point; for the rest
of the invariants I think it makes sense to say "Also see the comment
for heap_get_root_tuples_internal". I wonder if heap_get_root_tuple
should just return the ctid instead of assigning the value to a
passed-in pointer, i.e.
OffsetNumber
heap_get_root_tuple(Page page, OffsetNumber target_offnum)
{
OffsetNumber off;
heap_get_root_tuples_internal(page, target_offnum, &off);
return off;
}
The simple_heap_update + CatalogUpdateIndexes pattern is getting
obnoxious. How about creating something like catalog_heap_update which
does both things at once, and stop bothering each callsite with the WARM
stuff? In fact, given that CatalogUpdateIndexes is used in other
places, maybe we should leave its API alone and create another function,
so that we don't have to change the many places that only do
simple_heap_insert. (Places like OperatorCreate which do either insert
or update could just move the index update call into each branch.)
I'm not real sure about the interface between index AM and executor,
namely IndexScanDesc->xs_tuple_recheck. For example, this pattern:
if (!scan->xs_recheck)
scan->xs_tuple_recheck = false;
else
scan->xs_tuple_recheck = true;
can become simply
scan->xs_tuple_recheck = scan->xs_recheck;
which looks odd. I can't pinpoint exactly what's the problem, though.
I'll continue looking at this one.
I wonder if heap_hot_search_buffer() and heap_hot_search() should return
a tri-valued enum instead of boolean; that idea looks reasonable in
theory but callers have to do more work afterwards, so maybe not.
I think heap_hot_search() sometimes leaving the buffer pinned is
confusing. Really, the whole idea of having heap_hot_search have a
buffer output argument is an important API change that should be better
thought. Maybe it'd be better to return the buffer pinned always, and
the caller is always in charge of unpinning if not InvalidBuffer. Or
perhaps we need a completely new function, given how different it is to
the original? If you tried to document in the comment above
heap_hot_search how it works, you'd find that it's difficult to
describe, which'd be an indicator that it's not well considered.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
I wonder if heap_hot_search_buffer() and heap_hot_search() should return
a tri-valued enum instead of boolean; that idea looks reasonable in
theory but callers have to do more work afterwards, so maybe not.I think heap_hot_search() sometimes leaving the buffer pinned is
confusing. Really, the whole idea of having heap_hot_search have a
buffer output argument is an important API change that should be better
thought. Maybe it'd be better to return the buffer pinned always, and
the caller is always in charge of unpinning if not InvalidBuffer. Or
perhaps we need a completely new function, given how different it is to
the original? If you tried to document in the comment above
heap_hot_search how it works, you'd find that it's difficult to
describe, which'd be an indicator that it's not well considered.
Even before your patch, heap_hot_search claims to have the same API as
heap_hot_search_buffer "except that caller does not provide the buffer."
But this is a lie and has been since 9.2 (more precisely, since commit
4da99ea4231e). I think WARM makes things even worse and we should fix
that. Not yet sure which direction to fix it ...
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 25, 2017 at 4:08 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
I think the way WARM works has been pretty well hammered by now, other
than the CREATE INDEX CONCURRENTLY issues, so I'm looking at the code
from a maintainability point of view only.
Which senior hackers have previously reviewed it in detail?
Where would I go to get a good overview of the overall theory of operation?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, Jan 25, 2017 at 4:08 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:I think the way WARM works has been pretty well hammered by now, other
than the CREATE INDEX CONCURRENTLY issues, so I'm looking at the code
from a maintainability point of view only.Which senior hackers have previously reviewed it in detail?
The previous thread,
/messages/by-id/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com
contains some discussion of it, which uncovered bugs in the initial idea
and gave rise to the current design.
Where would I go to get a good overview of the overall theory of operation?
The added README file does a pretty good job, I thought.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 25, 2017 at 10:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Reading 0001_track_root_lp_v9.patch again:
Thanks for the review.
+/* + * We use the same HEAP_LATEST_TUPLE flag to check if the tuple'st_ctid field
+ * contains the root line pointer. We can't use the same + * HeapTupleHeaderIsHeapLatest macro because that also checks forTID-equality
+ * to decide whether a tuple is at the of the chain + */ +#define HeapTupleHeaderHasRootOffset(tup) \ +( \ + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \ +)+#define HeapTupleHeaderGetRootOffset(tup) \ +( \ + AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \ + ItemPointerGetOffsetNumber(&(tup)->t_ctid) \ +)Interesting stuff; it took me a bit to see why these macros are this
way. I propose the following wording which I think is clearer:Return whether the tuple has a cached root offset. We don't use
HeapTupleHeaderIsHeapLatest because that one also considers the slow
case of scanning the whole block.
Umm, not scanning the whole block, but HeapTupleHeaderIsHeapLatest compares
t_ctid with the passed in TID and returns true if those matches. To know if
root lp is cached, we only rely on the HEAP_LATEST_TUPLE flag. Though if
the flag is set, then it implies latest tuple too.
Please flag the macros that have multiple evaluation hazards -- there
are a few of them.
Can you please tell me an example? I must be missing something.
+/* + * Get TID of next tuple in the update chain. Caller should havechecked that
+ * we are not already at the end of the chain because in that case
t_ctid may
+ * actually store the root line pointer of the HOT chain whose member
this
+ * tuple is. + */ +#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ +do { \ + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ +} while (0)Actually, I think this macro could just return the TID so that it can be
used as struct assignment, just like ItemPointerCopy does internally --
callers can do
ctid = HeapTupleHeaderGetNextTid(tup);
Yes, makes sense. Will fix.
The API of RelationPutHeapTuple appears a bit contorted, where
root_offnum is both input and output. I think it's cleaner to have the
argument be the input, and have the output offset be the return value --
please check whether that simplifies things; for example I think this:+ root_offnum = InvalidOffsetNumber; + RelationPutHeapTuple(relation, buffer, heaptup,false,
+ &root_offnum);
becomes
root_offnum = RelationPutHeapTuple(relation, buffer, heaptup,
false,
InvalidOffsetNumber);
Make sense. Will fix.
Many comments lack finishing periods in complete sentences, which looks
odd. Please fix.
Sorry, not sure where I picked that style from. I see that the existing
code has both styles, though I will add finishing periods because I like
that way too.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 26, 2017 at 2:38 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Looking at your 0002 patch now. It no longer applies, but the conflicts
are trivial to fix. Please rebase and resubmit.
Thanks.
Maybe I will spend some time trying to
convert it to Perl using PostgresNode.
Agree. I put together a test harness to hammer the WARM code as much as we
can. This harness has already discovered some bugs, especially around index
creation part. It also discovered one outstanding bug in master, so it's
been useful. But I agree to rewrite it using perl.
I think having the "recheck" index methods create an ExecutorState looks
out of place. How difficult is it to pass the estate from the calling
code?
I couldn't find an easy way given the place where recheck is required. Can
you suggest something?
IMO heap_get_root_tuple_one should be called just heap_get_root_tuple().
That function and its plural sibling heap_get_root_tuples() should
indicate in their own comments what the expectations are regarding the
root_offsets output argument, rather than deferring to the comments in
the "internal" function, since they differ on that point; for the rest
of the invariants I think it makes sense to say "Also see the comment
for heap_get_root_tuples_internal". I wonder if heap_get_root_tuple
should just return the ctid instead of assigning the value to a
passed-in pointer, i.e.
OffsetNumber
heap_get_root_tuple(Page page, OffsetNumber target_offnum)
{
OffsetNumber off;
heap_get_root_tuples_internal(page, target_offnum, &off);
return off;
}
Yes, all of that makes sense. Will fix.
The simple_heap_update + CatalogUpdateIndexes pattern is getting
obnoxious. How about creating something like catalog_heap_update which
does both things at once, and stop bothering each callsite with the WARM
stuff? In fact, given that CatalogUpdateIndexes is used in other
places, maybe we should leave its API alone and create another function,
so that we don't have to change the many places that only do
simple_heap_insert. (Places like OperatorCreate which do either insert
or update could just move the index update call into each branch.)
What I ended up doing is I added two new APIs.
- CatalogUpdateHeapAndIndex
- CatalogInsertHeapAndIndex
I could replace almost all occurrences of simple_heap_update +
CatalogUpdateIndexes with the first API and simple_heap_insert +
CatalogUpdateIndexes with the second API. This looks like a good
improvement to me anyways since there are about 180 places where these
functions are called almost in the same pattern. May be it will also avoid
a bug when someone forgets to update the index after inserting/updating
heap.
.
I wonder if heap_hot_search_buffer() and heap_hot_search() should return
a tri-valued enum instead of boolean; that idea looks reasonable in
theory but callers have to do more work afterwards, so maybe not.
Ok. I'll try to rearrange it a bit. May be we just have one API after all?
There are only a very few callers of these APIs.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee wrote:
On Wed, Jan 25, 2017 at 10:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
+( \ + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \ +)+#define HeapTupleHeaderGetRootOffset(tup) \ +( \ + AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \ + ItemPointerGetOffsetNumber(&(tup)->t_ctid) \ +)Interesting stuff; it took me a bit to see why these macros are this
way. I propose the following wording which I think is clearer:Return whether the tuple has a cached root offset. We don't use
HeapTupleHeaderIsHeapLatest because that one also considers the slow
case of scanning the whole block.Umm, not scanning the whole block, but HeapTupleHeaderIsHeapLatest compares
t_ctid with the passed in TID and returns true if those matches. To know if
root lp is cached, we only rely on the HEAP_LATEST_TUPLE flag. Though if
the flag is set, then it implies latest tuple too.
Well, I'm just trying to fix the problem that when I saw that macro, I
thought "why is this checking the bitmask directly instead of using the
existing IsHeapLatest macro?" when I saw the code. It turned out that
IsHeapLatest is not just simply comparing the bitmask, but it also does
more expensive processing which is unwanted in this case. I think the
comment to this macro should explain why the other macro cannot be used.
Please flag the macros that have multiple evaluation hazards -- there
are a few of them.Can you please tell me an example? I must be missing something.
Any macro that uses an argument more than once is subject to multiple
evaluations of that argument; for example, if you pass a function call to
the macro as one of the parameters, the function is called multiple
times. In many cases this is not a problem because the argument is
always a constant, but sometimes it does become a problem.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 26, 2017 at 2:38 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Looking at your 0002 patch now. It no longer applies, but the conflicts
are trivial to fix. Please rebase and resubmit.
Please see rebased and updated patches attached.
I think having the "recheck" index methods create an ExecutorState looks
out of place. How difficult is it to pass the estate from the calling
code?
I couldn't find a good way to pass estate from the calling code. It would
require changes to many other APIs. I saw all other callers who need to
form index keys do that too. But please suggest if there are better ways.
OffsetNumber
heap_get_root_tuple(Page page, OffsetNumber target_offnum)
{
OffsetNumber off;
heap_get_root_tuples_internal(page, target_offnum, &off);
return off;
}
Ok. Changed this way. Definitely looks better.
The simple_heap_update + CatalogUpdateIndexes pattern is getting
obnoxious. How about creating something like catalog_heap_update which
does both things at once, and stop bothering each callsite with the WARM
stuff?
What I realised that there are really 2 patterns:
1. simple_heap_insert, CatalogUpdateIndexes
2. simple_heap_update, CatalogUpdateIndexes
There are only couple of places where we already have indexes open or have
more than one tuple to update, so we call CatalogIndexInsert directly. What
I ended up doing in the attached patch is add two new APIs which combines
the two steps of each of these patterns. It seems much cleaner to me and
also less buggy for future users. I hope I am not missing a reason not to
do combine these steps.
I'm not real sure about the interface between index AM and executor,
namely IndexScanDesc->xs_tuple_recheck. For example, this pattern:
if (!scan->xs_recheck)
scan->xs_tuple_recheck = false;
else
scan->xs_tuple_recheck = true;
can become simply
scan->xs_tuple_recheck = scan->xs_recheck;
Fixed.
which looks odd. I can't pinpoint exactly what's the problem, though.
I'll continue looking at this one.
What we do is if the index scan is marked to do recheck, we do it for each
tuple anyways. Otherwise recheck is required only if a tuple comes from a
WARM chain.
I wonder if heap_hot_search_buffer() and heap_hot_search() should return
a tri-valued enum instead of boolean; that idea looks reasonable in
theory but callers have to do more work afterwards, so maybe not.
I did not do anything with this yet. But I agree with you that we need to
make it better/simpler. Will continue to work on that.
I've addressed other review comments on the 0001 patch, except this one.
+/* + * Get TID of next tuple in the update chain. Caller should have checked
that
+ * we are not already at the end of the chain because in that case
t_ctid may
+ * actually store the root line pointer of the HOT chain whose member
this
+ * tuple is. + */ +#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ +do { \ + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ +} while (0)
Actually, I think this macro could just return the TID so that it can be
used as struct assignment, just like ItemPointerCopy does internally --
callers can do
ctid = HeapTupleHeaderGetNextTid(tup);
While I agree with your proposal, I wonder why we have ItemPointerCopy() in
the first place because we freely copy TIDs as struct assignment. Is there
a reason for that? And if there is, does it impact this specific case?
Other than the review comments, there were couple of bugs that I discovered
while running the stress test notably around visibility map handling. The
patch has those fixes. I also ripped out the kludge to record WARM-ness in
the line pointer because that is no longer needed after I reworked the code
a few versions back.
The other critical bug I found, which unfortunately exists in the master
too, is the index corruption during CIC. The patch includes the same fix
that I've proposed on the other thread. With these changes, WARM stress is
running fine for last 24 hours on a decently powerful box. Multiple
CREATE/DROP INDEX cycles and updates via different indexed columns, with a
mix of FOR SHARE/UPDATE and rollbacks did not produce any consistency
issues. A side note: while performance measurement wasn't a goal of stress
tests, WARM has done about 67% more transaction than master in 24 hour
period (95M in master vs 156M in WARM to be precise on a 30GB table
including indexes). I believe the numbers would be far better had the test
not dropping and recreating the indexes, thus effectively cleaning up all
index bloats. Also the table is small enough to fit in the shared buffers.
I'll rerun these tests with much larger scale factor and without dropping
indexes.
Of course, make check-world, including all TAP tests, passes too.
The CREATE INDEX CONCURRENTLY now works. The way we handle this is by
ensuring that no broken WARM chains are created while the initial index
build is happening. We check the list of attributes of indexes currently
in-progress (i.e. not ready for inserts) and if any of these attributes are
being modified, we don't do a WARM update. This is enough to address CIC
issue and all other mechanisms remain same as HOT. I've updated README to
include CIC algorithm.
There is one issue that bothers me. The current implementation lacks
ability to convert WARM chains into HOT chains. The README.WARM has some
proposal to do that. But it requires additional free bit in tuple header
(which we don't have) and of course, it needs to be vetted and implemented.
If the heap ends up with many WARM tuples, then index-only-scans will
become ineffective because index-only-scan can not skip a heap page, if it
contains a WARM tuple. Alternate ideas/suggestions and review of the design
are welcome!
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0002_warm_updates_v10.patchapplication/octet-stream; name=0002_warm_updates_v10.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..7a9a976 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -141,6 +141,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b2afdb7..ef3bfa3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c2247ad..2135ae0 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -92,6 +92,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index ec8ed33..4861957 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -89,6 +89,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -269,6 +270,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -306,8 +309,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index a59ad6f..46a334c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -408,6 +410,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..dcba734 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..7b9a712
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,306 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
+
+CREATE INDEX CONCURRENTLY
+-------------------------
+
+Currently CREATE INDEX CONCURRENTLY (CIC) is implemented as a 3-phase
+process. In the first phase, we create catalog entry for the new index
+so that the index is visible to all other backends, but still don't use
+it for either read or write. But we ensure that no new broken HOT
+chains are created by new transactions. In the second phase, we build
+the new index using a MVCC snapshot and then make the index available
+for inserts. We then do another pass over the index and insert any
+missing tuples, everytime indexing only it's root line pointer. See
+README.HOT for details about how HOT impacts CIC and how various
+challenges are tackeled.
+
+WARM poses another challenge because it allows creation of HOT chains
+even when an index key is changed. But since the index is not ready for
+insertion until the second phase is over, we might end up with a
+situation where the HOT chain has tuples with different index columns,
+yet only one of these values are indexed by the new index. Note that
+during the third phase, we only index tuples whose root line pointer is
+missing from the index. But we can't easily check if the existing index
+tuple is actually indexing the heap tuple visible to the new MVCC
+snapshot. Finding that information will require us to query the index
+again for every tuple in the chain, especially if it's a WARM tuple.
+This would require repeated access to the index. Another option would be
+to return index keys along with the heap TIDs when index is scanned for
+collecting all indexed TIDs during third phase. We can then compare the
+heap tuple against the already indexed key and decide whether or not to
+index the new tuple.
+
+We solve this problem more simply by disallowing WARM updates until the
+index is ready for insertion. We don't need to disallow WARM on a
+wholesale basis, but only those updates that change the columns of the
+new index are disallowed to be WARM updates.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5149c07..8be0137 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1957,6 +1957,78 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain containing this tid is actually a WARM chain.
+ * Note that even if the WARM update ultimately aborted, we still must do a
+ * recheck because the failing UPDATE when have inserted created index entries
+ * which are now stale, but still referencing this chain.
+ */
+static bool
+hot_check_warm_chain(Page dp, ItemPointer tid)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ return true;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return false;
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1976,11 +2048,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2034,9 +2109,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2049,6 +2127,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2097,7 +2185,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Check to see if HOT chain continues past this tuple; if so fetch
* the next offnum and loop around.
*/
- if (HeapTupleIsHotUpdated(heapTuple))
+ if (HeapTupleIsHotUpdated(heapTuple) &&
+ !HeapTupleHeaderHasRootOffset(heapTuple->t_data))
{
Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
ItemPointerGetBlockNumber(tid));
@@ -2121,18 +2210,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3491,15 +3603,18 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
+ Bitmapset *notready_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3520,6 +3635,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3544,6 +3660,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3565,10 +3685,17 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+ notready_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_NOTREADY);
+
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, notready_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3620,6 +3747,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3875,6 +4005,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4193,6 +4324,37 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !IsSystemRelation(relation) &&
+ !bms_overlap(notready_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup))
+ use_warm_update = true;
+ }
}
else
{
@@ -4239,6 +4401,22 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * Note: If we ever have a mechanism to avoid duplicate <key, TID> in
+ * indexes, we could look at relaxing this restriction and allow even
+ * more WARM udpates
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4251,12 +4429,35 @@ l2:
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
+ else if (use_warm_update)
+ {
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4366,7 +4567,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4506,7 +4710,8 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
* via ereport().
*/
void
-simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
+simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
+ Bitmapset **modified_attrs, bool *warm_update)
{
HTSU_Result result;
HeapUpdateFailureData hufd;
@@ -4515,7 +4720,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, modified_attrs, warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7567,6 +7772,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7578,6 +7784,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7651,6 +7860,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8628,16 +8839,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8697,6 +8914,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextTid(htup, &newtid);
@@ -8832,6 +9054,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f54337c..c2bd7d6 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -834,6 +834,13 @@ heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
continue;
+ /*
+ * If the tuple has root line pointer, it must be the end of the
+ * chain
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
/* Set up to scan the HOT-chain */
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ba27c1e..3cbe1d0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -75,10 +75,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -233,6 +235,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -534,7 +551,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -573,7 +590,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -600,6 +617,12 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple.
+ * Otherwise we must recheck every tuple.
+ */
+ scan->xs_tuple_recheck = scan->xs_recheck;
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -609,32 +632,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 883d70d..6efccf7 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 469e7ab..27013f4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -121,6 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -301,8 +303,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
+ /* btree indexes are never lossy, except for WARM tuples */
scan->xs_recheck = false;
+ scan->xs_tuple_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..9becaeb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..2236f02 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -71,6 +71,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 00a9aea..477f450 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1252,7 +1252,7 @@ SetDefaultACL(InternalDefaultACL *iacls)
values[Anum_pg_default_acl_defaclacl - 1] = PointerGetDatum(new_acl);
newtuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- simple_heap_insert(rel, newtuple);
+ CatalogInsertHeapAndIndexes(rel, newtuple);
}
else
{
@@ -1262,12 +1262,9 @@ SetDefaultACL(InternalDefaultACL *iacls)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
values, nulls, replaces);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
}
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(rel, newtuple);
-
/* these dependencies don't change in an update */
if (isNew)
{
@@ -1697,10 +1694,7 @@ ExecGrant_Attribute(InternalGrant *istmt, Oid relOid, const char *relname,
newtuple = heap_modify_tuple(attr_tuple, RelationGetDescr(attRelation),
values, nulls, replaces);
- simple_heap_update(attRelation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(attRelation, newtuple);
+ CatalogUpdateHeapAndIndexes(attRelation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(relOid, RelationRelationId, attnum,
@@ -1963,10 +1957,7 @@ ExecGrant_Relation(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation),
values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(relOid, RelationRelationId, 0, new_acl);
@@ -2156,10 +2147,7 @@ ExecGrant_Database(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update the shared dependency ACL info */
updateAclDependencies(DatabaseRelationId, HeapTupleGetOid(tuple), 0,
@@ -2281,10 +2269,7 @@ ExecGrant_Fdw(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(fdwid, ForeignDataWrapperRelationId, 0,
@@ -2410,10 +2395,7 @@ ExecGrant_ForeignServer(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(srvid, ForeignServerRelationId, 0, new_acl);
@@ -2537,10 +2519,7 @@ ExecGrant_Function(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(funcId, ProcedureRelationId, 0, new_acl);
@@ -2671,10 +2650,7 @@ ExecGrant_Language(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(langId, LanguageRelationId, 0, new_acl);
@@ -2813,10 +2789,7 @@ ExecGrant_Largeobject(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation),
values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(loid, LargeObjectRelationId, 0, new_acl);
@@ -2941,10 +2914,7 @@ ExecGrant_Namespace(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(nspid, NamespaceRelationId, 0, new_acl);
@@ -3068,10 +3038,7 @@ ExecGrant_Tablespace(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update the shared dependency ACL info */
updateAclDependencies(TableSpaceRelationId, tblId, 0,
@@ -3205,10 +3172,7 @@ ExecGrant_Type(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(typId, TypeRelationId, 0, new_acl);
@@ -5751,10 +5715,7 @@ recordExtensionInitPrivWorker(Oid objoid, Oid classoid, int objsubid, Acl *new_a
oldtuple = heap_modify_tuple(oldtuple, RelationGetDescr(relation),
values, nulls, replace);
- simple_heap_update(relation, &oldtuple->t_self, oldtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, oldtuple);
+ CatalogUpdateHeapAndIndexes(relation, &oldtuple->t_self, oldtuple);
}
else
/* new_acl is NULL, so delete the entry we found. */
@@ -5788,10 +5749,7 @@ recordExtensionInitPrivWorker(Oid objoid, Oid classoid, int objsubid, Acl *new_a
tuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
- simple_heap_insert(relation, tuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, tuple);
+ CatalogInsertHeapAndIndexes(relation, tuple);
}
}
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7ce9115..84e9ef5 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -633,9 +633,9 @@ InsertPgAttributeTuple(Relation pg_attribute_rel,
simple_heap_insert(pg_attribute_rel, tup);
if (indstate != NULL)
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, NULL, false);
else
- CatalogUpdateIndexes(pg_attribute_rel, tup);
+ CatalogUpdateIndexes(pg_attribute_rel, tup, NULL, false);
heap_freetuple(tup);
}
@@ -824,9 +824,7 @@ InsertPgClassTuple(Relation pg_class_desc,
HeapTupleSetOid(tup, new_rel_oid);
/* finally insert the new tuple, update the indexes, and clean up */
- simple_heap_insert(pg_class_desc, tup);
-
- CatalogUpdateIndexes(pg_class_desc, tup);
+ CatalogInsertHeapAndIndexes(pg_class_desc, tup);
heap_freetuple(tup);
}
@@ -1599,10 +1597,7 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
"........pg.dropped.%d........", attnum);
namestrcpy(&(attStruct->attname), newattname);
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
}
/*
@@ -1731,10 +1726,7 @@ RemoveAttrDefaultById(Oid attrdefId)
((Form_pg_attribute) GETSTRUCT(tuple))->atthasdef = false;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/*
* Our update of the pg_attribute row will force a relcache rebuild, so
@@ -1932,9 +1924,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
adrel = heap_open(AttrDefaultRelationId, RowExclusiveLock);
tuple = heap_form_tuple(adrel->rd_att, values, nulls);
- attrdefOid = simple_heap_insert(adrel, tuple);
-
- CatalogUpdateIndexes(adrel, tuple);
+ attrdefOid = CatalogInsertHeapAndIndexes(adrel, tuple);
defobject.classId = AttrDefaultRelationId;
defobject.objectId = attrdefOid;
@@ -1964,9 +1954,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
if (!attStruct->atthasdef)
{
attStruct->atthasdef = true;
- simple_heap_update(attrrel, &atttup->t_self, atttup);
- /* keep catalog indexes current */
- CatalogUpdateIndexes(attrrel, atttup);
+ CatalogUpdateHeapAndIndexes(attrrel, &atttup->t_self, atttup);
}
heap_close(attrrel, RowExclusiveLock);
heap_freetuple(atttup);
@@ -2561,8 +2549,7 @@ MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
Assert(is_local);
con->connoinherit = true;
}
- simple_heap_update(conDesc, &tup->t_self, tup);
- CatalogUpdateIndexes(conDesc, tup);
+ CatalogUpdateHeapAndIndexes(conDesc, &tup->t_self, tup);
break;
}
}
@@ -2602,10 +2589,7 @@ SetRelationNumChecks(Relation rel, int numchecks)
{
relStruct->relchecks = numchecks;
- simple_heap_update(relrel, &reltup->t_self, reltup);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(relrel, reltup);
+ CatalogUpdateHeapAndIndexes(relrel, &reltup->t_self, reltup);
}
else
{
@@ -3145,10 +3129,7 @@ StorePartitionKey(Relation rel,
tuple = heap_form_tuple(RelationGetDescr(pg_partitioned_table), values, nulls);
- simple_heap_insert(pg_partitioned_table, tuple);
-
- /* Update the indexes on pg_partitioned_table */
- CatalogUpdateIndexes(pg_partitioned_table, tuple);
+ CatalogInsertHeapAndIndexes(pg_partitioned_table, tuple);
heap_close(pg_partitioned_table, RowExclusiveLock);
/* Mark this relation as dependent on a few things as follows */
@@ -3265,8 +3246,7 @@ StorePartitionBound(Relation rel, Relation parent, Node *bound)
new_val, new_null, new_repl);
/* Also set the flag */
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = true;
- simple_heap_update(classRel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(classRel, newtuple);
+ CatalogUpdateHeapAndIndexes(classRel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 26cbc0e..1b51cbc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -649,10 +650,7 @@ UpdateIndexRelation(Oid indexoid,
/*
* insert the tuple into the pg_index catalog
*/
- simple_heap_insert(pg_index, tuple);
-
- /* update the indexes on pg_index */
- CatalogUpdateIndexes(pg_index, tuple);
+ CatalogInsertHeapAndIndexes(pg_index, tuple);
/*
* close the relation and free the tuple
@@ -1324,8 +1322,7 @@ index_constraint_create(Relation heapRelation,
if (dirty)
{
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
InvokeObjectPostAlterHookArg(IndexRelationId, indexRelationId, 0,
InvalidOid, is_internal);
@@ -1691,6 +1688,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
@@ -2103,8 +2114,7 @@ index_build(Relation heapRelation,
Assert(!indexForm->indcheckxmin);
indexForm->indcheckxmin = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
heap_freetuple(indexTuple);
heap_close(pg_index, RowExclusiveLock);
@@ -3448,8 +3458,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
indexForm->indisvalid = true;
indexForm->indisready = true;
indexForm->indislive = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
/*
* Invalidate the relcache for the table, so that after we commit
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 1915ca3..304f742 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -66,10 +66,15 @@ CatalogCloseIndexes(CatalogIndexState indstate)
*
* This should be called for each inserted or updated catalog tuple.
*
+ * If the tuple was WARM updated, the modified_attrs contains the list of
+ * columns updated by the update. We must not insert new index entries for
+ * indexes which do not refer to any of the modified columns.
+ *
* This is effectively a cut-down version of ExecInsertIndexTuples.
*/
void
-CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
+CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update)
{
int i;
int numIndexes;
@@ -79,12 +84,28 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
IndexInfo **indexInfoArray;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData root_tid;
- /* HOT update does not require index inserts */
- if (HeapTupleIsHeapOnly(heapTuple))
+ /*
+ * HOT update does not require index inserts, but WARM may need for some
+ * indexes.
+ */
+ if (HeapTupleIsHeapOnly(heapTuple) && !warm_update)
return;
/*
+ * If we've done a WARM update, then we must index the TID of the root line
+ * pointer and not the actual TID of the new tuple.
+ */
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(heapTuple->t_self)),
+ HeapTupleHeaderGetRootOffset(heapTuple->t_data));
+ else
+ ItemPointerCopy(&heapTuple->t_self, &root_tid);
+
+
+ /*
* Get information from the state structure. Fall out if nothing to do.
*/
numIndexes = indstate->ri_NumIndices;
@@ -112,6 +133,17 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
continue;
/*
+ * If we've done WARM update, then we must not insert a new index tuple
+ * if none of the index keys have changed. This is not just an
+ * optimization, but a requirement for WARM to work correctly.
+ */
+ if (warm_update)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
+ /*
* Expressional and partial indexes on system catalogs are not
* supported, nor exclusion constraints, nor deferred uniqueness
*/
@@ -136,7 +168,7 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
index_insert(relationDescs[i], /* index relation */
values, /* array of index Datums */
isnull, /* is-null flags */
- &(heapTuple->t_self), /* tid of heap tuple */
+ &root_tid,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO);
@@ -154,11 +186,43 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
* structures is moderately expensive.
*/
void
-CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple)
+CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update)
{
CatalogIndexState indstate;
indstate = CatalogOpenIndexes(heapRel);
- CatalogIndexInsert(indstate, heapTuple);
+ CatalogIndexInsert(indstate, heapTuple, modified_attrs, warm_update);
CatalogCloseIndexes(indstate);
}
+
+/*
+ * A convenience routine which updates the heap tuple (identified by otid) with
+ * tup and also update all indexes on the table.
+ */
+void
+CatalogUpdateHeapAndIndexes(Relation heapRel, ItemPointer otid, HeapTuple tup)
+{
+ bool warm_update;
+ Bitmapset *modified_attrs;
+
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
+
+ /* Make sure only indexes whose columns are modified receive new entries */
+ CatalogUpdateIndexes(heapRel, tup, modified_attrs, warm_update);
+}
+
+/*
+ * A convenience routine which inserts a new heap tuple and also update all
+ * indexes on the table.
+ *
+ * Oid of the inserted tuple is returned
+ */
+Oid
+CatalogInsertHeapAndIndexes(Relation heapRel, HeapTuple tup)
+{
+ Oid oid;
+ oid = simple_heap_insert(heapRel, tup);
+ CatalogUpdateIndexes(heapRel, tup, NULL, false);
+ return oid;
+}
diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 3a4e22f..9cab585 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -674,9 +674,7 @@ AggregateCreate(const char *aggName,
tupDesc = aggdesc->rd_att;
tup = heap_form_tuple(tupDesc, values, nulls);
- simple_heap_insert(aggdesc, tup);
-
- CatalogUpdateIndexes(aggdesc, tup);
+ CatalogInsertHeapAndIndexes(aggdesc, tup);
heap_close(aggdesc, RowExclusiveLock);
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index 694c0f6..ebaf3fd 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -134,12 +134,9 @@ CollationCreate(const char *collname, Oid collnamespace,
tup = heap_form_tuple(tupDesc, values, nulls);
/* insert a new tuple */
- oid = simple_heap_insert(rel, tup);
+ oid = CatalogInsertHeapAndIndexes(rel, tup);
Assert(OidIsValid(oid));
- /* update the index if any */
- CatalogUpdateIndexes(rel, tup);
-
/* set up dependencies for the new collation */
myself.classId = CollationRelationId;
myself.objectId = oid;
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index b5a0ce9..9509cac 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -226,10 +226,7 @@ CreateConstraintEntry(const char *constraintName,
tup = heap_form_tuple(RelationGetDescr(conDesc), values, nulls);
- conOid = simple_heap_insert(conDesc, tup);
-
- /* update catalog indexes */
- CatalogUpdateIndexes(conDesc, tup);
+ conOid = CatalogInsertHeapAndIndexes(conDesc, tup);
conobject.classId = ConstraintRelationId;
conobject.objectId = conOid;
@@ -584,9 +581,7 @@ RemoveConstraintById(Oid conId)
RelationGetRelationName(rel));
classForm->relchecks--;
- simple_heap_update(pgrel, &relTup->t_self, relTup);
-
- CatalogUpdateIndexes(pgrel, relTup);
+ CatalogUpdateHeapAndIndexes(pgrel, &relTup->t_self, relTup);
heap_freetuple(relTup);
@@ -666,10 +661,7 @@ RenameConstraintById(Oid conId, const char *newname)
/* OK, do the rename --- tuple is a copy, so OK to scribble on it */
namestrcpy(&(con->conname), newname);
- simple_heap_update(conDesc, &tuple->t_self, tuple);
-
- /* update the system catalog indexes */
- CatalogUpdateIndexes(conDesc, tuple);
+ CatalogUpdateHeapAndIndexes(conDesc, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(ConstraintRelationId, conId, 0);
@@ -736,8 +728,7 @@ AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
conform->connamespace = newNspId;
- simple_heap_update(conRel, &tup->t_self, tup);
- CatalogUpdateIndexes(conRel, tup);
+ CatalogUpdateHeapAndIndexes(conRel, &tup->t_self, tup);
/*
* Note: currently, the constraint will not have its own
diff --git a/src/backend/catalog/pg_conversion.c b/src/backend/catalog/pg_conversion.c
index adaf7b8..a942e02 100644
--- a/src/backend/catalog/pg_conversion.c
+++ b/src/backend/catalog/pg_conversion.c
@@ -105,10 +105,7 @@ ConversionCreate(const char *conname, Oid connamespace,
tup = heap_form_tuple(tupDesc, values, nulls);
/* insert a new tuple */
- simple_heap_insert(rel, tup);
-
- /* update the index if any */
- CatalogUpdateIndexes(rel, tup);
+ CatalogInsertHeapAndIndexes(rel, tup);
myself.classId = ConversionRelationId;
myself.objectId = HeapTupleGetOid(tup);
diff --git a/src/backend/catalog/pg_db_role_setting.c b/src/backend/catalog/pg_db_role_setting.c
index 117cc8d..c206b03 100644
--- a/src/backend/catalog/pg_db_role_setting.c
+++ b/src/backend/catalog/pg_db_role_setting.c
@@ -88,10 +88,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, newtuple);
}
else
simple_heap_delete(rel, &tuple->t_self);
@@ -129,10 +126,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, newtuple);
}
else
simple_heap_delete(rel, &tuple->t_self);
@@ -155,10 +149,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
values[Anum_pg_db_role_setting_setconfig - 1] = PointerGetDatum(a);
newtuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- simple_heap_insert(rel, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogInsertHeapAndIndexes(rel, newtuple);
}
InvokeObjectPostAlterHookArg(DbRoleSettingRelationId,
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index b71fa1b..85a7622 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -113,7 +113,7 @@ recordMultipleDependencies(const ObjectAddress *depender,
if (indstate == NULL)
indstate = CatalogOpenIndexes(dependDesc);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, NULL, false);
heap_freetuple(tup);
}
@@ -362,8 +362,7 @@ changeDependencyFor(Oid classId, Oid objectId,
depform->refobjid = newRefObjectId;
- simple_heap_update(depRel, &tup->t_self, tup);
- CatalogUpdateIndexes(depRel, tup);
+ CatalogUpdateHeapAndIndexes(depRel, &tup->t_self, tup);
heap_freetuple(tup);
}
diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 089a9a0..16a4e80 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -125,8 +125,7 @@ EnumValuesCreate(Oid enumTypeOid, List *vals)
tup = heap_form_tuple(RelationGetDescr(pg_enum), values, nulls);
HeapTupleSetOid(tup, oids[elemno]);
- simple_heap_insert(pg_enum, tup);
- CatalogUpdateIndexes(pg_enum, tup);
+ CatalogInsertHeapAndIndexes(pg_enum, tup);
heap_freetuple(tup);
elemno++;
@@ -458,8 +457,7 @@ restart:
values[Anum_pg_enum_enumlabel - 1] = NameGetDatum(&enumlabel);
enum_tup = heap_form_tuple(RelationGetDescr(pg_enum), values, nulls);
HeapTupleSetOid(enum_tup, newOid);
- simple_heap_insert(pg_enum, enum_tup);
- CatalogUpdateIndexes(pg_enum, enum_tup);
+ CatalogInsertHeapAndIndexes(pg_enum, enum_tup);
heap_freetuple(enum_tup);
heap_close(pg_enum, RowExclusiveLock);
@@ -543,8 +541,7 @@ RenameEnumLabel(Oid enumTypeOid,
/* Update the pg_enum entry */
namestrcpy(&en->enumlabel, newVal);
- simple_heap_update(pg_enum, &enum_tup->t_self, enum_tup);
- CatalogUpdateIndexes(pg_enum, enum_tup);
+ CatalogUpdateHeapAndIndexes(pg_enum, &enum_tup->t_self, enum_tup);
heap_freetuple(enum_tup);
heap_close(pg_enum, RowExclusiveLock);
@@ -597,9 +594,7 @@ RenumberEnumType(Relation pg_enum, HeapTuple *existing, int nelems)
{
en->enumsortorder = newsortorder;
- simple_heap_update(pg_enum, &newtup->t_self, newtup);
-
- CatalogUpdateIndexes(pg_enum, newtup);
+ CatalogUpdateHeapAndIndexes(pg_enum, &newtup->t_self, newtup);
}
heap_freetuple(newtup);
diff --git a/src/backend/catalog/pg_largeobject.c b/src/backend/catalog/pg_largeobject.c
index 24edf6a..d59d4b7 100644
--- a/src/backend/catalog/pg_largeobject.c
+++ b/src/backend/catalog/pg_largeobject.c
@@ -63,11 +63,9 @@ LargeObjectCreate(Oid loid)
if (OidIsValid(loid))
HeapTupleSetOid(ntup, loid);
- loid_new = simple_heap_insert(pg_lo_meta, ntup);
+ loid_new = CatalogInsertHeapAndIndexes(pg_lo_meta, ntup);
Assert(!OidIsValid(loid) || loid == loid_new);
- CatalogUpdateIndexes(pg_lo_meta, ntup);
-
heap_freetuple(ntup);
heap_close(pg_lo_meta, RowExclusiveLock);
diff --git a/src/backend/catalog/pg_namespace.c b/src/backend/catalog/pg_namespace.c
index f048ad4..4c06873 100644
--- a/src/backend/catalog/pg_namespace.c
+++ b/src/backend/catalog/pg_namespace.c
@@ -76,11 +76,9 @@ NamespaceCreate(const char *nspName, Oid ownerId, bool isTemp)
tup = heap_form_tuple(tupDesc, values, nulls);
- nspoid = simple_heap_insert(nspdesc, tup);
+ nspoid = CatalogInsertHeapAndIndexes(nspdesc, tup);
Assert(OidIsValid(nspoid));
- CatalogUpdateIndexes(nspdesc, tup);
-
heap_close(nspdesc, RowExclusiveLock);
/* Record dependencies */
diff --git a/src/backend/catalog/pg_operator.c b/src/backend/catalog/pg_operator.c
index 556f9fe..d3f71ca 100644
--- a/src/backend/catalog/pg_operator.c
+++ b/src/backend/catalog/pg_operator.c
@@ -262,9 +262,7 @@ OperatorShellMake(const char *operatorName,
/*
* insert our "shell" operator tuple
*/
- operatorObjectId = simple_heap_insert(pg_operator_desc, tup);
-
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ operatorObjectId = CatalogInsertHeapAndIndexes(pg_operator_desc, tup);
/* Add dependencies for the entry */
makeOperatorDependencies(tup, false);
@@ -526,7 +524,7 @@ OperatorCreate(const char *operatorName,
nulls,
replaces);
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(pg_operator_desc, &tup->t_self, tup);
}
else
{
@@ -535,12 +533,9 @@ OperatorCreate(const char *operatorName,
tup = heap_form_tuple(RelationGetDescr(pg_operator_desc),
values, nulls);
- operatorObjectId = simple_heap_insert(pg_operator_desc, tup);
+ operatorObjectId = CatalogInsertHeapAndIndexes(pg_operator_desc, tup);
}
- /* Must update the indexes in either case */
- CatalogUpdateIndexes(pg_operator_desc, tup);
-
/* Add dependencies for the entry */
address = makeOperatorDependencies(tup, isUpdate);
@@ -695,8 +690,7 @@ OperatorUpd(Oid baseId, Oid commId, Oid negId, bool isDelete)
/* If any columns were found to need modification, update tuple. */
if (update_commutator)
{
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ CatalogUpdateHeapAndIndexes(pg_operator_desc, &tup->t_self, tup);
/*
* Do CCI to make the updated tuple visible. We must do this in
@@ -741,8 +735,7 @@ OperatorUpd(Oid baseId, Oid commId, Oid negId, bool isDelete)
/* If any columns were found to need modification, update tuple. */
if (update_negator)
{
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ CatalogUpdateHeapAndIndexes(pg_operator_desc, &tup->t_self, tup);
/*
* In the deletion case, do CCI to make the updated tuple visible.
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 6ab849c..f35769e 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -572,7 +572,7 @@ ProcedureCreate(const char *procedureName,
/* Okay, do it... */
tup = heap_modify_tuple(oldtup, tupDesc, values, nulls, replaces);
- simple_heap_update(rel, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
ReleaseSysCache(oldtup);
is_update = true;
@@ -590,12 +590,10 @@ ProcedureCreate(const char *procedureName,
nulls[Anum_pg_proc_proacl - 1] = true;
tup = heap_form_tuple(tupDesc, values, nulls);
- simple_heap_insert(rel, tup);
+ CatalogInsertHeapAndIndexes(rel, tup);
is_update = false;
}
- /* Need to update indexes for either the insert or update case */
- CatalogUpdateIndexes(rel, tup);
retval = HeapTupleGetOid(tup);
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 00ed28f..2c7c3b5 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -149,8 +149,7 @@ publication_add_relation(Oid pubid, Relation targetrel,
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
/* Insert tuple into catalog. */
- prrelid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ prrelid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
ObjectAddressSet(myself, PublicationRelRelationId, prrelid);
diff --git a/src/backend/catalog/pg_range.c b/src/backend/catalog/pg_range.c
index d3a4c26..c21610d 100644
--- a/src/backend/catalog/pg_range.c
+++ b/src/backend/catalog/pg_range.c
@@ -58,8 +58,7 @@ RangeCreate(Oid rangeTypeOid, Oid rangeSubType, Oid rangeCollation,
tup = heap_form_tuple(RelationGetDescr(pg_range), values, nulls);
- simple_heap_insert(pg_range, tup);
- CatalogUpdateIndexes(pg_range, tup);
+ CatalogInsertHeapAndIndexes(pg_range, tup);
heap_freetuple(tup);
/* record type's dependencies on range-related items */
diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c
index 60ed957..8d1ddab 100644
--- a/src/backend/catalog/pg_shdepend.c
+++ b/src/backend/catalog/pg_shdepend.c
@@ -260,10 +260,7 @@ shdepChangeDep(Relation sdepRel,
shForm->refclassid = refclassid;
shForm->refobjid = refobjid;
- simple_heap_update(sdepRel, &oldtup->t_self, oldtup);
-
- /* keep indexes current */
- CatalogUpdateIndexes(sdepRel, oldtup);
+ CatalogUpdateHeapAndIndexes(sdepRel, &oldtup->t_self, oldtup);
}
else
{
@@ -287,10 +284,7 @@ shdepChangeDep(Relation sdepRel,
* it's certainly a new tuple
*/
oldtup = heap_form_tuple(RelationGetDescr(sdepRel), values, nulls);
- simple_heap_insert(sdepRel, oldtup);
-
- /* keep indexes current */
- CatalogUpdateIndexes(sdepRel, oldtup);
+ CatalogInsertHeapAndIndexes(sdepRel, oldtup);
}
if (oldtup)
@@ -759,10 +753,7 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId)
HeapTuple newtup;
newtup = heap_modify_tuple(tup, sdepDesc, values, nulls, replace);
- simple_heap_insert(sdepRel, newtup);
-
- /* Keep indexes current */
- CatalogIndexInsert(indstate, newtup);
+ CatalogInsertHeapAndIndexes(sdepRel, newtup);
heap_freetuple(newtup);
}
@@ -882,10 +873,7 @@ shdepAddDependency(Relation sdepRel,
tup = heap_form_tuple(sdepRel->rd_att, values, nulls);
- simple_heap_insert(sdepRel, tup);
-
- /* keep indexes current */
- CatalogUpdateIndexes(sdepRel, tup);
+ CatalogInsertHeapAndIndexes(sdepRel, tup);
/* clean up */
heap_freetuple(tup);
diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c
index 6d9a324..8dfd5f0 100644
--- a/src/backend/catalog/pg_type.c
+++ b/src/backend/catalog/pg_type.c
@@ -142,9 +142,7 @@ TypeShellMake(const char *typeName, Oid typeNamespace, Oid ownerId)
/*
* insert the tuple in the relation and get the tuple's oid.
*/
- typoid = simple_heap_insert(pg_type_desc, tup);
-
- CatalogUpdateIndexes(pg_type_desc, tup);
+ typoid = CatalogInsertHeapAndIndexes(pg_type_desc, tup);
/*
* Create dependencies. We can/must skip this in bootstrap mode.
@@ -430,7 +428,7 @@ TypeCreate(Oid newTypeOid,
nulls,
replaces);
- simple_heap_update(pg_type_desc, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(pg_type_desc, &tup->t_self, tup);
typeObjectId = HeapTupleGetOid(tup);
@@ -458,12 +456,9 @@ TypeCreate(Oid newTypeOid,
}
/* else allow system to assign oid */
- typeObjectId = simple_heap_insert(pg_type_desc, tup);
+ typeObjectId = CatalogInsertHeapAndIndexes(pg_type_desc, tup);
}
- /* Update indexes */
- CatalogUpdateIndexes(pg_type_desc, tup);
-
/*
* Create dependencies. We can/must skip this in bootstrap mode.
*/
@@ -724,10 +719,7 @@ RenameTypeInternal(Oid typeOid, const char *newTypeName, Oid typeNamespace)
/* OK, do the rename --- tuple is a copy, so OK to scribble on it */
namestrcpy(&(typ->typname), newTypeName);
- simple_heap_update(pg_type_desc, &tuple->t_self, tuple);
-
- /* update the system catalog indexes */
- CatalogUpdateIndexes(pg_type_desc, tuple);
+ CatalogUpdateHeapAndIndexes(pg_type_desc, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(TypeRelationId, typeOid, 0);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 28be27a..92fa6e0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -493,6 +493,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -523,7 +524,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ee4a182..cae1228 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -350,10 +350,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
if (!IsBootstrapProcessingMode())
{
/* normal case, use a transactional update */
- simple_heap_update(class_rel, &reltup->t_self, reltup);
-
- /* Keep catalog indexes current */
- CatalogUpdateIndexes(class_rel, reltup);
+ CatalogUpdateHeapAndIndexes(class_rel, &reltup->t_self, reltup);
}
else
{
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 768fcc8..d8d4bec 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -284,8 +284,7 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &oldtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &oldtup->t_self, newtup);
InvokeObjectPostAlterHook(classId, objectId, 0);
@@ -722,8 +721,7 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &tup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, newtup);
/* Release memory */
pfree(values);
@@ -954,8 +952,7 @@ AlterObjectOwner_internal(Relation rel, Oid objectId, Oid new_ownerId)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &newtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &newtup->t_self, newtup);
/* Update owner dependency reference */
if (classId == LargeObjectMetadataRelationId)
diff --git a/src/backend/commands/amcmds.c b/src/backend/commands/amcmds.c
index 29061b8..33e207c 100644
--- a/src/backend/commands/amcmds.c
+++ b/src/backend/commands/amcmds.c
@@ -87,8 +87,7 @@ CreateAccessMethod(CreateAmStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- amoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ amoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
myself.classId = AccessMethodRelationId;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c9f6afe..648520e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1589,18 +1589,15 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
nulls,
replaces);
ReleaseSysCache(oldtup);
- simple_heap_update(sd, &stup->t_self, stup);
+ CatalogUpdateHeapAndIndexes(sd, &stup->t_self, stup);
}
else
{
/* No, insert new tuple */
stup = heap_form_tuple(RelationGetDescr(sd), values, nulls);
- simple_heap_insert(sd, stup);
+ CatalogInsertHeapAndIndexes(sd, stup);
}
- /* update indexes too */
- CatalogUpdateIndexes(sd, stup);
-
heap_freetuple(stup);
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f9309fc..8983cdf 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -523,8 +523,7 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
if (indexForm->indisclustered)
{
indexForm->indisclustered = false;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
}
else if (thisIndexOid == indexOid)
{
@@ -532,8 +531,7 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
if (!IndexIsValid(indexForm))
elog(ERROR, "cannot cluster on invalid index %u", indexOid);
indexForm->indisclustered = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
}
InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
@@ -1287,13 +1285,17 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
*/
if (!target_is_pg_class)
{
- simple_heap_update(relRelation, &reltup1->t_self, reltup1);
- simple_heap_update(relRelation, &reltup2->t_self, reltup2);
+ bool warm_update1, warm_update2;
+ Bitmapset *modified_attrs1, *modified_attrs2;
+ simple_heap_update(relRelation, &reltup1->t_self, reltup1,
+ &modified_attrs1, &warm_update1);
+ simple_heap_update(relRelation, &reltup2->t_self, reltup2,
+ &modified_attrs2, &warm_update2);
/* Keep system catalogs current */
indstate = CatalogOpenIndexes(relRelation);
- CatalogIndexInsert(indstate, reltup1);
- CatalogIndexInsert(indstate, reltup2);
+ CatalogIndexInsert(indstate, reltup1, modified_attrs1, warm_update1);
+ CatalogIndexInsert(indstate, reltup2, modified_attrs2, warm_update2);
CatalogCloseIndexes(indstate);
}
else
@@ -1558,8 +1560,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
relform->relfrozenxid = frozenXid;
relform->relminmxid = cutoffMulti;
- simple_heap_update(relRelation, &reltup->t_self, reltup);
- CatalogUpdateIndexes(relRelation, reltup);
+ CatalogUpdateHeapAndIndexes(relRelation, &reltup->t_self, reltup);
heap_close(relRelation, RowExclusiveLock);
}
diff --git a/src/backend/commands/comment.c b/src/backend/commands/comment.c
index ada0b03..c250385 100644
--- a/src/backend/commands/comment.c
+++ b/src/backend/commands/comment.c
@@ -199,7 +199,7 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
{
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(description), values,
nulls, replaces);
- simple_heap_update(description, &oldtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(description, &oldtuple->t_self, newtuple);
}
break; /* Assume there can be only one match */
@@ -213,15 +213,11 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
{
newtuple = heap_form_tuple(RelationGetDescr(description),
values, nulls);
- simple_heap_insert(description, newtuple);
+ CatalogInsertHeapAndIndexes(description, newtuple);
}
- /* Update indexes, if necessary */
if (newtuple != NULL)
- {
- CatalogUpdateIndexes(description, newtuple);
heap_freetuple(newtuple);
- }
/* Done */
@@ -293,7 +289,7 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
{
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(shdescription),
values, nulls, replaces);
- simple_heap_update(shdescription, &oldtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(shdescription, &oldtuple->t_self, newtuple);
}
break; /* Assume there can be only one match */
@@ -307,15 +303,11 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
{
newtuple = heap_form_tuple(RelationGetDescr(shdescription),
values, nulls);
- simple_heap_insert(shdescription, newtuple);
+ CatalogInsertHeapAndIndexes(shdescription, newtuple);
}
- /* Update indexes, if necessary */
if (newtuple != NULL)
- {
- CatalogUpdateIndexes(shdescription, newtuple);
heap_freetuple(newtuple);
- }
/* Done */
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index e9eeacd..f199074 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = castNode(TriggerData, fcinfo->context);
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 949844d..38702e5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2680,6 +2680,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2834,6 +2836,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6ad8fd7..b6ef57d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -546,10 +546,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
HeapTupleSetOid(tuple, dboid);
- simple_heap_insert(pg_database_rel, tuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(pg_database_rel, tuple);
+ CatalogInsertHeapAndIndexes(pg_database_rel, tuple);
/*
* Now generate additional catalog entries associated with the new DB
@@ -1040,8 +1037,7 @@ RenameDatabase(const char *oldname, const char *newname)
if (!HeapTupleIsValid(newtup))
elog(ERROR, "cache lookup failed for database %u", db_id);
namestrcpy(&(((Form_pg_database) GETSTRUCT(newtup))->datname), newname);
- simple_heap_update(rel, &newtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &newtup->t_self, newtup);
InvokeObjectPostAlterHook(DatabaseRelationId, db_id, 0);
@@ -1296,10 +1292,7 @@ movedb(const char *dbname, const char *tblspcname)
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(pgdbrel),
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pgdbrel, &oldtuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(pgdbrel, newtuple);
+ CatalogUpdateHeapAndIndexes(pgdbrel, &oldtuple->t_self, newtuple);
InvokeObjectPostAlterHook(DatabaseRelationId,
HeapTupleGetOid(newtuple), 0);
@@ -1554,10 +1547,7 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel), new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, newtuple);
InvokeObjectPostAlterHook(DatabaseRelationId,
HeapTupleGetOid(newtuple), 0);
@@ -1692,8 +1682,7 @@ AlterDatabaseOwner(const char *dbname, Oid newOwnerId)
}
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel), repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 8125537..a5460a3 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -405,8 +405,7 @@ insert_event_trigger_tuple(char *trigname, char *eventname, Oid evtOwner,
/* Insert heap tuple. */
tuple = heap_form_tuple(tgrel->rd_att, values, nulls);
- trigoid = simple_heap_insert(tgrel, tuple);
- CatalogUpdateIndexes(tgrel, tuple);
+ trigoid = CatalogInsertHeapAndIndexes(tgrel, tuple);
heap_freetuple(tuple);
/* Depend on owner. */
@@ -524,8 +523,7 @@ AlterEventTrigger(AlterEventTrigStmt *stmt)
evtForm = (Form_pg_event_trigger) GETSTRUCT(tup);
evtForm->evtenabled = tgenabled;
- simple_heap_update(tgrel, &tup->t_self, tup);
- CatalogUpdateIndexes(tgrel, tup);
+ CatalogUpdateHeapAndIndexes(tgrel, &tup->t_self, tup);
InvokeObjectPostAlterHook(EventTriggerRelationId,
trigoid, 0);
@@ -621,8 +619,7 @@ AlterEventTriggerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of an event trigger must be a superuser.")));
form->evtowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(EventTriggerRelationId,
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index f23c697..425d14b 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -1773,8 +1773,7 @@ InsertExtensionTuple(const char *extName, Oid extOwner,
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- extensionOid = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ extensionOid = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
heap_close(rel, RowExclusiveLock);
@@ -2485,8 +2484,7 @@ pg_extension_config_dump(PG_FUNCTION_ARGS)
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
repl_val, repl_null, repl_repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
systable_endscan(extScan);
@@ -2663,8 +2661,7 @@ extension_config_remove(Oid extensionoid, Oid tableoid)
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
repl_val, repl_null, repl_repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
systable_endscan(extScan);
@@ -2844,8 +2841,7 @@ AlterExtensionNamespace(List *names, const char *newschema, Oid *oldschema)
/* Now adjust pg_extension.extnamespace */
extForm->extnamespace = nspOid;
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
heap_close(extRel, RowExclusiveLock);
@@ -3091,8 +3087,7 @@ ApplyExtensionUpdates(Oid extensionOid,
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
values, nulls, repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
systable_endscan(extScan);
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 6ff8b69..a67dc52 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -256,8 +256,7 @@ AlterForeignDataWrapperOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerI
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(ForeignDataWrapperRelationId,
@@ -397,8 +396,7 @@ AlterForeignServerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(ForeignServerRelationId, HeapTupleGetOid(tup),
@@ -629,8 +627,7 @@ CreateForeignDataWrapper(CreateFdwStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- fdwId = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ fdwId = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -786,8 +783,7 @@ AlterForeignDataWrapper(AlterFdwStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ CatalogUpdateHeapAndIndexes(rel, &tp->t_self, tp);
heap_freetuple(tp);
@@ -941,9 +937,7 @@ CreateForeignServer(CreateForeignServerStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- srvId = simple_heap_insert(rel, tuple);
-
- CatalogUpdateIndexes(rel, tuple);
+ srvId = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -1056,8 +1050,7 @@ AlterForeignServer(AlterForeignServerStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ CatalogUpdateHeapAndIndexes(rel, &tp->t_self, tp);
InvokeObjectPostAlterHook(ForeignServerRelationId, srvId, 0);
@@ -1190,9 +1183,7 @@ CreateUserMapping(CreateUserMappingStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- umId = simple_heap_insert(rel, tuple);
-
- CatalogUpdateIndexes(rel, tuple);
+ umId = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -1307,8 +1298,7 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ CatalogUpdateHeapAndIndexes(rel, &tp->t_self, tp);
ObjectAddressSet(address, UserMappingRelationId, umId);
@@ -1484,8 +1474,7 @@ CreateForeignTable(CreateForeignTableStmt *stmt, Oid relid)
tuple = heap_form_tuple(ftrel->rd_att, values, nulls);
- simple_heap_insert(ftrel, tuple);
- CatalogUpdateIndexes(ftrel, tuple);
+ CatalogInsertHeapAndIndexes(ftrel, tuple);
heap_freetuple(tuple);
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index ec833c3..c58dc26 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -1292,8 +1292,7 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt)
procForm->proparallel = interpret_func_parallel(parallel_item);
/* Do the update */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
InvokeObjectPostAlterHook(ProcedureRelationId, funcOid, 0);
@@ -1333,9 +1332,7 @@ SetFunctionReturnType(Oid funcOid, Oid newRetType)
procForm->prorettype = newRetType;
/* update the catalog and its indexes */
- simple_heap_update(pg_proc_rel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(pg_proc_rel, tup);
+ CatalogUpdateHeapAndIndexes(pg_proc_rel, &tup->t_self, tup);
heap_close(pg_proc_rel, RowExclusiveLock);
}
@@ -1368,9 +1365,7 @@ SetFunctionArgType(Oid funcOid, int argIndex, Oid newArgType)
procForm->proargtypes.values[argIndex] = newArgType;
/* update the catalog and its indexes */
- simple_heap_update(pg_proc_rel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(pg_proc_rel, tup);
+ CatalogUpdateHeapAndIndexes(pg_proc_rel, &tup->t_self, tup);
heap_close(pg_proc_rel, RowExclusiveLock);
}
@@ -1656,9 +1651,7 @@ CreateCast(CreateCastStmt *stmt)
tuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
- castid = simple_heap_insert(relation, tuple);
-
- CatalogUpdateIndexes(relation, tuple);
+ castid = CatalogInsertHeapAndIndexes(relation, tuple);
/* make dependency entries */
myself.classId = CastRelationId;
@@ -1921,7 +1914,7 @@ CreateTransform(CreateTransformStmt *stmt)
replaces[Anum_pg_transform_trftosql - 1] = true;
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
transformid = HeapTupleGetOid(tuple);
ReleaseSysCache(tuple);
@@ -1930,12 +1923,10 @@ CreateTransform(CreateTransformStmt *stmt)
else
{
newtuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
- transformid = simple_heap_insert(relation, newtuple);
+ transformid = CatalogInsertHeapAndIndexes(relation, newtuple);
is_replace = false;
}
- CatalogUpdateIndexes(relation, newtuple);
-
if (is_replace)
deleteDependencyRecordsFor(TransformRelationId, transformid, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ed6136c..0fc77b6 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -694,7 +694,14 @@ DefineIndex(Oid relationId,
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * tries to either insert into it or use it for queries. In addition to
+ * that, WARM updates will be disallowed if an update is modifying one of
+ * the columns used by this new index. This is necessary to ensure that we
+ * don't create WARM tuples which do not have corresponding entry in this
+ * index. It must be noted that during the second phase, we will index only
+ * those heap tuples whose root line pointer is not already in the index,
+ * hence it's important that all tuples in a given chain, has the same
+ * value for any indexed column (including this new index).
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
@@ -742,7 +749,10 @@ DefineIndex(Oid relationId,
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
- * chain have different index keys.
+ * chain have different index keys. Also, the new index is consulted for
+ * deciding whether a WARM update is possible, and WARM update is not done
+ * if a column used by this index is being updated. This ensures that we
+ * don't create WARM tuples which are not indexed by this index.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
@@ -777,7 +787,8 @@ DefineIndex(Oid relationId,
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
- * insert new entries into the index for insertions and non-HOT updates.
+ * insert new entries into the index for insertions and non-HOT updates or
+ * WARM updates where this index needs new entry.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index b7daf1c..53661a3 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -100,9 +100,7 @@ SetMatViewPopulatedState(Relation relation, bool newstate)
((Form_pg_class) GETSTRUCT(tuple))->relispopulated = newstate;
- simple_heap_update(pgrel, &tuple->t_self, tuple);
-
- CatalogUpdateIndexes(pgrel, tuple);
+ CatalogUpdateHeapAndIndexes(pgrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
heap_close(pgrel, RowExclusiveLock);
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index bc43483..adb4a7d 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -278,9 +278,7 @@ CreateOpFamily(char *amname, char *opfname, Oid namespaceoid, Oid amoid)
tup = heap_form_tuple(rel->rd_att, values, nulls);
- opfamilyoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ opfamilyoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
@@ -654,9 +652,7 @@ DefineOpClass(CreateOpClassStmt *stmt)
tup = heap_form_tuple(rel->rd_att, values, nulls);
- opclassoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ opclassoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
@@ -1327,9 +1323,7 @@ storeOperators(List *opfamilyname, Oid amoid,
tup = heap_form_tuple(rel->rd_att, values, nulls);
- entryoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ entryoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
@@ -1438,9 +1432,7 @@ storeProcedures(List *opfamilyname, Oid amoid,
tup = heap_form_tuple(rel->rd_att, values, nulls);
- entryoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ entryoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
diff --git a/src/backend/commands/operatorcmds.c b/src/backend/commands/operatorcmds.c
index a273376..eb6b308 100644
--- a/src/backend/commands/operatorcmds.c
+++ b/src/backend/commands/operatorcmds.c
@@ -518,8 +518,7 @@ AlterOperator(AlterOperatorStmt *stmt)
tup = heap_modify_tuple(tup, RelationGetDescr(catalog),
values, nulls, replaces);
- simple_heap_update(catalog, &tup->t_self, tup);
- CatalogUpdateIndexes(catalog, tup);
+ CatalogUpdateHeapAndIndexes(catalog, &tup->t_self, tup);
address = makeOperatorDependencies(tup, true);
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 5d9d3a6..d1513f7 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -614,10 +614,7 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
new_tuple = heap_modify_tuple(tuple,
RelationGetDescr(pg_policy_rel),
values, isnull, replaces);
- simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple);
-
- /* Update Catalog Indexes */
- CatalogUpdateIndexes(pg_policy_rel, new_tuple);
+ CatalogUpdateHeapAndIndexes(pg_policy_rel, &new_tuple->t_self, new_tuple);
/* Remove all old dependencies. */
deleteDependencyRecordsFor(PolicyRelationId, policy_id, false);
@@ -823,10 +820,7 @@ CreatePolicy(CreatePolicyStmt *stmt)
policy_tuple = heap_form_tuple(RelationGetDescr(pg_policy_rel), values,
isnull);
- policy_id = simple_heap_insert(pg_policy_rel, policy_tuple);
-
- /* Update Indexes */
- CatalogUpdateIndexes(pg_policy_rel, policy_tuple);
+ policy_id = CatalogInsertHeapAndIndexes(pg_policy_rel, policy_tuple);
/* Record Dependencies */
target.classId = RelationRelationId;
@@ -1150,10 +1144,7 @@ AlterPolicy(AlterPolicyStmt *stmt)
new_tuple = heap_modify_tuple(policy_tuple,
RelationGetDescr(pg_policy_rel),
values, isnull, replaces);
- simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple);
-
- /* Update Catalog Indexes */
- CatalogUpdateIndexes(pg_policy_rel, new_tuple);
+ CatalogUpdateHeapAndIndexes(pg_policy_rel, &new_tuple->t_self, new_tuple);
/* Update Dependencies. */
deleteDependencyRecordsFor(PolicyRelationId, policy_id, false);
@@ -1287,10 +1278,7 @@ rename_policy(RenameStmt *stmt)
namestrcpy(&((Form_pg_policy) GETSTRUCT(policy_tuple))->polname,
stmt->newname);
- simple_heap_update(pg_policy_rel, &policy_tuple->t_self, policy_tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_policy_rel, policy_tuple);
+ CatalogUpdateHeapAndIndexes(pg_policy_rel, &policy_tuple->t_self, policy_tuple);
InvokeObjectPostAlterHook(PolicyRelationId,
HeapTupleGetOid(policy_tuple), 0);
diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index b684f41..f7fa548 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -378,7 +378,7 @@ create_proc_lang(const char *languageName, bool replace,
/* Okay, do it... */
tup = heap_modify_tuple(oldtup, tupDesc, values, nulls, replaces);
- simple_heap_update(rel, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
ReleaseSysCache(oldtup);
is_update = true;
@@ -387,13 +387,10 @@ create_proc_lang(const char *languageName, bool replace,
{
/* Creating a new language */
tup = heap_form_tuple(tupDesc, values, nulls);
- simple_heap_insert(rel, tup);
+ CatalogInsertHeapAndIndexes(rel, tup);
is_update = false;
}
- /* Need to update indexes for either the insert or update case */
- CatalogUpdateIndexes(rel, tup);
-
/*
* Create dependencies for the new language. If we are updating an
* existing language, first delete any existing pg_depend entries.
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 173b076..57543e4 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -215,8 +215,7 @@ CreatePublication(CreatePublicationStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
/* Insert tuple into catalog. */
- puboid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ puboid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
recordDependencyOnOwner(PublicationRelationId, puboid, GetUserId());
@@ -295,8 +294,7 @@ AlterPublicationOptions(AlterPublicationStmt *stmt, Relation rel,
replaces);
/* Update the catalog. */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
CommandCounterIncrement();
@@ -686,8 +684,7 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of a publication must be a superuser.")));
form->pubowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(PublicationRelationId,
diff --git a/src/backend/commands/schemacmds.c b/src/backend/commands/schemacmds.c
index c3b37b2..f49767e 100644
--- a/src/backend/commands/schemacmds.c
+++ b/src/backend/commands/schemacmds.c
@@ -281,8 +281,7 @@ RenameSchema(const char *oldname, const char *newname)
/* rename */
namestrcpy(&(((Form_pg_namespace) GETSTRUCT(tup))->nspname), newname);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
InvokeObjectPostAlterHook(NamespaceRelationId, HeapTupleGetOid(tup), 0);
@@ -417,8 +416,7 @@ AlterSchemaOwner_internal(HeapTuple tup, Relation rel, Oid newOwnerId)
newtuple = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
diff --git a/src/backend/commands/seclabel.c b/src/backend/commands/seclabel.c
index 324f2e7..7e25411 100644
--- a/src/backend/commands/seclabel.c
+++ b/src/backend/commands/seclabel.c
@@ -299,7 +299,7 @@ SetSharedSecurityLabel(const ObjectAddress *object,
replaces[Anum_pg_shseclabel_label - 1] = true;
newtup = heap_modify_tuple(oldtup, RelationGetDescr(pg_shseclabel),
values, nulls, replaces);
- simple_heap_update(pg_shseclabel, &oldtup->t_self, newtup);
+ CatalogUpdateHeapAndIndexes(pg_shseclabel, &oldtup->t_self, newtup);
}
}
systable_endscan(scan);
@@ -309,15 +309,11 @@ SetSharedSecurityLabel(const ObjectAddress *object,
{
newtup = heap_form_tuple(RelationGetDescr(pg_shseclabel),
values, nulls);
- simple_heap_insert(pg_shseclabel, newtup);
+ CatalogInsertHeapAndIndexes(pg_shseclabel, newtup);
}
- /* Update indexes, if necessary */
if (newtup != NULL)
- {
- CatalogUpdateIndexes(pg_shseclabel, newtup);
heap_freetuple(newtup);
- }
heap_close(pg_shseclabel, RowExclusiveLock);
}
@@ -390,7 +386,7 @@ SetSecurityLabel(const ObjectAddress *object,
replaces[Anum_pg_seclabel_label - 1] = true;
newtup = heap_modify_tuple(oldtup, RelationGetDescr(pg_seclabel),
values, nulls, replaces);
- simple_heap_update(pg_seclabel, &oldtup->t_self, newtup);
+ CatalogUpdateHeapAndIndexes(pg_seclabel, &oldtup->t_self, newtup);
}
}
systable_endscan(scan);
@@ -400,15 +396,12 @@ SetSecurityLabel(const ObjectAddress *object,
{
newtup = heap_form_tuple(RelationGetDescr(pg_seclabel),
values, nulls);
- simple_heap_insert(pg_seclabel, newtup);
+ CatalogInsertHeapAndIndexes(pg_seclabel, newtup);
}
/* Update indexes, if necessary */
if (newtup != NULL)
- {
- CatalogUpdateIndexes(pg_seclabel, newtup);
heap_freetuple(newtup);
- }
heap_close(pg_seclabel, RowExclusiveLock);
}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0c673f5..830b600 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -236,8 +236,7 @@ DefineSequence(ParseState *pstate, CreateSeqStmt *seq)
pgs_values[Anum_pg_sequence_seqcache - 1] = Int64GetDatumFast(seqform.seqcache);
tuple = heap_form_tuple(tupDesc, pgs_values, pgs_nulls);
- simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
heap_close(rel, RowExclusiveLock);
@@ -504,8 +503,7 @@ AlterSequence(ParseState *pstate, AlterSeqStmt *stmt)
relation_close(seqrel, NoLock);
- simple_heap_update(rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, tuple);
heap_close(rel, RowExclusiveLock);
return address;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 41ef7a3..853dcd3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -277,8 +277,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
/* Insert tuple into catalog. */
- subid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ subid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
recordDependencyOnOwner(SubscriptionRelationId, subid, owner);
@@ -408,8 +407,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
replaces);
/* Update the catalog. */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
ObjectAddressSet(myself, SubscriptionRelationId, subid);
@@ -588,8 +586,7 @@ AlterSubscriptionOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of an subscription must be a superuser.")));
form->subowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(SubscriptionRelationId,
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 90f2f7f..f62f8d7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2308,9 +2308,7 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
tuple = heap_form_tuple(desc, values, nulls);
- simple_heap_insert(inhRelation, tuple);
-
- CatalogUpdateIndexes(inhRelation, tuple);
+ CatalogInsertHeapAndIndexes(inhRelation, tuple);
heap_freetuple(tuple);
@@ -2398,10 +2396,7 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
if (classtuple->relhassubclass != relhassubclass)
{
classtuple->relhassubclass = relhassubclass;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &tuple->t_self, tuple);
}
else
{
@@ -2592,10 +2587,7 @@ renameatt_internal(Oid myrelid,
/* apply the update */
namestrcpy(&(attform->attname), newattname);
- simple_heap_update(attrelation, &atttup->t_self, atttup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, atttup);
+ CatalogUpdateHeapAndIndexes(attrelation, &atttup->t_self, atttup);
InvokeObjectPostAlterHook(RelationRelationId, myrelid, attnum);
@@ -2902,10 +2894,7 @@ RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal)
*/
namestrcpy(&(relform->relname), newrelname);
- simple_heap_update(relrelation, &reltup->t_self, reltup);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(relrelation, reltup);
+ CatalogUpdateHeapAndIndexes(relrelation, &reltup->t_self, reltup);
InvokeObjectPostAlterHookArg(RelationRelationId, myrelid, 0,
InvalidOid, is_internal);
@@ -5097,8 +5086,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
/* Bump the existing child att's inhcount */
childatt->attinhcount++;
- simple_heap_update(attrdesc, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrdesc, tuple);
+ CatalogUpdateHeapAndIndexes(attrdesc, &tuple->t_self, tuple);
heap_freetuple(tuple);
@@ -5191,10 +5179,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
else
((Form_pg_class) GETSTRUCT(reltup))->relnatts = newattnum;
- simple_heap_update(pgclass, &reltup->t_self, reltup);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pgclass, reltup);
+ CatalogUpdateHeapAndIndexes(pgclass, &reltup->t_self, reltup);
heap_freetuple(reltup);
@@ -5630,10 +5615,7 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode)
{
((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull = FALSE;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
ObjectAddressSubSet(address, RelationRelationId,
RelationGetRelid(rel), attnum);
@@ -5708,10 +5690,7 @@ ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
{
((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull = TRUE;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/* Tell Phase 3 it needs to test the constraint */
tab->new_notnull = true;
@@ -5876,10 +5855,7 @@ ATExecSetStatistics(Relation rel, const char *colName, Node *newValue, LOCKMODE
attrtuple->attstattarget = newtarget;
- simple_heap_update(attrelation, &tuple->t_self, tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, tuple);
+ CatalogUpdateHeapAndIndexes(attrelation, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -5952,8 +5928,7 @@ ATExecSetOptions(Relation rel, const char *colName, Node *options,
repl_val, repl_null, repl_repl);
/* Update system catalog. */
- simple_heap_update(attrelation, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attrelation, newtuple);
+ CatalogUpdateHeapAndIndexes(attrelation, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -6036,10 +6011,7 @@ ATExecSetStorage(Relation rel, const char *colName, Node *newValue, LOCKMODE loc
errmsg("column data type %s can only have storage PLAIN",
format_type_be(attrtuple->atttypid))));
- simple_heap_update(attrelation, &tuple->t_self, tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, tuple);
+ CatalogUpdateHeapAndIndexes(attrelation, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -6277,10 +6249,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
/* Child column must survive my deletion */
childatt->attinhcount--;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -6296,10 +6265,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
childatt->attinhcount--;
childatt->attislocal = true;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -6343,10 +6309,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
tuple_class = (Form_pg_class) GETSTRUCT(tuple);
tuple_class->relhasoids = false;
- simple_heap_update(class_rel, &tuple->t_self, tuple);
-
- /* Keep the catalog indexes up to date */
- CatalogUpdateIndexes(class_rel, tuple);
+ CatalogUpdateHeapAndIndexes(class_rel, &tuple->t_self, tuple);
heap_close(class_rel, RowExclusiveLock);
@@ -7195,8 +7158,7 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->condeferrable = cmdcon->deferrable;
copy_con->condeferred = cmdcon->initdeferred;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(contuple), 0);
@@ -7249,8 +7211,7 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
copy_tg->tgdeferrable = cmdcon->deferrable;
copy_tg->tginitdeferred = cmdcon->initdeferred;
- simple_heap_update(tgrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(tgrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(tgrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(TriggerRelationId,
HeapTupleGetOid(tgtuple), 0);
@@ -7436,8 +7397,7 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
copyTuple = heap_copytuple(tuple);
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->convalidated = true;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(tuple), 0);
@@ -8339,8 +8299,7 @@ ATExecDropConstraint(Relation rel, const char *constrName,
{
/* Child constraint must survive my deletion */
con->coninhcount--;
- simple_heap_update(conrel, ©_tuple->t_self, copy_tuple);
- CatalogUpdateIndexes(conrel, copy_tuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©_tuple->t_self, copy_tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -8356,8 +8315,7 @@ ATExecDropConstraint(Relation rel, const char *constrName,
con->coninhcount--;
con->conislocal = true;
- simple_heap_update(conrel, ©_tuple->t_self, copy_tuple);
- CatalogUpdateIndexes(conrel, copy_tuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©_tuple->t_self, copy_tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -9003,10 +8961,7 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
ReleaseSysCache(typeTuple);
- simple_heap_update(attrelation, &heapTup->t_self, heapTup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, heapTup);
+ CatalogUpdateHeapAndIndexes(attrelation, &heapTup->t_self, heapTup);
heap_close(attrelation, RowExclusiveLock);
@@ -9144,8 +9099,7 @@ ATExecAlterColumnGenericOptions(Relation rel,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(attrel),
repl_val, repl_null, repl_repl);
- simple_heap_update(attrel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attrel, newtuple);
+ CatalogUpdateHeapAndIndexes(attrel, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -9661,8 +9615,7 @@ ATExecChangeOwner(Oid relationOid, Oid newOwnerId, bool recursing, LOCKMODE lock
newtuple = heap_modify_tuple(tuple, RelationGetDescr(class_rel), repl_val, repl_null, repl_repl);
- simple_heap_update(class_rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(class_rel, newtuple);
+ CatalogUpdateHeapAndIndexes(class_rel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
@@ -9789,8 +9742,7 @@ change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId)
RelationGetDescr(attRelation),
repl_val, repl_null, repl_repl);
- simple_heap_update(attRelation, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attRelation, newtuple);
+ CatalogUpdateHeapAndIndexes(attRelation, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
}
@@ -10067,9 +10019,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(pgclass),
repl_val, repl_null, repl_repl);
- simple_heap_update(pgclass, &newtuple->t_self, newtuple);
-
- CatalogUpdateIndexes(pgclass, newtuple);
+ CatalogUpdateHeapAndIndexes(pgclass, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(RelationRelationId, RelationGetRelid(rel), 0);
@@ -10126,9 +10076,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(pgclass),
repl_val, repl_null, repl_repl);
- simple_heap_update(pgclass, &newtuple->t_self, newtuple);
-
- CatalogUpdateIndexes(pgclass, newtuple);
+ CatalogUpdateHeapAndIndexes(pgclass, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHookArg(RelationRelationId,
RelationGetRelid(toastrel), 0,
@@ -10289,8 +10237,7 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/* update the pg_class row */
rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace;
rd_rel->relfilenode = newrelfilenode;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId, RelationGetRelid(rel), 0);
@@ -10940,8 +10887,7 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
childatt->attislocal = false;
}
- simple_heap_update(attrrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrrel, tuple);
+ CatalogUpdateHeapAndIndexes(attrrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
}
else
@@ -10980,8 +10926,7 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
childatt->attislocal = false;
}
- simple_heap_update(attrrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrrel, tuple);
+ CatalogUpdateHeapAndIndexes(attrrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
}
else
@@ -11118,8 +11063,7 @@ MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel)
child_con->conislocal = false;
}
- simple_heap_update(catalog_relation, &child_copy->t_self, child_copy);
- CatalogUpdateIndexes(catalog_relation, child_copy);
+ CatalogUpdateHeapAndIndexes(catalog_relation, &child_copy->t_self, child_copy);
heap_freetuple(child_copy);
found = true;
@@ -11289,8 +11233,7 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
if (copy_att->attinhcount == 0)
copy_att->attislocal = true;
- simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(catalogRelation, copyTuple);
+ CatalogUpdateHeapAndIndexes(catalogRelation, ©Tuple->t_self, copyTuple);
heap_freetuple(copyTuple);
}
}
@@ -11364,8 +11307,7 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
if (copy_con->coninhcount == 0)
copy_con->conislocal = true;
- simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(catalogRelation, copyTuple);
+ CatalogUpdateHeapAndIndexes(catalogRelation, ©Tuple->t_self, copyTuple);
heap_freetuple(copyTuple);
}
}
@@ -11565,8 +11507,7 @@ ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode)
if (!HeapTupleIsValid(classtuple))
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(classtuple))->reloftype = typeid;
- simple_heap_update(relationRelation, &classtuple->t_self, classtuple);
- CatalogUpdateIndexes(relationRelation, classtuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &classtuple->t_self, classtuple);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
@@ -11610,8 +11551,7 @@ ATExecDropOf(Relation rel, LOCKMODE lockmode)
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->reloftype = InvalidOid;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
@@ -11651,8 +11591,7 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
if (pg_class_form->relreplident != ri_type)
{
pg_class_form->relreplident = ri_type;
- simple_heap_update(pg_class, &pg_class_tuple->t_self, pg_class_tuple);
- CatalogUpdateIndexes(pg_class, pg_class_tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &pg_class_tuple->t_self, pg_class_tuple);
}
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(pg_class_tuple);
@@ -11711,8 +11650,7 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
if (dirty)
{
- simple_heap_update(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
- CatalogUpdateIndexes(pg_index, pg_index_tuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
InvalidOid, is_internal);
}
@@ -11861,10 +11799,7 @@ ATExecEnableRowSecurity(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = true;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11888,10 +11823,7 @@ ATExecDisableRowSecurity(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = false;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11917,10 +11849,7 @@ ATExecForceNoForceRowSecurity(Relation rel, bool force_rls)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relforcerowsecurity = force_rls;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11988,8 +11917,7 @@ ATExecGenericOptions(Relation rel, List *options)
tuple = heap_modify_tuple(tuple, RelationGetDescr(ftrel),
repl_val, repl_null, repl_repl);
- simple_heap_update(ftrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(ftrel, tuple);
+ CatalogUpdateHeapAndIndexes(ftrel, &tuple->t_self, tuple);
/*
* Invalidate relcache so that all sessions will refresh any cached plans
@@ -12284,8 +12212,7 @@ AlterRelationNamespaceInternal(Relation classRel, Oid relOid,
/* classTup is a copy, so OK to scribble on */
classForm->relnamespace = newNspOid;
- simple_heap_update(classRel, &classTup->t_self, classTup);
- CatalogUpdateIndexes(classRel, classTup);
+ CatalogUpdateHeapAndIndexes(classRel, &classTup->t_self, classTup);
/* Update dependency on schema if caller said so */
if (hasDependEntry &&
@@ -13520,8 +13447,7 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
new_val, new_null, new_repl);
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = false;
- simple_heap_update(classRel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(classRel, newtuple);
+ CatalogUpdateHeapAndIndexes(classRel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 651e1b3..f3c7436 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -344,9 +344,7 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- tablespaceoid = simple_heap_insert(rel, tuple);
-
- CatalogUpdateIndexes(rel, tuple);
+ tablespaceoid = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -971,8 +969,7 @@ RenameTableSpace(const char *oldname, const char *newname)
/* OK, update the entry */
namestrcpy(&(newform->spcname), newname);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(TableSpaceRelationId, tspId, 0);
@@ -1044,8 +1041,7 @@ AlterTableSpaceOptions(AlterTableSpaceOptionsStmt *stmt)
repl_null, repl_repl);
/* Update system catalog. */
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(TableSpaceRelationId, HeapTupleGetOid(tup), 0);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index f067d0a..1cc67ef 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -773,9 +773,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
/*
* Insert tuple into pg_trigger.
*/
- simple_heap_insert(tgrel, tuple);
-
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogInsertHeapAndIndexes(tgrel, tuple);
heap_freetuple(tuple);
heap_close(tgrel, RowExclusiveLock);
@@ -802,9 +800,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
((Form_pg_class) GETSTRUCT(tuple))->relhastriggers = true;
- simple_heap_update(pgrel, &tuple->t_self, tuple);
-
- CatalogUpdateIndexes(pgrel, tuple);
+ CatalogUpdateHeapAndIndexes(pgrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
heap_close(pgrel, RowExclusiveLock);
@@ -1444,10 +1440,7 @@ renametrig(RenameStmt *stmt)
namestrcpy(&((Form_pg_trigger) GETSTRUCT(tuple))->tgname,
stmt->newname);
- simple_heap_update(tgrel, &tuple->t_self, tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogUpdateHeapAndIndexes(tgrel, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(TriggerRelationId,
HeapTupleGetOid(tuple), 0);
@@ -1560,10 +1553,7 @@ EnableDisableTrigger(Relation rel, const char *tgname,
newtrig->tgenabled = fires_when;
- simple_heap_update(tgrel, &newtup->t_self, newtup);
-
- /* Keep catalog indexes current */
- CatalogUpdateIndexes(tgrel, newtup);
+ CatalogUpdateHeapAndIndexes(tgrel, &newtup->t_self, newtup);
heap_freetuple(newtup);
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index 479a160..b9929a5 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -271,9 +271,7 @@ DefineTSParser(List *names, List *parameters)
tup = heap_form_tuple(prsRel->rd_att, values, nulls);
- prsOid = simple_heap_insert(prsRel, tup);
-
- CatalogUpdateIndexes(prsRel, tup);
+ prsOid = CatalogInsertHeapAndIndexes(prsRel, tup);
address = makeParserDependencies(tup);
@@ -482,9 +480,7 @@ DefineTSDictionary(List *names, List *parameters)
tup = heap_form_tuple(dictRel->rd_att, values, nulls);
- dictOid = simple_heap_insert(dictRel, tup);
-
- CatalogUpdateIndexes(dictRel, tup);
+ dictOid = CatalogInsertHeapAndIndexes(dictRel, tup);
address = makeDictionaryDependencies(tup);
@@ -620,9 +616,7 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
newtup = heap_modify_tuple(tup, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtup->t_self, newtup);
-
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &newtup->t_self, newtup);
InvokeObjectPostAlterHook(TSDictionaryRelationId, dictId, 0);
@@ -806,9 +800,7 @@ DefineTSTemplate(List *names, List *parameters)
tup = heap_form_tuple(tmplRel->rd_att, values, nulls);
- tmplOid = simple_heap_insert(tmplRel, tup);
-
- CatalogUpdateIndexes(tmplRel, tup);
+ tmplOid = CatalogInsertHeapAndIndexes(tmplRel, tup);
address = makeTSTemplateDependencies(tup);
@@ -1066,9 +1058,7 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied)
tup = heap_form_tuple(cfgRel->rd_att, values, nulls);
- cfgOid = simple_heap_insert(cfgRel, tup);
-
- CatalogUpdateIndexes(cfgRel, tup);
+ cfgOid = CatalogInsertHeapAndIndexes(cfgRel, tup);
if (OidIsValid(sourceOid))
{
@@ -1106,9 +1096,7 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied)
newmaptup = heap_form_tuple(mapRel->rd_att, mapvalues, mapnulls);
- simple_heap_insert(mapRel, newmaptup);
-
- CatalogUpdateIndexes(mapRel, newmaptup);
+ CatalogInsertHeapAndIndexes(mapRel, newmaptup);
heap_freetuple(newmaptup);
}
@@ -1409,9 +1397,7 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
newtup = heap_modify_tuple(maptup,
RelationGetDescr(relMap),
repl_val, repl_null, repl_repl);
- simple_heap_update(relMap, &newtup->t_self, newtup);
-
- CatalogUpdateIndexes(relMap, newtup);
+ CatalogUpdateHeapAndIndexes(relMap, &newtup->t_self, newtup);
}
}
@@ -1436,8 +1422,7 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
values[Anum_pg_ts_config_map_mapdict - 1] = ObjectIdGetDatum(dictIds[j]);
tup = heap_form_tuple(relMap->rd_att, values, nulls);
- simple_heap_insert(relMap, tup);
- CatalogUpdateIndexes(relMap, tup);
+ CatalogInsertHeapAndIndexes(relMap, tup);
heap_freetuple(tup);
}
diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 4c33d55..68e93fc 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -2221,9 +2221,7 @@ AlterDomainDefault(List *names, Node *defaultRaw)
new_record, new_record_nulls,
new_record_repl);
- simple_heap_update(rel, &tup->t_self, newtuple);
-
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, newtuple);
/* Rebuild dependencies */
GenerateTypeDependencies(typTup->typnamespace,
@@ -2360,9 +2358,7 @@ AlterDomainNotNull(List *names, bool notNull)
*/
typTup->typnotnull = notNull;
- simple_heap_update(typrel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(typrel, tup);
+ CatalogUpdateHeapAndIndexes(typrel, &tup->t_self, tup);
InvokeObjectPostAlterHook(TypeRelationId, domainoid, 0);
@@ -2662,8 +2658,7 @@ AlterDomainValidateConstraint(List *names, char *constrName)
copyTuple = heap_copytuple(tuple);
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->convalidated = true;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(copyTuple), 0);
@@ -3404,9 +3399,7 @@ AlterTypeOwnerInternal(Oid typeOid, Oid newOwnerId)
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* If it has an array type, update that too */
if (OidIsValid(typTup->typarray))
@@ -3566,8 +3559,7 @@ AlterTypeNamespaceInternal(Oid typeOid, Oid nspOid,
/* tup is a copy, so we can scribble directly on it */
typform->typnamespace = nspOid;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
}
/*
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index b746982..46e3a66 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -433,8 +433,7 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
/*
* Insert new record in the pg_authid table
*/
- roleid = simple_heap_insert(pg_authid_rel, tuple);
- CatalogUpdateIndexes(pg_authid_rel, tuple);
+ roleid = CatalogInsertHeapAndIndexes(pg_authid_rel, tuple);
/*
* Advance command counter so we can see new record; else tests in
@@ -838,10 +837,7 @@ AlterRole(AlterRoleStmt *stmt)
new_tuple = heap_modify_tuple(tuple, pg_authid_dsc, new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authid_rel, &tuple->t_self, new_tuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(pg_authid_rel, new_tuple);
+ CatalogUpdateHeapAndIndexes(pg_authid_rel, &tuple->t_self, new_tuple);
InvokeObjectPostAlterHook(AuthIdRelationId, roleid, 0);
@@ -1243,9 +1239,7 @@ RenameRole(const char *oldname, const char *newname)
}
newtuple = heap_modify_tuple(oldtuple, dsc, repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &oldtuple->t_self, newtuple);
-
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &oldtuple->t_self, newtuple);
InvokeObjectPostAlterHook(AuthIdRelationId, roleid, 0);
@@ -1530,16 +1524,14 @@ AddRoleMems(const char *rolename, Oid roleid,
tuple = heap_modify_tuple(authmem_tuple, pg_authmem_dsc,
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogUpdateHeapAndIndexes(pg_authmem_rel, &tuple->t_self, tuple);
ReleaseSysCache(authmem_tuple);
}
else
{
tuple = heap_form_tuple(pg_authmem_dsc,
new_record, new_record_nulls);
- simple_heap_insert(pg_authmem_rel, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogInsertHeapAndIndexes(pg_authmem_rel, tuple);
}
/* CCI after each change, in case there are duplicates in list */
@@ -1647,8 +1639,7 @@ DelRoleMems(const char *rolename, Oid roleid,
tuple = heap_modify_tuple(authmem_tuple, pg_authmem_dsc,
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogUpdateHeapAndIndexes(pg_authmem_rel, &tuple->t_self, tuple);
}
ReleaseSysCache(authmem_tuple);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..1388be1 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1032,6 +1032,19 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM
+ * tuple, there could be multiple index entries
+ * pointing to the root of this chain. We can't do
+ * index-only scans for such tuples without verifying
+ * index key check. So mark the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ break;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, visibility_cutoff_xid))
visibility_cutoff_xid = xmin;
@@ -2158,6 +2171,18 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9920f48..94cf92f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
@@ -790,6 +803,9 @@ retry:
{
if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
+ else
+ ItemPointerCopy(&tup->t_self, &ctid_wait);
+
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index a8bd583..b6c115d 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -399,6 +399,8 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate, false, NULL,
NIL);
@@ -445,6 +447,8 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
if (!skip_tuple)
{
List *recheckIndexes = NIL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Check the constraints of the tuple */
if (rel->rd_att->constr)
@@ -455,13 +459,30 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
/* OK, update the tuple and index entries for it */
simple_heap_update(rel, &searchslot->tts_tuple->t_self,
- slot->tts_tuple);
+ slot->tts_tuple, &modified_attrs, &warm_update);
if (resultRelInfo->ri_NumIndices > 0 &&
- !HeapTupleIsHeapOnly(slot->tts_tuple))
+ (!HeapTupleIsHeapOnly(slot->tts_tuple) || warm_update))
+ {
+ ItemPointerData root_tid;
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self,
+ &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
+
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL,
NIL);
+ }
/* AFTER ROW UPDATE Triggers */
ExecARUpdateTriggers(estate, resultRelInfo,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index f18827d..f81d290 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,27 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ /*
+ * If the heap tuple needs a recheck because of a WARM update,
+ * it's a lossy case
+ */
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..c7be366 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,10 +115,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..a1f3440 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -512,6 +512,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -558,6 +559,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -891,6 +893,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -1007,7 +1012,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1094,10 +1099,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..432dd4b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4085,6 +4087,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5194,6 +5197,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5221,6 +5225,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index d7dda6a..7048f73 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -299,8 +299,7 @@ replorigin_create(char *roname)
values[Anum_pg_replication_origin_roname - 1] = roname_d;
tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogInsertHeapAndIndexes(rel, tuple);
CommandCounterIncrement();
break;
}
diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c
index 481868b..33d73c2 100644
--- a/src/backend/rewrite/rewriteDefine.c
+++ b/src/backend/rewrite/rewriteDefine.c
@@ -124,7 +124,7 @@ InsertRule(char *rulname,
tup = heap_modify_tuple(oldtup, RelationGetDescr(pg_rewrite_desc),
values, nulls, replaces);
- simple_heap_update(pg_rewrite_desc, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(pg_rewrite_desc, &tup->t_self, tup);
ReleaseSysCache(oldtup);
@@ -135,11 +135,9 @@ InsertRule(char *rulname,
{
tup = heap_form_tuple(pg_rewrite_desc->rd_att, values, nulls);
- rewriteObjectId = simple_heap_insert(pg_rewrite_desc, tup);
+ rewriteObjectId = CatalogInsertHeapAndIndexes(pg_rewrite_desc, tup);
}
- /* Need to update indexes in either case */
- CatalogUpdateIndexes(pg_rewrite_desc, tup);
heap_freetuple(tup);
@@ -613,8 +611,7 @@ DefineQueryRewrite(char *rulename,
classForm->relminmxid = InvalidMultiXactId;
classForm->relreplident = REPLICA_IDENTITY_NOTHING;
- simple_heap_update(relationRelation, &classTup->t_self, classTup);
- CatalogUpdateIndexes(relationRelation, classTup);
+ CatalogUpdateHeapAndIndexes(relationRelation, &classTup->t_self, classTup);
heap_freetuple(classTup);
heap_close(relationRelation, RowExclusiveLock);
@@ -866,10 +863,7 @@ EnableDisableRule(Relation rel, const char *rulename,
{
((Form_pg_rewrite) GETSTRUCT(ruletup))->ev_enabled =
CharGetDatum(fires_when);
- simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_rewrite_desc, ruletup);
+ CatalogUpdateHeapAndIndexes(pg_rewrite_desc, &ruletup->t_self, ruletup);
changed = true;
}
@@ -985,10 +979,7 @@ RenameRewriteRule(RangeVar *relation, const char *oldName,
/* OK, do the update */
namestrcpy(&(ruleform->rulename), newName);
- simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_rewrite_desc, ruletup);
+ CatalogUpdateHeapAndIndexes(pg_rewrite_desc, &ruletup->t_self, ruletup);
heap_freetuple(ruletup);
heap_close(pg_rewrite_desc, RowExclusiveLock);
diff --git a/src/backend/rewrite/rewriteSupport.c b/src/backend/rewrite/rewriteSupport.c
index 0154072..fc76fab 100644
--- a/src/backend/rewrite/rewriteSupport.c
+++ b/src/backend/rewrite/rewriteSupport.c
@@ -72,10 +72,7 @@ SetRelationRuleStatus(Oid relationId, bool relHasRules)
/* Do the update */
classForm->relhasrules = relHasRules;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
-
- /* Keep the catalog indexes up to date */
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &tuple->t_self, tuple);
}
else
{
diff --git a/src/backend/storage/large_object/inv_api.c b/src/backend/storage/large_object/inv_api.c
index 262b0b2..de35e03 100644
--- a/src/backend/storage/large_object/inv_api.c
+++ b/src/backend/storage/large_object/inv_api.c
@@ -678,8 +678,7 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
replace[Anum_pg_largeobject_data - 1] = true;
newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
values, nulls, replace);
- simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogUpdateHeapAndIndexes(lo_heap_r, &newtup->t_self, newtup);
heap_freetuple(newtup);
/*
@@ -721,8 +720,7 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
values[Anum_pg_largeobject_pageno - 1] = Int32GetDatum(pageno);
values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
- simple_heap_insert(lo_heap_r, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogInsertHeapAndIndexes(lo_heap_r, newtup);
heap_freetuple(newtup);
}
pageno++;
@@ -850,8 +848,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
replace[Anum_pg_largeobject_data - 1] = true;
newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
values, nulls, replace);
- simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogUpdateHeapAndIndexes(lo_heap_r, &newtup->t_self, newtup);
heap_freetuple(newtup);
}
else
@@ -888,8 +885,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
values[Anum_pg_largeobject_pageno - 1] = Int32GetDatum(pageno);
values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
- simple_heap_insert(lo_heap_r, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogInsertHeapAndIndexes(lo_heap_r, newtup);
heap_freetuple(newtup);
}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a987d0d..b8677f3 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -145,6 +145,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1644,6 +1660,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 26ff7e1..43781fb 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2338,6 +2338,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_pkattr);
bms_free(relation->rd_idattr);
@@ -3484,8 +3485,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
classform->relminmxid = minmulti;
classform->relpersistence = persistence;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_freetuple(tuple);
@@ -4352,6 +4352,13 @@ RelationGetIndexList(Relation relation)
return list_copy(relation->rd_indexlist);
/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */
+ relation->rd_indexattr = NULL;
+
+ /*
* We build the list we intend to return (in the caller's context) while
* doing the scan. After successfully completing the scan, we copy that
* list into the relcache entry. This avoids cache-context memory leakage
@@ -4757,14 +4764,18 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *pkindexattrs; /* columns in the primary index */
Bitmapset *idindexattrs; /* columns in the replica identity */
+ Bitmapset *indxnotreadyattrs; /* columns in not ready indexes */
List *indexoidlist;
Oid relpkindex;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4779,6 +4790,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return bms_copy(relation->rd_indxnotreadyattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4819,9 +4834,11 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
pkindexattrs = NULL;
idindexattrs = NULL;
+ indxnotreadyattrs = NULL;
foreach(l, indexoidlist)
{
Oid indexOid = lfirst_oid(l);
@@ -4858,6 +4875,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
indexattrs = bms_add_member(indexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_member(indxnotreadyattrs,
+ attrnum - FirstLowInvalidHeapAttributeNumber);
+
if (isKey)
uindexattrs = bms_add_member(uindexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
@@ -4873,25 +4894,51 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_members(indxnotreadyattrs,
+ exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
+
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_pkattr);
relation->rd_pkattr = NULL;
bms_free(relation->rd_idattr);
relation->rd_idattr = NULL;
+ bms_free(relation->rd_indxnotreadyattr);
+ relation->rd_indxnotreadyattr = NULL;
/*
* Now save copies of the bitmaps in the relcache entry. We intentionally
@@ -4904,7 +4951,9 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_pkattr = bms_copy(pkindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
+ relation->rd_indxnotreadyattr = bms_copy(indxnotreadyattrs);
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4918,6 +4967,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return indxnotreadyattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
@@ -5530,6 +5583,7 @@ load_relcache_init_file(bool shared)
rel->rd_keyattr = NULL;
rel->rd_pkattr = NULL;
rel->rd_idattr = NULL;
+ rel->rd_indxnotreadyattr = NULL;
rel->rd_pubactions = NULL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e91e41d..34430a9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -150,6 +151,10 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -213,6 +218,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to support WARM */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 69a3873..3e14023 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -364,4 +364,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 95aa976..9412c3a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -161,7 +162,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -176,7 +178,9 @@ extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
- HeapTuple tup);
+ HeapTuple tup,
+ Bitmapset **modified_attrs,
+ bool *warm_update);
extern void heap_sync(Relation relation);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a4a1fe1..b4238e5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7552186..ddbdbcd 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) != 0 \
+)
+
/*
* Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
* the TID of the tuple itself in t_ctid field to mark the end of the chain.
@@ -785,6 +801,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..98129d6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -750,6 +750,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce3ca8d..12d3b0c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -112,7 +112,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a3635a4..7e29df3 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -31,8 +31,13 @@ typedef struct ResultRelInfo *CatalogIndexState;
extern CatalogIndexState CatalogOpenIndexes(Relation heapRel);
extern void CatalogCloseIndexes(CatalogIndexState indstate);
extern void CatalogIndexInsert(CatalogIndexState indstate,
- HeapTuple heapTuple);
-extern void CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple);
+ HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update);
+extern void CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update);
+extern void CatalogUpdateHeapAndIndexes(Relation heapRel, ItemPointer otid,
+ HeapTuple tup);
+extern Oid CatalogInsertHeapAndIndexes(Relation heapRel, HeapTuple tup);
/*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 05652e8..c132b10 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2740,6 +2740,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3353 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2892,6 +2894,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3354 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b..c4495a3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -382,6 +382,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..2c4d884 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -37,5 +37,4 @@ extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..07f2900 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -62,6 +62,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..ee635be 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1177,7 +1179,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a617a7c..fbac7c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -138,9 +138,14 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
+ Bitmapset *rd_indxnotreadyattr; /* columns used by indexes not yet
+ ready */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_pkattr; /* cols included in primary key */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
PublicationActions *rd_pubactions; /* publication actions */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index da36b67..d18bd09 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -50,7 +50,9 @@ typedef enum IndexAttrBitmapKind
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
INDEX_ATTR_BITMAP_PRIMARY_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE,
+ INDEX_ATTR_BITMAP_NOTREADY
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index de5ae00..7656e6e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1728,6 +1728,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1871,6 +1872,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1914,6 +1916,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1951,7 +1954,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1967,7 +1971,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1989,7 +1994,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa3bb7
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,367 @@
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=72)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=4)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1 (cost=0.29..9.16 rows=50 width=4)
+ Index Cond: (b = 140001)
+(2 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab1;
+------------------
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE a = 1;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE b = 701;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2 (cost=0.14..4.16 rows=1 width=4)
+ Index Cond: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab2 WHERE b = 701;
+ b
+-----
+ 701
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab2;
+------------------
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 1;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on updtst_tab3 (cost=0.00..2.25 rows=1 width=4)
+ Filter: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 701;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+ b
+------
+ 1421
+(1 row)
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+SET enable_seqscan = false;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 98
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 2;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3 (cost=0.14..8.16 rows=1 width=4)
+ Index Cond: (b = 702)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 702;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+ b
+------
+ 1422
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab3;
+------------------
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index edeb2d6..2268705 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..b73c278
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,172 @@
+
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab1;
+
+------------------
+
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE a = 1;
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE b = 701;
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+SELECT b FROM updtst_tab2 WHERE b = 701;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab2;
+------------------
+
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+SELECT * FROM updtst_tab3 WHERE a = 1;
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+SET enable_seqscan = false;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+SELECT * FROM updtst_tab3 WHERE a = 2;
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab3;
+------------------
+
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
0001_track_root_lp_v10.patchapplication/octet-stream; name=0001_track_root_lp_v10.patchDownload
diff --git b/src/backend/access/heap/heapam.c a/src/backend/access/heap/heapam.c
index 84447f0..5149c07 100644
--- b/src/backend/access/heap/heapam.c
+++ a/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2247,13 +2248,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2384,6 +2385,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2422,8 +2424,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
- RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup,
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
+
+ /* We must not overwrite the speculative insertion token. */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2651,6 +2658,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2721,7 +2729,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
+
+ /* Mark this tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2729,7 +2742,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
+ /* Mark each tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -3001,6 +3017,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3011,6 +3028,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3052,7 +3070,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3182,7 +3201,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain.
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3231,6 +3260,22 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section.
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever.
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tp.t_self));
+
START_CRIT_SECTION();
/*
@@ -3258,8 +3303,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain. */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3460,6 +3507,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3522,6 +3571,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3806,7 +3856,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3946,6 +4001,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3973,6 +4029,14 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available.
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self));
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3987,7 +4051,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4145,6 +4210,10 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4170,6 +4239,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above).
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4177,10 +4257,22 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ root_offnum = RelationPutHeapTuple(relation, newbuf, heaptup, false,
+ root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain.
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4193,7 +4285,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4232,6 +4324,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4512,7 +4605,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4521,9 +4615,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4543,6 +4639,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4570,7 +4667,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5008,7 +5109,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5056,6 +5162,10 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self));
+
START_CRIT_SECTION();
/*
@@ -5084,7 +5194,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5598,6 +5711,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5606,6 +5720,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5835,7 +5951,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5844,7 +5960,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5961,7 +6077,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6087,8 +6203,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7436,6 +7551,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7556,6 +7672,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8210,7 +8329,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, xlrec->offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8300,7 +8425,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8435,8 +8561,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8572,7 +8698,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8705,13 +8831,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored.
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8774,6 +8904,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself. */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8837,11 +8970,17 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git b/src/backend/access/heap/hio.c a/src/backend/access/heap/hio.c
index 6529fe3..8052519 100644
--- b/src/backend/access/heap/hio.c
+++ a/src/backend/access/heap/hio.c
@@ -31,12 +31,20 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple where the caller
+ * tells us about the root line pointer of the chain. The latter is used
+ * during insertion of a new row, hence root line pointer is set to the offset
+ * where this tuple is inserted.
*/
-void
+OffsetNumber
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,17 +68,24 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead).
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number. */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(root_offnum))
+ root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, root_offnum);
}
+
+ return root_offnum;
}
/*
diff --git b/src/backend/access/heap/pruneheap.c a/src/backend/access/heap/pruneheap.c
index d69a266..f54337c 100644
--- b/src/backend/access/heap/pruneheap.c
+++ a/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that).
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,47 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
+ *
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line pointers
+ * and tuples on the page in order to find the root line pointers. To minimize
+ * the cost, we break early if target_offnum is specified and root line pointer
+ * to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +807,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return.
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping. */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +869,65 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return.
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item. */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion.
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple.
+ */
+OffsetNumber
+heap_get_root_tuple(Page page, OffsetNumber target_offnum)
+{
+ OffsetNumber offnum = InvalidOffsetNumber;
+ heap_get_root_tuples_internal(page, target_offnum, &offnum);
+ return offnum;
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git b/src/backend/access/heap/rewriteheap.c a/src/backend/access/heap/rewriteheap.c
index 90ab6f2..e11b4a2 100644
--- b/src/backend/access/heap/rewriteheap.c
+++ a/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain.
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest.
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git b/src/backend/executor/execIndexing.c a/src/backend/executor/execIndexing.c
index 8d119f6..9920f48 100644
--- b/src/backend/executor/execIndexing.c
+++ a/src/backend/executor/execIndexing.c
@@ -788,7 +788,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git b/src/backend/executor/execMain.c a/src/backend/executor/execMain.c
index 3a5b5b2..12476e7 100644
--- b/src/backend/executor/execMain.c
+++ a/src/backend/executor/execMain.c
@@ -2589,7 +2589,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2597,7 +2597,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git b/src/include/access/heapam.h a/src/include/access/heapam.h
index a864f78..95aa976 100644
--- b/src/include/access/heapam.h
+++ a/src/include/access/heapam.h
@@ -189,6 +189,7 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern OffsetNumber heap_get_root_tuple(Page page, OffsetNumber target_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git b/src/include/access/heapam_xlog.h a/src/include/access/heapam_xlog.h
index 52f28b8..a4a1fe1 100644
--- b/src/include/access/heapam_xlog.h
+++ a/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git b/src/include/access/hio.h a/src/include/access/hio.h
index 2824f23..921cb37 100644
--- b/src/include/access/hio.h
+++ a/src/include/access/hio.h
@@ -35,8 +35,8 @@ typedef struct BulkInsertStateData
} BulkInsertStateData;
-extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+extern OffsetNumber RelationPutHeapTuple(Relation relation, Buffer buffer,
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git b/src/include/access/htup_details.h a/src/include/access/htup_details.h
index a6c7e31..7552186 100644
--- b/src/include/access/htup_details.h
+++ a/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,43 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+/*
+ * Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
+ * the TID of the tuple itself in t_ctid field to mark the end of the chain.
+ * But starting PG v10, we use a special flag HEAP_LATEST_TUPLE to identify the
+ * last tuple and store the root line pointer of the HOT chain in t_ctid field
+ * instead.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * Starting from PostgreSQL 10, the latest tuple in an update chain has
+ * HEAP_LATEST_TUPLE set; but tuples upgraded from earlier versions do not.
+ * For those, we determine whether a tuple is latest by testing that its t_ctid
+ * points to itself.
+ *
+ * Note: beware of multiple evaluations of "tup" and "tid" arguments.
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +585,56 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * now have a new tuple in the chain and this is no longer the last tuple of
+ * the chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller must have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+/*
+ * Get the root line pointer of the HOT chain. The caller should have confirmed
+ * that the root offset is cached before calling this macro.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * Return whether the tuple has a cached root offset. We don't use
+ * HeapTupleHeaderIsHeapLatest because that one also considers the case of
+ * t_ctid pointing to itself, for tuples migrated from pre v10 clusters. Here
+ * we are only interested in the tuples which are marked with HEAP_LATEST_TUPLE
+ * flag.
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0000_interesting_attrs.patchapplication/octet-stream; name=0000_interesting_attrs.patchDownload
commit 5fc696cf695f3bc488ba8f4544166b1be44998e3
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sun Jan 1 16:29:10 2017 +0530
Alvaro's patch on interesting attrs
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5fd7f1e..84447f0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -95,11 +95,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3454,6 +3451,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3471,9 +3470,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3500,21 +3496,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3535,7 +3540,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3561,6 +3566,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3572,10 +3581,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3814,6 +3820,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4118,7 +4126,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4133,7 +4141,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4281,13 +4291,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4321,7 +4333,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4366,114 +4378,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
- *
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
-
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
+ int attnum;
+ Bitmapset *modified = NULL;
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
-
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
Pavan Deolasee wrote:
On Thu, Jan 26, 2017 at 2:38 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
The simple_heap_update + CatalogUpdateIndexes pattern is getting
obnoxious. How about creating something like catalog_heap_update which
does both things at once, and stop bothering each callsite with the WARM
stuff?What I realised that there are really 2 patterns:
1. simple_heap_insert, CatalogUpdateIndexes
2. simple_heap_update, CatalogUpdateIndexesThere are only couple of places where we already have indexes open or have
more than one tuple to update, so we call CatalogIndexInsert directly. What
I ended up doing in the attached patch is add two new APIs which combines
the two steps of each of these patterns. It seems much cleaner to me and
also less buggy for future users. I hope I am not missing a reason not to
do combine these steps.
CatalogUpdateIndexes was just added as a convenience function on top of
a very common pattern. If we now have a reason to create a second one
because there are now two very common patterns, it seems reasonable to
have two functions. I think I would commit the refactoring to create
these functions ahead of the larger WARM patch, since I think it'd be
bulky and largely mechanical. (I'm going from this description; didn't
read your actual code.)
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ +do { \ + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ +} while (0)Actually, I think this macro could just return the TID so that it can be
used as struct assignment, just like ItemPointerCopy does internally --
callers can do
ctid = HeapTupleHeaderGetNextTid(tup);While I agree with your proposal, I wonder why we have ItemPointerCopy() in
the first place because we freely copy TIDs as struct assignment. Is there
a reason for that? And if there is, does it impact this specific case?
I dunno. This macro is present in our very first commit d31084e9d1118b.
Maybe it's an artifact from the Lisp to C conversion. Even then, we had
some cases of iptrs being copied by struct assignment, so it's not like
it didn't work. Perhaps somebody envisioned that the internal details
could change, but that hasn't happened in two decades so why should we
worry about it now? If somebody needs it later, it can be changed then.
There is one issue that bothers me. The current implementation lacks
ability to convert WARM chains into HOT chains. The README.WARM has some
proposal to do that. But it requires additional free bit in tuple header
(which we don't have) and of course, it needs to be vetted and implemented.
If the heap ends up with many WARM tuples, then index-only-scans will
become ineffective because index-only-scan can not skip a heap page, if it
contains a WARM tuple. Alternate ideas/suggestions and review of the design
are welcome!
t_infomask2 contains one last unused bit, and we could reuse vacuum
full's bits (HEAP_MOVED_OUT, HEAP_MOVED_IN), but that will need some
thinking ahead. Maybe now's the time to start versioning relations so
that we can ensure clusters upgraded to pg10 do not contain any of those
bits in any tuple headers.
I don't have any ideas regarding the estate passed to recheck yet --
haven't looked at the callsites in detail. I'll give this another look
later.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Pavan Deolasee wrote:
On Thu, Jan 26, 2017 at 2:38 AM, Alvaro Herrera <
alvherre@2ndquadrant.com>
wrote:
The simple_heap_update + CatalogUpdateIndexes pattern is getting
obnoxious. How about creating something like catalog_heap_update which
does both things at once, and stop bothering each callsite with theWARM
stuff?
What I realised that there are really 2 patterns:
1. simple_heap_insert, CatalogUpdateIndexes
2. simple_heap_update, CatalogUpdateIndexesThere are only couple of places where we already have indexes open or
have
more than one tuple to update, so we call CatalogIndexInsert directly.
What
I ended up doing in the attached patch is add two new APIs which combines
the two steps of each of these patterns. It seems much cleaner to me and
also less buggy for future users. I hope I am not missing a reason not to
do combine these steps.CatalogUpdateIndexes was just added as a convenience function on top of
a very common pattern. If we now have a reason to create a second one
because there are now two very common patterns, it seems reasonable to
have two functions. I think I would commit the refactoring to create
these functions ahead of the larger WARM patch, since I think it'd be
bulky and largely mechanical. (I'm going from this description; didn't
read your actual code.)
Sounds good. Should I submit that as a separate patch on current master?
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee wrote:
On Tue, Jan 31, 2017 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
CatalogUpdateIndexes was just added as a convenience function on top of
a very common pattern. If we now have a reason to create a second one
because there are now two very common patterns, it seems reasonable to
have two functions. I think I would commit the refactoring to create
these functions ahead of the larger WARM patch, since I think it'd be
bulky and largely mechanical. (I'm going from this description; didn't
read your actual code.)Sounds good. Should I submit that as a separate patch on current master?
Yes, please.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 7:37 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Pavan Deolasee wrote:
Sounds good. Should I submit that as a separate patch on current master?
Yes, please.
Attached.
Two new APIs added.
- CatalogInsertHeapAndIndex which does a simple_heap_insert followed by
catalog updates
- CatalogUpdateHeapAndIndex which does a simple_heap_update followed by
catalog updates
There are only a handful callers remain for simple_heap_insert/update after
this patch. They are typically working with already opened indexes and
hence I left them unchanged.
make check-world passes with the patch.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
catalog_update.patchapplication/octet-stream; name=catalog_update.patchDownload
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 00a9aea..477f450 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1252,7 +1252,7 @@ SetDefaultACL(InternalDefaultACL *iacls)
values[Anum_pg_default_acl_defaclacl - 1] = PointerGetDatum(new_acl);
newtuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- simple_heap_insert(rel, newtuple);
+ CatalogInsertHeapAndIndexes(rel, newtuple);
}
else
{
@@ -1262,12 +1262,9 @@ SetDefaultACL(InternalDefaultACL *iacls)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
values, nulls, replaces);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
}
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(rel, newtuple);
-
/* these dependencies don't change in an update */
if (isNew)
{
@@ -1697,10 +1694,7 @@ ExecGrant_Attribute(InternalGrant *istmt, Oid relOid, const char *relname,
newtuple = heap_modify_tuple(attr_tuple, RelationGetDescr(attRelation),
values, nulls, replaces);
- simple_heap_update(attRelation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(attRelation, newtuple);
+ CatalogUpdateHeapAndIndexes(attRelation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(relOid, RelationRelationId, attnum,
@@ -1963,10 +1957,7 @@ ExecGrant_Relation(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation),
values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(relOid, RelationRelationId, 0, new_acl);
@@ -2156,10 +2147,7 @@ ExecGrant_Database(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update the shared dependency ACL info */
updateAclDependencies(DatabaseRelationId, HeapTupleGetOid(tuple), 0,
@@ -2281,10 +2269,7 @@ ExecGrant_Fdw(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(fdwid, ForeignDataWrapperRelationId, 0,
@@ -2410,10 +2395,7 @@ ExecGrant_ForeignServer(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(srvid, ForeignServerRelationId, 0, new_acl);
@@ -2537,10 +2519,7 @@ ExecGrant_Function(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(funcId, ProcedureRelationId, 0, new_acl);
@@ -2671,10 +2650,7 @@ ExecGrant_Language(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(langId, LanguageRelationId, 0, new_acl);
@@ -2813,10 +2789,7 @@ ExecGrant_Largeobject(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation),
values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(loid, LargeObjectRelationId, 0, new_acl);
@@ -2941,10 +2914,7 @@ ExecGrant_Namespace(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(nspid, NamespaceRelationId, 0, new_acl);
@@ -3068,10 +3038,7 @@ ExecGrant_Tablespace(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update the shared dependency ACL info */
updateAclDependencies(TableSpaceRelationId, tblId, 0,
@@ -3205,10 +3172,7 @@ ExecGrant_Type(InternalGrant *istmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
/* Update initial privileges for extensions */
recordExtensionInitPriv(typId, TypeRelationId, 0, new_acl);
@@ -5751,10 +5715,7 @@ recordExtensionInitPrivWorker(Oid objoid, Oid classoid, int objsubid, Acl *new_a
oldtuple = heap_modify_tuple(oldtuple, RelationGetDescr(relation),
values, nulls, replace);
- simple_heap_update(relation, &oldtuple->t_self, oldtuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, oldtuple);
+ CatalogUpdateHeapAndIndexes(relation, &oldtuple->t_self, oldtuple);
}
else
/* new_acl is NULL, so delete the entry we found. */
@@ -5788,10 +5749,7 @@ recordExtensionInitPrivWorker(Oid objoid, Oid classoid, int objsubid, Acl *new_a
tuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
- simple_heap_insert(relation, tuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relation, tuple);
+ CatalogInsertHeapAndIndexes(relation, tuple);
}
}
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7ce9115..de2ba12 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -824,9 +824,7 @@ InsertPgClassTuple(Relation pg_class_desc,
HeapTupleSetOid(tup, new_rel_oid);
/* finally insert the new tuple, update the indexes, and clean up */
- simple_heap_insert(pg_class_desc, tup);
-
- CatalogUpdateIndexes(pg_class_desc, tup);
+ CatalogInsertHeapAndIndexes(pg_class_desc, tup);
heap_freetuple(tup);
}
@@ -1599,10 +1597,7 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
"........pg.dropped.%d........", attnum);
namestrcpy(&(attStruct->attname), newattname);
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
}
/*
@@ -1731,10 +1726,7 @@ RemoveAttrDefaultById(Oid attrdefId)
((Form_pg_attribute) GETSTRUCT(tuple))->atthasdef = false;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/*
* Our update of the pg_attribute row will force a relcache rebuild, so
@@ -1932,9 +1924,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
adrel = heap_open(AttrDefaultRelationId, RowExclusiveLock);
tuple = heap_form_tuple(adrel->rd_att, values, nulls);
- attrdefOid = simple_heap_insert(adrel, tuple);
-
- CatalogUpdateIndexes(adrel, tuple);
+ attrdefOid = CatalogInsertHeapAndIndexes(adrel, tuple);
defobject.classId = AttrDefaultRelationId;
defobject.objectId = attrdefOid;
@@ -1964,9 +1954,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
if (!attStruct->atthasdef)
{
attStruct->atthasdef = true;
- simple_heap_update(attrrel, &atttup->t_self, atttup);
- /* keep catalog indexes current */
- CatalogUpdateIndexes(attrrel, atttup);
+ CatalogUpdateHeapAndIndexes(attrrel, &atttup->t_self, atttup);
}
heap_close(attrrel, RowExclusiveLock);
heap_freetuple(atttup);
@@ -2561,8 +2549,7 @@ MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
Assert(is_local);
con->connoinherit = true;
}
- simple_heap_update(conDesc, &tup->t_self, tup);
- CatalogUpdateIndexes(conDesc, tup);
+ CatalogUpdateHeapAndIndexes(conDesc, &tup->t_self, tup);
break;
}
}
@@ -2602,10 +2589,7 @@ SetRelationNumChecks(Relation rel, int numchecks)
{
relStruct->relchecks = numchecks;
- simple_heap_update(relrel, &reltup->t_self, reltup);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(relrel, reltup);
+ CatalogUpdateHeapAndIndexes(relrel, &reltup->t_self, reltup);
}
else
{
@@ -3145,10 +3129,7 @@ StorePartitionKey(Relation rel,
tuple = heap_form_tuple(RelationGetDescr(pg_partitioned_table), values, nulls);
- simple_heap_insert(pg_partitioned_table, tuple);
-
- /* Update the indexes on pg_partitioned_table */
- CatalogUpdateIndexes(pg_partitioned_table, tuple);
+ CatalogInsertHeapAndIndexes(pg_partitioned_table, tuple);
heap_close(pg_partitioned_table, RowExclusiveLock);
/* Mark this relation as dependent on a few things as follows */
@@ -3265,8 +3246,7 @@ StorePartitionBound(Relation rel, Relation parent, Node *bound)
new_val, new_null, new_repl);
/* Also set the flag */
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = true;
- simple_heap_update(classRel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(classRel, newtuple);
+ CatalogUpdateHeapAndIndexes(classRel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 26cbc0e..33ca96a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -649,10 +649,7 @@ UpdateIndexRelation(Oid indexoid,
/*
* insert the tuple into the pg_index catalog
*/
- simple_heap_insert(pg_index, tuple);
-
- /* update the indexes on pg_index */
- CatalogUpdateIndexes(pg_index, tuple);
+ CatalogInsertHeapAndIndexes(pg_index, tuple);
/*
* close the relation and free the tuple
@@ -1324,8 +1321,7 @@ index_constraint_create(Relation heapRelation,
if (dirty)
{
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
InvokeObjectPostAlterHookArg(IndexRelationId, indexRelationId, 0,
InvalidOid, is_internal);
@@ -2103,8 +2099,7 @@ index_build(Relation heapRelation,
Assert(!indexForm->indcheckxmin);
indexForm->indcheckxmin = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
heap_freetuple(indexTuple);
heap_close(pg_index, RowExclusiveLock);
@@ -3448,8 +3443,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
indexForm->indisvalid = true;
indexForm->indisready = true;
indexForm->indislive = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
/*
* Invalidate the relcache for the table, so that after we commit
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 1915ca3..bad9fb0 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -162,3 +162,31 @@ CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple)
CatalogIndexInsert(indstate, heapTuple);
CatalogCloseIndexes(indstate);
}
+
+/*
+ * A convenience routine which updates the heap tuple (identified by otid) with
+ * tup and also update all indexes on the table.
+ */
+void
+CatalogUpdateHeapAndIndexes(Relation heapRel, ItemPointer otid, HeapTuple tup)
+{
+ simple_heap_update(heapRel, otid, tup);
+
+ /* Make sure only indexes whose columns are modified receive new entries */
+ CatalogUpdateIndexes(heapRel, tup);
+}
+
+/*
+ * A convenience routine which inserts a new heap tuple and also update all
+ * indexes on the table.
+ *
+ * Oid of the inserted tuple is returned
+ */
+Oid
+CatalogInsertHeapAndIndexes(Relation heapRel, HeapTuple tup)
+{
+ Oid oid;
+ oid = simple_heap_insert(heapRel, tup);
+ CatalogUpdateIndexes(heapRel, tup);
+ return oid;
+}
diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 3a4e22f..9cab585 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -674,9 +674,7 @@ AggregateCreate(const char *aggName,
tupDesc = aggdesc->rd_att;
tup = heap_form_tuple(tupDesc, values, nulls);
- simple_heap_insert(aggdesc, tup);
-
- CatalogUpdateIndexes(aggdesc, tup);
+ CatalogInsertHeapAndIndexes(aggdesc, tup);
heap_close(aggdesc, RowExclusiveLock);
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index 694c0f6..ebaf3fd 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -134,12 +134,9 @@ CollationCreate(const char *collname, Oid collnamespace,
tup = heap_form_tuple(tupDesc, values, nulls);
/* insert a new tuple */
- oid = simple_heap_insert(rel, tup);
+ oid = CatalogInsertHeapAndIndexes(rel, tup);
Assert(OidIsValid(oid));
- /* update the index if any */
- CatalogUpdateIndexes(rel, tup);
-
/* set up dependencies for the new collation */
myself.classId = CollationRelationId;
myself.objectId = oid;
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index b5a0ce9..9509cac 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -226,10 +226,7 @@ CreateConstraintEntry(const char *constraintName,
tup = heap_form_tuple(RelationGetDescr(conDesc), values, nulls);
- conOid = simple_heap_insert(conDesc, tup);
-
- /* update catalog indexes */
- CatalogUpdateIndexes(conDesc, tup);
+ conOid = CatalogInsertHeapAndIndexes(conDesc, tup);
conobject.classId = ConstraintRelationId;
conobject.objectId = conOid;
@@ -584,9 +581,7 @@ RemoveConstraintById(Oid conId)
RelationGetRelationName(rel));
classForm->relchecks--;
- simple_heap_update(pgrel, &relTup->t_self, relTup);
-
- CatalogUpdateIndexes(pgrel, relTup);
+ CatalogUpdateHeapAndIndexes(pgrel, &relTup->t_self, relTup);
heap_freetuple(relTup);
@@ -666,10 +661,7 @@ RenameConstraintById(Oid conId, const char *newname)
/* OK, do the rename --- tuple is a copy, so OK to scribble on it */
namestrcpy(&(con->conname), newname);
- simple_heap_update(conDesc, &tuple->t_self, tuple);
-
- /* update the system catalog indexes */
- CatalogUpdateIndexes(conDesc, tuple);
+ CatalogUpdateHeapAndIndexes(conDesc, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(ConstraintRelationId, conId, 0);
@@ -736,8 +728,7 @@ AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
conform->connamespace = newNspId;
- simple_heap_update(conRel, &tup->t_self, tup);
- CatalogUpdateIndexes(conRel, tup);
+ CatalogUpdateHeapAndIndexes(conRel, &tup->t_self, tup);
/*
* Note: currently, the constraint will not have its own
diff --git a/src/backend/catalog/pg_conversion.c b/src/backend/catalog/pg_conversion.c
index adaf7b8..a942e02 100644
--- a/src/backend/catalog/pg_conversion.c
+++ b/src/backend/catalog/pg_conversion.c
@@ -105,10 +105,7 @@ ConversionCreate(const char *conname, Oid connamespace,
tup = heap_form_tuple(tupDesc, values, nulls);
/* insert a new tuple */
- simple_heap_insert(rel, tup);
-
- /* update the index if any */
- CatalogUpdateIndexes(rel, tup);
+ CatalogInsertHeapAndIndexes(rel, tup);
myself.classId = ConversionRelationId;
myself.objectId = HeapTupleGetOid(tup);
diff --git a/src/backend/catalog/pg_db_role_setting.c b/src/backend/catalog/pg_db_role_setting.c
index 117cc8d..c206b03 100644
--- a/src/backend/catalog/pg_db_role_setting.c
+++ b/src/backend/catalog/pg_db_role_setting.c
@@ -88,10 +88,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, newtuple);
}
else
simple_heap_delete(rel, &tuple->t_self);
@@ -129,10 +126,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, newtuple);
}
else
simple_heap_delete(rel, &tuple->t_self);
@@ -155,10 +149,7 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
values[Anum_pg_db_role_setting_setconfig - 1] = PointerGetDatum(a);
newtuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- simple_heap_insert(rel, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogInsertHeapAndIndexes(rel, newtuple);
}
InvokeObjectPostAlterHookArg(DbRoleSettingRelationId,
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index b71fa1b..49f3bf3 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -362,8 +362,7 @@ changeDependencyFor(Oid classId, Oid objectId,
depform->refobjid = newRefObjectId;
- simple_heap_update(depRel, &tup->t_self, tup);
- CatalogUpdateIndexes(depRel, tup);
+ CatalogUpdateHeapAndIndexes(depRel, &tup->t_self, tup);
heap_freetuple(tup);
}
diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 089a9a0..16a4e80 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -125,8 +125,7 @@ EnumValuesCreate(Oid enumTypeOid, List *vals)
tup = heap_form_tuple(RelationGetDescr(pg_enum), values, nulls);
HeapTupleSetOid(tup, oids[elemno]);
- simple_heap_insert(pg_enum, tup);
- CatalogUpdateIndexes(pg_enum, tup);
+ CatalogInsertHeapAndIndexes(pg_enum, tup);
heap_freetuple(tup);
elemno++;
@@ -458,8 +457,7 @@ restart:
values[Anum_pg_enum_enumlabel - 1] = NameGetDatum(&enumlabel);
enum_tup = heap_form_tuple(RelationGetDescr(pg_enum), values, nulls);
HeapTupleSetOid(enum_tup, newOid);
- simple_heap_insert(pg_enum, enum_tup);
- CatalogUpdateIndexes(pg_enum, enum_tup);
+ CatalogInsertHeapAndIndexes(pg_enum, enum_tup);
heap_freetuple(enum_tup);
heap_close(pg_enum, RowExclusiveLock);
@@ -543,8 +541,7 @@ RenameEnumLabel(Oid enumTypeOid,
/* Update the pg_enum entry */
namestrcpy(&en->enumlabel, newVal);
- simple_heap_update(pg_enum, &enum_tup->t_self, enum_tup);
- CatalogUpdateIndexes(pg_enum, enum_tup);
+ CatalogUpdateHeapAndIndexes(pg_enum, &enum_tup->t_self, enum_tup);
heap_freetuple(enum_tup);
heap_close(pg_enum, RowExclusiveLock);
@@ -597,9 +594,7 @@ RenumberEnumType(Relation pg_enum, HeapTuple *existing, int nelems)
{
en->enumsortorder = newsortorder;
- simple_heap_update(pg_enum, &newtup->t_self, newtup);
-
- CatalogUpdateIndexes(pg_enum, newtup);
+ CatalogUpdateHeapAndIndexes(pg_enum, &newtup->t_self, newtup);
}
heap_freetuple(newtup);
diff --git a/src/backend/catalog/pg_largeobject.c b/src/backend/catalog/pg_largeobject.c
index 24edf6a..d59d4b7 100644
--- a/src/backend/catalog/pg_largeobject.c
+++ b/src/backend/catalog/pg_largeobject.c
@@ -63,11 +63,9 @@ LargeObjectCreate(Oid loid)
if (OidIsValid(loid))
HeapTupleSetOid(ntup, loid);
- loid_new = simple_heap_insert(pg_lo_meta, ntup);
+ loid_new = CatalogInsertHeapAndIndexes(pg_lo_meta, ntup);
Assert(!OidIsValid(loid) || loid == loid_new);
- CatalogUpdateIndexes(pg_lo_meta, ntup);
-
heap_freetuple(ntup);
heap_close(pg_lo_meta, RowExclusiveLock);
diff --git a/src/backend/catalog/pg_namespace.c b/src/backend/catalog/pg_namespace.c
index f048ad4..4c06873 100644
--- a/src/backend/catalog/pg_namespace.c
+++ b/src/backend/catalog/pg_namespace.c
@@ -76,11 +76,9 @@ NamespaceCreate(const char *nspName, Oid ownerId, bool isTemp)
tup = heap_form_tuple(tupDesc, values, nulls);
- nspoid = simple_heap_insert(nspdesc, tup);
+ nspoid = CatalogInsertHeapAndIndexes(nspdesc, tup);
Assert(OidIsValid(nspoid));
- CatalogUpdateIndexes(nspdesc, tup);
-
heap_close(nspdesc, RowExclusiveLock);
/* Record dependencies */
diff --git a/src/backend/catalog/pg_operator.c b/src/backend/catalog/pg_operator.c
index 556f9fe..d3f71ca 100644
--- a/src/backend/catalog/pg_operator.c
+++ b/src/backend/catalog/pg_operator.c
@@ -262,9 +262,7 @@ OperatorShellMake(const char *operatorName,
/*
* insert our "shell" operator tuple
*/
- operatorObjectId = simple_heap_insert(pg_operator_desc, tup);
-
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ operatorObjectId = CatalogInsertHeapAndIndexes(pg_operator_desc, tup);
/* Add dependencies for the entry */
makeOperatorDependencies(tup, false);
@@ -526,7 +524,7 @@ OperatorCreate(const char *operatorName,
nulls,
replaces);
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(pg_operator_desc, &tup->t_self, tup);
}
else
{
@@ -535,12 +533,9 @@ OperatorCreate(const char *operatorName,
tup = heap_form_tuple(RelationGetDescr(pg_operator_desc),
values, nulls);
- operatorObjectId = simple_heap_insert(pg_operator_desc, tup);
+ operatorObjectId = CatalogInsertHeapAndIndexes(pg_operator_desc, tup);
}
- /* Must update the indexes in either case */
- CatalogUpdateIndexes(pg_operator_desc, tup);
-
/* Add dependencies for the entry */
address = makeOperatorDependencies(tup, isUpdate);
@@ -695,8 +690,7 @@ OperatorUpd(Oid baseId, Oid commId, Oid negId, bool isDelete)
/* If any columns were found to need modification, update tuple. */
if (update_commutator)
{
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ CatalogUpdateHeapAndIndexes(pg_operator_desc, &tup->t_self, tup);
/*
* Do CCI to make the updated tuple visible. We must do this in
@@ -741,8 +735,7 @@ OperatorUpd(Oid baseId, Oid commId, Oid negId, bool isDelete)
/* If any columns were found to need modification, update tuple. */
if (update_negator)
{
- simple_heap_update(pg_operator_desc, &tup->t_self, tup);
- CatalogUpdateIndexes(pg_operator_desc, tup);
+ CatalogUpdateHeapAndIndexes(pg_operator_desc, &tup->t_self, tup);
/*
* In the deletion case, do CCI to make the updated tuple visible.
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 6ab849c..f35769e 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -572,7 +572,7 @@ ProcedureCreate(const char *procedureName,
/* Okay, do it... */
tup = heap_modify_tuple(oldtup, tupDesc, values, nulls, replaces);
- simple_heap_update(rel, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
ReleaseSysCache(oldtup);
is_update = true;
@@ -590,12 +590,10 @@ ProcedureCreate(const char *procedureName,
nulls[Anum_pg_proc_proacl - 1] = true;
tup = heap_form_tuple(tupDesc, values, nulls);
- simple_heap_insert(rel, tup);
+ CatalogInsertHeapAndIndexes(rel, tup);
is_update = false;
}
- /* Need to update indexes for either the insert or update case */
- CatalogUpdateIndexes(rel, tup);
retval = HeapTupleGetOid(tup);
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 00ed28f..2c7c3b5 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -149,8 +149,7 @@ publication_add_relation(Oid pubid, Relation targetrel,
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
/* Insert tuple into catalog. */
- prrelid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ prrelid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
ObjectAddressSet(myself, PublicationRelRelationId, prrelid);
diff --git a/src/backend/catalog/pg_range.c b/src/backend/catalog/pg_range.c
index d3a4c26..c21610d 100644
--- a/src/backend/catalog/pg_range.c
+++ b/src/backend/catalog/pg_range.c
@@ -58,8 +58,7 @@ RangeCreate(Oid rangeTypeOid, Oid rangeSubType, Oid rangeCollation,
tup = heap_form_tuple(RelationGetDescr(pg_range), values, nulls);
- simple_heap_insert(pg_range, tup);
- CatalogUpdateIndexes(pg_range, tup);
+ CatalogInsertHeapAndIndexes(pg_range, tup);
heap_freetuple(tup);
/* record type's dependencies on range-related items */
diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c
index 60ed957..8d1ddab 100644
--- a/src/backend/catalog/pg_shdepend.c
+++ b/src/backend/catalog/pg_shdepend.c
@@ -260,10 +260,7 @@ shdepChangeDep(Relation sdepRel,
shForm->refclassid = refclassid;
shForm->refobjid = refobjid;
- simple_heap_update(sdepRel, &oldtup->t_self, oldtup);
-
- /* keep indexes current */
- CatalogUpdateIndexes(sdepRel, oldtup);
+ CatalogUpdateHeapAndIndexes(sdepRel, &oldtup->t_self, oldtup);
}
else
{
@@ -287,10 +284,7 @@ shdepChangeDep(Relation sdepRel,
* it's certainly a new tuple
*/
oldtup = heap_form_tuple(RelationGetDescr(sdepRel), values, nulls);
- simple_heap_insert(sdepRel, oldtup);
-
- /* keep indexes current */
- CatalogUpdateIndexes(sdepRel, oldtup);
+ CatalogInsertHeapAndIndexes(sdepRel, oldtup);
}
if (oldtup)
@@ -759,10 +753,7 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId)
HeapTuple newtup;
newtup = heap_modify_tuple(tup, sdepDesc, values, nulls, replace);
- simple_heap_insert(sdepRel, newtup);
-
- /* Keep indexes current */
- CatalogIndexInsert(indstate, newtup);
+ CatalogInsertHeapAndIndexes(sdepRel, newtup);
heap_freetuple(newtup);
}
@@ -882,10 +873,7 @@ shdepAddDependency(Relation sdepRel,
tup = heap_form_tuple(sdepRel->rd_att, values, nulls);
- simple_heap_insert(sdepRel, tup);
-
- /* keep indexes current */
- CatalogUpdateIndexes(sdepRel, tup);
+ CatalogInsertHeapAndIndexes(sdepRel, tup);
/* clean up */
heap_freetuple(tup);
diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c
index 6d9a324..8dfd5f0 100644
--- a/src/backend/catalog/pg_type.c
+++ b/src/backend/catalog/pg_type.c
@@ -142,9 +142,7 @@ TypeShellMake(const char *typeName, Oid typeNamespace, Oid ownerId)
/*
* insert the tuple in the relation and get the tuple's oid.
*/
- typoid = simple_heap_insert(pg_type_desc, tup);
-
- CatalogUpdateIndexes(pg_type_desc, tup);
+ typoid = CatalogInsertHeapAndIndexes(pg_type_desc, tup);
/*
* Create dependencies. We can/must skip this in bootstrap mode.
@@ -430,7 +428,7 @@ TypeCreate(Oid newTypeOid,
nulls,
replaces);
- simple_heap_update(pg_type_desc, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(pg_type_desc, &tup->t_self, tup);
typeObjectId = HeapTupleGetOid(tup);
@@ -458,12 +456,9 @@ TypeCreate(Oid newTypeOid,
}
/* else allow system to assign oid */
- typeObjectId = simple_heap_insert(pg_type_desc, tup);
+ typeObjectId = CatalogInsertHeapAndIndexes(pg_type_desc, tup);
}
- /* Update indexes */
- CatalogUpdateIndexes(pg_type_desc, tup);
-
/*
* Create dependencies. We can/must skip this in bootstrap mode.
*/
@@ -724,10 +719,7 @@ RenameTypeInternal(Oid typeOid, const char *newTypeName, Oid typeNamespace)
/* OK, do the rename --- tuple is a copy, so OK to scribble on it */
namestrcpy(&(typ->typname), newTypeName);
- simple_heap_update(pg_type_desc, &tuple->t_self, tuple);
-
- /* update the system catalog indexes */
- CatalogUpdateIndexes(pg_type_desc, tuple);
+ CatalogUpdateHeapAndIndexes(pg_type_desc, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(TypeRelationId, typeOid, 0);
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ee4a182..cae1228 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -350,10 +350,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
if (!IsBootstrapProcessingMode())
{
/* normal case, use a transactional update */
- simple_heap_update(class_rel, &reltup->t_self, reltup);
-
- /* Keep catalog indexes current */
- CatalogUpdateIndexes(class_rel, reltup);
+ CatalogUpdateHeapAndIndexes(class_rel, &reltup->t_self, reltup);
}
else
{
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 768fcc8..d8d4bec 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -284,8 +284,7 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &oldtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &oldtup->t_self, newtup);
InvokeObjectPostAlterHook(classId, objectId, 0);
@@ -722,8 +721,7 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &tup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, newtup);
/* Release memory */
pfree(values);
@@ -954,8 +952,7 @@ AlterObjectOwner_internal(Relation rel, Oid objectId, Oid new_ownerId)
values, nulls, replaces);
/* Perform actual update */
- simple_heap_update(rel, &newtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &newtup->t_self, newtup);
/* Update owner dependency reference */
if (classId == LargeObjectMetadataRelationId)
diff --git a/src/backend/commands/amcmds.c b/src/backend/commands/amcmds.c
index 29061b8..33e207c 100644
--- a/src/backend/commands/amcmds.c
+++ b/src/backend/commands/amcmds.c
@@ -87,8 +87,7 @@ CreateAccessMethod(CreateAmStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- amoid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ amoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
myself.classId = AccessMethodRelationId;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c9f6afe..648520e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1589,18 +1589,15 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
nulls,
replaces);
ReleaseSysCache(oldtup);
- simple_heap_update(sd, &stup->t_self, stup);
+ CatalogUpdateHeapAndIndexes(sd, &stup->t_self, stup);
}
else
{
/* No, insert new tuple */
stup = heap_form_tuple(RelationGetDescr(sd), values, nulls);
- simple_heap_insert(sd, stup);
+ CatalogInsertHeapAndIndexes(sd, stup);
}
- /* update indexes too */
- CatalogUpdateIndexes(sd, stup);
-
heap_freetuple(stup);
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f9309fc..8060758 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -523,8 +523,7 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
if (indexForm->indisclustered)
{
indexForm->indisclustered = false;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
}
else if (thisIndexOid == indexOid)
{
@@ -532,8 +531,7 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
if (!IndexIsValid(indexForm))
elog(ERROR, "cannot cluster on invalid index %u", indexOid);
indexForm->indisclustered = true;
- simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
- CatalogUpdateIndexes(pg_index, indexTuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &indexTuple->t_self, indexTuple);
}
InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
@@ -1558,8 +1556,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
relform->relfrozenxid = frozenXid;
relform->relminmxid = cutoffMulti;
- simple_heap_update(relRelation, &reltup->t_self, reltup);
- CatalogUpdateIndexes(relRelation, reltup);
+ CatalogUpdateHeapAndIndexes(relRelation, &reltup->t_self, reltup);
heap_close(relRelation, RowExclusiveLock);
}
diff --git a/src/backend/commands/comment.c b/src/backend/commands/comment.c
index ada0b03..c250385 100644
--- a/src/backend/commands/comment.c
+++ b/src/backend/commands/comment.c
@@ -199,7 +199,7 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
{
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(description), values,
nulls, replaces);
- simple_heap_update(description, &oldtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(description, &oldtuple->t_self, newtuple);
}
break; /* Assume there can be only one match */
@@ -213,15 +213,11 @@ CreateComments(Oid oid, Oid classoid, int32 subid, char *comment)
{
newtuple = heap_form_tuple(RelationGetDescr(description),
values, nulls);
- simple_heap_insert(description, newtuple);
+ CatalogInsertHeapAndIndexes(description, newtuple);
}
- /* Update indexes, if necessary */
if (newtuple != NULL)
- {
- CatalogUpdateIndexes(description, newtuple);
heap_freetuple(newtuple);
- }
/* Done */
@@ -293,7 +289,7 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
{
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(shdescription),
values, nulls, replaces);
- simple_heap_update(shdescription, &oldtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(shdescription, &oldtuple->t_self, newtuple);
}
break; /* Assume there can be only one match */
@@ -307,15 +303,11 @@ CreateSharedComments(Oid oid, Oid classoid, char *comment)
{
newtuple = heap_form_tuple(RelationGetDescr(shdescription),
values, nulls);
- simple_heap_insert(shdescription, newtuple);
+ CatalogInsertHeapAndIndexes(shdescription, newtuple);
}
- /* Update indexes, if necessary */
if (newtuple != NULL)
- {
- CatalogUpdateIndexes(shdescription, newtuple);
heap_freetuple(newtuple);
- }
/* Done */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6ad8fd7..b6ef57d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -546,10 +546,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
HeapTupleSetOid(tuple, dboid);
- simple_heap_insert(pg_database_rel, tuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(pg_database_rel, tuple);
+ CatalogInsertHeapAndIndexes(pg_database_rel, tuple);
/*
* Now generate additional catalog entries associated with the new DB
@@ -1040,8 +1037,7 @@ RenameDatabase(const char *oldname, const char *newname)
if (!HeapTupleIsValid(newtup))
elog(ERROR, "cache lookup failed for database %u", db_id);
namestrcpy(&(((Form_pg_database) GETSTRUCT(newtup))->datname), newname);
- simple_heap_update(rel, &newtup->t_self, newtup);
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &newtup->t_self, newtup);
InvokeObjectPostAlterHook(DatabaseRelationId, db_id, 0);
@@ -1296,10 +1292,7 @@ movedb(const char *dbname, const char *tblspcname)
newtuple = heap_modify_tuple(oldtuple, RelationGetDescr(pgdbrel),
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pgdbrel, &oldtuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(pgdbrel, newtuple);
+ CatalogUpdateHeapAndIndexes(pgdbrel, &oldtuple->t_self, newtuple);
InvokeObjectPostAlterHook(DatabaseRelationId,
HeapTupleGetOid(newtuple), 0);
@@ -1554,10 +1547,7 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel), new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(rel, &tuple->t_self, newtuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, newtuple);
InvokeObjectPostAlterHook(DatabaseRelationId,
HeapTupleGetOid(newtuple), 0);
@@ -1692,8 +1682,7 @@ AlterDatabaseOwner(const char *dbname, Oid newOwnerId)
}
newtuple = heap_modify_tuple(tuple, RelationGetDescr(rel), repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 8125537..a5460a3 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -405,8 +405,7 @@ insert_event_trigger_tuple(char *trigname, char *eventname, Oid evtOwner,
/* Insert heap tuple. */
tuple = heap_form_tuple(tgrel->rd_att, values, nulls);
- trigoid = simple_heap_insert(tgrel, tuple);
- CatalogUpdateIndexes(tgrel, tuple);
+ trigoid = CatalogInsertHeapAndIndexes(tgrel, tuple);
heap_freetuple(tuple);
/* Depend on owner. */
@@ -524,8 +523,7 @@ AlterEventTrigger(AlterEventTrigStmt *stmt)
evtForm = (Form_pg_event_trigger) GETSTRUCT(tup);
evtForm->evtenabled = tgenabled;
- simple_heap_update(tgrel, &tup->t_self, tup);
- CatalogUpdateIndexes(tgrel, tup);
+ CatalogUpdateHeapAndIndexes(tgrel, &tup->t_self, tup);
InvokeObjectPostAlterHook(EventTriggerRelationId,
trigoid, 0);
@@ -621,8 +619,7 @@ AlterEventTriggerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of an event trigger must be a superuser.")));
form->evtowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(EventTriggerRelationId,
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index f23c697..425d14b 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -1773,8 +1773,7 @@ InsertExtensionTuple(const char *extName, Oid extOwner,
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- extensionOid = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ extensionOid = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
heap_close(rel, RowExclusiveLock);
@@ -2485,8 +2484,7 @@ pg_extension_config_dump(PG_FUNCTION_ARGS)
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
repl_val, repl_null, repl_repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
systable_endscan(extScan);
@@ -2663,8 +2661,7 @@ extension_config_remove(Oid extensionoid, Oid tableoid)
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
repl_val, repl_null, repl_repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
systable_endscan(extScan);
@@ -2844,8 +2841,7 @@ AlterExtensionNamespace(List *names, const char *newschema, Oid *oldschema)
/* Now adjust pg_extension.extnamespace */
extForm->extnamespace = nspOid;
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
heap_close(extRel, RowExclusiveLock);
@@ -3091,8 +3087,7 @@ ApplyExtensionUpdates(Oid extensionOid,
extTup = heap_modify_tuple(extTup, RelationGetDescr(extRel),
values, nulls, repl);
- simple_heap_update(extRel, &extTup->t_self, extTup);
- CatalogUpdateIndexes(extRel, extTup);
+ CatalogUpdateHeapAndIndexes(extRel, &extTup->t_self, extTup);
systable_endscan(extScan);
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 6ff8b69..a67dc52 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -256,8 +256,7 @@ AlterForeignDataWrapperOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerI
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(ForeignDataWrapperRelationId,
@@ -397,8 +396,7 @@ AlterForeignServerOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(ForeignServerRelationId, HeapTupleGetOid(tup),
@@ -629,8 +627,7 @@ CreateForeignDataWrapper(CreateFdwStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- fdwId = simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ fdwId = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -786,8 +783,7 @@ AlterForeignDataWrapper(AlterFdwStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ CatalogUpdateHeapAndIndexes(rel, &tp->t_self, tp);
heap_freetuple(tp);
@@ -941,9 +937,7 @@ CreateForeignServer(CreateForeignServerStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- srvId = simple_heap_insert(rel, tuple);
-
- CatalogUpdateIndexes(rel, tuple);
+ srvId = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -1056,8 +1050,7 @@ AlterForeignServer(AlterForeignServerStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ CatalogUpdateHeapAndIndexes(rel, &tp->t_self, tp);
InvokeObjectPostAlterHook(ForeignServerRelationId, srvId, 0);
@@ -1190,9 +1183,7 @@ CreateUserMapping(CreateUserMappingStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- umId = simple_heap_insert(rel, tuple);
-
- CatalogUpdateIndexes(rel, tuple);
+ umId = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -1307,8 +1298,7 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
tp = heap_modify_tuple(tp, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &tp->t_self, tp);
- CatalogUpdateIndexes(rel, tp);
+ CatalogUpdateHeapAndIndexes(rel, &tp->t_self, tp);
ObjectAddressSet(address, UserMappingRelationId, umId);
@@ -1484,8 +1474,7 @@ CreateForeignTable(CreateForeignTableStmt *stmt, Oid relid)
tuple = heap_form_tuple(ftrel->rd_att, values, nulls);
- simple_heap_insert(ftrel, tuple);
- CatalogUpdateIndexes(ftrel, tuple);
+ CatalogInsertHeapAndIndexes(ftrel, tuple);
heap_freetuple(tuple);
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index ec833c3..c58dc26 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -1292,8 +1292,7 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt)
procForm->proparallel = interpret_func_parallel(parallel_item);
/* Do the update */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
InvokeObjectPostAlterHook(ProcedureRelationId, funcOid, 0);
@@ -1333,9 +1332,7 @@ SetFunctionReturnType(Oid funcOid, Oid newRetType)
procForm->prorettype = newRetType;
/* update the catalog and its indexes */
- simple_heap_update(pg_proc_rel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(pg_proc_rel, tup);
+ CatalogUpdateHeapAndIndexes(pg_proc_rel, &tup->t_self, tup);
heap_close(pg_proc_rel, RowExclusiveLock);
}
@@ -1368,9 +1365,7 @@ SetFunctionArgType(Oid funcOid, int argIndex, Oid newArgType)
procForm->proargtypes.values[argIndex] = newArgType;
/* update the catalog and its indexes */
- simple_heap_update(pg_proc_rel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(pg_proc_rel, tup);
+ CatalogUpdateHeapAndIndexes(pg_proc_rel, &tup->t_self, tup);
heap_close(pg_proc_rel, RowExclusiveLock);
}
@@ -1656,9 +1651,7 @@ CreateCast(CreateCastStmt *stmt)
tuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
- castid = simple_heap_insert(relation, tuple);
-
- CatalogUpdateIndexes(relation, tuple);
+ castid = CatalogInsertHeapAndIndexes(relation, tuple);
/* make dependency entries */
myself.classId = CastRelationId;
@@ -1921,7 +1914,7 @@ CreateTransform(CreateTransformStmt *stmt)
replaces[Anum_pg_transform_trftosql - 1] = true;
newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values, nulls, replaces);
- simple_heap_update(relation, &newtuple->t_self, newtuple);
+ CatalogUpdateHeapAndIndexes(relation, &newtuple->t_self, newtuple);
transformid = HeapTupleGetOid(tuple);
ReleaseSysCache(tuple);
@@ -1930,12 +1923,10 @@ CreateTransform(CreateTransformStmt *stmt)
else
{
newtuple = heap_form_tuple(RelationGetDescr(relation), values, nulls);
- transformid = simple_heap_insert(relation, newtuple);
+ transformid = CatalogInsertHeapAndIndexes(relation, newtuple);
is_replace = false;
}
- CatalogUpdateIndexes(relation, newtuple);
-
if (is_replace)
deleteDependencyRecordsFor(TransformRelationId, transformid, true);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index b7daf1c..53661a3 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -100,9 +100,7 @@ SetMatViewPopulatedState(Relation relation, bool newstate)
((Form_pg_class) GETSTRUCT(tuple))->relispopulated = newstate;
- simple_heap_update(pgrel, &tuple->t_self, tuple);
-
- CatalogUpdateIndexes(pgrel, tuple);
+ CatalogUpdateHeapAndIndexes(pgrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
heap_close(pgrel, RowExclusiveLock);
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index bc43483..adb4a7d 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -278,9 +278,7 @@ CreateOpFamily(char *amname, char *opfname, Oid namespaceoid, Oid amoid)
tup = heap_form_tuple(rel->rd_att, values, nulls);
- opfamilyoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ opfamilyoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
@@ -654,9 +652,7 @@ DefineOpClass(CreateOpClassStmt *stmt)
tup = heap_form_tuple(rel->rd_att, values, nulls);
- opclassoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ opclassoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
@@ -1327,9 +1323,7 @@ storeOperators(List *opfamilyname, Oid amoid,
tup = heap_form_tuple(rel->rd_att, values, nulls);
- entryoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ entryoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
@@ -1438,9 +1432,7 @@ storeProcedures(List *opfamilyname, Oid amoid,
tup = heap_form_tuple(rel->rd_att, values, nulls);
- entryoid = simple_heap_insert(rel, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ entryoid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
diff --git a/src/backend/commands/operatorcmds.c b/src/backend/commands/operatorcmds.c
index a273376..eb6b308 100644
--- a/src/backend/commands/operatorcmds.c
+++ b/src/backend/commands/operatorcmds.c
@@ -518,8 +518,7 @@ AlterOperator(AlterOperatorStmt *stmt)
tup = heap_modify_tuple(tup, RelationGetDescr(catalog),
values, nulls, replaces);
- simple_heap_update(catalog, &tup->t_self, tup);
- CatalogUpdateIndexes(catalog, tup);
+ CatalogUpdateHeapAndIndexes(catalog, &tup->t_self, tup);
address = makeOperatorDependencies(tup, true);
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 5d9d3a6..d1513f7 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -614,10 +614,7 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
new_tuple = heap_modify_tuple(tuple,
RelationGetDescr(pg_policy_rel),
values, isnull, replaces);
- simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple);
-
- /* Update Catalog Indexes */
- CatalogUpdateIndexes(pg_policy_rel, new_tuple);
+ CatalogUpdateHeapAndIndexes(pg_policy_rel, &new_tuple->t_self, new_tuple);
/* Remove all old dependencies. */
deleteDependencyRecordsFor(PolicyRelationId, policy_id, false);
@@ -823,10 +820,7 @@ CreatePolicy(CreatePolicyStmt *stmt)
policy_tuple = heap_form_tuple(RelationGetDescr(pg_policy_rel), values,
isnull);
- policy_id = simple_heap_insert(pg_policy_rel, policy_tuple);
-
- /* Update Indexes */
- CatalogUpdateIndexes(pg_policy_rel, policy_tuple);
+ policy_id = CatalogInsertHeapAndIndexes(pg_policy_rel, policy_tuple);
/* Record Dependencies */
target.classId = RelationRelationId;
@@ -1150,10 +1144,7 @@ AlterPolicy(AlterPolicyStmt *stmt)
new_tuple = heap_modify_tuple(policy_tuple,
RelationGetDescr(pg_policy_rel),
values, isnull, replaces);
- simple_heap_update(pg_policy_rel, &new_tuple->t_self, new_tuple);
-
- /* Update Catalog Indexes */
- CatalogUpdateIndexes(pg_policy_rel, new_tuple);
+ CatalogUpdateHeapAndIndexes(pg_policy_rel, &new_tuple->t_self, new_tuple);
/* Update Dependencies. */
deleteDependencyRecordsFor(PolicyRelationId, policy_id, false);
@@ -1287,10 +1278,7 @@ rename_policy(RenameStmt *stmt)
namestrcpy(&((Form_pg_policy) GETSTRUCT(policy_tuple))->polname,
stmt->newname);
- simple_heap_update(pg_policy_rel, &policy_tuple->t_self, policy_tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_policy_rel, policy_tuple);
+ CatalogUpdateHeapAndIndexes(pg_policy_rel, &policy_tuple->t_self, policy_tuple);
InvokeObjectPostAlterHook(PolicyRelationId,
HeapTupleGetOid(policy_tuple), 0);
diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index b684f41..f7fa548 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -378,7 +378,7 @@ create_proc_lang(const char *languageName, bool replace,
/* Okay, do it... */
tup = heap_modify_tuple(oldtup, tupDesc, values, nulls, replaces);
- simple_heap_update(rel, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
ReleaseSysCache(oldtup);
is_update = true;
@@ -387,13 +387,10 @@ create_proc_lang(const char *languageName, bool replace,
{
/* Creating a new language */
tup = heap_form_tuple(tupDesc, values, nulls);
- simple_heap_insert(rel, tup);
+ CatalogInsertHeapAndIndexes(rel, tup);
is_update = false;
}
- /* Need to update indexes for either the insert or update case */
- CatalogUpdateIndexes(rel, tup);
-
/*
* Create dependencies for the new language. If we are updating an
* existing language, first delete any existing pg_depend entries.
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 173b076..57543e4 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -215,8 +215,7 @@ CreatePublication(CreatePublicationStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
/* Insert tuple into catalog. */
- puboid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ puboid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
recordDependencyOnOwner(PublicationRelationId, puboid, GetUserId());
@@ -295,8 +294,7 @@ AlterPublicationOptions(AlterPublicationStmt *stmt, Relation rel,
replaces);
/* Update the catalog. */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
CommandCounterIncrement();
@@ -686,8 +684,7 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of a publication must be a superuser.")));
form->pubowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(PublicationRelationId,
diff --git a/src/backend/commands/schemacmds.c b/src/backend/commands/schemacmds.c
index c3b37b2..f49767e 100644
--- a/src/backend/commands/schemacmds.c
+++ b/src/backend/commands/schemacmds.c
@@ -281,8 +281,7 @@ RenameSchema(const char *oldname, const char *newname)
/* rename */
namestrcpy(&(((Form_pg_namespace) GETSTRUCT(tup))->nspname), newname);
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
InvokeObjectPostAlterHook(NamespaceRelationId, HeapTupleGetOid(tup), 0);
@@ -417,8 +416,7 @@ AlterSchemaOwner_internal(HeapTuple tup, Relation rel, Oid newOwnerId)
newtuple = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
diff --git a/src/backend/commands/seclabel.c b/src/backend/commands/seclabel.c
index 324f2e7..7e25411 100644
--- a/src/backend/commands/seclabel.c
+++ b/src/backend/commands/seclabel.c
@@ -299,7 +299,7 @@ SetSharedSecurityLabel(const ObjectAddress *object,
replaces[Anum_pg_shseclabel_label - 1] = true;
newtup = heap_modify_tuple(oldtup, RelationGetDescr(pg_shseclabel),
values, nulls, replaces);
- simple_heap_update(pg_shseclabel, &oldtup->t_self, newtup);
+ CatalogUpdateHeapAndIndexes(pg_shseclabel, &oldtup->t_self, newtup);
}
}
systable_endscan(scan);
@@ -309,15 +309,11 @@ SetSharedSecurityLabel(const ObjectAddress *object,
{
newtup = heap_form_tuple(RelationGetDescr(pg_shseclabel),
values, nulls);
- simple_heap_insert(pg_shseclabel, newtup);
+ CatalogInsertHeapAndIndexes(pg_shseclabel, newtup);
}
- /* Update indexes, if necessary */
if (newtup != NULL)
- {
- CatalogUpdateIndexes(pg_shseclabel, newtup);
heap_freetuple(newtup);
- }
heap_close(pg_shseclabel, RowExclusiveLock);
}
@@ -390,7 +386,7 @@ SetSecurityLabel(const ObjectAddress *object,
replaces[Anum_pg_seclabel_label - 1] = true;
newtup = heap_modify_tuple(oldtup, RelationGetDescr(pg_seclabel),
values, nulls, replaces);
- simple_heap_update(pg_seclabel, &oldtup->t_self, newtup);
+ CatalogUpdateHeapAndIndexes(pg_seclabel, &oldtup->t_self, newtup);
}
}
systable_endscan(scan);
@@ -400,15 +396,12 @@ SetSecurityLabel(const ObjectAddress *object,
{
newtup = heap_form_tuple(RelationGetDescr(pg_seclabel),
values, nulls);
- simple_heap_insert(pg_seclabel, newtup);
+ CatalogInsertHeapAndIndexes(pg_seclabel, newtup);
}
/* Update indexes, if necessary */
if (newtup != NULL)
- {
- CatalogUpdateIndexes(pg_seclabel, newtup);
heap_freetuple(newtup);
- }
heap_close(pg_seclabel, RowExclusiveLock);
}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0c673f5..830b600 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -236,8 +236,7 @@ DefineSequence(ParseState *pstate, CreateSeqStmt *seq)
pgs_values[Anum_pg_sequence_seqcache - 1] = Int64GetDatumFast(seqform.seqcache);
tuple = heap_form_tuple(tupDesc, pgs_values, pgs_nulls);
- simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
heap_close(rel, RowExclusiveLock);
@@ -504,8 +503,7 @@ AlterSequence(ParseState *pstate, AlterSeqStmt *stmt)
relation_close(seqrel, NoLock);
- simple_heap_update(rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogUpdateHeapAndIndexes(rel, &tuple->t_self, tuple);
heap_close(rel, RowExclusiveLock);
return address;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 41ef7a3..853dcd3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -277,8 +277,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt)
tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
/* Insert tuple into catalog. */
- subid = simple_heap_insert(rel, tup);
- CatalogUpdateIndexes(rel, tup);
+ subid = CatalogInsertHeapAndIndexes(rel, tup);
heap_freetuple(tup);
recordDependencyOnOwner(SubscriptionRelationId, subid, owner);
@@ -408,8 +407,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
replaces);
/* Update the catalog. */
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
ObjectAddressSet(myself, SubscriptionRelationId, subid);
@@ -588,8 +586,7 @@ AlterSubscriptionOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
errhint("The owner of an subscription must be a superuser.")));
form->subowner = newOwnerId;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* Update owner dependency reference */
changeDependencyOnOwner(SubscriptionRelationId,
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 90f2f7f..f62f8d7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2308,9 +2308,7 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
tuple = heap_form_tuple(desc, values, nulls);
- simple_heap_insert(inhRelation, tuple);
-
- CatalogUpdateIndexes(inhRelation, tuple);
+ CatalogInsertHeapAndIndexes(inhRelation, tuple);
heap_freetuple(tuple);
@@ -2398,10 +2396,7 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
if (classtuple->relhassubclass != relhassubclass)
{
classtuple->relhassubclass = relhassubclass;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
-
- /* keep the catalog indexes up to date */
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &tuple->t_self, tuple);
}
else
{
@@ -2592,10 +2587,7 @@ renameatt_internal(Oid myrelid,
/* apply the update */
namestrcpy(&(attform->attname), newattname);
- simple_heap_update(attrelation, &atttup->t_self, atttup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, atttup);
+ CatalogUpdateHeapAndIndexes(attrelation, &atttup->t_self, atttup);
InvokeObjectPostAlterHook(RelationRelationId, myrelid, attnum);
@@ -2902,10 +2894,7 @@ RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal)
*/
namestrcpy(&(relform->relname), newrelname);
- simple_heap_update(relrelation, &reltup->t_self, reltup);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(relrelation, reltup);
+ CatalogUpdateHeapAndIndexes(relrelation, &reltup->t_self, reltup);
InvokeObjectPostAlterHookArg(RelationRelationId, myrelid, 0,
InvalidOid, is_internal);
@@ -5097,8 +5086,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
/* Bump the existing child att's inhcount */
childatt->attinhcount++;
- simple_heap_update(attrdesc, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrdesc, tuple);
+ CatalogUpdateHeapAndIndexes(attrdesc, &tuple->t_self, tuple);
heap_freetuple(tuple);
@@ -5191,10 +5179,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
else
((Form_pg_class) GETSTRUCT(reltup))->relnatts = newattnum;
- simple_heap_update(pgclass, &reltup->t_self, reltup);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pgclass, reltup);
+ CatalogUpdateHeapAndIndexes(pgclass, &reltup->t_self, reltup);
heap_freetuple(reltup);
@@ -5630,10 +5615,7 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode)
{
((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull = FALSE;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
ObjectAddressSubSet(address, RelationRelationId,
RelationGetRelid(rel), attnum);
@@ -5708,10 +5690,7 @@ ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
{
((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull = TRUE;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/* Tell Phase 3 it needs to test the constraint */
tab->new_notnull = true;
@@ -5876,10 +5855,7 @@ ATExecSetStatistics(Relation rel, const char *colName, Node *newValue, LOCKMODE
attrtuple->attstattarget = newtarget;
- simple_heap_update(attrelation, &tuple->t_self, tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, tuple);
+ CatalogUpdateHeapAndIndexes(attrelation, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -5952,8 +5928,7 @@ ATExecSetOptions(Relation rel, const char *colName, Node *options,
repl_val, repl_null, repl_repl);
/* Update system catalog. */
- simple_heap_update(attrelation, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attrelation, newtuple);
+ CatalogUpdateHeapAndIndexes(attrelation, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -6036,10 +6011,7 @@ ATExecSetStorage(Relation rel, const char *colName, Node *newValue, LOCKMODE loc
errmsg("column data type %s can only have storage PLAIN",
format_type_be(attrtuple->atttypid))));
- simple_heap_update(attrelation, &tuple->t_self, tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, tuple);
+ CatalogUpdateHeapAndIndexes(attrelation, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -6277,10 +6249,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
/* Child column must survive my deletion */
childatt->attinhcount--;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -6296,10 +6265,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
childatt->attinhcount--;
childatt->attislocal = true;
- simple_heap_update(attr_rel, &tuple->t_self, tuple);
-
- /* keep the system catalog indexes current */
- CatalogUpdateIndexes(attr_rel, tuple);
+ CatalogUpdateHeapAndIndexes(attr_rel, &tuple->t_self, tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -6343,10 +6309,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
tuple_class = (Form_pg_class) GETSTRUCT(tuple);
tuple_class->relhasoids = false;
- simple_heap_update(class_rel, &tuple->t_self, tuple);
-
- /* Keep the catalog indexes up to date */
- CatalogUpdateIndexes(class_rel, tuple);
+ CatalogUpdateHeapAndIndexes(class_rel, &tuple->t_self, tuple);
heap_close(class_rel, RowExclusiveLock);
@@ -7195,8 +7158,7 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->condeferrable = cmdcon->deferrable;
copy_con->condeferred = cmdcon->initdeferred;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(contuple), 0);
@@ -7249,8 +7211,7 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
copy_tg->tgdeferrable = cmdcon->deferrable;
copy_tg->tginitdeferred = cmdcon->initdeferred;
- simple_heap_update(tgrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(tgrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(tgrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(TriggerRelationId,
HeapTupleGetOid(tgtuple), 0);
@@ -7436,8 +7397,7 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
copyTuple = heap_copytuple(tuple);
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->convalidated = true;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(tuple), 0);
@@ -8339,8 +8299,7 @@ ATExecDropConstraint(Relation rel, const char *constrName,
{
/* Child constraint must survive my deletion */
con->coninhcount--;
- simple_heap_update(conrel, ©_tuple->t_self, copy_tuple);
- CatalogUpdateIndexes(conrel, copy_tuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©_tuple->t_self, copy_tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -8356,8 +8315,7 @@ ATExecDropConstraint(Relation rel, const char *constrName,
con->coninhcount--;
con->conislocal = true;
- simple_heap_update(conrel, ©_tuple->t_self, copy_tuple);
- CatalogUpdateIndexes(conrel, copy_tuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©_tuple->t_self, copy_tuple);
/* Make update visible */
CommandCounterIncrement();
@@ -9003,10 +8961,7 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
ReleaseSysCache(typeTuple);
- simple_heap_update(attrelation, &heapTup->t_self, heapTup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(attrelation, heapTup);
+ CatalogUpdateHeapAndIndexes(attrelation, &heapTup->t_self, heapTup);
heap_close(attrelation, RowExclusiveLock);
@@ -9144,8 +9099,7 @@ ATExecAlterColumnGenericOptions(Relation rel,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(attrel),
repl_val, repl_null, repl_repl);
- simple_heap_update(attrel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attrel, newtuple);
+ CatalogUpdateHeapAndIndexes(attrel, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(RelationRelationId,
RelationGetRelid(rel),
@@ -9661,8 +9615,7 @@ ATExecChangeOwner(Oid relationOid, Oid newOwnerId, bool recursing, LOCKMODE lock
newtuple = heap_modify_tuple(tuple, RelationGetDescr(class_rel), repl_val, repl_null, repl_repl);
- simple_heap_update(class_rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(class_rel, newtuple);
+ CatalogUpdateHeapAndIndexes(class_rel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
@@ -9789,8 +9742,7 @@ change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId)
RelationGetDescr(attRelation),
repl_val, repl_null, repl_repl);
- simple_heap_update(attRelation, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(attRelation, newtuple);
+ CatalogUpdateHeapAndIndexes(attRelation, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
}
@@ -10067,9 +10019,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(pgclass),
repl_val, repl_null, repl_repl);
- simple_heap_update(pgclass, &newtuple->t_self, newtuple);
-
- CatalogUpdateIndexes(pgclass, newtuple);
+ CatalogUpdateHeapAndIndexes(pgclass, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(RelationRelationId, RelationGetRelid(rel), 0);
@@ -10126,9 +10076,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
newtuple = heap_modify_tuple(tuple, RelationGetDescr(pgclass),
repl_val, repl_null, repl_repl);
- simple_heap_update(pgclass, &newtuple->t_self, newtuple);
-
- CatalogUpdateIndexes(pgclass, newtuple);
+ CatalogUpdateHeapAndIndexes(pgclass, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHookArg(RelationRelationId,
RelationGetRelid(toastrel), 0,
@@ -10289,8 +10237,7 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/* update the pg_class row */
rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace;
rd_rel->relfilenode = newrelfilenode;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId, RelationGetRelid(rel), 0);
@@ -10940,8 +10887,7 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
childatt->attislocal = false;
}
- simple_heap_update(attrrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrrel, tuple);
+ CatalogUpdateHeapAndIndexes(attrrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
}
else
@@ -10980,8 +10926,7 @@ MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
childatt->attislocal = false;
}
- simple_heap_update(attrrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(attrrel, tuple);
+ CatalogUpdateHeapAndIndexes(attrrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
}
else
@@ -11118,8 +11063,7 @@ MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel)
child_con->conislocal = false;
}
- simple_heap_update(catalog_relation, &child_copy->t_self, child_copy);
- CatalogUpdateIndexes(catalog_relation, child_copy);
+ CatalogUpdateHeapAndIndexes(catalog_relation, &child_copy->t_self, child_copy);
heap_freetuple(child_copy);
found = true;
@@ -11289,8 +11233,7 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
if (copy_att->attinhcount == 0)
copy_att->attislocal = true;
- simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(catalogRelation, copyTuple);
+ CatalogUpdateHeapAndIndexes(catalogRelation, ©Tuple->t_self, copyTuple);
heap_freetuple(copyTuple);
}
}
@@ -11364,8 +11307,7 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
if (copy_con->coninhcount == 0)
copy_con->conislocal = true;
- simple_heap_update(catalogRelation, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(catalogRelation, copyTuple);
+ CatalogUpdateHeapAndIndexes(catalogRelation, ©Tuple->t_self, copyTuple);
heap_freetuple(copyTuple);
}
}
@@ -11565,8 +11507,7 @@ ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode)
if (!HeapTupleIsValid(classtuple))
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(classtuple))->reloftype = typeid;
- simple_heap_update(relationRelation, &classtuple->t_self, classtuple);
- CatalogUpdateIndexes(relationRelation, classtuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &classtuple->t_self, classtuple);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
@@ -11610,8 +11551,7 @@ ATExecDropOf(Relation rel, LOCKMODE lockmode)
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->reloftype = InvalidOid;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
@@ -11651,8 +11591,7 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
if (pg_class_form->relreplident != ri_type)
{
pg_class_form->relreplident = ri_type;
- simple_heap_update(pg_class, &pg_class_tuple->t_self, pg_class_tuple);
- CatalogUpdateIndexes(pg_class, pg_class_tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &pg_class_tuple->t_self, pg_class_tuple);
}
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(pg_class_tuple);
@@ -11711,8 +11650,7 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
if (dirty)
{
- simple_heap_update(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
- CatalogUpdateIndexes(pg_index, pg_index_tuple);
+ CatalogUpdateHeapAndIndexes(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
InvalidOid, is_internal);
}
@@ -11861,10 +11799,7 @@ ATExecEnableRowSecurity(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = true;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11888,10 +11823,7 @@ ATExecDisableRowSecurity(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = false;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11917,10 +11849,7 @@ ATExecForceNoForceRowSecurity(Relation rel, bool force_rls)
elog(ERROR, "cache lookup failed for relation %u", relid);
((Form_pg_class) GETSTRUCT(tuple))->relforcerowsecurity = force_rls;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
-
- /* keep catalog indexes current */
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_close(pg_class, RowExclusiveLock);
heap_freetuple(tuple);
@@ -11988,8 +11917,7 @@ ATExecGenericOptions(Relation rel, List *options)
tuple = heap_modify_tuple(tuple, RelationGetDescr(ftrel),
repl_val, repl_null, repl_repl);
- simple_heap_update(ftrel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(ftrel, tuple);
+ CatalogUpdateHeapAndIndexes(ftrel, &tuple->t_self, tuple);
/*
* Invalidate relcache so that all sessions will refresh any cached plans
@@ -12284,8 +12212,7 @@ AlterRelationNamespaceInternal(Relation classRel, Oid relOid,
/* classTup is a copy, so OK to scribble on */
classForm->relnamespace = newNspOid;
- simple_heap_update(classRel, &classTup->t_self, classTup);
- CatalogUpdateIndexes(classRel, classTup);
+ CatalogUpdateHeapAndIndexes(classRel, &classTup->t_self, classTup);
/* Update dependency on schema if caller said so */
if (hasDependEntry &&
@@ -13520,8 +13447,7 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
new_val, new_null, new_repl);
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = false;
- simple_heap_update(classRel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(classRel, newtuple);
+ CatalogUpdateHeapAndIndexes(classRel, &newtuple->t_self, newtuple);
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 651e1b3..f3c7436 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -344,9 +344,7 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
tuple = heap_form_tuple(rel->rd_att, values, nulls);
- tablespaceoid = simple_heap_insert(rel, tuple);
-
- CatalogUpdateIndexes(rel, tuple);
+ tablespaceoid = CatalogInsertHeapAndIndexes(rel, tuple);
heap_freetuple(tuple);
@@ -971,8 +969,7 @@ RenameTableSpace(const char *oldname, const char *newname)
/* OK, update the entry */
namestrcpy(&(newform->spcname), newname);
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(TableSpaceRelationId, tspId, 0);
@@ -1044,8 +1041,7 @@ AlterTableSpaceOptions(AlterTableSpaceOptionsStmt *stmt)
repl_null, repl_repl);
/* Update system catalog. */
- simple_heap_update(rel, &newtuple->t_self, newtuple);
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &newtuple->t_self, newtuple);
InvokeObjectPostAlterHook(TableSpaceRelationId, HeapTupleGetOid(tup), 0);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index f067d0a..1cc67ef 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -773,9 +773,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
/*
* Insert tuple into pg_trigger.
*/
- simple_heap_insert(tgrel, tuple);
-
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogInsertHeapAndIndexes(tgrel, tuple);
heap_freetuple(tuple);
heap_close(tgrel, RowExclusiveLock);
@@ -802,9 +800,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
((Form_pg_class) GETSTRUCT(tuple))->relhastriggers = true;
- simple_heap_update(pgrel, &tuple->t_self, tuple);
-
- CatalogUpdateIndexes(pgrel, tuple);
+ CatalogUpdateHeapAndIndexes(pgrel, &tuple->t_self, tuple);
heap_freetuple(tuple);
heap_close(pgrel, RowExclusiveLock);
@@ -1444,10 +1440,7 @@ renametrig(RenameStmt *stmt)
namestrcpy(&((Form_pg_trigger) GETSTRUCT(tuple))->tgname,
stmt->newname);
- simple_heap_update(tgrel, &tuple->t_self, tuple);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(tgrel, tuple);
+ CatalogUpdateHeapAndIndexes(tgrel, &tuple->t_self, tuple);
InvokeObjectPostAlterHook(TriggerRelationId,
HeapTupleGetOid(tuple), 0);
@@ -1560,10 +1553,7 @@ EnableDisableTrigger(Relation rel, const char *tgname,
newtrig->tgenabled = fires_when;
- simple_heap_update(tgrel, &newtup->t_self, newtup);
-
- /* Keep catalog indexes current */
- CatalogUpdateIndexes(tgrel, newtup);
+ CatalogUpdateHeapAndIndexes(tgrel, &newtup->t_self, newtup);
heap_freetuple(newtup);
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index 479a160..b9929a5 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -271,9 +271,7 @@ DefineTSParser(List *names, List *parameters)
tup = heap_form_tuple(prsRel->rd_att, values, nulls);
- prsOid = simple_heap_insert(prsRel, tup);
-
- CatalogUpdateIndexes(prsRel, tup);
+ prsOid = CatalogInsertHeapAndIndexes(prsRel, tup);
address = makeParserDependencies(tup);
@@ -482,9 +480,7 @@ DefineTSDictionary(List *names, List *parameters)
tup = heap_form_tuple(dictRel->rd_att, values, nulls);
- dictOid = simple_heap_insert(dictRel, tup);
-
- CatalogUpdateIndexes(dictRel, tup);
+ dictOid = CatalogInsertHeapAndIndexes(dictRel, tup);
address = makeDictionaryDependencies(tup);
@@ -620,9 +616,7 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
newtup = heap_modify_tuple(tup, RelationGetDescr(rel),
repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &newtup->t_self, newtup);
-
- CatalogUpdateIndexes(rel, newtup);
+ CatalogUpdateHeapAndIndexes(rel, &newtup->t_self, newtup);
InvokeObjectPostAlterHook(TSDictionaryRelationId, dictId, 0);
@@ -806,9 +800,7 @@ DefineTSTemplate(List *names, List *parameters)
tup = heap_form_tuple(tmplRel->rd_att, values, nulls);
- tmplOid = simple_heap_insert(tmplRel, tup);
-
- CatalogUpdateIndexes(tmplRel, tup);
+ tmplOid = CatalogInsertHeapAndIndexes(tmplRel, tup);
address = makeTSTemplateDependencies(tup);
@@ -1066,9 +1058,7 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied)
tup = heap_form_tuple(cfgRel->rd_att, values, nulls);
- cfgOid = simple_heap_insert(cfgRel, tup);
-
- CatalogUpdateIndexes(cfgRel, tup);
+ cfgOid = CatalogInsertHeapAndIndexes(cfgRel, tup);
if (OidIsValid(sourceOid))
{
@@ -1106,9 +1096,7 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied)
newmaptup = heap_form_tuple(mapRel->rd_att, mapvalues, mapnulls);
- simple_heap_insert(mapRel, newmaptup);
-
- CatalogUpdateIndexes(mapRel, newmaptup);
+ CatalogInsertHeapAndIndexes(mapRel, newmaptup);
heap_freetuple(newmaptup);
}
@@ -1409,9 +1397,7 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
newtup = heap_modify_tuple(maptup,
RelationGetDescr(relMap),
repl_val, repl_null, repl_repl);
- simple_heap_update(relMap, &newtup->t_self, newtup);
-
- CatalogUpdateIndexes(relMap, newtup);
+ CatalogUpdateHeapAndIndexes(relMap, &newtup->t_self, newtup);
}
}
@@ -1436,8 +1422,7 @@ MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
values[Anum_pg_ts_config_map_mapdict - 1] = ObjectIdGetDatum(dictIds[j]);
tup = heap_form_tuple(relMap->rd_att, values, nulls);
- simple_heap_insert(relMap, tup);
- CatalogUpdateIndexes(relMap, tup);
+ CatalogInsertHeapAndIndexes(relMap, tup);
heap_freetuple(tup);
}
diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 4c33d55..68e93fc 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -2221,9 +2221,7 @@ AlterDomainDefault(List *names, Node *defaultRaw)
new_record, new_record_nulls,
new_record_repl);
- simple_heap_update(rel, &tup->t_self, newtuple);
-
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, newtuple);
/* Rebuild dependencies */
GenerateTypeDependencies(typTup->typnamespace,
@@ -2360,9 +2358,7 @@ AlterDomainNotNull(List *names, bool notNull)
*/
typTup->typnotnull = notNull;
- simple_heap_update(typrel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(typrel, tup);
+ CatalogUpdateHeapAndIndexes(typrel, &tup->t_self, tup);
InvokeObjectPostAlterHook(TypeRelationId, domainoid, 0);
@@ -2662,8 +2658,7 @@ AlterDomainValidateConstraint(List *names, char *constrName)
copyTuple = heap_copytuple(tuple);
copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
copy_con->convalidated = true;
- simple_heap_update(conrel, ©Tuple->t_self, copyTuple);
- CatalogUpdateIndexes(conrel, copyTuple);
+ CatalogUpdateHeapAndIndexes(conrel, ©Tuple->t_self, copyTuple);
InvokeObjectPostAlterHook(ConstraintRelationId,
HeapTupleGetOid(copyTuple), 0);
@@ -3404,9 +3399,7 @@ AlterTypeOwnerInternal(Oid typeOid, Oid newOwnerId)
tup = heap_modify_tuple(tup, RelationGetDescr(rel), repl_val, repl_null,
repl_repl);
- simple_heap_update(rel, &tup->t_self, tup);
-
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
/* If it has an array type, update that too */
if (OidIsValid(typTup->typarray))
@@ -3566,8 +3559,7 @@ AlterTypeNamespaceInternal(Oid typeOid, Oid nspOid,
/* tup is a copy, so we can scribble directly on it */
typform->typnamespace = nspOid;
- simple_heap_update(rel, &tup->t_self, tup);
- CatalogUpdateIndexes(rel, tup);
+ CatalogUpdateHeapAndIndexes(rel, &tup->t_self, tup);
}
/*
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index b746982..46e3a66 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -433,8 +433,7 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
/*
* Insert new record in the pg_authid table
*/
- roleid = simple_heap_insert(pg_authid_rel, tuple);
- CatalogUpdateIndexes(pg_authid_rel, tuple);
+ roleid = CatalogInsertHeapAndIndexes(pg_authid_rel, tuple);
/*
* Advance command counter so we can see new record; else tests in
@@ -838,10 +837,7 @@ AlterRole(AlterRoleStmt *stmt)
new_tuple = heap_modify_tuple(tuple, pg_authid_dsc, new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authid_rel, &tuple->t_self, new_tuple);
-
- /* Update indexes */
- CatalogUpdateIndexes(pg_authid_rel, new_tuple);
+ CatalogUpdateHeapAndIndexes(pg_authid_rel, &tuple->t_self, new_tuple);
InvokeObjectPostAlterHook(AuthIdRelationId, roleid, 0);
@@ -1243,9 +1239,7 @@ RenameRole(const char *oldname, const char *newname)
}
newtuple = heap_modify_tuple(oldtuple, dsc, repl_val, repl_null, repl_repl);
- simple_heap_update(rel, &oldtuple->t_self, newtuple);
-
- CatalogUpdateIndexes(rel, newtuple);
+ CatalogUpdateHeapAndIndexes(rel, &oldtuple->t_self, newtuple);
InvokeObjectPostAlterHook(AuthIdRelationId, roleid, 0);
@@ -1530,16 +1524,14 @@ AddRoleMems(const char *rolename, Oid roleid,
tuple = heap_modify_tuple(authmem_tuple, pg_authmem_dsc,
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogUpdateHeapAndIndexes(pg_authmem_rel, &tuple->t_self, tuple);
ReleaseSysCache(authmem_tuple);
}
else
{
tuple = heap_form_tuple(pg_authmem_dsc,
new_record, new_record_nulls);
- simple_heap_insert(pg_authmem_rel, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogInsertHeapAndIndexes(pg_authmem_rel, tuple);
}
/* CCI after each change, in case there are duplicates in list */
@@ -1647,8 +1639,7 @@ DelRoleMems(const char *rolename, Oid roleid,
tuple = heap_modify_tuple(authmem_tuple, pg_authmem_dsc,
new_record,
new_record_nulls, new_record_repl);
- simple_heap_update(pg_authmem_rel, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_authmem_rel, tuple);
+ CatalogUpdateHeapAndIndexes(pg_authmem_rel, &tuple->t_self, tuple);
}
ReleaseSysCache(authmem_tuple);
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index d7dda6a..7048f73 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -299,8 +299,7 @@ replorigin_create(char *roname)
values[Anum_pg_replication_origin_roname - 1] = roname_d;
tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
- simple_heap_insert(rel, tuple);
- CatalogUpdateIndexes(rel, tuple);
+ CatalogInsertHeapAndIndexes(rel, tuple);
CommandCounterIncrement();
break;
}
diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c
index 481868b..33d73c2 100644
--- a/src/backend/rewrite/rewriteDefine.c
+++ b/src/backend/rewrite/rewriteDefine.c
@@ -124,7 +124,7 @@ InsertRule(char *rulname,
tup = heap_modify_tuple(oldtup, RelationGetDescr(pg_rewrite_desc),
values, nulls, replaces);
- simple_heap_update(pg_rewrite_desc, &tup->t_self, tup);
+ CatalogUpdateHeapAndIndexes(pg_rewrite_desc, &tup->t_self, tup);
ReleaseSysCache(oldtup);
@@ -135,11 +135,9 @@ InsertRule(char *rulname,
{
tup = heap_form_tuple(pg_rewrite_desc->rd_att, values, nulls);
- rewriteObjectId = simple_heap_insert(pg_rewrite_desc, tup);
+ rewriteObjectId = CatalogInsertHeapAndIndexes(pg_rewrite_desc, tup);
}
- /* Need to update indexes in either case */
- CatalogUpdateIndexes(pg_rewrite_desc, tup);
heap_freetuple(tup);
@@ -613,8 +611,7 @@ DefineQueryRewrite(char *rulename,
classForm->relminmxid = InvalidMultiXactId;
classForm->relreplident = REPLICA_IDENTITY_NOTHING;
- simple_heap_update(relationRelation, &classTup->t_self, classTup);
- CatalogUpdateIndexes(relationRelation, classTup);
+ CatalogUpdateHeapAndIndexes(relationRelation, &classTup->t_self, classTup);
heap_freetuple(classTup);
heap_close(relationRelation, RowExclusiveLock);
@@ -866,10 +863,7 @@ EnableDisableRule(Relation rel, const char *rulename,
{
((Form_pg_rewrite) GETSTRUCT(ruletup))->ev_enabled =
CharGetDatum(fires_when);
- simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_rewrite_desc, ruletup);
+ CatalogUpdateHeapAndIndexes(pg_rewrite_desc, &ruletup->t_self, ruletup);
changed = true;
}
@@ -985,10 +979,7 @@ RenameRewriteRule(RangeVar *relation, const char *oldName,
/* OK, do the update */
namestrcpy(&(ruleform->rulename), newName);
- simple_heap_update(pg_rewrite_desc, &ruletup->t_self, ruletup);
-
- /* keep system catalog indexes current */
- CatalogUpdateIndexes(pg_rewrite_desc, ruletup);
+ CatalogUpdateHeapAndIndexes(pg_rewrite_desc, &ruletup->t_self, ruletup);
heap_freetuple(ruletup);
heap_close(pg_rewrite_desc, RowExclusiveLock);
diff --git a/src/backend/rewrite/rewriteSupport.c b/src/backend/rewrite/rewriteSupport.c
index 0154072..fc76fab 100644
--- a/src/backend/rewrite/rewriteSupport.c
+++ b/src/backend/rewrite/rewriteSupport.c
@@ -72,10 +72,7 @@ SetRelationRuleStatus(Oid relationId, bool relHasRules)
/* Do the update */
classForm->relhasrules = relHasRules;
- simple_heap_update(relationRelation, &tuple->t_self, tuple);
-
- /* Keep the catalog indexes up to date */
- CatalogUpdateIndexes(relationRelation, tuple);
+ CatalogUpdateHeapAndIndexes(relationRelation, &tuple->t_self, tuple);
}
else
{
diff --git a/src/backend/storage/large_object/inv_api.c b/src/backend/storage/large_object/inv_api.c
index 262b0b2..de35e03 100644
--- a/src/backend/storage/large_object/inv_api.c
+++ b/src/backend/storage/large_object/inv_api.c
@@ -678,8 +678,7 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
replace[Anum_pg_largeobject_data - 1] = true;
newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
values, nulls, replace);
- simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogUpdateHeapAndIndexes(lo_heap_r, &newtup->t_self, newtup);
heap_freetuple(newtup);
/*
@@ -721,8 +720,7 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
values[Anum_pg_largeobject_pageno - 1] = Int32GetDatum(pageno);
values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
- simple_heap_insert(lo_heap_r, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogInsertHeapAndIndexes(lo_heap_r, newtup);
heap_freetuple(newtup);
}
pageno++;
@@ -850,8 +848,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
replace[Anum_pg_largeobject_data - 1] = true;
newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
values, nulls, replace);
- simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogUpdateHeapAndIndexes(lo_heap_r, &newtup->t_self, newtup);
heap_freetuple(newtup);
}
else
@@ -888,8 +885,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int64 len)
values[Anum_pg_largeobject_pageno - 1] = Int32GetDatum(pageno);
values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
- simple_heap_insert(lo_heap_r, newtup);
- CatalogIndexInsert(indstate, newtup);
+ CatalogInsertHeapAndIndexes(lo_heap_r, newtup);
heap_freetuple(newtup);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 26ff7e1..fe1ecbc 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3484,8 +3484,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
classform->relminmxid = minmulti;
classform->relpersistence = persistence;
- simple_heap_update(pg_class, &tuple->t_self, tuple);
- CatalogUpdateIndexes(pg_class, tuple);
+ CatalogUpdateHeapAndIndexes(pg_class, &tuple->t_self, tuple);
heap_freetuple(tuple);
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a3635a4..1620d7a 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -33,6 +33,9 @@ extern void CatalogCloseIndexes(CatalogIndexState indstate);
extern void CatalogIndexInsert(CatalogIndexState indstate,
HeapTuple heapTuple);
extern void CatalogUpdateIndexes(Relation heapRel, HeapTuple heapTuple);
+extern void CatalogUpdateHeapAndIndexes(Relation heapRel, ItemPointer otid,
+ HeapTuple tup);
+extern Oid CatalogInsertHeapAndIndexes(Relation heapRel, HeapTuple tup);
/*
Pavan Deolasee wrote:
Two new APIs added.
- CatalogInsertHeapAndIndex which does a simple_heap_insert followed by
catalog updates
- CatalogUpdateHeapAndIndex which does a simple_heap_update followed by
catalog updatesThere are only a handful callers remain for simple_heap_insert/update after
this patch. They are typically working with already opened indexes and
hence I left them unchanged.
Hmm, I was thinking we would get rid of CatalogUpdateIndexes altogether.
Two of the callers are in the new routines (which I propose to rename to
CatalogTupleInsert and CatalogTupleUpdate); the only remaining one is in
InsertPgAttributeTuple. I propose that we inline the three lines into
all those places and just remove CatalogUpdateIndexes. Half the out-of-
core places that are using this function will be broken as soon as WARM
lands anyway. I see no reason to keep it. (I have already modified the
patch this way -- no need to resend).
Unless there are objections I will push this later this afternoon.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
Unless there are objections I will push this later this afternoon.
Done. Let's get on with the show -- please post a rebased WARM.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-01-31 14:10:01 -0300, Alvaro Herrera wrote:
Pavan Deolasee wrote:
Two new APIs added.
- CatalogInsertHeapAndIndex which does a simple_heap_insert followed by
catalog updates
- CatalogUpdateHeapAndIndex which does a simple_heap_update followed by
catalog updatesThere are only a handful callers remain for simple_heap_insert/update after
this patch. They are typically working with already opened indexes and
hence I left them unchanged.Hmm, I was thinking we would get rid of CatalogUpdateIndexes altogether.
Two of the callers are in the new routines (which I propose to rename to
CatalogTupleInsert and CatalogTupleUpdate); the only remaining one is in
InsertPgAttributeTuple. I propose that we inline the three lines into
all those places and just remove CatalogUpdateIndexes. Half the out-of-
core places that are using this function will be broken as soon as WARM
lands anyway. I see no reason to keep it. (I have already modified the
patch this way -- no need to resend).Unless there are objections I will push this later this afternoon.
Hm, sorry for missing this earlier. I think CatalogUpdateIndexes() is
fairly widely used in extensions - it seems like a pretty harsh change
to not leave some backward compatibility layer in place.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
On 2017-01-31 14:10:01 -0300, Alvaro Herrera wrote:
Hmm, I was thinking we would get rid of CatalogUpdateIndexes altogether.
Two of the callers are in the new routines (which I propose to rename to
CatalogTupleInsert and CatalogTupleUpdate); the only remaining one is in
InsertPgAttributeTuple. I propose that we inline the three lines into
all those places and just remove CatalogUpdateIndexes. Half the out-of-
core places that are using this function will be broken as soon as WARM
lands anyway. I see no reason to keep it. (I have already modified the
patch this way -- no need to resend).Unless there are objections I will push this later this afternoon.
Hm, sorry for missing this earlier. I think CatalogUpdateIndexes() is
fairly widely used in extensions - it seems like a pretty harsh change
to not leave some backward compatibility layer in place.
Yeah, I can put it back if there's pushback about the removal, but I
think it's going to break due to WARM anyway.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-01-31 19:10:05 -0300, Alvaro Herrera wrote:
Andres Freund wrote:
On 2017-01-31 14:10:01 -0300, Alvaro Herrera wrote:
Hmm, I was thinking we would get rid of CatalogUpdateIndexes altogether.
Two of the callers are in the new routines (which I propose to rename to
CatalogTupleInsert and CatalogTupleUpdate); the only remaining one is in
InsertPgAttributeTuple. I propose that we inline the three lines into
all those places and just remove CatalogUpdateIndexes. Half the out-of-
core places that are using this function will be broken as soon as WARM
lands anyway. I see no reason to keep it. (I have already modified the
patch this way -- no need to resend).Unless there are objections I will push this later this afternoon.
Hm, sorry for missing this earlier. I think CatalogUpdateIndexes() is
fairly widely used in extensions - it seems like a pretty harsh change
to not leave some backward compatibility layer in place.Yeah, I can put it back if there's pushback about the removal, but I
think it's going to break due to WARM anyway.
I'm a bit doubtful (but not extremely so) that that's ok.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
Hm, sorry for missing this earlier. I think CatalogUpdateIndexes() is
fairly widely used in extensions - it seems like a pretty harsh change
to not leave some backward compatibility layer in place.
If an extension is doing that, it is probably constructing tuples to put
into the catalog, which means it'd be equally (and much more quietly)
broken by any change to the catalog's schema. We've never considered
such an argument as a reason not to change catalog schemas, though.
In short, I've got mighty little sympathy for that argument.
(I'm a little more concerned by Alvaro's apparent position that WARM
is a done deal; I didn't think so. This particular change seems like
good cleanup anyhow, however.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Andres Freund <andres@anarazel.de> writes:
Hm, sorry for missing this earlier. I think CatalogUpdateIndexes() is
fairly widely used in extensions - it seems like a pretty harsh change
to not leave some backward compatibility layer in place.If an extension is doing that, it is probably constructing tuples to put
into the catalog, which means it'd be equally (and much more quietly)
broken by any change to the catalog's schema. We've never considered
such an argument as a reason not to change catalog schemas, though.In short, I've got mighty little sympathy for that argument.
+1
(I'm a little more concerned by Alvaro's apparent position that WARM
is a done deal; I didn't think so. This particular change seems like
good cleanup anyhow, however.)
Agreed.
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
(I'm a little more concerned by Alvaro's apparent position that WARM
is a done deal; I didn't think so. This particular change seems like
good cleanup anyhow, however.)
Agreed.
BTW, the reason I think it's good cleanup is that it's something that my
colleagues at Salesforce also had to do as part of putting PG on top of a
different storage engine that had different ideas about index handling.
Essentially it's providing a bit of abstraction as to whether catalog
storage is exactly heaps or not (a topic I've noticed Robert is starting
to take some interest in, as well). However, the patch misses an
important part of such an abstraction layer by not also converting
catalog-related simple_heap_delete() calls into some sort of
CatalogTupleDelete() operation. It is certainly a peculiarity of
PG heaps that deletions don't require any immediate index work --- most
other storage engines would need that.
I propose that we should finish the job by inventing CatalogTupleDelete(),
which for the moment would be a trivial wrapper around
simple_heap_delete(), maybe just a macro for it.
If there's no objections I'll go make that happen in a day or two.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-01-31 17:21:28 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Hm, sorry for missing this earlier. I think CatalogUpdateIndexes() is
fairly widely used in extensions - it seems like a pretty harsh change
to not leave some backward compatibility layer in place.If an extension is doing that, it is probably constructing tuples to put
into the catalog, which means it'd be equally (and much more quietly)
broken by any change to the catalog's schema. We've never considered
such an argument as a reason not to change catalog schemas, though.
I know of several extensions that use CatalogUpdateIndexes() to update
their own tables. Citus included (It's trivial to change on our side, so
that's not a reason to do or not do something). There really is no
convenient API to do so without it.
(I'm a little more concerned by Alvaro's apparent position that WARM
is a done deal; I didn't think so. This particular change seems like
good cleanup anyhow, however.)
Yea, I don't think we're even close to that either.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
BTW, the reason I think it's good cleanup is that it's something that my
colleagues at Salesforce also had to do as part of putting PG on top of a
different storage engine that had different ideas about index handling.
Essentially it's providing a bit of abstraction as to whether catalog
storage is exactly heaps or not (a topic I've noticed Robert is starting
to take some interest in, as well).
Yeah, I remembered that too. Of course, we'd need to change the whole
idea of mapping tuples to C structs too, but this seemed a nice step
forward. (I renamed Pavan's proposed routine precisely to avoid the
word "Heap" in it.)
However, the patch misses an
important part of such an abstraction layer by not also converting
catalog-related simple_heap_delete() calls into some sort of
CatalogTupleDelete() operation. It is certainly a peculiarity of
PG heaps that deletions don't require any immediate index work --- most
other storage engines would need that.
I propose that we should finish the job by inventing CatalogTupleDelete(),
which for the moment would be a trivial wrapper around
simple_heap_delete(), maybe just a macro for it.If there's no objections I'll go make that happen in a day or two.
Sounds good.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 1, 2017 at 9:36 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I propose that we should finish the job by inventing CatalogTupleDelete(),
which for the moment would be a trivial wrapper around
simple_heap_delete(), maybe just a macro for it.If there's no objections I'll go make that happen in a day or two.
Sounds good.
As you are on it, I have moved the patch to CF 2017-03.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ +do { \ + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ +} while (0)Actually, I think this macro could just return the TID so that it can
be
used as struct assignment, just like ItemPointerCopy does internally --
callers can do
ctid = HeapTupleHeaderGetNextTid(tup);While I agree with your proposal, I wonder why we have ItemPointerCopy()
in
the first place because we freely copy TIDs as struct assignment. Is
there
a reason for that? And if there is, does it impact this specific case?
I dunno. This macro is present in our very first commit d31084e9d1118b.
Maybe it's an artifact from the Lisp to C conversion. Even then, we had
some cases of iptrs being copied by struct assignment, so it's not like
it didn't work. Perhaps somebody envisioned that the internal details
could change, but that hasn't happened in two decades so why should we
worry about it now? If somebody needs it later, it can be changed then.
May I suggest in that case that we apply the attached patch which removes
all references to ItemPointerCopy and its definition as well? This will
avoid confusion in future too. No issues noticed in regression tests.
There is one issue that bothers me. The current implementation lacks
ability to convert WARM chains into HOT chains. The README.WARM has some
proposal to do that. But it requires additional free bit in tuple header
(which we don't have) and of course, it needs to be vetted andimplemented.
If the heap ends up with many WARM tuples, then index-only-scans will
become ineffective because index-only-scan can not skip a heap page, ifit
contains a WARM tuple. Alternate ideas/suggestions and review of the
design
are welcome!
t_infomask2 contains one last unused bit,
Umm, WARM is using 2 unused bits from t_infomask2. You mean there is
another free bit after that too?
and we could reuse vacuum
full's bits (HEAP_MOVED_OUT, HEAP_MOVED_IN), but that will need some
thinking ahead. Maybe now's the time to start versioning relations so
that we can ensure clusters upgraded to pg10 do not contain any of those
bits in any tuple headers.
Yeah, IIRC old VACUUM FULL was removed in 9.0, which is good 6 year old.
Obviously, there still a chance that a pre-9.0 binary upgraded cluster
exists and upgrades to 10. So we still need to do something about them if
we reuse these bits. I'm surprised to see that we don't have any mechanism
in place to clear those bits. So may be we add something to do that.
I'd some other ideas (and a patch too) to reuse bits from t_ctid.ip_pos
given that offset numbers can be represented in just 13 bits, even with the
maximum block size. Can look at that if it comes to finding more bits.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
remove_itempointercopy.patchapplication/octet-stream; name=remove_itempointercopy.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5fd7f1e..2bbd59c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4642,7 +4642,7 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+ t_ctid = tuple->t_data->t_ctid;
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5671,14 +5671,14 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
- ItemPointerCopy(tid, &tupid);
+ tupid = *tid;
for (;;)
{
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
- ItemPointerCopy(&tupid, &(mytup.t_self));
+ mytup.t_self = tupid;
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
{
@@ -5916,7 +5916,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ tupid = mytup.t_data->t_ctid;
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index b3e89a4..19b3d80 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3792,7 +3792,7 @@ AfterTriggerExecute(AfterTriggerEvent event,
default:
if (ItemPointerIsValid(&(event->ate_ctid1)))
{
- ItemPointerCopy(&(event->ate_ctid1), &(tuple1.t_self));
+ tuple1.t_self = event->ate_ctid1;
if (!heap_fetch(rel, SnapshotAny, &tuple1, &buffer1, false, NULL))
elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
LocTriggerData.tg_trigtuple = &tuple1;
@@ -3809,7 +3809,7 @@ AfterTriggerExecute(AfterTriggerEvent event,
AFTER_TRIGGER_2CTID &&
ItemPointerIsValid(&(event->ate_ctid2)))
{
- ItemPointerCopy(&(event->ate_ctid2), &(tuple2.t_self));
+ tuple2.t_self = event->ate_ctid2;
if (!heap_fetch(rel, SnapshotAny, &tuple2, &buffer2, false, NULL))
elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
LocTriggerData.tg_newtuple = &tuple2;
@@ -5152,7 +5152,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
{
Assert(oldtup == NULL);
Assert(newtup != NULL);
- ItemPointerCopy(&(newtup->t_self), &(new_event.ate_ctid1));
+ new_event.ate_ctid1 = newtup->t_self;
ItemPointerSetInvalid(&(new_event.ate_ctid2));
}
else
@@ -5169,7 +5169,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
{
Assert(oldtup != NULL);
Assert(newtup == NULL);
- ItemPointerCopy(&(oldtup->t_self), &(new_event.ate_ctid1));
+ new_event.ate_ctid1 = oldtup->t_self;
ItemPointerSetInvalid(&(new_event.ate_ctid2));
}
else
@@ -5186,8 +5186,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
{
Assert(oldtup != NULL);
Assert(newtup != NULL);
- ItemPointerCopy(&(oldtup->t_self), &(new_event.ate_ctid1));
- ItemPointerCopy(&(newtup->t_self), &(new_event.ate_ctid2));
+ new_event.ate_ctid1 = oldtup->t_self;
+ new_event.ate_ctid2 = newtup->t_self;
}
else
{
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index a8bd583..7ea8e44 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -170,7 +170,7 @@ retry:
HTSU_Result res;
HeapTupleData locktup;
- ItemPointerCopy(&outslot->tts_tuple->t_self, &locktup.t_self);
+ locktup.t_self = outslot->tts_tuple->t_self;
PushActiveSnapshot(GetLatestSnapshot());
@@ -317,7 +317,7 @@ retry:
HTSU_Result res;
HeapTupleData locktup;
- ItemPointerCopy(&outslot->tts_tuple->t_self, &locktup.t_self);
+ locktup.t_self = outslot->tts_tuple->t_self;
PushActiveSnapshot(GetLatestSnapshot());
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d805ef4..3a5d8fc 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1233,8 +1233,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
key.relnode = change->data.tuplecid.node;
- ItemPointerCopy(&change->data.tuplecid.tid,
- &key.tid);
+ key.tid = change->data.tuplecid.tid;
ent = (ReorderBufferTupleCidEnt *)
hash_search(txn->tuplecid_hash,
@@ -3106,9 +3105,7 @@ ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
(int32) sizeof(LogicalRewriteMappingData))));
key.relnode = map.old_node;
- ItemPointerCopy(&map.old_tid,
- &key.tid);
-
+ key.tid = map.old_tid;
ent = (ReorderBufferTupleCidEnt *)
hash_search(tuplecid_data,
@@ -3121,8 +3118,7 @@ ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
continue;
key.relnode = map.new_node;
- ItemPointerCopy(&map.new_tid,
- &key.tid);
+ key.tid = map.new_tid;
new_ent = (ReorderBufferTupleCidEnt *)
hash_search(tuplecid_data,
@@ -3297,8 +3293,7 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
Assert(forkno == MAIN_FORKNUM);
Assert(blockno == ItemPointerGetBlockNumber(&htup->t_self));
- ItemPointerCopy(&htup->t_self,
- &key.tid);
+ key.tid = htup->t_self;
restart:
ent = (ReorderBufferTupleCidEnt *)
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index a3b372f..1eda80d 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -354,7 +354,7 @@ currtid_byreloid(PG_FUNCTION_ARGS)
if (rel->rd_rel->relkind == RELKIND_VIEW)
return currtid_for_view(rel, tid);
- ItemPointerCopy(tid, result);
+ *result = *tid;
snapshot = RegisterSnapshot(GetLatestSnapshot());
heap_get_latest_tid(rel, snapshot, result);
@@ -389,7 +389,7 @@ currtid_byrelname(PG_FUNCTION_ARGS)
return currtid_for_view(rel, tid);
result = (ItemPointer) palloc(sizeof(ItemPointerData));
- ItemPointerCopy(tid, result);
+ *result = *tid;
snapshot = RegisterSnapshot(GetLatestSnapshot());
heap_get_latest_tid(rel, snapshot, result);
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
However, the patch misses an
important part of such an abstraction layer by not also converting
catalog-related simple_heap_delete() calls into some sort of
CatalogTupleDelete() operation. It is certainly a peculiarity of
PG heaps that deletions don't require any immediate index work --- most
other storage engines would need that.
I propose that we should finish the job by inventing CatalogTupleDelete(),
which for the moment would be a trivial wrapper around
simple_heap_delete(), maybe just a macro for it.If there's no objections I'll go make that happen in a day or two.
Sounds good.
So while I was working on this I got quite unhappy with the
already-committed patch: it's a leaky abstraction in more ways than
this, and it's created a possibly-serious performance regression
for large objects (and maybe other places).
The source of both of those problems is that in some places, we
did CatalogOpenIndexes and then used the CatalogIndexState for
multiple tuple inserts/updates before doing CatalogCloseIndexes.
The patch dealt with these either by not touching them, just
leaving the simple_heap_insert/update calls in place (thus failing
to create any abstraction), or by blithely ignoring the optimization
and doing s/simple_heap_insert/CatalogTupleInsert/ anyway. For example,
in inv_api.c we are now doing a CatalogOpenIndexes/CatalogCloseIndexes
cycle for each chunk of the large object ... and just to add insult to
injury, the now-useless open/close calls outside the loop are still there.
I think what we ought to do about this is invent additional API
functions, say
Oid CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
CatalogIndexState indstate);
void CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid,
HeapTuple tup, CatalogIndexState indstate);
and use these in place of simple_heap_foo plus CatalogIndexInsert
in the places where this optimization had been applied.
An alternative but much more complicated fix would be to get rid of
the necessity for callers to worry about this at all, by caching
a CatalogIndexState in the catalog's relcache entry. That might be
worth doing eventually (because it would allow sharing index info
collection across unrelated operations) but I don't want to do it today.
Objections, better naming ideas?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
The source of both of those problems is that in some places, we
did CatalogOpenIndexes and then used the CatalogIndexState for
multiple tuple inserts/updates before doing CatalogCloseIndexes.
The patch dealt with these either by not touching them, just
leaving the simple_heap_insert/update calls in place (thus failing
to create any abstraction), or by blithely ignoring the optimization
and doing s/simple_heap_insert/CatalogTupleInsert/ anyway. For example,
in inv_api.c we are now doing a CatalogOpenIndexes/CatalogCloseIndexes
cycle for each chunk of the large object ... and just to add insult to
injury, the now-useless open/close calls outside the loop are still there.
Ouch. You're right, I missed that.
I think what we ought to do about this is invent additional API
functions, sayOid CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
CatalogIndexState indstate);
void CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid,
HeapTuple tup, CatalogIndexState indstate);and use these in place of simple_heap_foo plus CatalogIndexInsert
in the places where this optimization had been applied.
This looks reasonable enough to me.
An alternative but much more complicated fix would be to get rid of
the necessity for callers to worry about this at all, by caching
a CatalogIndexState in the catalog's relcache entry. That might be
worth doing eventually (because it would allow sharing index info
collection across unrelated operations) but I don't want to do it today.
Hmm, interesting idea. No disagreement on postponing.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
I think what we ought to do about this is invent additional API
functions, sayOid CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
CatalogIndexState indstate);
void CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid,
HeapTuple tup, CatalogIndexState indstate);and use these in place of simple_heap_foo plus CatalogIndexInsert
in the places where this optimization had been applied.
This looks reasonable enough to me.
Done.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 2, 2017 at 3:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
I think what we ought to do about this is invent additional API
functions, sayOid CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
CatalogIndexState indstate);
void CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid,
HeapTuple tup, CatalogIndexState indstate);and use these in place of simple_heap_foo plus CatalogIndexInsert
in the places where this optimization had been applied.This looks reasonable enough to me.
Done.
Thanks for taking care of this. Shame that I missed this because I'd
specifically noted the special casing for large objects etc. But looks like
while changing 180+ call sites, I forgot my notes.
Thanks again,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Feb 1, 2017 at 3:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Alvaro Herrera wrote:
Unless there are objections I will push this later this afternoon.
Done. Let's get on with the show -- please post a rebased WARM.
Please see rebased patches attached. There is not much change other than
the fact the patch now uses new catalog maintenance API.
Do you think we should apply the patch to remove ItemPointerCopy()? I will
rework the HeapTupleHeaderGetNextTid() after that. Not that it depends on
removing ItemPointerCopy(), but decided to postpone it until we make a call
on that patch.
BTW I've run now long stress tests with the patch applied and see no new
issues, even when indexes are dropped and recreated concurrently (includes
my patch to fix CIC bug in the master though). In another 24 hour test,
WARM could do 274M transactions where as master did 164M transactions. I
did not drop and recreate indexes during this run.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0002_warm_updates_v11.patchapplication/octet-stream; name=0002_warm_updates_v11.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..7a9a976 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -141,6 +141,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b2afdb7..ef3bfa3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c2247ad..2135ae0 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -92,6 +92,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index ec8ed33..4861957 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -89,6 +89,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -269,6 +270,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -306,8 +309,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index a59ad6f..46a334c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -408,6 +410,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..dcba734 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..7b9a712
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,306 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
+
+CREATE INDEX CONCURRENTLY
+-------------------------
+
+Currently CREATE INDEX CONCURRENTLY (CIC) is implemented as a 3-phase
+process. In the first phase, we create catalog entry for the new index
+so that the index is visible to all other backends, but still don't use
+it for either read or write. But we ensure that no new broken HOT
+chains are created by new transactions. In the second phase, we build
+the new index using a MVCC snapshot and then make the index available
+for inserts. We then do another pass over the index and insert any
+missing tuples, everytime indexing only it's root line pointer. See
+README.HOT for details about how HOT impacts CIC and how various
+challenges are tackeled.
+
+WARM poses another challenge because it allows creation of HOT chains
+even when an index key is changed. But since the index is not ready for
+insertion until the second phase is over, we might end up with a
+situation where the HOT chain has tuples with different index columns,
+yet only one of these values are indexed by the new index. Note that
+during the third phase, we only index tuples whose root line pointer is
+missing from the index. But we can't easily check if the existing index
+tuple is actually indexing the heap tuple visible to the new MVCC
+snapshot. Finding that information will require us to query the index
+again for every tuple in the chain, especially if it's a WARM tuple.
+This would require repeated access to the index. Another option would be
+to return index keys along with the heap TIDs when index is scanned for
+collecting all indexed TIDs during third phase. We can then compare the
+heap tuple against the already indexed key and decide whether or not to
+index the new tuple.
+
+We solve this problem more simply by disallowing WARM updates until the
+index is ready for insertion. We don't need to disallow WARM on a
+wholesale basis, but only those updates that change the columns of the
+new index are disallowed to be WARM updates.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5149c07..8be0137 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1957,6 +1957,78 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain containing this tid is actually a WARM chain.
+ * Note that even if the WARM update ultimately aborted, we still must do a
+ * recheck because the failing UPDATE when have inserted created index entries
+ * which are now stale, but still referencing this chain.
+ */
+static bool
+hot_check_warm_chain(Page dp, ItemPointer tid)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ return true;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return false;
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1976,11 +2048,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2034,9 +2109,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2049,6 +2127,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2097,7 +2185,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Check to see if HOT chain continues past this tuple; if so fetch
* the next offnum and loop around.
*/
- if (HeapTupleIsHotUpdated(heapTuple))
+ if (HeapTupleIsHotUpdated(heapTuple) &&
+ !HeapTupleHeaderHasRootOffset(heapTuple->t_data))
{
Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
ItemPointerGetBlockNumber(tid));
@@ -2121,18 +2210,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3491,15 +3603,18 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
+ Bitmapset *notready_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3520,6 +3635,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3544,6 +3660,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3565,10 +3685,17 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+ notready_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_NOTREADY);
+
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, notready_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3620,6 +3747,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3875,6 +4005,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4193,6 +4324,37 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !IsSystemRelation(relation) &&
+ !bms_overlap(notready_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup))
+ use_warm_update = true;
+ }
}
else
{
@@ -4239,6 +4401,22 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * Note: If we ever have a mechanism to avoid duplicate <key, TID> in
+ * indexes, we could look at relaxing this restriction and allow even
+ * more WARM udpates
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4251,12 +4429,35 @@ l2:
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
+ else if (use_warm_update)
+ {
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4366,7 +4567,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4506,7 +4710,8 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
* via ereport().
*/
void
-simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
+simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
+ Bitmapset **modified_attrs, bool *warm_update)
{
HTSU_Result result;
HeapUpdateFailureData hufd;
@@ -4515,7 +4720,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, modified_attrs, warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7567,6 +7772,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7578,6 +7784,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7651,6 +7860,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8628,16 +8839,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8697,6 +8914,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextTid(htup, &newtid);
@@ -8832,6 +9054,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f54337c..c2bd7d6 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -834,6 +834,13 @@ heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
continue;
+ /*
+ * If the tuple has root line pointer, it must be the end of the
+ * chain
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
/* Set up to scan the HOT-chain */
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ba27c1e..3cbe1d0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -75,10 +75,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -233,6 +235,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -534,7 +551,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -573,7 +590,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -600,6 +617,12 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple.
+ * Otherwise we must recheck every tuple.
+ */
+ scan->xs_tuple_recheck = scan->xs_recheck;
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -609,32 +632,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 883d70d..6efccf7 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,11 +19,14 @@
#include "access/nbtree.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -249,6 +252,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -308,6 +314,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -325,112 +333,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 469e7ab..27013f4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -121,6 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -301,8 +303,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
+ /* btree indexes are never lossy, except for WARM tuples */
scan->xs_recheck = false;
+ scan->xs_tuple_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..9becaeb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2065,3 +2069,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..2236f02 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -71,6 +71,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 815a694..1e8cdbd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1687,6 +1688,20 @@ BuildIndexInfo(Relation index)
ii->ii_Concurrent = false;
ii->ii_BrokenHotChain = false;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 76268e1..b2bfa10 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -66,10 +66,15 @@ CatalogCloseIndexes(CatalogIndexState indstate)
*
* This should be called for each inserted or updated catalog tuple.
*
+ * If the tuple was WARM updated, the modified_attrs contains the list of
+ * columns updated by the update. We must not insert new index entries for
+ * indexes which do not refer to any of the modified columns.
+ *
* This is effectively a cut-down version of ExecInsertIndexTuples.
*/
static void
-CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
+CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update)
{
int i;
int numIndexes;
@@ -79,12 +84,28 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
IndexInfo **indexInfoArray;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData root_tid;
- /* HOT update does not require index inserts */
- if (HeapTupleIsHeapOnly(heapTuple))
+ /*
+ * HOT update does not require index inserts, but WARM may need for some
+ * indexes.
+ */
+ if (HeapTupleIsHeapOnly(heapTuple) && !warm_update)
return;
/*
+ * If we've done a WARM update, then we must index the TID of the root line
+ * pointer and not the actual TID of the new tuple.
+ */
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(heapTuple->t_self)),
+ HeapTupleHeaderGetRootOffset(heapTuple->t_data));
+ else
+ ItemPointerCopy(&heapTuple->t_self, &root_tid);
+
+
+ /*
* Get information from the state structure. Fall out if nothing to do.
*/
numIndexes = indstate->ri_NumIndices;
@@ -112,6 +133,17 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
continue;
/*
+ * If we've done WARM update, then we must not insert a new index tuple
+ * if none of the index keys have changed. This is not just an
+ * optimization, but a requirement for WARM to work correctly.
+ */
+ if (warm_update)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
+ /*
* Expressional and partial indexes on system catalogs are not
* supported, nor exclusion constraints, nor deferred uniqueness
*/
@@ -136,7 +168,7 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
index_insert(relationDescs[i], /* index relation */
values, /* array of index Datums */
isnull, /* is-null flags */
- &(heapTuple->t_self), /* tid of heap tuple */
+ &root_tid,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO);
@@ -167,7 +199,7 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, NULL, false);
CatalogCloseIndexes(indstate);
return oid;
@@ -189,7 +221,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, false, NULL);
return oid;
}
@@ -209,12 +241,14 @@ void
CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup)
{
CatalogIndexState indstate;
+ bool warm_update;
+ Bitmapset *modified_attrs;
indstate = CatalogOpenIndexes(heapRel);
- simple_heap_update(heapRel, otid, tup);
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
CatalogCloseIndexes(indstate);
}
@@ -230,9 +264,12 @@ void
CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
CatalogIndexState indstate)
{
- simple_heap_update(heapRel, otid, tup);
+ Bitmapset *modified_attrs;
+ bool warm_update;
+
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 28be27a..92fa6e0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -493,6 +493,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -523,7 +524,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index e9eeacd..f199074 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = castNode(TriggerData, fcinfo->context);
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 949844d..38702e5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2680,6 +2680,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2834,6 +2836,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ed6136c..0fc77b6 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -694,7 +694,14 @@ DefineIndex(Oid relationId,
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * tries to either insert into it or use it for queries. In addition to
+ * that, WARM updates will be disallowed if an update is modifying one of
+ * the columns used by this new index. This is necessary to ensure that we
+ * don't create WARM tuples which do not have corresponding entry in this
+ * index. It must be noted that during the second phase, we will index only
+ * those heap tuples whose root line pointer is not already in the index,
+ * hence it's important that all tuples in a given chain, has the same
+ * value for any indexed column (including this new index).
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
@@ -742,7 +749,10 @@ DefineIndex(Oid relationId,
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
- * chain have different index keys.
+ * chain have different index keys. Also, the new index is consulted for
+ * deciding whether a WARM update is possible, and WARM update is not done
+ * if a column used by this index is being updated. This ensures that we
+ * don't create WARM tuples which are not indexed by this index.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
@@ -777,7 +787,8 @@ DefineIndex(Oid relationId,
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
- * insert new entries into the index for insertions and non-HOT updates.
+ * insert new entries into the index for insertions and non-HOT updates or
+ * WARM updates where this index needs new entry.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..1388be1 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1032,6 +1032,19 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM
+ * tuple, there could be multiple index entries
+ * pointing to the root of this chain. We can't do
+ * index-only scans for such tuples without verifying
+ * index key check. So mark the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ break;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, visibility_cutoff_xid))
visibility_cutoff_xid = xmin;
@@ -2158,6 +2171,18 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9920f48..94cf92f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique); /* type of uniqueness check to do */
@@ -790,6 +803,9 @@ retry:
{
if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
+ else
+ ItemPointerCopy(&tup->t_self, &ctid_wait);
+
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index a8bd583..b6c115d 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -399,6 +399,8 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate, false, NULL,
NIL);
@@ -445,6 +447,8 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
if (!skip_tuple)
{
List *recheckIndexes = NIL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Check the constraints of the tuple */
if (rel->rd_att->constr)
@@ -455,13 +459,30 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
/* OK, update the tuple and index entries for it */
simple_heap_update(rel, &searchslot->tts_tuple->t_self,
- slot->tts_tuple);
+ slot->tts_tuple, &modified_attrs, &warm_update);
if (resultRelInfo->ri_NumIndices > 0 &&
- !HeapTupleIsHeapOnly(slot->tts_tuple))
+ (!HeapTupleIsHeapOnly(slot->tts_tuple) || warm_update))
+ {
+ ItemPointerData root_tid;
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self,
+ &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
+
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL,
NIL);
+ }
/* AFTER ROW UPDATE Triggers */
ExecARUpdateTriggers(estate, resultRelInfo,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index f18827d..f81d290 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,27 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ /*
+ * If the heap tuple needs a recheck because of a WARM update,
+ * it's a lossy case
+ */
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..c7be366 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,10 +115,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..a1f3440 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -512,6 +512,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -558,6 +559,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -891,6 +893,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -1007,7 +1012,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1094,10 +1099,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..432dd4b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4085,6 +4087,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5194,6 +5197,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5221,6 +5225,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a987d0d..b8677f3 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -145,6 +145,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1644,6 +1660,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..5801703 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2338,6 +2338,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_pkattr);
bms_free(relation->rd_idattr);
@@ -4351,6 +4352,13 @@ RelationGetIndexList(Relation relation)
return list_copy(relation->rd_indexlist);
/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */
+ relation->rd_indexattr = NULL;
+
+ /*
* We build the list we intend to return (in the caller's context) while
* doing the scan. After successfully completing the scan, we copy that
* list into the relcache entry. This avoids cache-context memory leakage
@@ -4756,14 +4764,18 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *pkindexattrs; /* columns in the primary index */
Bitmapset *idindexattrs; /* columns in the replica identity */
+ Bitmapset *indxnotreadyattrs; /* columns in not ready indexes */
List *indexoidlist;
Oid relpkindex;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4778,6 +4790,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return bms_copy(relation->rd_indxnotreadyattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4818,9 +4834,11 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
pkindexattrs = NULL;
idindexattrs = NULL;
+ indxnotreadyattrs = NULL;
foreach(l, indexoidlist)
{
Oid indexOid = lfirst_oid(l);
@@ -4857,6 +4875,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
indexattrs = bms_add_member(indexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_member(indxnotreadyattrs,
+ attrnum - FirstLowInvalidHeapAttributeNumber);
+
if (isKey)
uindexattrs = bms_add_member(uindexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
@@ -4872,25 +4894,51 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_members(indxnotreadyattrs,
+ exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
+
index_close(indexDesc, AccessShareLock);
}
list_free(indexoidlist);
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_pkattr);
relation->rd_pkattr = NULL;
bms_free(relation->rd_idattr);
relation->rd_idattr = NULL;
+ bms_free(relation->rd_indxnotreadyattr);
+ relation->rd_indxnotreadyattr = NULL;
/*
* Now save copies of the bitmaps in the relcache entry. We intentionally
@@ -4903,7 +4951,9 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_pkattr = bms_copy(pkindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
+ relation->rd_indxnotreadyattr = bms_copy(indxnotreadyattrs);
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4917,6 +4967,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return indxnotreadyattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
@@ -5529,6 +5583,7 @@ load_relcache_init_file(bool shared)
rel->rd_keyattr = NULL;
rel->rd_pkattr = NULL;
rel->rd_idattr = NULL;
+ rel->rd_indxnotreadyattr = NULL;
rel->rd_pubactions = NULL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e91e41d..34430a9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -150,6 +151,10 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -213,6 +218,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to support WARM */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 69a3873..3e14023 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -364,4 +364,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 95aa976..9412c3a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -161,7 +162,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -176,7 +178,9 @@ extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
- HeapTuple tup);
+ HeapTuple tup,
+ Bitmapset **modified_attrs,
+ bool *warm_update);
extern void heap_sync(Relation relation);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a4a1fe1..b4238e5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7552186..ddbdbcd 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) != 0 \
+)
+
/*
* Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
* the TID of the tuple itself in t_ctid field to mark the end of the chain.
@@ -785,6 +801,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..98129d6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -750,6 +750,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce3ca8d..12d3b0c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -112,7 +112,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 05652e8..c132b10 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2740,6 +2740,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3353 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2892,6 +2894,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3354 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b..c4495a3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -382,6 +382,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..2c4d884 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -37,5 +37,4 @@ extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..07f2900 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -62,6 +62,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..ee635be 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1177,7 +1179,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a617a7c..fbac7c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -138,9 +138,14 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
+ Bitmapset *rd_indxnotreadyattr; /* columns used by indexes not yet
+ ready */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_pkattr; /* cols included in primary key */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
PublicationActions *rd_pubactions; /* publication actions */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index da36b67..d18bd09 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -50,7 +50,9 @@ typedef enum IndexAttrBitmapKind
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
INDEX_ATTR_BITMAP_PRIMARY_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE,
+ INDEX_ATTR_BITMAP_NOTREADY
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index de5ae00..7656e6e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1728,6 +1728,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1871,6 +1872,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1914,6 +1916,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1951,7 +1954,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1967,7 +1971,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1989,7 +1994,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa3bb7
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,367 @@
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=72)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=4)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1 (cost=0.29..9.16 rows=50 width=4)
+ Index Cond: (b = 140001)
+(2 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab1;
+------------------
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE a = 1;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE b = 701;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2 (cost=0.14..4.16 rows=1 width=4)
+ Index Cond: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab2 WHERE b = 701;
+ b
+-----
+ 701
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab2;
+------------------
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 1;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on updtst_tab3 (cost=0.00..2.25 rows=1 width=4)
+ Filter: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 701;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+ b
+------
+ 1421
+(1 row)
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+SET enable_seqscan = false;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 98
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 2;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3 (cost=0.14..8.16 rows=1 width=4)
+ Index Cond: (b = 702)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 702;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+ b
+------
+ 1422
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab3;
+------------------
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index edeb2d6..2268705 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..b73c278
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,172 @@
+
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab1;
+
+------------------
+
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE a = 1;
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE b = 701;
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+SELECT b FROM updtst_tab2 WHERE b = 701;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab2;
+------------------
+
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+SELECT * FROM updtst_tab3 WHERE a = 1;
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+SET enable_seqscan = false;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+SELECT * FROM updtst_tab3 WHERE a = 2;
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab3;
+------------------
+
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
0001_track_root_lp_v11.patchapplication/octet-stream; name=0001_track_root_lp_v11.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 84447f0..5149c07 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -93,7 +93,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2247,13 +2248,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2384,6 +2385,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2422,8 +2424,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
- RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup,
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
+
+ /* We must not overwrite the speculative insertion token. */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2651,6 +2658,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2721,7 +2729,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
+
+ /* Mark this tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2729,7 +2742,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
+ /* Mark each tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -3001,6 +3017,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3011,6 +3028,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3052,7 +3070,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3182,7 +3201,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain.
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3231,6 +3260,22 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section.
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever.
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tp.t_self));
+
START_CRIT_SECTION();
/*
@@ -3258,8 +3303,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain. */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3460,6 +3507,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3522,6 +3571,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3806,7 +3856,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3946,6 +4001,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3973,6 +4029,14 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available.
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self));
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3987,7 +4051,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4145,6 +4210,10 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4170,6 +4239,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above).
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4177,10 +4257,22 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ root_offnum = RelationPutHeapTuple(relation, newbuf, heaptup, false,
+ root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain.
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4193,7 +4285,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4232,6 +4324,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4512,7 +4605,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4521,9 +4615,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4543,6 +4639,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4570,7 +4667,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5008,7 +5109,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5056,6 +5162,10 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self));
+
START_CRIT_SECTION();
/*
@@ -5084,7 +5194,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5598,6 +5711,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5606,6 +5720,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5835,7 +5951,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5844,7 +5960,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5961,7 +6077,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6087,8 +6203,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7436,6 +7551,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7556,6 +7672,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8210,7 +8329,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, xlrec->offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8300,7 +8425,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8435,8 +8561,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8572,7 +8698,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8705,13 +8831,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored.
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8774,6 +8904,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself. */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8837,11 +8970,17 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6529fe3..8052519 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,20 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple where the caller
+ * tells us about the root line pointer of the chain. The latter is used
+ * during insertion of a new row, hence root line pointer is set to the offset
+ * where this tuple is inserted.
*/
-void
+OffsetNumber
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,17 +68,24 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead).
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number. */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(root_offnum))
+ root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, root_offnum);
}
+
+ return root_offnum;
}
/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..f54337c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that).
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,47 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
+ *
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line pointers
+ * and tuples on the page in order to find the root line pointers. To minimize
+ * the cost, we break early if target_offnum is specified and root line pointer
+ * to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +807,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return.
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping. */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +869,65 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return.
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item. */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion.
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple.
+ */
+OffsetNumber
+heap_get_root_tuple(Page page, OffsetNumber target_offnum)
+{
+ OffsetNumber offnum = InvalidOffsetNumber;
+ heap_get_root_tuples_internal(page, target_offnum, &offnum);
+ return offnum;
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 90ab6f2..e11b4a2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain.
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest.
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 8d119f6..9920f48 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -788,7 +788,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3a5b5b2..12476e7 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2589,7 +2589,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2597,7 +2597,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a864f78..95aa976 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -189,6 +189,7 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern OffsetNumber heap_get_root_tuple(Page page, OffsetNumber target_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 52f28b8..a4a1fe1 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 2824f23..921cb37 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -35,8 +35,8 @@ typedef struct BulkInsertStateData
} BulkInsertStateData;
-extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+extern OffsetNumber RelationPutHeapTuple(Relation relation, Buffer buffer,
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a6c7e31..7552186 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,43 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+/*
+ * Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
+ * the TID of the tuple itself in t_ctid field to mark the end of the chain.
+ * But starting PG v10, we use a special flag HEAP_LATEST_TUPLE to identify the
+ * last tuple and store the root line pointer of the HOT chain in t_ctid field
+ * instead.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * Starting from PostgreSQL 10, the latest tuple in an update chain has
+ * HEAP_LATEST_TUPLE set; but tuples upgraded from earlier versions do not.
+ * For those, we determine whether a tuple is latest by testing that its t_ctid
+ * points to itself.
+ *
+ * Note: beware of multiple evaluations of "tup" and "tid" arguments.
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +585,56 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * now have a new tuple in the chain and this is no longer the last tuple of
+ * the chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller must have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+/*
+ * Get the root line pointer of the HOT chain. The caller should have confirmed
+ * that the root offset is cached before calling this macro.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * Return whether the tuple has a cached root offset. We don't use
+ * HeapTupleHeaderIsHeapLatest because that one also considers the case of
+ * t_ctid pointing to itself, for tuples migrated from pre v10 clusters. Here
+ * we are only interested in the tuples which are marked with HEAP_LATEST_TUPLE
+ * flag.
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0000_interesting_attrs.patchapplication/octet-stream; name=0000_interesting_attrs.patchDownload
diff --git b/src/backend/access/heap/heapam.c a/src/backend/access/heap/heapam.c
index 5fd7f1e..84447f0 100644
--- b/src/backend/access/heap/heapam.c
+++ a/src/backend/access/heap/heapam.c
@@ -95,11 +95,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3454,6 +3451,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3471,9 +3470,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3500,21 +3496,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3535,7 +3540,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3561,6 +3566,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3572,10 +3581,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3814,6 +3820,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4118,7 +4126,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4133,7 +4141,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4281,13 +4291,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4321,7 +4333,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4366,114 +4378,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
- *
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
-
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
+ int attnum;
+ Bitmapset *modified = NULL;
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
-
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
Pavan Deolasee wrote:
Do you think we should apply the patch to remove ItemPointerCopy()? I will
rework the HeapTupleHeaderGetNextTid() after that. Not that it depends on
removing ItemPointerCopy(), but decided to postpone it until we make a call
on that patch.
My inclination is not to. We don't really know where we are going with
storage layer reworks in the near future, and we might end up changing
this in other ways. We might find ourselves needing this kind of
abstraction again. I don't think this means we need to follow it
completely in new code, since it's already broken in other places, but
let's not destroy it completely just yet.
BTW I've run now long stress tests with the patch applied and see no new
issues, even when indexes are dropped and recreated concurrently (includes
my patch to fix CIC bug in the master though). In another 24 hour test,
WARM could do 274M transactions where as master did 164M transactions. I
did not drop and recreate indexes during this run.
Eh, that's a 67% performance improvement. Nice.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 2, 2017 at 6:17 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
Please see rebased patches attached. There is not much change other than
the fact the patch now uses new catalog maintenance API.
Another rebase on current master.
This time I am also attaching a proof-of-concept patch to demonstrate chain
conversion. The proposed algorithm is mentioned in the README.WARM, but
I'll briefly explain here.
The chain conversion works in two phases and requires another index pass
during vacuum. During first heap scan, we collect candidate chains for
conversion. A chain qualifies for conversion if it has all tuples with
matching index keys with respect to all current indexes (i.e. chain becomes
HOT). WARM chains become HOT as and when old versions retire (or new
versions retire in case of aborts). But before we can mark them HOT again,
we must first remove duplicate (and potentially wrong) index pointers. This
algorithm deals with that.
When a WARM update occurs and we insert a new index entry in one or more
indexes, we mark the new index pointer with a special RED flag. The heap
tuple created by this UPDATE is also marked as RED. If the tuple is then
HOT-updated, subsequent versions will be marked RED as well. IOW each WARM
chain has two HOT chains inside it and these chains are identified as BLUE
and RED chains. The index pointer which satisfies key in RED chain is
marked RED too.
When we collect candidate WARM chains in the first heap scan, we also
remember the color of the chain.
During first index scan we delete all known dead index pointers (same as
lazy_tid_reaped). Also we also count number of RED and BLUE pointers to
each candidate chain.
The next index scan will either 1. remove an index pointer which is known
to be useless or 2. color a RED pointer BLUE.
- A BLUE pointer to a RED chain is removed when there exists a RED pointer
to the chain. If there is no RED pointer, we can't remove the BLUE pointer
because that is the only path to the heap tuple (case when WARM does not
cause new index entry). But we instead color the heap tuples BLUE
- A BLUE pointer to a BLUE chain is always retained
- A RED pointer to a BLUE chain is always removed (aborted updates)
- A RED pointer to a RED chain is colored BLUE (we will color the heap
tuples BLUE in the second heap scan)
Once the index pointers are taken care of such that there exists exactly
one pointer to a chain, the chain can be converted into HOT chains by
clearing WARM and RED flags.
There is one case of aborted vacuums. If a crash happens after coloring RED
pointer BLUE, but before we can clear the heap tuples, we might end up with
two BLUE pointers to a RED chain. This case will require recheck logic and
is not yet implemented.
The POC only works with BTREEs because the unused bit in IndexTuple's
t_info is already used by HASH indexes. For heap tuples, we can reuse one
of HEAP_MOVED_IN/OFF bits for marking tuples RED since this is only
required for WARM tuples. So the bit can be checked along with WARM bit.
Unless there is an objection to the design or someone thinks it cannot
work, I'll look at some alternate mechanism to free up more bits in tuple
header or at least in the index tuples. One idea is to free up 3 bits from
ip_posid knowing that OffsetNumber can never really need more than 13 bits
with the other constraints in place. We could use some bit-field magic to
do that with minimal changes. The thing that concerns me is whether there
will be a guaranteed way to make that work on all hardwares without
breaking the on-disk layout.
Comments/suggestions?
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0003_convert_chains_v12.patchapplication/octet-stream; name=0003_convert_chains_v12.patchDownload
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 04abd0f..ff50361 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -88,7 +88,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
while (itup < itupEnd)
{
/* Do we have to delete this tuple? */
- if (callback(&itup->heapPtr, callback_state))
+ if (callback(&itup->heapPtr, false, callback_state) == IBDCR_DELETE)
{
/* Yes; adjust count of tuples that will be left on page */
BloomPageGetOpaque(page)->maxoff--;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index c9ccfee..8ed71c5 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -56,7 +56,8 @@ ginVacuumItemPointers(GinVacuumState *gvs, ItemPointerData *items,
*/
for (i = 0; i < nitem; i++)
{
- if (gvs->callback(items + i, gvs->callback_state))
+ if (gvs->callback(items + i, false, gvs->callback_state) ==
+ IBDCR_DELETE)
{
gvs->result->tuples_removed += 1;
if (!tmpitems)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..0955db6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -202,7 +202,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
iid = PageGetItemId(page, i);
idxtuple = (IndexTuple) PageGetItem(page, iid);
- if (callback(&(idxtuple->t_tid), callback_state))
+ if (callback(&(idxtuple->t_tid), false, callback_state) ==
+ IBDCR_DELETE)
todelete[ntodelete++] = i;
else
stats->num_index_tuples += 1;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6645160..8b7a8aa 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -764,7 +764,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
* To remove the dead tuples, we strictly want to rely on results
* of callback function. refer btvacuumpage for detailed reason.
*/
- if (callback && callback(htup, callback_state))
+ if (callback && callback(htup, false, callback_state) ==
+ IBDCR_DELETE)
{
kill_tuple = true;
if (tuples_removed)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9c4522a..1f8f3eb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1958,17 +1958,32 @@ heap_fetch(Relation relation,
}
/*
- * Check if the HOT chain containing this tid is actually a WARM chain.
- * Note that even if the WARM update ultimately aborted, we still must do a
- * recheck because the failing UPDATE when have inserted created index entries
- * which are now stale, but still referencing this chain.
+ * Check status of a (possibly) WARM chain.
+ *
+ * This function looks at a HOT/WARM chain starting at tid and return a bitmask
+ * of information. We only follow the chain as long as it's known to be valid
+ * HOT chain. Information returned by the function consists of:
+ *
+ * HCWC_WARM_TUPLE - a warm tuple is found somewhere in the chain. Note that
+ * when a tuple is WARM updated, both old and new versions
+ * of the tuple are treated as WARM tuple
+ *
+ * HCWC_RED_TUPLE - a warm tuple part of the Red chain is found somewhere in
+ * the chain.
+ *
+ * HCWC_BLUE_TUPLE - a warm tuple part of the Blue chain is found somewhere in
+ * the chain.
+ *
+ * If stop_at_warm is true, we stop when the first WARM tuple is found and
+ * return information collected so far.
*/
-static bool
-hot_check_warm_chain(Page dp, ItemPointer tid)
+HeapCheckWarmChainStatus
+heap_check_warm_chain(Page dp, ItemPointer tid, bool stop_at_warm)
{
- TransactionId prev_xmax = InvalidTransactionId;
- OffsetNumber offnum;
- HeapTupleData heapTuple;
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+ HeapCheckWarmChainStatus status = 0;
offnum = ItemPointerGetOffsetNumber(tid);
heapTuple.t_self = *tid;
@@ -1985,7 +2000,16 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
/* check for unused, dead, or redirected items */
if (!ItemIdIsNormal(lp))
+ {
+ if (ItemIdIsRedirected(lp))
+ {
+ /* Follow the redirect */
+ offnum = ItemIdGetRedirect(lp);
+ continue;
+ }
+ /* else must be end of chain */
break;
+ }
heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
@@ -2000,13 +2024,113 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ /* We found a WARM tuple */
+ status |= HCWC_WARM_TUPLE;
+
+ /*
+ * If we've been told to stop at the first WARM tuple, just return
+ * whatever information collected so far.
+ */
+ if (stop_at_warm)
+ return status;
+
+ /* We found a tuple belonging to the Red chain */
+ if (HeapTupleHeaderIsWarmRed(heapTuple.t_data))
+ status |= HCWC_RED_TUPLE;
+ }
+ else
+ /* Must be a tuple belonging to the Blue chain */
+ status |= HCWC_BLUE_TUPLE;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
/*
- * Presence of either WARM or WARM updated tuple signals possible
- * breakage and the caller must recheck tuple returned from this chain
- * for index satisfaction
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return status;
+}
+
+/*
+ * Scan through the WARM chain starting at tid and reset all WARM related
+ * flags. At the end, the chain will have all characteristics of a regular HOT
+ * chain.
+ *
+ * Return the number of cleared offnums. Cleared offnums are returned in the
+ * passed-in cleared_offnums array. The caller must ensure that the array is
+ * large enough to hold maximum offnums that can be cleared by this invokation
+ * of heap_clear_warm_chain().
+ */
+int
+heap_clear_warm_chain(Page dp, ItemPointer tid, OffsetNumber *cleared_offnums)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+ int num_cleared = 0;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ {
+ if (ItemIdIsRedirected(lp))
+ {
+ /* Follow the redirect */
+ offnum = ItemIdGetRedirect(lp);
+ continue;
+ }
+ /* else must be end of chain */
+ break;
+ }
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Clear WARM and Red flags
*/
if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
- return true;
+ {
+ HeapTupleHeaderClearHeapWarmTuple(heapTuple.t_data);
+ HeapTupleHeaderClearWarmRed(heapTuple.t_data);
+ cleared_offnums[num_cleared++] = offnum;
+ }
/*
* Check to see if HOT chain continues past this tuple; if so fetch
@@ -2025,8 +2149,7 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
}
- /* All OK. No need to recheck */
- return false;
+ return num_cleared;
}
/*
@@ -2135,7 +2258,11 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* possible improvements here
*/
if (recheck && *recheck == false)
- *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+ {
+ HeapCheckWarmChainStatus status;
+ status = heap_check_warm_chain(dp, &heapTuple->t_self, true);
+ *recheck = HCWC_IS_WARM(status);
+ }
/*
* When first_call is true (and thus, skip is initially false) we'll
@@ -3409,7 +3536,9 @@ l1:
}
/* store transaction information of xact deleting the tuple */
- tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ tp.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(tp.t_data))
+ tp.t_data->t_infomask &= ~HEAP_MOVED;
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
tp.t_data->t_infomask |= new_infomask;
tp.t_data->t_infomask2 |= new_infomask2;
@@ -4172,7 +4301,9 @@ l2:
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
- oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ oldtup.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(oldtup.t_data))
+ oldtup.t_data->t_infomask &= ~HEAP_MOVED;
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
HeapTupleClearHotUpdated(&oldtup);
/* ... and store info about transaction updating this tuple */
@@ -4419,6 +4550,16 @@ l2:
}
/*
+ * If the old tuple is already a member of the Red chain, mark the new
+ * tuple with the same flag
+ */
+ if (HeapTupleIsHeapWarmTupleRed(&oldtup))
+ {
+ HeapTupleSetHeapWarmTupleRed(heaptup);
+ HeapTupleSetHeapWarmTupleRed(newtup);
+ }
+
+ /*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
* Usually this information will be available in the corresponding
@@ -4435,12 +4576,20 @@ l2:
/* Mark the old tuple as HOT-updated */
HeapTupleSetHotUpdated(&oldtup);
HeapTupleSetHeapWarmTuple(&oldtup);
+
/* And mark the new tuple as heap-only */
HeapTupleSetHeapOnly(heaptup);
+ /* Mark the new tuple as WARM tuple */
HeapTupleSetHeapWarmTuple(heaptup);
+ /* This update also starts a Red chain */
+ HeapTupleSetHeapWarmTupleRed(heaptup);
+ Assert(!HeapTupleIsHeapWarmTupleRed(&oldtup));
+
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
HeapTupleSetHeapWarmTuple(newtup);
+ HeapTupleSetHeapWarmTupleRed(newtup);
+
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
else
@@ -4459,6 +4608,8 @@ l2:
HeapTupleClearHeapOnly(newtup);
HeapTupleClearHeapWarmTuple(heaptup);
HeapTupleClearHeapWarmTuple(newtup);
+ HeapTupleClearHeapWarmTupleRed(heaptup);
+ HeapTupleClearHeapWarmTupleRed(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4477,7 +4628,9 @@ l2:
HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
- oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ oldtup.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(oldtup.t_data))
+ oldtup.t_data->t_infomask &= ~HEAP_MOVED;
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
/* ... and store info about transaction updating this tuple */
Assert(TransactionIdIsValid(xmax_old_tuple));
@@ -6398,7 +6551,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
PageSetPrunable(page, RecentGlobalXmin);
/* store transaction information of xact deleting the tuple */
- tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ tp.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(tp.t_data))
+ tp.t_data->t_infomask &= ~HEAP_MOVED;
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
/*
@@ -6972,7 +7127,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* Old-style VACUUM FULL is gone, but we have to keep this code as long as
* we support having MOVED_OFF/MOVED_IN tuples in the database.
*/
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
@@ -6991,7 +7146,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* have failed; whereas a non-dead MOVED_IN tuple must mean the
* xvac transaction succeeded.
*/
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
frz->frzflags |= XLH_INVALID_XVAC;
else
frz->frzflags |= XLH_FREEZE_XVAC;
@@ -7461,7 +7616,7 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
return true;
}
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsNormal(xid))
@@ -7544,7 +7699,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
return true;
}
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsNormal(xid) &&
@@ -7570,7 +7725,7 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
TransactionId xmax = HeapTupleHeaderGetUpdateXid(tuple);
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
if (TransactionIdPrecedes(*latestRemovedXid, xvac))
*latestRemovedXid = xvac;
@@ -8523,7 +8678,9 @@ heap_xlog_delete(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
HeapTupleHeaderClearHotUpdated(htup);
fix_infomask_from_infobits(xlrec->infobits_set,
@@ -9186,7 +9343,9 @@ heap_xlog_lock(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
fix_infomask_from_infobits(xlrec->infobits_set, &htup->t_infomask,
&htup->t_infomask2);
@@ -9265,7 +9424,9 @@ heap_xlog_lock_updated(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
fix_infomask_from_infobits(xlrec->infobits_set, &htup->t_infomask,
&htup->t_infomask2);
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 19e7048..47b01eb 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1620,7 +1620,8 @@ toast_save_datum(Relation rel, Datum value,
toastrel,
toastidxs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- NULL);
+ NULL,
+ false);
}
/*
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f56c58f..e8027f8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -199,7 +199,8 @@ index_insert(Relation indexRelation,
ItemPointer heap_t_ctid,
Relation heapRelation,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo,
+ bool warm_update)
{
RELATION_CHECKS;
CHECK_REL_PROCEDURE(aminsert);
@@ -209,6 +210,12 @@ index_insert(Relation indexRelation,
(HeapTuple) NULL,
InvalidBuffer);
+ if (warm_update)
+ {
+ Assert(indexRelation->rd_amroutine->amwarminsert != NULL);
+ return indexRelation->rd_amroutine->amwarminsert(indexRelation, values,
+ isnull, heap_t_ctid, heapRelation, checkUnique, indexInfo);
+ }
return indexRelation->rd_amroutine->aminsert(indexRelation, values, isnull,
heap_t_ctid, heapRelation,
checkUnique, indexInfo);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 952ed8f..4988f47 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
amroutine->aminsert = btinsert;
+ amroutine->amwarminsert = btwarminsert;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -317,11 +318,12 @@ btbuildempty(Relation index)
* Descend the tree recursively, find the appropriate location for our
* new tuple, and put it there.
*/
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
+static bool
+btinsert_internal(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo,
+ bool warm_update)
{
bool result;
IndexTuple itup;
@@ -330,6 +332,11 @@ btinsert(Relation rel, Datum *values, bool *isnull,
itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
itup->t_tid = *ht_ctid;
+ if (warm_update)
+ itup->t_info |= INDEX_RED_CHAIN;
+ else
+ itup->t_info &= ~INDEX_RED_CHAIN;
+
result = _bt_doinsert(rel, itup, checkUnique, heapRel);
pfree(itup);
@@ -337,6 +344,26 @@ btinsert(Relation rel, Datum *values, bool *isnull,
return result;
}
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return btinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, false);
+}
+
+bool
+btwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return btinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, true);
+}
+
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -1253,6 +1280,8 @@ restart:
{
IndexTuple itup;
ItemPointer htup;
+ IndexBulkDeleteCallbackResult result;
+ bool is_red = false;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
@@ -1279,8 +1308,29 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
+ if (itup->t_info & INDEX_RED_CHAIN)
+ is_red = true;
+
+ if (is_red)
+ stats->num_red_pointers++;
+ else
+ stats->num_blue_pointers++;
+
+ result = callback(htup, is_red, callback_state);
+ if (result == IBDCR_DELETE)
+ {
+ if (is_red)
+ stats->red_pointers_removed++;
+ else
+ stats->blue_pointers_removed++;
deletable[ndeletable++] = offnum;
+ }
+ else if (result == IBDCR_COLOR_BLUE)
+ {
+ stats->pointers_colored++;
+ itup->t_info &= ~INDEX_RED_CHAIN;
+ }
+ /* XXX XLOG stuff for converted pointers */
}
}
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cce9b3f..5343b10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -155,7 +155,8 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
{
Assert(ItemPointerIsValid(<->heapPtr));
- if (bds->callback(<->heapPtr, bds->callback_state))
+ if (bds->callback(<->heapPtr, false, bds->callback_state) ==
+ IBDCR_DELETE)
{
bds->stats->tuples_removed += 1;
deletable[i] = true;
@@ -425,7 +426,8 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
{
Assert(ItemPointerIsValid(<->heapPtr));
- if (bds->callback(<->heapPtr, bds->callback_state))
+ if (bds->callback(<->heapPtr, false, bds->callback_state) ==
+ IBDCR_DELETE)
{
bds->stats->tuples_removed += 1;
toDelete[xlrec.nDelete] = i;
@@ -902,10 +904,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
/* Dummy callback to delete no tuples during spgvacuumcleanup */
-static bool
-dummy_callback(ItemPointer itemptr, void *state)
+static IndexBulkDeleteCallbackResult
+dummy_callback(ItemPointer itemptr, bool is_red, void *state)
{
- return false;
+ return IBDCR_KEEP;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bba52ec..ab37b43 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -115,7 +115,7 @@ static void IndexCheckExclusion(Relation heapRelation,
IndexInfo *indexInfo);
static inline int64 itemptr_encode(ItemPointer itemptr);
static inline void itemptr_decode(ItemPointer itemptr, int64 encoded);
-static bool validate_index_callback(ItemPointer itemptr, void *opaque);
+static IndexBulkDeleteCallbackResult validate_index_callback(ItemPointer itemptr, bool is_red, void *opaque);
static void validate_index_heapscan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
@@ -2949,15 +2949,15 @@ itemptr_decode(ItemPointer itemptr, int64 encoded)
/*
* validate_index_callback - bulkdelete callback to collect the index TIDs
*/
-static bool
-validate_index_callback(ItemPointer itemptr, void *opaque)
+static IndexBulkDeleteCallbackResult
+validate_index_callback(ItemPointer itemptr, bool is_red, void *opaque)
{
v_i_state *state = (v_i_state *) opaque;
int64 encoded = itemptr_encode(itemptr);
tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false);
state->itups += 1;
- return false; /* never actually delete anything */
+ return IBDCR_KEEP; /* never actually delete anything */
}
/*
@@ -3178,7 +3178,8 @@ validate_index_heapscan(Relation heapRelation,
heapRelation,
indexInfo->ii_Unique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- indexInfo);
+ indexInfo,
+ false);
state->tups_inserted += 1;
}
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index e5355a8..5b6efcf 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -172,7 +172,8 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- indexInfo);
+ indexInfo,
+ warm_update);
}
ExecDropSingleTupleTableSlot(slot);
@@ -222,7 +223,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup, false, NULL);
+ CatalogIndexInsert(indstate, tup, NULL, false);
return oid;
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index d9c0fe7..330b661 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -168,7 +168,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
*/
index_insert(indexRel, values, isnull, &(new_row->t_self),
trigdata->tg_relation, UNIQUE_CHECK_EXISTING,
- indexInfo);
+ indexInfo,
+ false);
}
else
{
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 1388be1..33b1ac3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -104,6 +104,25 @@
*/
#define PREFETCH_SIZE ((BlockNumber) 32)
+/*
+ * Structure to track WARM chains that can be converted into HOT chains during
+ * this run.
+ *
+ * To reduce space requirement, we're using bitfields. But the way things are
+ * laid down, we're still wasting 1-byte per candidate chain.
+ */
+typedef struct LVRedBlueChain
+{
+ ItemPointerData chain_tid; /* root of the chain */
+ uint8 is_red_chain:1; /* is the WARM chain complete red ? */
+ uint8 keep_warm_chain:1; /* this chain can't be cleared of WARM
+ * tuples */
+ uint8 num_blue_pointers:2;/* number of blue pointers found so
+ * far */
+ uint8 num_red_pointers:2; /* number of red pointers found so far
+ * in the current index */
+} LVRedBlueChain;
+
typedef struct LVRelStats
{
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -121,6 +140,16 @@ typedef struct LVRelStats
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+ double num_warm_chains; /* number of warm chains seen so far */
+
+ /* List of WARM chains that can be converted into HOT chains */
+ /* NB: this list is ordered by TID of the root pointers */
+ int num_redblue_chains; /* current # of entries */
+ int max_redblue_chains; /* # slots allocated in array */
+ LVRedBlueChain *redblue_chains; /* array of LVRedBlueChain */
+ double num_non_convertible_warm_chains;
+
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
@@ -149,6 +178,7 @@ static void lazy_scan_heap(Relation onerel, int options,
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
static void lazy_vacuum_index(Relation indrel,
+ bool clear_warm,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats);
static void lazy_cleanup_index(Relation indrel,
@@ -156,6 +186,10 @@ static void lazy_cleanup_index(Relation indrel,
LVRelStats *vacrelstats);
static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+static int lazy_warmclear_page(Relation onerel, BlockNumber blkno,
+ Buffer buffer, int chainindex, LVRelStats *vacrelstats,
+ Buffer *vmbuffer);
+static void lazy_reset_redblue_pointer_count(LVRelStats *vacrelstats);
static bool should_attempt_truncation(LVRelStats *vacrelstats);
static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -163,8 +197,15 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
-static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_record_red_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr);
+static void lazy_record_blue_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr);
+static IndexBulkDeleteCallbackResult lazy_tid_reaped(ItemPointer itemptr, bool is_red, void *state);
+static IndexBulkDeleteCallbackResult lazy_indexvac_phase1(ItemPointer itemptr, bool is_red, void *state);
+static IndexBulkDeleteCallbackResult lazy_indexvac_phase2(ItemPointer itemptr, bool is_red, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
+static int vac_cmp_redblue_chain(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -683,8 +724,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* If we are close to overrunning the available space for dead-tuple
* TIDs, pause and do a cycle of vacuuming before we tackle this page.
*/
- if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
- vacrelstats->num_dead_tuples > 0)
+ if (((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
+ vacrelstats->num_dead_tuples > 0) ||
+ ((vacrelstats->max_redblue_chains - vacrelstats->num_redblue_chains) < MaxHeapTuplesPerPage &&
+ vacrelstats->num_redblue_chains > 0))
{
const int hvp_index[] = {
PROGRESS_VACUUM_PHASE,
@@ -714,6 +757,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* Remove index entries */
for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i],
+ (vacrelstats->num_redblue_chains > 0),
&indstats[i],
vacrelstats);
@@ -736,6 +780,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* valid.
*/
vacrelstats->num_dead_tuples = 0;
+ vacrelstats->num_redblue_chains = 0;
+ memset(vacrelstats->redblue_chains, 0,
+ vacrelstats->max_redblue_chains * sizeof (LVRedBlueChain));
vacrelstats->num_index_scans++;
/* Report that we are once again scanning the heap */
@@ -939,15 +986,33 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
continue;
}
+ ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
/* Redirect items mustn't be touched */
if (ItemIdIsRedirected(itemid))
{
+ HeapCheckWarmChainStatus status = heap_check_warm_chain(page,
+ &tuple.t_self, false);
+ if (HCWC_IS_WARM(status))
+ {
+ vacrelstats->num_warm_chains++;
+
+ /*
+ * A chain which is either complete Red or Blue is a
+ * candidate for chain conversion. Remember the chain and
+ * its color.
+ */
+ if (HCWC_IS_ALL_RED(status))
+ lazy_record_red_chain(vacrelstats, &tuple.t_self);
+ else if (HCWC_IS_ALL_BLUE(status))
+ lazy_record_blue_chain(vacrelstats, &tuple.t_self);
+ else
+ vacrelstats->num_non_convertible_warm_chains++;
+ }
hastup = true; /* this page won't be truncatable */
continue;
}
- ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
/*
* DEAD item pointers are to be vacuumed normally; but we don't
* count them in tups_vacuumed, else we'd be double-counting (at
@@ -967,6 +1032,28 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
tuple.t_len = ItemIdGetLength(itemid);
tuple.t_tableOid = RelationGetRelid(onerel);
+ if (!HeapTupleIsHeapOnly(&tuple))
+ {
+ HeapCheckWarmChainStatus status = heap_check_warm_chain(page,
+ &tuple.t_self, false);
+ if (HCWC_IS_WARM(status))
+ {
+ vacrelstats->num_warm_chains++;
+
+ /*
+ * A chain which is either complete Red or Blue is a
+ * candidate for chain conversion. Remember the chain and
+ * its color.
+ */
+ if (HCWC_IS_ALL_RED(status))
+ lazy_record_red_chain(vacrelstats, &tuple.t_self);
+ else if (HCWC_IS_ALL_BLUE(status))
+ lazy_record_blue_chain(vacrelstats, &tuple.t_self);
+ else
+ vacrelstats->num_non_convertible_warm_chains++;
+ }
+ }
+
tupgone = false;
switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
@@ -1287,7 +1374,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* If any tuples need to be deleted, perform final vacuum cycle */
/* XXX put a threshold on min number of tuples here? */
- if (vacrelstats->num_dead_tuples > 0)
+ if (vacrelstats->num_dead_tuples > 0 || vacrelstats->num_redblue_chains > 0)
{
const int hvp_index[] = {
PROGRESS_VACUUM_PHASE,
@@ -1305,6 +1392,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* Remove index entries */
for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i],
+ (vacrelstats->num_redblue_chains > 0),
&indstats[i],
vacrelstats);
@@ -1372,7 +1460,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
*
* This routine marks dead tuples as unused and compacts out free
* space on their pages. Pages not having dead tuples recorded from
- * lazy_scan_heap are not visited at all.
+ * lazy_scan_heap are not visited at all. This routine also converts
+ * candidate WARM chains to HOT chains by clearing WARM related flags. The
+ * candidate chains are determined by the preceeding index scans after
+ * looking at the data collected by the first heap scan.
*
* Note: the reason for doing this as a second pass is we cannot remove
* the tuples until we've removed their index entries, and we want to
@@ -1381,7 +1472,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
static void
lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
{
- int tupindex;
+ int tupindex, chainindex;
int npages;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
@@ -1390,33 +1481,66 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
npages = 0;
tupindex = 0;
- while (tupindex < vacrelstats->num_dead_tuples)
+ chainindex = 0;
+ while (tupindex < vacrelstats->num_dead_tuples ||
+ chainindex < vacrelstats->num_redblue_chains)
{
- BlockNumber tblk;
+ BlockNumber tblk, chainblk, vacblk;
Buffer buf;
Page page;
Size freespace;
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
- buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+ tblk = chainblk = InvalidBlockNumber;
+ if (chainindex < vacrelstats->num_redblue_chains)
+ chainblk =
+ ItemPointerGetBlockNumber(&(vacrelstats->redblue_chains[chainindex].chain_tid));
+
+ if (tupindex < vacrelstats->num_dead_tuples)
+ tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
+
+ if (tblk == InvalidBlockNumber)
+ vacblk = chainblk;
+ else if (chainblk == InvalidBlockNumber)
+ vacblk = tblk;
+ else
+ vacblk = Min(chainblk, tblk);
+
+ Assert(vacblk != InvalidBlockNumber);
+
+ buf = ReadBufferExtended(onerel, MAIN_FORKNUM, vacblk, RBM_NORMAL,
vac_strategy);
- if (!ConditionalLockBufferForCleanup(buf))
+
+
+ if (vacblk == chainblk)
+ LockBufferForCleanup(buf);
+ else if (!ConditionalLockBufferForCleanup(buf))
{
ReleaseBuffer(buf);
++tupindex;
continue;
}
- tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
- &vmbuffer);
+
+ /*
+ * Convert WARM chains on this page. This should be done before
+ * vacuuming the page to ensure that we can correctly set visibility
+ * bits after clearing WARM chains
+ */
+ if (vacblk == chainblk)
+ chainindex = lazy_warmclear_page(onerel, chainblk, buf, chainindex,
+ vacrelstats, &vmbuffer);
+
+ if (vacblk == tblk)
+ tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
+ &vmbuffer);
/* Now that we've compacted the page, record its available space */
page = BufferGetPage(buf);
freespace = PageGetHeapFreeSpace(page);
UnlockReleaseBuffer(buf);
- RecordPageWithFreeSpace(onerel, tblk, freespace);
+ RecordPageWithFreeSpace(onerel, vacblk, freespace);
npages++;
}
@@ -1435,6 +1559,63 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
}
/*
+ * lazy_warmclear_page() -- clear WARM flag and mark chains blue when possible
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * chainindex is the index in vacrelstats->redblue_chains of the first dead
+ * tuple for this page. We assume the rest follow sequentially.
+ * The return value is the first tupindex after the tuples of this page.
+ */
+static int
+lazy_warmclear_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+ int chainindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+{
+ Page page = BufferGetPage(buffer);
+ OffsetNumber cleared_offnums[MaxHeapTuplesPerPage];
+ int num_cleared = 0;
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_WARMCLEARED, blkno);
+
+ START_CRIT_SECTION();
+
+ for (; chainindex < vacrelstats->num_redblue_chains ; chainindex++)
+ {
+ BlockNumber tblk;
+ LVRedBlueChain *chain;
+
+ chain = &vacrelstats->redblue_chains[chainindex];
+
+ tblk = ItemPointerGetBlockNumber(&chain->chain_tid);
+ if (tblk != blkno)
+ break; /* past end of tuples for this block */
+
+ /*
+ * Since a heap page can have no more than MaxHeapTuplesPerPage
+ * offnums and we process each offnum only once, MaxHeapTuplesPerPage
+ * size array should be enough to hold all cleared tuples in this page.
+ */
+ if (!chain->keep_warm_chain)
+ num_cleared += heap_clear_warm_chain(page, &chain->chain_tid,
+ cleared_offnums + num_cleared);
+ }
+
+ /*
+ * Mark buffer dirty before we write WAL.
+ */
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(onerel))
+ {
+ }
+
+ END_CRIT_SECTION();
+
+ return chainindex;
+}
+
+/*
* lazy_vacuum_page() -- free dead tuples on a page
* and repair its fragmentation.
*
@@ -1587,6 +1768,16 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
return false;
}
+static void
+lazy_reset_redblue_pointer_count(LVRelStats *vacrelstats)
+{
+ int i;
+ for (i = 0; i < vacrelstats->num_redblue_chains; i++)
+ {
+ LVRedBlueChain *chain = &vacrelstats->redblue_chains[i];
+ chain->num_blue_pointers = chain->num_red_pointers = 0;
+ }
+}
/*
* lazy_vacuum_index() -- vacuum one index relation.
@@ -1596,6 +1787,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
*/
static void
lazy_vacuum_index(Relation indrel,
+ bool clear_warm,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats)
{
@@ -1611,15 +1803,81 @@ lazy_vacuum_index(Relation indrel,
ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
ivinfo.strategy = vac_strategy;
- /* Do bulk deletion */
- *stats = index_bulk_delete(&ivinfo, *stats,
- lazy_tid_reaped, (void *) vacrelstats);
+ /*
+ * If told, convert WARM chains into HOT chains.
+ *
+ * We must have already collected candidate WARM chains i.e. chains which
+ * has either has only Red or only Blue tuples, but not a mix of both.
+ *
+ * This works in two phases. In the first phase, we do a complete index
+ * scan and collect information about index pointers to the candidate
+ * chains, but we don't do conversion. To be precise, we count the number
+ * of Blue and Red index pointers to each candidate chain and use that
+ * knowledge to arrive at a decision and do the actual conversion during
+ * the second phase (we kill known dead pointers though in this phase).
+ *
+ * In the second phase, for each Red chain we check if we have seen a Red
+ * index pointer. For such chains, we kill the Blue pointer and color the
+ * Red pointer Blue. the heap tuples are marked Blue in the second heap
+ * scan. If we did not find any Red pointer to a Red chain, that means that
+ * the chain is reachable from the Blue pointer (because say WARM update
+ * did not added a new entry for this index). In that case, we do nothing.
+ * There is a third case where we find more than one Blue pointers to a Red
+ * chain. This can happen because of aborted vacuums. We don't handle that
+ * case yet, but it should be possible to apply the same recheck logic and
+ * find which of the Blue pointers is redundant and should be removed.
+ *
+ * For Blue chains, we just kill the Red pointer, if it exists and keep the
+ * Blue pointer.
+ */
+ if (clear_warm)
+ {
+ lazy_reset_redblue_pointer_count(vacrelstats);
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase1, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to remove %d row version, found "
+ "%0.f red pointers, %0.f blue pointers, removed "
+ "%0.f red pointers, removed %0.f blue pointers",
+ RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples,
+ (*stats)->num_red_pointers,
+ (*stats)->num_blue_pointers,
+ (*stats)->red_pointers_removed,
+ (*stats)->blue_pointers_removed)));
+
+ (*stats)->num_red_pointers = 0;
+ (*stats)->num_blue_pointers = 0;
+ (*stats)->red_pointers_removed = 0;
+ (*stats)->blue_pointers_removed = 0;
+ (*stats)->pointers_colored = 0;
+
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase2, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to convert red pointers, found "
+ "%0.f red pointers, %0.f blue pointers, removed "
+ "%0.f red pointers, removed %0.f blue pointers, "
+ "colored %0.f red pointers blue",
+ RelationGetRelationName(indrel),
+ (*stats)->num_red_pointers,
+ (*stats)->num_blue_pointers,
+ (*stats)->red_pointers_removed,
+ (*stats)->blue_pointers_removed,
+ (*stats)->pointers_colored)));
+ }
+ else
+ {
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_tid_reaped, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to remove %d row versions",
+ RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples),
+ errdetail("%s.", pg_rusage_show(&ru0))));
+ }
- ereport(elevel,
- (errmsg("scanned index \"%s\" to remove %d row versions",
- RelationGetRelationName(indrel),
- vacrelstats->num_dead_tuples),
- errdetail("%s.", pg_rusage_show(&ru0))));
}
/*
@@ -1993,9 +2251,11 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
if (vacrelstats->hasindex)
{
- maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
+ maxtuples = (vac_work_mem * 1024L) / (sizeof(ItemPointerData) +
+ sizeof(LVRedBlueChain));
maxtuples = Min(maxtuples, INT_MAX);
- maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+ maxtuples = Min(maxtuples, MaxAllocSize / (sizeof(ItemPointerData) +
+ sizeof(LVRedBlueChain)));
/* curious coding here to ensure the multiplication can't overflow */
if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
@@ -2013,6 +2273,57 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
vacrelstats->max_dead_tuples = (int) maxtuples;
vacrelstats->dead_tuples = (ItemPointer)
palloc(maxtuples * sizeof(ItemPointerData));
+
+ /*
+ * XXX Cheat for now and allocate the same size array for tracking blue and
+ * red chains. maxtuples must have been already adjusted above to ensure we
+ * don't cross vac_work_mem.
+ */
+ vacrelstats->num_redblue_chains = 0;
+ vacrelstats->max_redblue_chains = (int) maxtuples;
+ vacrelstats->redblue_chains = (LVRedBlueChain *)
+ palloc0(maxtuples * sizeof(LVRedBlueChain));
+
+}
+
+/*
+ * lazy_record_blue_chain - remember one blue chain
+ */
+static void
+lazy_record_blue_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (vacrelstats->num_redblue_chains < vacrelstats->max_redblue_chains)
+ {
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].chain_tid = *itemptr;
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].is_red_chain = 0;
+ vacrelstats->num_redblue_chains++;
+ }
+}
+
+/*
+ * lazy_record_red_chain - remember one red chain
+ */
+static void
+lazy_record_red_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (vacrelstats->num_redblue_chains < vacrelstats->max_redblue_chains)
+ {
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].chain_tid = *itemptr;
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].is_red_chain = 1;
+ vacrelstats->num_redblue_chains++;
+ }
}
/*
@@ -2043,8 +2354,8 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
*
* Assumes dead_tuples array is in sorted order.
*/
-static bool
-lazy_tid_reaped(ItemPointer itemptr, void *state)
+static IndexBulkDeleteCallbackResult
+lazy_tid_reaped(ItemPointer itemptr, bool is_red, void *state)
{
LVRelStats *vacrelstats = (LVRelStats *) state;
ItemPointer res;
@@ -2055,7 +2366,152 @@ lazy_tid_reaped(ItemPointer itemptr, void *state)
sizeof(ItemPointerData),
vac_cmp_itemptr);
- return (res != NULL);
+ return (res != NULL) ? IBDCR_DELETE : IBDCR_KEEP;
+}
+
+/*
+ * lazy_indexvac_phase1() -- run first pass of index vacuum
+ *
+ * This has the right signature to be an IndexBulkDeleteCallback.
+ */
+static IndexBulkDeleteCallbackResult
+lazy_indexvac_phase1(ItemPointer itemptr, bool is_red, void *state)
+{
+ LVRelStats *vacrelstats = (LVRelStats *) state;
+ ItemPointer res;
+ LVRedBlueChain *chain;
+
+ res = (ItemPointer) bsearch((void *) itemptr,
+ (void *) vacrelstats->dead_tuples,
+ vacrelstats->num_dead_tuples,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+
+ if (res != NULL)
+ return IBDCR_DELETE;
+
+ chain = (LVRedBlueChain *) bsearch((void *) itemptr,
+ (void *) vacrelstats->redblue_chains,
+ vacrelstats->num_redblue_chains,
+ sizeof(LVRedBlueChain),
+ vac_cmp_redblue_chain);
+ if (chain != NULL)
+ {
+ if (is_red)
+ chain->num_red_pointers++;
+ else
+ chain->num_blue_pointers++;
+ }
+ return IBDCR_KEEP;
+}
+
+/*
+ * lazy_indexvac_phase2() -- run first pass of index vacuum
+ *
+ * This has the right signature to be an IndexBulkDeleteCallback.
+ */
+static IndexBulkDeleteCallbackResult
+lazy_indexvac_phase2(ItemPointer itemptr, bool is_red, void *state)
+{
+ LVRelStats *vacrelstats = (LVRelStats *) state;
+ LVRedBlueChain *chain;
+
+ chain = (LVRedBlueChain *) bsearch((void *) itemptr,
+ (void *) vacrelstats->redblue_chains,
+ vacrelstats->num_redblue_chains,
+ sizeof(LVRedBlueChain),
+ vac_cmp_redblue_chain);
+
+ if (chain != NULL && (chain->keep_warm_chain != 1))
+ {
+ if (chain->is_red_chain == 1)
+ {
+ /*
+ * For Red chains, color Red index pointer Blue and kill the Blue
+ * pointer if we have a Red index pointer.
+ */
+ if (is_red)
+ {
+ Assert(chain->num_red_pointers == 1);
+ chain->keep_warm_chain = 0;
+ return IBDCR_COLOR_BLUE;
+ }
+ else
+ {
+ if (chain->num_red_pointers > 0)
+ {
+ chain->keep_warm_chain = 0;
+ return IBDCR_DELETE;
+ }
+ else if (chain->num_blue_pointers == 1)
+ {
+ chain->keep_warm_chain = 0;
+ return IBDCR_KEEP;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * For Blue chains, kill the Red pointer
+ */
+ if (chain->num_red_pointers > 0)
+ {
+ chain->keep_warm_chain = 0;
+ return IBDCR_DELETE;
+ }
+
+ /*
+ * If this is the only surviving Blue pointer, keep it but convert
+ * the chain.
+ */
+ if (chain->num_blue_pointers == 1)
+ {
+ chain->keep_warm_chain = 0;
+ return IBDCR_KEEP;
+ }
+
+ /*
+ * If there are more than 1 Blue pointers to this chain, we can
+ * apply the recheck logic and kill the redudant Blue pointer and
+ * convert the chain. But that's not yet done.
+ */
+ }
+ chain->keep_warm_chain = 1;
+ return IBDCR_KEEP;
+ }
+ return IBDCR_KEEP;
+}
+
+/*
+ * Comparator routines for use with qsort() and bsearch(). Similar to
+ * vac_cmp_itemptr, but right hand argument is LVRedBlueChain struct pointer.
+ */
+static int
+vac_cmp_redblue_chain(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber(&((LVRedBlueChain *) right)->chain_tid);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber(&((LVRedBlueChain *) right)->chain_tid);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
}
/*
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index d62d2de..3e49a8f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -405,7 +405,8 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique, /* type of uniqueness check to do */
- indexInfo); /* index AM may need this */
+ indexInfo, /* index AM may need this */
+ (modified_attrs != NULL)); /* type of uniqueness check to do */
/*
* If the index has an associated exclusion constraint, check that.
diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c
index baff998..6a2e2f2 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -106,7 +106,7 @@ HeapTupleHeaderGetCmin(HeapTupleHeader tup)
{
CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
- Assert(!(tup->t_infomask & HEAP_MOVED));
+ Assert(!(HeapTupleHeaderIsMoved(tup)));
Assert(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)));
if (tup->t_infomask & HEAP_COMBOCID)
@@ -120,7 +120,7 @@ HeapTupleHeaderGetCmax(HeapTupleHeader tup)
{
CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
- Assert(!(tup->t_infomask & HEAP_MOVED));
+ Assert(!(HeapTupleHeaderIsMoved(tup)));
/*
* Because GetUpdateXid() performs memory allocations if xmax is a
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 703bdce..0df5a44 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -186,7 +186,7 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -205,7 +205,7 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -377,7 +377,7 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -396,7 +396,7 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -471,7 +471,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return HeapTupleInvisible;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -490,7 +490,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -753,7 +753,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -772,7 +772,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -974,7 +974,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -993,7 +993,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -1180,7 +1180,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
if (HeapTupleHeaderXminInvalid(tuple))
return HEAPTUPLE_DEAD;
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_OFF)
+ else if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -1198,7 +1198,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
InvalidTransactionId);
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d7702e5..68859f2 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -75,6 +75,14 @@ typedef bool (*aminsert_function) (Relation indexRelation,
Relation heapRelation,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+/* insert this WARM tuple */
+typedef bool (*amwarminsert_function) (Relation indexRelation,
+ Datum *values,
+ bool *isnull,
+ ItemPointer heap_tid,
+ Relation heapRelation,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
/* bulk delete */
typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -203,6 +211,7 @@ typedef struct IndexAmRoutine
ambuild_function ambuild;
ambuildempty_function ambuildempty;
aminsert_function aminsert;
+ amwarminsert_function amwarminsert;
ambulkdelete_function ambulkdelete;
amvacuumcleanup_function amvacuumcleanup;
amcanreturn_function amcanreturn; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index f467b18..bf1e6bd 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -75,12 +75,29 @@ typedef struct IndexBulkDeleteResult
bool estimated_count; /* num_index_tuples is an estimate */
double num_index_tuples; /* tuples remaining */
double tuples_removed; /* # removed during vacuum operation */
+ double num_red_pointers; /* # red pointers found */
+ double num_blue_pointers; /* # blue pointers found */
+ double pointers_colored; /* # red pointers colored blue */
+ double red_pointers_removed; /* # red pointers removed */
+ double blue_pointers_removed; /* # blue pointers removed */
BlockNumber pages_deleted; /* # unused pages in index */
BlockNumber pages_free; /* # pages available for reuse */
} IndexBulkDeleteResult;
+/*
+ * IndexBulkDeleteCallback should return one of the following
+ */
+typedef enum IndexBulkDeleteCallbackResult
+{
+ IBDCR_KEEP, /* index tuple should be preserved */
+ IBDCR_DELETE, /* index tuple should be deleted */
+ IBDCR_COLOR_BLUE /* index tuple should be colored blue */
+} IndexBulkDeleteCallbackResult;
+
/* Typedef for callback function to determine if a tuple is bulk-deletable */
-typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
+typedef IndexBulkDeleteCallbackResult (*IndexBulkDeleteCallback) (
+ ItemPointer itemptr,
+ bool is_red, void *state);
/* struct definitions appear in relscan.h */
typedef struct IndexScanDescData *IndexScanDesc;
@@ -135,7 +152,8 @@ extern bool index_insert(Relation indexRelation,
ItemPointer heap_t_ctid,
Relation heapRelation,
IndexUniqueCheck checkUnique,
- struct IndexInfo *indexInfo);
+ struct IndexInfo *indexInfo,
+ bool warm_update);
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9412c3a..719a725 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,20 @@ typedef struct HeapUpdateFailureData
CommandId cmax;
} HeapUpdateFailureData;
+typedef int HeapCheckWarmChainStatus;
+
+#define HCWC_BLUE_TUPLE 0x0001
+#define HCWC_RED_TUPLE 0x0002
+#define HCWC_WARM_TUPLE 0x0004
+
+#define HCWC_IS_MIXED(status) \
+ (((status) & (HCWC_BLUE_TUPLE | HCWC_RED_TUPLE)) != 0)
+#define HCWC_IS_ALL_RED(status) \
+ (((status) & HCWC_BLUE_TUPLE) == 0)
+#define HCWC_IS_ALL_BLUE(status) \
+ (((status) & HCWC_RED_TUPLE) == 0)
+#define HCWC_IS_WARM(status) \
+ (((status) & HCWC_WARM_TUPLE) != 0)
/* ----------------
* function prototypes for heap access method
@@ -183,6 +197,10 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
bool *warm_update);
extern void heap_sync(Relation relation);
+extern HeapCheckWarmChainStatus heap_check_warm_chain(Page dp,
+ ItemPointer tid, bool stop_at_warm);
+extern int heap_clear_warm_chain(Page dp, ItemPointer tid,
+ OffsetNumber *cleared_offnums);
/* in heap/pruneheap.c */
extern void heap_page_prune_opt(Relation relation, Buffer buffer);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index ddbdbcd..45fe12c 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -201,6 +201,21 @@ struct HeapTupleHeaderData
* upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
+/*
+ * A WARM chain usually consists of two parts. Each of these parts are HOT
+ * chains in themselves i.e. all indexed columns has the same value, but a WARM
+ * update separates these parts. We call these two parts as Blue chain and Red
+ * chain. We need a mechanism to identify which part a tuple belongs to. We
+ * can't just look at if it's a HeapTupleHeaderIsHeapWarmTuple() because during
+ * WARM update, both old and new tuples are marked as WARM tuples.
+ *
+ * We need another infomask bit for this. But we use the same infomask bit that
+ * was earlier used for by old-style VACUUM FULL. This is safe because
+ * HEAP_WARM_TUPLE flag will always be set along with HEAP_WARM_RED. So if
+ * HEAP_WARM_TUPLE and HEAP_WARM_RED is set then we know that it's referring to
+ * red part of the WARM chain.
+ */
+#define HEAP_WARM_RED 0x4000
#define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
/*
@@ -397,7 +412,7 @@ struct HeapTupleHeaderData
/* SetCmin is reasonably simple since we never need a combo CID */
#define HeapTupleHeaderSetCmin(tup, cid) \
do { \
- Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+ Assert(!HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_cid = (cid); \
(tup)->t_infomask &= ~HEAP_COMBOCID; \
} while (0)
@@ -405,7 +420,7 @@ do { \
/* SetCmax must be used after HeapTupleHeaderAdjustCmax; see combocid.c */
#define HeapTupleHeaderSetCmax(tup, cid, iscombo) \
do { \
- Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+ Assert(!HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_cid = (cid); \
if (iscombo) \
(tup)->t_infomask |= HEAP_COMBOCID; \
@@ -415,7 +430,7 @@ do { \
#define HeapTupleHeaderGetXvac(tup) \
( \
- ((tup)->t_infomask & HEAP_MOVED) ? \
+ HeapTupleHeaderIsMoved(tup) ? \
(tup)->t_choice.t_heap.t_field3.t_xvac \
: \
InvalidTransactionId \
@@ -423,7 +438,7 @@ do { \
#define HeapTupleHeaderSetXvac(tup, xid) \
do { \
- Assert((tup)->t_infomask & HEAP_MOVED); \
+ Assert(HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_xvac = (xid); \
} while (0)
@@ -651,6 +666,58 @@ do { \
)
/*
+ * Macros to check if tuple is a moved-off/in tuple by VACUUM FULL in from
+ * pre-9.0 era. Such tuple must not have HEAP_WARM_TUPLE flag set.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderIsMovedOff(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED_OFF) \
+)
+
+#define HeapTupleHeaderIsMovedIn(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED_IN) \
+)
+
+#define HeapTupleHeaderIsMoved(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED) \
+)
+
+/*
+ * Check if tuple belongs to the Red part of the WARM chain.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderIsWarmRed(tuple) \
+( \
+ HeapTupleHeaderIsHeapWarmTuple(tuple) && \
+ (((tuple)->t_infomask & HEAP_WARM_RED) != 0) \
+)
+
+/*
+ * Mark tuple as a member of the Red chain. Must only be done on a tuple which
+ * is already marked a WARM-tuple.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderSetWarmRed(tuple) \
+( \
+ AssertMacro(HeapTupleHeaderIsHeapWarmTuple(tuple)), \
+ (tuple)->t_infomask |= HEAP_WARM_RED \
+)
+
+#define HeapTupleHeaderClearWarmRed(tuple) \
+( \
+ (tuple)->t_infomask &= ~HEAP_WARM_RED \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
@@ -810,6 +877,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapWarmTuple(tuple) \
HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+#define HeapTupleIsHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderIsWarmRed((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderSetWarmRed((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderClearWarmRed((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 08d056d..40be895 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -427,6 +427,9 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+/* This index tuple points to the red part of the WARM chain */
+#define INDEX_RED_CHAIN 0x2000
+
/*
* external entry points for btree, in nbtree.c
*/
@@ -437,6 +440,10 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+extern bool btwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9472ecc..b355b61 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -25,6 +25,7 @@
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_HEAP_BLKS_WARMCLEARED 7
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
index 0aa3bb7..6391891 100644
--- a/src/test/regress/expected/warm.out
+++ b/src/test/regress/expected/warm.out
@@ -26,12 +26,12 @@ SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
-- Even when seqscan is disabled and indexscan is forced
SET enable_seqscan = false;
-EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
- QUERY PLAN
-----------------------------------------------------------------------------
- Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=72)
+EXPLAIN (costs off) SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab1
Recheck Cond: (b = 140001)
- -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ -> Bitmap Index Scan on updtst_indx1
Index Cond: (b = 140001)
(4 rows)
@@ -42,12 +42,12 @@ SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
(1 row)
-- Check if index only scan works correctly
-EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
- QUERY PLAN
-----------------------------------------------------------------------------
- Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=4)
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab1
Recheck Cond: (b = 140001)
- -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ -> Bitmap Index Scan on updtst_indx1
Index Cond: (b = 140001)
(4 rows)
@@ -59,10 +59,10 @@ SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
-- Table must be vacuumed to force index-only scan
VACUUM updtst_tab1;
-EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Index Only Scan using updtst_indx1 on updtst_tab1 (cost=0.29..9.16 rows=50 width=4)
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1
Index Cond: (b = 140001)
(2 rows)
@@ -99,12 +99,12 @@ SELECT * FROM updtst_tab2 WHERE c = 'foo6';
1 | 701 | foo6 | bar
(1 row)
-EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
- QUERY PLAN
----------------------------------------------------------------------------
- Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab2
Recheck Cond: (b = 701)
- -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ -> Bitmap Index Scan on updtst_indx2
Index Cond: (b = 701)
(4 rows)
@@ -115,12 +115,12 @@ SELECT * FROM updtst_tab2 WHERE a = 1;
(1 row)
SET enable_seqscan = false;
-EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
- QUERY PLAN
----------------------------------------------------------------------------
- Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab2
Recheck Cond: (b = 701)
- -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ -> Bitmap Index Scan on updtst_indx2
Index Cond: (b = 701)
(4 rows)
@@ -131,10 +131,10 @@ SELECT * FROM updtst_tab2 WHERE b = 701;
(1 row)
VACUUM updtst_tab2;
-EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
- QUERY PLAN
--------------------------------------------------------------------------------------
- Index Only Scan using updtst_indx2 on updtst_tab2 (cost=0.14..4.16 rows=1 width=4)
+EXPLAIN (costs off) SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2
Index Cond: (b = 701)
(2 rows)
@@ -212,10 +212,10 @@ SELECT * FROM updtst_tab3 WHERE b = 1421;
(1 row)
VACUUM updtst_tab3;
-EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
- QUERY PLAN
------------------------------------------------------------
- Seq Scan on updtst_tab3 (cost=0.00..2.25 rows=1 width=4)
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-------------------------
+ Seq Scan on updtst_tab3
Filter: (b = 701)
(2 rows)
@@ -293,10 +293,10 @@ SELECT * FROM updtst_tab3 WHERE b = 1422;
(1 row)
VACUUM updtst_tab3;
-EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
- QUERY PLAN
--------------------------------------------------------------------------------------
- Index Only Scan using updtst_indx3 on updtst_tab3 (cost=0.14..8.16 rows=1 width=4)
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3
Index Cond: (b = 702)
(2 rows)
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
index b73c278..f31127c 100644
--- a/src/test/regress/sql/warm.sql
+++ b/src/test/regress/sql/warm.sql
@@ -23,16 +23,16 @@ SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
-- Even when seqscan is disabled and indexscan is forced
SET enable_seqscan = false;
-EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+EXPLAIN (costs off) SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
-- Check if index only scan works correctly
-EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
-- Table must be vacuumed to force index-only scan
VACUUM updtst_tab1;
-EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
SET enable_seqscan = true;
@@ -58,15 +58,15 @@ UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
SELECT * FROM updtst_tab2 WHERE c = 'foo6';
-EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
SELECT * FROM updtst_tab2 WHERE a = 1;
SET enable_seqscan = false;
-EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
SELECT * FROM updtst_tab2 WHERE b = 701;
VACUUM updtst_tab2;
-EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+EXPLAIN (costs off) SELECT b FROM updtst_tab2 WHERE b = 701;
SELECT b FROM updtst_tab2 WHERE b = 701;
SET enable_seqscan = true;
@@ -109,7 +109,7 @@ SELECT * FROM updtst_tab3 WHERE b = 701;
SELECT * FROM updtst_tab3 WHERE b = 1421;
VACUUM updtst_tab3;
-EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 701;
SELECT b FROM updtst_tab3 WHERE b = 701;
SELECT b FROM updtst_tab3 WHERE b = 1421;
@@ -146,7 +146,7 @@ SELECT * FROM updtst_tab3 WHERE b = 702;
SELECT * FROM updtst_tab3 WHERE b = 1422;
VACUUM updtst_tab3;
-EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 702;
SELECT b FROM updtst_tab3 WHERE b = 702;
SELECT b FROM updtst_tab3 WHERE b = 1422;
0002_warm_updates_v12.patchapplication/octet-stream; name=0002_warm_updates_v12.patchDownload
diff --git b/contrib/bloom/blutils.c a/contrib/bloom/blutils.c
index f2eda67..b356e2b 100644
--- b/contrib/bloom/blutils.c
+++ a/contrib/bloom/blutils.c
@@ -142,6 +142,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/access/brin/brin.c a/src/backend/access/brin/brin.c
index b22563b..b4a1465 100644
--- b/src/backend/access/brin/brin.c
+++ a/src/backend/access/brin/brin.c
@@ -116,6 +116,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/access/gist/gist.c a/src/backend/access/gist/gist.c
index 6593771..843389b 100644
--- b/src/backend/access/gist/gist.c
+++ a/src/backend/access/gist/gist.c
@@ -94,6 +94,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/access/hash/hash.c a/src/backend/access/hash/hash.c
index 24510e7..6645160 100644
--- b/src/backend/access/hash/hash.c
+++ a/src/backend/access/hash/hash.c
@@ -90,6 +90,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -271,6 +272,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -308,8 +311,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git b/src/backend/access/hash/hashsearch.c a/src/backend/access/hash/hashsearch.c
index 9e5d7e4..60e941d 100644
--- b/src/backend/access/hash/hashsearch.c
+++ a/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -363,6 +365,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git b/src/backend/access/hash/hashutil.c a/src/backend/access/hash/hashutil.c
index c705531..dcba734 100644
--- b/src/backend/access/hash/hashutil.c
+++ a/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git b/src/backend/access/heap/README.WARM a/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..7b9a712
--- /dev/null
+++ a/src/backend/access/heap/README.WARM
@@ -0,0 +1,306 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
+
+CREATE INDEX CONCURRENTLY
+-------------------------
+
+Currently CREATE INDEX CONCURRENTLY (CIC) is implemented as a 3-phase
+process. In the first phase, we create catalog entry for the new index
+so that the index is visible to all other backends, but still don't use
+it for either read or write. But we ensure that no new broken HOT
+chains are created by new transactions. In the second phase, we build
+the new index using a MVCC snapshot and then make the index available
+for inserts. We then do another pass over the index and insert any
+missing tuples, everytime indexing only it's root line pointer. See
+README.HOT for details about how HOT impacts CIC and how various
+challenges are tackeled.
+
+WARM poses another challenge because it allows creation of HOT chains
+even when an index key is changed. But since the index is not ready for
+insertion until the second phase is over, we might end up with a
+situation where the HOT chain has tuples with different index columns,
+yet only one of these values are indexed by the new index. Note that
+during the third phase, we only index tuples whose root line pointer is
+missing from the index. But we can't easily check if the existing index
+tuple is actually indexing the heap tuple visible to the new MVCC
+snapshot. Finding that information will require us to query the index
+again for every tuple in the chain, especially if it's a WARM tuple.
+This would require repeated access to the index. Another option would be
+to return index keys along with the heap TIDs when index is scanned for
+collecting all indexed TIDs during third phase. We can then compare the
+heap tuple against the already indexed key and decide whether or not to
+index the new tuple.
+
+We solve this problem more simply by disallowing WARM updates until the
+index is ready for insertion. We don't need to disallow WARM on a
+wholesale basis, but only those updates that change the columns of the
+new index are disallowed to be WARM updates.
diff --git b/src/backend/access/heap/heapam.c a/src/backend/access/heap/heapam.c
index 064909a..9c4522a 100644
--- b/src/backend/access/heap/heapam.c
+++ a/src/backend/access/heap/heapam.c
@@ -1958,6 +1958,78 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain containing this tid is actually a WARM chain.
+ * Note that even if the WARM update ultimately aborted, we still must do a
+ * recheck because the failing UPDATE when have inserted created index entries
+ * which are now stale, but still referencing this chain.
+ */
+static bool
+hot_check_warm_chain(Page dp, ItemPointer tid)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ return true;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return false;
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1977,11 +2049,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2035,9 +2110,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2050,6 +2128,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2098,7 +2186,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Check to see if HOT chain continues past this tuple; if so fetch
* the next offnum and loop around.
*/
- if (HeapTupleIsHotUpdated(heapTuple))
+ if (HeapTupleIsHotUpdated(heapTuple) &&
+ !HeapTupleHeaderHasRootOffset(heapTuple->t_data))
{
Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
ItemPointerGetBlockNumber(tid));
@@ -2122,18 +2211,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3492,15 +3604,18 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
+ Bitmapset *notready_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3521,6 +3636,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3545,6 +3661,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3566,10 +3686,17 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+ notready_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_NOTREADY);
+
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, notready_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3621,6 +3748,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3876,6 +4006,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4194,6 +4325,37 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !IsSystemRelation(relation) &&
+ !bms_overlap(notready_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup))
+ use_warm_update = true;
+ }
}
else
{
@@ -4240,6 +4402,22 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * Note: If we ever have a mechanism to avoid duplicate <key, TID> in
+ * indexes, we could look at relaxing this restriction and allow even
+ * more WARM udpates
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4252,12 +4430,35 @@ l2:
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
+ else if (use_warm_update)
+ {
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4367,7 +4568,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4507,7 +4711,8 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
* via ereport().
*/
void
-simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
+simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
+ Bitmapset **modified_attrs, bool *warm_update)
{
HTSU_Result result;
HeapUpdateFailureData hufd;
@@ -4516,7 +4721,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, modified_attrs, warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7568,6 +7773,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7579,6 +7785,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7652,6 +7861,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8629,16 +8840,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8698,6 +8915,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextTid(htup, &newtid);
@@ -8833,6 +9055,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
diff --git b/src/backend/access/heap/pruneheap.c a/src/backend/access/heap/pruneheap.c
index f54337c..c2bd7d6 100644
--- b/src/backend/access/heap/pruneheap.c
+++ a/src/backend/access/heap/pruneheap.c
@@ -834,6 +834,13 @@ heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
continue;
+ /*
+ * If the tuple has root line pointer, it must be the end of the
+ * chain
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
/* Set up to scan the HOT-chain */
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
diff --git b/src/backend/access/index/indexam.c a/src/backend/access/index/indexam.c
index 4e7eca7..f56c58f 100644
--- b/src/backend/access/index/indexam.c
+++ a/src/backend/access/index/indexam.c
@@ -75,10 +75,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -535,7 +552,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -574,7 +591,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -601,6 +618,12 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple.
+ * Otherwise we must recheck every tuple.
+ */
+ scan->xs_tuple_recheck = scan->xs_recheck;
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -610,32 +633,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git b/src/backend/access/nbtree/nbtinsert.c a/src/backend/access/nbtree/nbtinsert.c
index 6dca810..b5cb619 100644
--- b/src/backend/access/nbtree/nbtinsert.c
+++ a/src/backend/access/nbtree/nbtinsert.c
@@ -20,11 +20,14 @@
#include "access/nbtxlog.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -250,6 +253,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -309,6 +315,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -326,112 +334,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git b/src/backend/access/nbtree/nbtree.c a/src/backend/access/nbtree/nbtree.c
index 775f2ff..952ed8f 100644
--- b/src/backend/access/nbtree/nbtree.c
+++ a/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "pgstat.h"
#include "storage/condition_variable.h"
#include "storage/indexfsm.h"
@@ -163,6 +164,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = btestimateparallelscan;
amroutine->aminitparallelscan = btinitparallelscan;
amroutine->amparallelrescan = btparallelrescan;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -344,8 +346,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
+ /* btree indexes are never lossy, except for WARM tuples */
scan->xs_recheck = false;
+ scan->xs_tuple_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
diff --git b/src/backend/access/nbtree/nbtutils.c a/src/backend/access/nbtree/nbtutils.c
index 5b259a3..c376c1b 100644
--- b/src/backend/access/nbtree/nbtutils.c
+++ a/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2069,3 +2073,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git b/src/backend/access/spgist/spgutils.c a/src/backend/access/spgist/spgutils.c
index e57ac49..59ef7f3 100644
--- b/src/backend/access/spgist/spgutils.c
+++ a/src/backend/access/spgist/spgutils.c
@@ -72,6 +72,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git b/src/backend/catalog/index.c a/src/backend/catalog/index.c
index f8d9214..bba52ec 100644
--- b/src/backend/catalog/index.c
+++ a/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_AmCache = NULL;
ii->ii_Context = CurrentMemoryContext;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git b/src/backend/catalog/indexing.c a/src/backend/catalog/indexing.c
index abc344a..e5355a8 100644
--- b/src/backend/catalog/indexing.c
+++ a/src/backend/catalog/indexing.c
@@ -66,10 +66,15 @@ CatalogCloseIndexes(CatalogIndexState indstate)
*
* This should be called for each inserted or updated catalog tuple.
*
+ * If the tuple was WARM updated, the modified_attrs contains the list of
+ * columns updated by the update. We must not insert new index entries for
+ * indexes which do not refer to any of the modified columns.
+ *
* This is effectively a cut-down version of ExecInsertIndexTuples.
*/
static void
-CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
+CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update)
{
int i;
int numIndexes;
@@ -79,12 +84,28 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
IndexInfo **indexInfoArray;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData root_tid;
- /* HOT update does not require index inserts */
- if (HeapTupleIsHeapOnly(heapTuple))
+ /*
+ * HOT update does not require index inserts, but WARM may need for some
+ * indexes.
+ */
+ if (HeapTupleIsHeapOnly(heapTuple) && !warm_update)
return;
/*
+ * If we've done a WARM update, then we must index the TID of the root line
+ * pointer and not the actual TID of the new tuple.
+ */
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(heapTuple->t_self)),
+ HeapTupleHeaderGetRootOffset(heapTuple->t_data));
+ else
+ ItemPointerCopy(&heapTuple->t_self, &root_tid);
+
+
+ /*
* Get information from the state structure. Fall out if nothing to do.
*/
numIndexes = indstate->ri_NumIndices;
@@ -112,6 +133,17 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
continue;
/*
+ * If we've done WARM update, then we must not insert a new index tuple
+ * if none of the index keys have changed. This is not just an
+ * optimization, but a requirement for WARM to work correctly.
+ */
+ if (warm_update)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
+ /*
* Expressional and partial indexes on system catalogs are not
* supported, nor exclusion constraints, nor deferred uniqueness
*/
@@ -136,7 +168,7 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
index_insert(relationDescs[i], /* index relation */
values, /* array of index Datums */
isnull, /* is-null flags */
- &(heapTuple->t_self), /* tid of heap tuple */
+ &root_tid,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
@@ -168,7 +200,7 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, NULL, false);
CatalogCloseIndexes(indstate);
return oid;
@@ -190,7 +222,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, false, NULL);
return oid;
}
@@ -210,12 +242,14 @@ void
CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup)
{
CatalogIndexState indstate;
+ bool warm_update;
+ Bitmapset *modified_attrs;
indstate = CatalogOpenIndexes(heapRel);
- simple_heap_update(heapRel, otid, tup);
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
CatalogCloseIndexes(indstate);
}
@@ -231,9 +265,12 @@ void
CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
CatalogIndexState indstate)
{
- simple_heap_update(heapRel, otid, tup);
+ Bitmapset *modified_attrs;
+ bool warm_update;
+
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
}
/*
diff --git b/src/backend/catalog/system_views.sql a/src/backend/catalog/system_views.sql
index 38be9cf..7fb1295 100644
--- b/src/backend/catalog/system_views.sql
+++ a/src/backend/catalog/system_views.sql
@@ -498,6 +498,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -528,7 +529,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git b/src/backend/commands/constraint.c a/src/backend/commands/constraint.c
index e2544e5..d9c0fe7 100644
--- b/src/backend/commands/constraint.c
+++ a/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = castNode(TriggerData, fcinfo->context);
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git b/src/backend/commands/copy.c a/src/backend/commands/copy.c
index 949844d..38702e5 100644
--- b/src/backend/commands/copy.c
+++ a/src/backend/commands/copy.c
@@ -2680,6 +2680,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2834,6 +2836,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git b/src/backend/commands/indexcmds.c a/src/backend/commands/indexcmds.c
index 72bb06c..d8f033d 100644
--- b/src/backend/commands/indexcmds.c
+++ a/src/backend/commands/indexcmds.c
@@ -699,7 +699,14 @@ DefineIndex(Oid relationId,
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * tries to either insert into it or use it for queries. In addition to
+ * that, WARM updates will be disallowed if an update is modifying one of
+ * the columns used by this new index. This is necessary to ensure that we
+ * don't create WARM tuples which do not have corresponding entry in this
+ * index. It must be noted that during the second phase, we will index only
+ * those heap tuples whose root line pointer is not already in the index,
+ * hence it's important that all tuples in a given chain, has the same
+ * value for any indexed column (including this new index).
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
@@ -747,7 +754,10 @@ DefineIndex(Oid relationId,
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
- * chain have different index keys.
+ * chain have different index keys. Also, the new index is consulted for
+ * deciding whether a WARM update is possible, and WARM update is not done
+ * if a column used by this index is being updated. This ensures that we
+ * don't create WARM tuples which are not indexed by this index.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
@@ -782,7 +792,8 @@ DefineIndex(Oid relationId,
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
- * insert new entries into the index for insertions and non-HOT updates.
+ * insert new entries into the index for insertions and non-HOT updates or
+ * WARM updates where this index needs new entry.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
diff --git b/src/backend/commands/vacuumlazy.c a/src/backend/commands/vacuumlazy.c
index 005440e..1388be1 100644
--- b/src/backend/commands/vacuumlazy.c
+++ a/src/backend/commands/vacuumlazy.c
@@ -1032,6 +1032,19 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM
+ * tuple, there could be multiple index entries
+ * pointing to the root of this chain. We can't do
+ * index-only scans for such tuples without verifying
+ * index key check. So mark the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ break;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, visibility_cutoff_xid))
visibility_cutoff_xid = xmin;
@@ -2158,6 +2171,18 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git b/src/backend/executor/execIndexing.c a/src/backend/executor/execIndexing.c
index 2142273..d62d2de 100644
--- b/src/backend/executor/execIndexing.c
+++ a/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique, /* type of uniqueness check to do */
indexInfo); /* index AM may need this */
@@ -791,6 +804,9 @@ retry:
{
if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
+ else
+ ItemPointerCopy(&tup->t_self, &ctid_wait);
+
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git b/src/backend/executor/execReplication.c a/src/backend/executor/execReplication.c
index ebf3f6b..1fa13a5 100644
--- b/src/backend/executor/execReplication.c
+++ a/src/backend/executor/execReplication.c
@@ -399,6 +399,8 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate, false, NULL,
NIL);
@@ -445,6 +447,8 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
if (!skip_tuple)
{
List *recheckIndexes = NIL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Check the constraints of the tuple */
if (rel->rd_att->constr)
@@ -455,13 +459,30 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
/* OK, update the tuple and index entries for it */
simple_heap_update(rel, &searchslot->tts_tuple->t_self,
- slot->tts_tuple);
+ slot->tts_tuple, &modified_attrs, &warm_update);
if (resultRelInfo->ri_NumIndices > 0 &&
- !HeapTupleIsHeapOnly(slot->tts_tuple))
+ (!HeapTupleIsHeapOnly(slot->tts_tuple) || warm_update))
+ {
+ ItemPointerData root_tid;
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self,
+ &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
+
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL,
NIL);
+ }
/* AFTER ROW UPDATE Triggers */
ExecARUpdateTriggers(estate, resultRelInfo,
diff --git b/src/backend/executor/nodeBitmapHeapscan.c a/src/backend/executor/nodeBitmapHeapscan.c
index f18827d..f81d290 100644
--- b/src/backend/executor/nodeBitmapHeapscan.c
+++ a/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -362,11 +363,27 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ /*
+ * If the heap tuple needs a recheck because of a WARM update,
+ * it's a lossy case
+ */
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git b/src/backend/executor/nodeIndexscan.c a/src/backend/executor/nodeIndexscan.c
index 0a9dfdb..38c7827 100644
--- b/src/backend/executor/nodeIndexscan.c
+++ a/src/backend/executor/nodeIndexscan.c
@@ -118,10 +118,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git b/src/backend/executor/nodeModifyTable.c a/src/backend/executor/nodeModifyTable.c
index 95e1589..a1f3440 100644
--- b/src/backend/executor/nodeModifyTable.c
+++ a/src/backend/executor/nodeModifyTable.c
@@ -512,6 +512,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -558,6 +559,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -891,6 +893,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -1007,7 +1012,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1094,10 +1099,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git b/src/backend/postmaster/pgstat.c a/src/backend/postmaster/pgstat.c
index ada374c..308ae8c 100644
--- b/src/backend/postmaster/pgstat.c
+++ a/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4088,6 +4090,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5197,6 +5200,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5224,6 +5228,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git b/src/backend/utils/adt/pgstatfuncs.c a/src/backend/utils/adt/pgstatfuncs.c
index a987d0d..b8677f3 100644
--- b/src/backend/utils/adt/pgstatfuncs.c
+++ a/src/backend/utils/adt/pgstatfuncs.c
@@ -145,6 +145,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1644,6 +1660,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git b/src/backend/utils/cache/relcache.c a/src/backend/utils/cache/relcache.c
index 9001e20..c85898c 100644
--- b/src/backend/utils/cache/relcache.c
+++ a/src/backend/utils/cache/relcache.c
@@ -2338,6 +2338,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_pkattr);
bms_free(relation->rd_idattr);
@@ -4352,6 +4353,13 @@ RelationGetIndexList(Relation relation)
return list_copy(relation->rd_indexlist);
/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */
+ relation->rd_indexattr = NULL;
+
+ /*
* We build the list we intend to return (in the caller's context) while
* doing the scan. After successfully completing the scan, we copy that
* list into the relcache entry. This avoids cache-context memory leakage
@@ -4759,15 +4767,19 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *pkindexattrs; /* columns in the primary index */
Bitmapset *idindexattrs; /* columns in the replica identity */
+ Bitmapset *indxnotreadyattrs; /* columns in not ready indexes */
List *indexoidlist;
List *newindexoidlist;
Oid relpkindex;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4782,6 +4794,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return bms_copy(relation->rd_indxnotreadyattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4822,9 +4838,11 @@ restart:
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
pkindexattrs = NULL;
idindexattrs = NULL;
+ indxnotreadyattrs = NULL;
foreach(l, indexoidlist)
{
Oid indexOid = lfirst_oid(l);
@@ -4861,6 +4879,10 @@ restart:
indexattrs = bms_add_member(indexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_member(indxnotreadyattrs,
+ attrnum - FirstLowInvalidHeapAttributeNumber);
+
if (isKey)
uindexattrs = bms_add_member(uindexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
@@ -4876,10 +4898,29 @@ restart:
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_members(indxnotreadyattrs,
+ exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
+
index_close(indexDesc, AccessShareLock);
}
@@ -4912,15 +4953,22 @@ restart:
goto restart;
}
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_pkattr);
relation->rd_pkattr = NULL;
bms_free(relation->rd_idattr);
relation->rd_idattr = NULL;
+ bms_free(relation->rd_indxnotreadyattr);
+ relation->rd_indxnotreadyattr = NULL;
/*
* Now save copies of the bitmaps in the relcache entry. We intentionally
@@ -4933,7 +4981,9 @@ restart:
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_pkattr = bms_copy(pkindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
+ relation->rd_indxnotreadyattr = bms_copy(indxnotreadyattrs);
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4947,6 +4997,10 @@ restart:
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return indxnotreadyattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
@@ -5559,6 +5613,7 @@ load_relcache_init_file(bool shared)
rel->rd_keyattr = NULL;
rel->rd_pkattr = NULL;
rel->rd_idattr = NULL;
+ rel->rd_indxnotreadyattr = NULL;
rel->rd_pubactions = NULL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
diff --git b/src/include/access/amapi.h a/src/include/access/amapi.h
index f919cf8..d7702e5 100644
--- b/src/include/access/amapi.h
+++ a/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -152,6 +153,10 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -217,6 +222,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to support WARM */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git b/src/include/access/hash.h a/src/include/access/hash.h
index 3bf587b..bc9c8fe 100644
--- b/src/include/access/hash.h
+++ a/src/include/access/hash.h
@@ -385,4 +385,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git b/src/include/access/heapam.h a/src/include/access/heapam.h
index 95aa976..9412c3a 100644
--- b/src/include/access/heapam.h
+++ a/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -161,7 +162,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -176,7 +178,9 @@ extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
- HeapTuple tup);
+ HeapTuple tup,
+ Bitmapset **modified_attrs,
+ bool *warm_update);
extern void heap_sync(Relation relation);
diff --git b/src/include/access/heapam_xlog.h a/src/include/access/heapam_xlog.h
index e6019d5..9b081bf 100644
--- b/src/include/access/heapam_xlog.h
+++ a/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
diff --git b/src/include/access/htup_details.h a/src/include/access/htup_details.h
index 7552186..ddbdbcd 100644
--- b/src/include/access/htup_details.h
+++ a/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) != 0 \
+)
+
/*
* Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
* the TID of the tuple itself in t_ctid field to mark the end of the chain.
@@ -785,6 +801,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git b/src/include/access/nbtree.h a/src/include/access/nbtree.h
index 6289ffa..08d056d 100644
--- b/src/include/access/nbtree.h
+++ a/src/include/access/nbtree.h
@@ -538,6 +538,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git b/src/include/access/relscan.h a/src/include/access/relscan.h
index ce3ca8d..12d3b0c 100644
--- b/src/include/access/relscan.h
+++ a/src/include/access/relscan.h
@@ -112,7 +112,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git b/src/include/catalog/pg_proc.h a/src/include/catalog/pg_proc.h
index bb7053a..21d0789 100644
--- b/src/include/catalog/pg_proc.h
+++ a/src/include/catalog/pg_proc.h
@@ -2740,6 +2740,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3353 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2892,6 +2894,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3354 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git b/src/include/executor/executor.h a/src/include/executor/executor.h
index 02dbe7b..c4495a3 100644
--- b/src/include/executor/executor.h
+++ a/src/include/executor/executor.h
@@ -382,6 +382,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git b/src/include/executor/nodeIndexscan.h a/src/include/executor/nodeIndexscan.h
index ea3f3a5..ebeec74 100644
--- b/src/include/executor/nodeIndexscan.h
+++ a/src/include/executor/nodeIndexscan.h
@@ -41,5 +41,4 @@ extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-
#endif /* NODEINDEXSCAN_H */
diff --git b/src/include/nodes/execnodes.h a/src/include/nodes/execnodes.h
index 1c1cb80..fb00b96 100644
--- b/src/include/nodes/execnodes.h
+++ a/src/include/nodes/execnodes.h
@@ -64,6 +64,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git b/src/include/pgstat.h a/src/include/pgstat.h
index 8b710ec..2ee690b 100644
--- b/src/include/pgstat.h
+++ a/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1178,7 +1180,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git b/src/include/utils/rel.h a/src/include/utils/rel.h
index a617a7c..fbac7c0 100644
--- b/src/include/utils/rel.h
+++ a/src/include/utils/rel.h
@@ -138,9 +138,14 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
+ Bitmapset *rd_indxnotreadyattr; /* columns used by indexes not yet
+ ready */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_pkattr; /* cols included in primary key */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
PublicationActions *rd_pubactions; /* publication actions */
diff --git b/src/include/utils/relcache.h a/src/include/utils/relcache.h
index da36b67..d18bd09 100644
--- b/src/include/utils/relcache.h
+++ a/src/include/utils/relcache.h
@@ -50,7 +50,9 @@ typedef enum IndexAttrBitmapKind
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
INDEX_ATTR_BITMAP_PRIMARY_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE,
+ INDEX_ATTR_BITMAP_NOTREADY
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git b/src/test/regress/expected/rules.out a/src/test/regress/expected/rules.out
index c661f1d..561d9579 100644
--- b/src/test/regress/expected/rules.out
+++ a/src/test/regress/expected/rules.out
@@ -1732,6 +1732,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1875,6 +1876,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1918,6 +1920,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1955,7 +1958,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1971,7 +1975,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1993,7 +1998,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git b/src/test/regress/expected/warm.out a/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..0aa3bb7
--- /dev/null
+++ a/src/test/regress/expected/warm.out
@@ -0,0 +1,367 @@
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=72)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab1 (cost=4.45..47.23 rows=22 width=4)
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1 (cost=0.00..4.45 rows=22 width=0)
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1 (cost=0.29..9.16 rows=50 width=4)
+ Index Cond: (b = 140001)
+(2 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab1;
+------------------
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE a = 1;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Bitmap Heap Scan on updtst_tab2 (cost=4.18..12.64 rows=4 width=72)
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2 (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE b = 701;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2 (cost=0.14..4.16 rows=1 width=4)
+ Index Cond: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab2 WHERE b = 701;
+ b
+-----
+ 701
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab2;
+------------------
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 1;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on updtst_tab3 (cost=0.00..2.25 rows=1 width=4)
+ Filter: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 701;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+ b
+------
+ 1421
+(1 row)
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+SET enable_seqscan = false;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 98
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 2;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3 (cost=0.14..8.16 rows=1 width=4)
+ Index Cond: (b = 702)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 702;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+ b
+------
+ 1422
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab3;
+------------------
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git b/src/test/regress/parallel_schedule a/src/test/regress/parallel_schedule
index edeb2d6..2268705 100644
--- b/src/test/regress/parallel_schedule
+++ a/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git b/src/test/regress/sql/warm.sql a/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..b73c278
--- /dev/null
+++ a/src/test/regress/sql/warm.sql
@@ -0,0 +1,172 @@
+
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Check if index only scan works correctly
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab1;
+
+------------------
+
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE a = 1;
+
+SET enable_seqscan = false;
+EXPLAIN SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE b = 701;
+
+VACUUM updtst_tab2;
+EXPLAIN SELECT b FROM updtst_tab2 WHERE b = 701;
+SELECT b FROM updtst_tab2 WHERE b = 701;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab2;
+------------------
+
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+SELECT * FROM updtst_tab3 WHERE a = 1;
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+SET enable_seqscan = false;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+SELECT * FROM updtst_tab3 WHERE a = 2;
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+
+VACUUM updtst_tab3;
+EXPLAIN SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab3;
+------------------
+
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
0001_track_root_lp_v12.patchapplication/octet-stream; name=0001_track_root_lp_v12.patchDownload
diff --git b/src/backend/access/heap/heapam.c a/src/backend/access/heap/heapam.c
index 74fb09c..064909a 100644
--- b/src/backend/access/heap/heapam.c
+++ a/src/backend/access/heap/heapam.c
@@ -94,7 +94,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2248,13 +2249,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2385,6 +2386,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2423,8 +2425,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
- RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup,
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
+
+ /* We must not overwrite the speculative insertion token. */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2652,6 +2659,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2722,7 +2730,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
+
+ /* Mark this tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2730,7 +2743,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
+ /* Mark each tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -3002,6 +3018,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3012,6 +3029,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3053,7 +3071,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3183,7 +3202,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain.
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3232,6 +3261,22 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section.
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever.
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tp.t_self));
+
START_CRIT_SECTION();
/*
@@ -3259,8 +3304,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain. */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3461,6 +3508,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3523,6 +3572,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3807,7 +3857,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3947,6 +4002,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3974,6 +4030,14 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available.
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self));
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3988,7 +4052,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4146,6 +4211,10 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4171,6 +4240,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above).
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4178,10 +4258,22 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ root_offnum = RelationPutHeapTuple(relation, newbuf, heaptup, false,
+ root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain.
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4194,7 +4286,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4233,6 +4325,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4513,7 +4606,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4522,9 +4616,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4544,6 +4640,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4571,7 +4668,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5009,7 +5110,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5057,6 +5163,10 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self));
+
START_CRIT_SECTION();
/*
@@ -5085,7 +5195,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5599,6 +5712,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5607,6 +5721,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5836,7 +5952,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5845,7 +5961,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5962,7 +6078,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6088,8 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7437,6 +7552,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7557,6 +7673,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8211,7 +8330,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, xlrec->offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8301,7 +8426,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8436,8 +8562,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8573,7 +8699,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8706,13 +8832,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored.
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8775,6 +8905,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself. */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8838,11 +8971,17 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git b/src/backend/access/heap/hio.c a/src/backend/access/heap/hio.c
index 6529fe3..8052519 100644
--- b/src/backend/access/heap/hio.c
+++ a/src/backend/access/heap/hio.c
@@ -31,12 +31,20 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple where the caller
+ * tells us about the root line pointer of the chain. The latter is used
+ * during insertion of a new row, hence root line pointer is set to the offset
+ * where this tuple is inserted.
*/
-void
+OffsetNumber
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,17 +68,24 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead).
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number. */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(root_offnum))
+ root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, root_offnum);
}
+
+ return root_offnum;
}
/*
diff --git b/src/backend/access/heap/pruneheap.c a/src/backend/access/heap/pruneheap.c
index d69a266..f54337c 100644
--- b/src/backend/access/heap/pruneheap.c
+++ a/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that).
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,47 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
+ *
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line pointers
+ * and tuples on the page in order to find the root line pointers. To minimize
+ * the cost, we break early if target_offnum is specified and root line pointer
+ * to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +807,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return.
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping. */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +869,65 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return.
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item. */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion.
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple.
+ */
+OffsetNumber
+heap_get_root_tuple(Page page, OffsetNumber target_offnum)
+{
+ OffsetNumber offnum = InvalidOffsetNumber;
+ heap_get_root_tuples_internal(page, target_offnum, &offnum);
+ return offnum;
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git b/src/backend/access/heap/rewriteheap.c a/src/backend/access/heap/rewriteheap.c
index c7b283c..6ced1e7 100644
--- b/src/backend/access/heap/rewriteheap.c
+++ a/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain.
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest.
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git b/src/backend/executor/execIndexing.c a/src/backend/executor/execIndexing.c
index 5242dee..2142273 100644
--- b/src/backend/executor/execIndexing.c
+++ a/src/backend/executor/execIndexing.c
@@ -789,7 +789,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git b/src/backend/executor/execMain.c a/src/backend/executor/execMain.c
index a666391..bd72ad3 100644
--- b/src/backend/executor/execMain.c
+++ a/src/backend/executor/execMain.c
@@ -2585,7 +2585,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2593,7 +2593,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git b/src/include/access/heapam.h a/src/include/access/heapam.h
index a864f78..95aa976 100644
--- b/src/include/access/heapam.h
+++ a/src/include/access/heapam.h
@@ -189,6 +189,7 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern OffsetNumber heap_get_root_tuple(Page page, OffsetNumber target_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git b/src/include/access/heapam_xlog.h a/src/include/access/heapam_xlog.h
index b285f17..e6019d5 100644
--- b/src/include/access/heapam_xlog.h
+++ a/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git b/src/include/access/hio.h a/src/include/access/hio.h
index 2824f23..921cb37 100644
--- b/src/include/access/hio.h
+++ a/src/include/access/hio.h
@@ -35,8 +35,8 @@ typedef struct BulkInsertStateData
} BulkInsertStateData;
-extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+extern OffsetNumber RelationPutHeapTuple(Relation relation, Buffer buffer,
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git b/src/include/access/htup_details.h a/src/include/access/htup_details.h
index a6c7e31..7552186 100644
--- b/src/include/access/htup_details.h
+++ a/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,43 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+/*
+ * Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
+ * the TID of the tuple itself in t_ctid field to mark the end of the chain.
+ * But starting PG v10, we use a special flag HEAP_LATEST_TUPLE to identify the
+ * last tuple and store the root line pointer of the HOT chain in t_ctid field
+ * instead.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * Starting from PostgreSQL 10, the latest tuple in an update chain has
+ * HEAP_LATEST_TUPLE set; but tuples upgraded from earlier versions do not.
+ * For those, we determine whether a tuple is latest by testing that its t_ctid
+ * points to itself.
+ *
+ * Note: beware of multiple evaluations of "tup" and "tid" arguments.
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +585,56 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * now have a new tuple in the chain and this is no longer the last tuple of
+ * the chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller must have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+/*
+ * Get the root line pointer of the HOT chain. The caller should have confirmed
+ * that the root offset is cached before calling this macro.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * Return whether the tuple has a cached root offset. We don't use
+ * HeapTupleHeaderIsHeapLatest because that one also considers the case of
+ * t_ctid pointing to itself, for tuples migrated from pre v10 clusters. Here
+ * we are only interested in the tuples which are marked with HEAP_LATEST_TUPLE
+ * flag.
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
0000_interesting_attrs.patchapplication/octet-stream; name=0000_interesting_attrs.patchDownload
commit 2c2e2be0a6459521ad1aebb285a3555649cc02ba
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sun Jan 1 16:29:10 2017 +0530
Alvaro's patch on interesting attrs
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index af25836..74fb09c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -96,11 +96,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3455,6 +3452,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3472,9 +3471,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3501,21 +3497,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3536,7 +3541,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3562,6 +3567,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3573,10 +3582,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3815,6 +3821,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4119,7 +4127,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4134,7 +4142,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4282,13 +4292,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4322,7 +4334,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4367,114 +4379,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
- *
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
-
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
+ int attnum;
+ Bitmapset *modified = NULL;
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
-
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
Hi Tom,
On Wed, Feb 1, 2017 at 3:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(I'm a little more concerned by Alvaro's apparent position that WARM
is a done deal; I didn't think so.
Are there any specific aspects of the design that you're not comfortable
with? I'm sure there could be some rough edges in the implementation that
I'm hoping will get handled during the further review process. But if there
are some obvious things I'm overlooking please let me know.
Probably the same question to Andres/Robert who has flagged concerns. On my
side, I've run some very long tests with data validation and haven't found
any new issues with the most recent patches.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 31, 2017 at 04:52:39PM +0530, Pavan Deolasee wrote:
The other critical bug I found, which unfortunately exists in the master too,
is the index corruption during CIC. The patch includes the same fix that I've
proposed on the other thread. With these changes, WARM stress is running fine
for last 24 hours on a decently powerful box. Multiple CREATE/DROP INDEX cycles
and updates via different indexed columns, with a mix of FOR SHARE/UPDATE and
rollbacks did not produce any consistency issues. A side note: while
performance measurement wasn't a goal of stress tests, WARM has done about 67%
more transaction than master in 24 hour period (95M in master vs 156M in WARM
to be precise on a 30GB table including indexes). I believe the numbers would
be far better had the test not dropping and recreating the indexes, thus
effectively cleaning up all index bloats. Also the table is small enough to fit
in the shared buffers. I'll rerun these tests with much larger scale factor and
without dropping indexes.
Thanks for setting up the test harness. I know it is hard but
in this case it has found an existing bug and given good performance
numbers. :-)
I have what might be a supid question. As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 1, 2017 at 10:46:45AM +0530, Pavan Deolasee wrote:
contains a WARM tuple. Alternate ideas/suggestions and review of the
design
are welcome!
t_infomask2 contains one last unused bit,
Umm, WARM is using 2 unused bits from t_infomask2. You mean there is another
free bit after that too?
We are obviously going to use several heap or item pointer bits for
WARM, and once we do that it is going to be hard to undo that. Pavan,
are you saying you could do more with WARM if you had more bits? Are we
sure we have given you all the bits we can? Do we want to commit to a
lesser feature because the bits are not available?
and we could reuse vacuum
full's bits (HEAP_MOVED_OUT, HEAP_MOVED_IN), but that will need some
thinking ahead.� Maybe now's the time to start versioning relations so
that we can ensure clusters upgraded to pg10 do not contain any of those
bits in any tuple headers.Yeah, IIRC old VACUUM FULL was removed in 9.0, which is good 6 year old.
Obviously, there still a chance that a pre-9.0 binary upgraded cluster exists
and upgrades to 10. So we still need to do something about them if we reuse
these bits. I'm surprised to see that we don't have any mechanism in place to
clear those bits. So may be we add something to do that.
Yeah, good question. :-( We have talked about adding some page,
table, or cluster-level version number so we could identify if a given
tuple _could_ be using those bits, but never did it.
I'd some other ideas (and a patch too) to reuse bits from t_ctid.ip_pos given
that offset numbers can be represented in just 13 bits, even with the maximum
block size. Can look at that if it comes to finding more bits.
OK, so it seems more bits is not a blocker to enhancements, yet.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.
The second update in a chain creates another non-warm-updated tuple, so
the third update can be a warm update again, and so on.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.The second update in a chain creates another non-warm-updated tuple, so
the third update can be a warm update again, and so on.
Right, before this patch they would be two independent HOT chains. It
still seems like an unexpectedly-high performance win. Are two
independent HOT chains that much more expensive than joining them via
WARM?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.The second update in a chain creates another non-warm-updated tuple, so
the third update can be a warm update again, and so on.Right, before this patch they would be two independent HOT chains.
No, they would be a regular update chain, not HOT updates.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 03:26:09PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.The second update in a chain creates another non-warm-updated tuple, so
the third update can be a warm update again, and so on.Right, before this patch they would be two independent HOT chains.
No, they would be a regular update chain, not HOT updates.
Well, let's walk through this. Let's suppose you have three updates
that stay on the same page and don't update any indexed columns --- that
would produce a HOT chain of four tuples. If you then do an update that
changes an indexed column, prior to this patch, you get a normal update,
and more HOT updates can be added to this. With WARM, we can join those
chains and potentially trim the first HOT chain as those tuples become
invisible.
Am I missing something?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
Well, let's walk through this. Let's suppose you have three updates
that stay on the same page and don't update any indexed columns --- that
would produce a HOT chain of four tuples. If you then do an update that
changes an indexed column, prior to this patch, you get a normal update,
and more HOT updates can be added to this. With WARM, we can join those
chains
With WARM, what happens is that the first three updates are HOT updates
just like currently, and the fourth one is a WARM update.
and potentially trim the first HOT chain as those tuples become
invisible.
That can already happen even without WARM, no?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
Well, let's walk through this. Let's suppose you have three updates
that stay on the same page and don't update any indexed columns --- that
would produce a HOT chain of four tuples. If you then do an update that
changes an indexed column, prior to this patch, you get a normal update,
and more HOT updates can be added to this. With WARM, we can join those
chainsWith WARM, what happens is that the first three updates are HOT updates
just like currently, and the fourth one is a WARM update.
Right.
and potentially trim the first HOT chain as those tuples become
invisible.That can already happen even without WARM, no?
Uh, the point is that with WARM those four early tuples can be removed
via a prune, rather than requiring a VACUUM. Without WARM, the fourth
tuple can't be removed until the index is cleared by VACUUM.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote:
and potentially trim the first HOT chain as those tuples become
invisible.That can already happen even without WARM, no?
Uh, the point is that with WARM those four early tuples can be removed
via a prune, rather than requiring a VACUUM. Without WARM, the fourth
tuple can't be removed until the index is cleared by VACUUM.
I *think* that the WARM-updated one cannot be pruned either, because
it's pointed to by at least one index (otherwise it'd have been a HOT
update). The ones prior to that can be removed either way.
I think the part you want (be able to prune the WARM updated tuple) is
part of what Pavan calls "turning the WARM chain into a HOT chain", so
not part of the initial patch. Pavan can explain this part better, and
also set me straight in case I'm wrong in the above :-)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 03:58:59PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote:
and potentially trim the first HOT chain as those tuples become
invisible.That can already happen even without WARM, no?
Uh, the point is that with WARM those four early tuples can be removed
via a prune, rather than requiring a VACUUM. Without WARM, the fourth
tuple can't be removed until the index is cleared by VACUUM.I *think* that the WARM-updated one cannot be pruned either, because
it's pointed to by at least one index (otherwise it'd have been a HOT
update). The ones prior to that can be removed either way.
Well, if you can't prune across index-column changes, how is a WARM
update different than just two HOT chains with no WARM linkage?
I think the part you want (be able to prune the WARM updated tuple) is
part of what Pavan calls "turning the WARM chain into a HOT chain", so
not part of the initial patch. Pavan can explain this part better, and
also set me straight in case I'm wrong in the above :-)
VACUUM can already remove entire HOT chains that have expired. What
his VACUUM patch does, I think, is to remove the index entries that no
longer point to values in the HOT/WARM chain, turning the chain into
fully HOT, so another WARM addition to the chain can happen.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 11:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Feb 1, 2017 at 10:46:45AM +0530, Pavan Deolasee wrote:
contains a WARM tuple. Alternate ideas/suggestions and review of
the
design
are welcome!
t_infomask2 contains one last unused bit,
Umm, WARM is using 2 unused bits from t_infomask2. You mean there is
another
free bit after that too?
We are obviously going to use several heap or item pointer bits for
WARM, and once we do that it is going to be hard to undo that. Pavan,
are you saying you could do more with WARM if you had more bits? Are we
sure we have given you all the bits we can? Do we want to commit to a
lesser feature because the bits are not available?
btree implementation is complete as much as I would like (there are a few
TODOs, but no show stoppers), at least for the first release. There is a
free bit in btree index tuple header that I could use for chain conversion.
In the heap tuples, I can reuse HEAP_MOVED_OFF because that bit will only
be set along with HEAP_WARM_TUPLE bit. Since none of the upgraded clusters
can have HEAP_WARM_TUPLE bit set, I think we are safe.
WARM currently also supports hash indexes, but there is no free bit left in
hash index tuple header. I think I can work around that by using a bit from
ip_posid (not yet implemented/tested, but seems doable).
IMHO if we can do that i.e. support btree and hash indexes to start with,
we should be good to go for the first release. We can try to support other
popular index AMs in the subsequent release.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Jan 31, 2017 at 04:52:39PM +0530, Pavan Deolasee wrote:
The other critical bug I found, which unfortunately exists in the master too,
is the index corruption during CIC. The patch includes the same fix that I've
proposed on the other thread. With these changes, WARM stress is running fine
for last 24 hours on a decently powerful box. Multiple CREATE/DROP INDEX cycles
and updates via different indexed columns, with a mix of FOR SHARE/UPDATE and
rollbacks did not produce any consistency issues. A side note: while
performance measurement wasn't a goal of stress tests, WARM has done about 67%
more transaction than master in 24 hour period (95M in master vs 156M in WARM
to be precise on a 30GB table including indexes). I believe the numbers would
be far better had the test not dropping and recreating the indexes, thus
effectively cleaning up all index bloats. Also the table is small enough to fit
in the shared buffers. I'll rerun these tests with much larger scale factor and
without dropping indexes.Thanks for setting up the test harness. I know it is hard but
in this case it has found an existing bug and given good performance
numbers. :-)I have what might be a supid question. As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.
I'm not sure how the test case is set up. If the table has multiple
indexes, each on a different column, and only one of the indexes is
updated, then you figure to win because now the other indexes need
less maintenance (and get less bloated). If you have only a single
index, then I don't see how WARM can be any better than HOT, but maybe
I just don't understand the situation.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.The second update in a chain creates another non-warm-updated tuple, so
the third update can be a warm update again, and so on.Right, before this patch they would be two independent HOT chains. It
still seems like an unexpectedly-high performance win. Are two
independent HOT chains that much more expensive than joining them via
WARM?
In these tests, there are zero HOT updates, since every update modifies
some index column. With WARM, we could reduce regular updates to half, even
when we allow only one WARM update per chain (chain really has a single
tuple for this discussion). IOW approximately half updates insert new index
entry in *every* index and half updates
insert new index entry *only* in affected index. That itself does a good
bit for performance.
So to answer your question: yes, joining two HOT chains via WARM is much
cheaper because it results in creating new index entries just for affected
indexes.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 24, 2017 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian <bruce@momjian.us> wrote:
I have what might be a supid question. As I remember, WARM only allows
a single index-column change in the chain. Why are you seeing such a
large performance improvement? I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.I'm not sure how the test case is set up. If the table has multiple
indexes, each on a different column, and only one of the indexes is
updated, then you figure to win because now the other indexes need
less maintenance (and get less bloated). If you have only a single
index, then I don't see how WARM can be any better than HOT, but maybe
I just don't understand the situation.
That's correct. If you have just one index and if the UPDATE modifies
indexed indexed, the UPDATE won't be a WARM update and the patch gives you
no benefit. OTOH if the UPDATE doesn't modify any indexed columns, then it
will be a HOT update and again the patch gives you no benefit. It might be
worthwhile to see if patch causes any regression in these scenarios, though
I think it will be minimal or zero.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 24, 2017 at 12:28 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Bruce Momjian wrote:
On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote:
and potentially trim the first HOT chain as those tuples become
invisible.That can already happen even without WARM, no?
Uh, the point is that with WARM those four early tuples can be removed
via a prune, rather than requiring a VACUUM. Without WARM, the fourth
tuple can't be removed until the index is cleared by VACUUM.I *think* that the WARM-updated one cannot be pruned either, because
it's pointed to by at least one index (otherwise it'd have been a HOT
update). The ones prior to that can be removed either way.
No, even the WARM-updated can be pruned and if there are further HOT
updates, those can be pruned too. All indexes and even multiple pointers
from the same index are always pointing to the root of the WARM chain and
that line pointer does not go away unless the entire chain become dead. The
only material difference between HOT and WARM is that since there are two
index pointers from the same index to the same root line pointer, we must
do recheck. But HOT-pruning and all such things remain the same.
Let's take an example. Say, we have a table (a int, b int, c text) and two
indexes on first two columns.
H W
H
(1, 100, 'foo') -----> (1, 100, 'bar') ------> (1, 200, 'bar') -----> (1,
200, 'foo')
The first update will be a HOT update, the second update will be a WARM
update and the third update will again be a HOT update. The first and third
update do not create any new index entry, though the second update will
create a new index entry in the second index. Any further WARM updates to
this chain is not allowed, but further HOT updates are ok.
If all but the last version become DEAD, HOT-prune will remove all of them
and turn the first line pointer into REDIRECT line pointer. At this point,
the first index has one index pointer and the second index has two index
pointers, but all pointing to the same root line pointer, which has not
become REDIRECT line pointer.
Redirect
o-----------------------> (1, 200, 'foo')
I think the part you want (be able to prune the WARM updated tuple) is
part of what Pavan calls "turning the WARM chain into a HOT chain", so
not part of the initial patch. Pavan can explain this part better, and
also set me straight in case I'm wrong in the above :-)
Umm.. it's a bit different. Without chain conversion, we still don't allow
further WARM updates to the above chain because that might create a third
index pointer and our recheck logic can't cope up with duplicate scans. HOT
updates are allowed though.
The latest patch that I proposed will handle this case and convert such
chains into regular HOT-pruned chains. To do that, we must remove the
duplicate (and now wrong) index pointer to the chain. Once we do that and
change the state on the heap tuple, we can once again do WARM update to
this tuple. Note that in this example the chain has just one tuple, which
will be the case typically, but the algorithm can deal with the case where
there are multiple tuples but with matching index keys.
Hope this helps.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 24, 2017 at 2:42 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
Let's take an example. Say, we have a table (a int, b int, c text) and two
indexes on first two columns.H W
H
(1, 100, 'foo') -----> (1, 100, 'bar') ------> (1, 200, 'bar') -----> (1,
200, 'foo')The first update will be a HOT update, the second update will be a WARM
update and the third update will again be a HOT update. The first and third
update do not create any new index entry, though the second update will
create a new index entry in the second index. Any further WARM updates to
this chain is not allowed, but further HOT updates are ok.If all but the last version become DEAD, HOT-prune will remove all of them
and turn the first line pointer into REDIRECT line pointer.
So, when you do the WARM update, the new index entries still point at
the original root, which they don't match, not the version where that
new value first appeared?
I don't immediately see how this will work with index-only scans. If
the tuple is HOT updated several times, HOT-pruned back to a single
version, and then the page is all-visible, the index entries are
guaranteed to agree with the remaining tuple, so it's fine to believe
the data in the index tuple. But with WARM, that would no longer be
true, unless you have some trick for that...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't immediately see how this will work with index-only scans. If
the tuple is HOT updated several times, HOT-pruned back to a single
version, and then the page is all-visible, the index entries are
guaranteed to agree with the remaining tuple, so it's fine to believe
the data in the index tuple. But with WARM, that would no longer be
true, unless you have some trick for that...
Well the trick is to not allow index-only scans on such pages by not
marking them all-visible. That's why when a tuple is WARM updated, we carry
that information in the subsequent versions even when later updates are HOT
updates. The chain conversion algorithm will handle this by clearing those
bits and thus allowing index-only scans again.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 24, 2017 at 3:31 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't immediately see how this will work with index-only scans. If
the tuple is HOT updated several times, HOT-pruned back to a single
version, and then the page is all-visible, the index entries are
guaranteed to agree with the remaining tuple, so it's fine to believe
the data in the index tuple. But with WARM, that would no longer be
true, unless you have some trick for that...Well the trick is to not allow index-only scans on such pages by not marking
them all-visible. That's why when a tuple is WARM updated, we carry that
information in the subsequent versions even when later updates are HOT
updates. The chain conversion algorithm will handle this by clearing those
bits and thus allowing index-only scans again.
Wow, OK. In my view, that makes the chain conversion code pretty much
essential, because if you had WARM without chain conversion then the
visibility map gets more or less irrevocably less effective over time,
which sounds terrible. But it sounds to me like even with the chain
conversion, it might take multiple vacuum passes before all visibility
map bits are set, which isn't such a great property (thus e.g.
fdf9e21196a6f58c6021c967dc5776a16190f295).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 24, 2017 at 3:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Wow, OK. In my view, that makes the chain conversion code pretty much
essential, because if you had WARM without chain conversion then the
visibility map gets more or less irrevocably less effective over time,
which sounds terrible.
Yes. I decided to complete chain conversion patch when I realised that IOS
will otherwise become completely useful if large percentage of rows are
updated just once. So I agree. It's not an optional patch and should get in
with the main WARM patch.
But it sounds to me like even with the chain
conversion, it might take multiple vacuum passes before all visibility
map bits are set, which isn't such a great property (thus e.g.
fdf9e21196a6f58c6021c967dc5776a16190f295).
The chain conversion algorithm first converts the chains during vacuum and
then checks if the page can be set all-visible. So I'm not sure why it
would take multiple vacuums before a page is set all-visible. The commit
you quote was written to ensure that we make another attempt to set the
page all-visible after al dead tuples are removed from the page. Similarly,
we will convert all WARM chains to HOT chains and then check for
all-visibility of the page.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 24, 2017 at 4:06 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
Wow, OK. In my view, that makes the chain conversion code pretty much
essential, because if you had WARM without chain conversion then the
visibility map gets more or less irrevocably less effective over time,
which sounds terrible.Yes. I decided to complete chain conversion patch when I realised that IOS
will otherwise become completely useful if large percentage of rows are
updated just once. So I agree. It's not an optional patch and should get in
with the main WARM patch.
Right, and it's not just index-only scans. VACUUM gets permanently
more expensive, too, which is probably a much worse problem.
But it sounds to me like even with the chain
conversion, it might take multiple vacuum passes before all visibility
map bits are set, which isn't such a great property (thus e.g.
fdf9e21196a6f58c6021c967dc5776a16190f295).The chain conversion algorithm first converts the chains during vacuum and
then checks if the page can be set all-visible. So I'm not sure why it would
take multiple vacuums before a page is set all-visible. The commit you quote
was written to ensure that we make another attempt to set the page
all-visible after al dead tuples are removed from the page. Similarly, we
will convert all WARM chains to HOT chains and then check for all-visibility
of the page.
OK, that sounds good. And there are no bugs, right? :-)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 24, 2017 at 02:14:23PM +0530, Pavan Deolasee wrote:
On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
As I remember, WARM only allows
a single index-column change in the chain.� Why are you seeing such a
large performance improvement?� I would have thought it would be that
high if we allowed an unlimited number of index changes in the chain.The second update in a chain creates another non-warm-updated tuple, so
the third update can be a warm update again, and so on.Right, before this patch they would be two independent HOT chains.� It
still seems like an unexpectedly-high performance win.� Are two
independent HOT chains that much more expensive than joining them via
WARM?In these tests, there are zero HOT updates, since every update modifies some
index column. With WARM, we could reduce regular updates to half, even when we
allow only one WARM update per chain (chain really has a single tuple for this
discussion). IOW approximately half updates insert new index entry in *every*
index and half updates�
insert new index entry *only* in affected index. That itself does a good bit
for performance.So to answer your question: yes, joining two HOT chains via WARM is much
cheaper because it results in creating new index entries just for affected
indexes.
OK, all my questions have been answered, including the use of flag bits.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 24, 2017 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
And there are no bugs, right? :-)
Yeah yeah absolutely nothing. Just like any other feature committed to
Postgres so far ;-)
I need to polish the chain conversion patch a bit and also add missing
support for redo, hash indexes etc. Support for hash indexes will need
overloading of ip_posid bits in the index tuple (since there are no free
bits left in hash tuples). I plan to work on that next and submit a fully
functional patch, hopefully before the commit-fest starts.
(I have mentioned the idea of overloading ip_posid bits a few times now and
haven't heard any objection so far. Well, that could either mean that
nobody has read those emails seriously or there is general acceptance to
that idea.. I am assuming latter :-))
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 25, 2017 at 10:50:57AM +0530, Pavan Deolasee wrote:
On Fri, Feb 24, 2017 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
And there are no bugs, right?� :-)Yeah yeah absolutely nothing. Just like any other feature committed to Postgres
so far ;-)I need to polish the chain conversion patch a bit and also add missing support
for redo, hash indexes etc. Support for hash indexes will need overloading of
ip_posid bits in the index tuple (since there are no free bits left in hash
tuples). I plan to work on that next and submit a fully functional patch,
hopefully before the commit-fest starts.(I have mentioned the idea of overloading ip_posid bits a few times now and
haven't heard any objection so far. Well, that could either mean that nobody
has read those emails seriously or there is general acceptance to that idea.. I
am assuming latter :-))
Yes, I think it is the latter.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Feb 25, 2017 at 10:50 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
On Fri, Feb 24, 2017 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
And there are no bugs, right? :-)
Yeah yeah absolutely nothing. Just like any other feature committed to
Postgres so far ;-)
Fair point, but I've already said why I think the stakes for this
particular feature are pretty high.
I need to polish the chain conversion patch a bit and also add missing
support for redo, hash indexes etc. Support for hash indexes will need
overloading of ip_posid bits in the index tuple (since there are no free
bits left in hash tuples). I plan to work on that next and submit a fully
functional patch, hopefully before the commit-fest starts.(I have mentioned the idea of overloading ip_posid bits a few times now and
haven't heard any objection so far. Well, that could either mean that nobody
has read those emails seriously or there is general acceptance to that
idea.. I am assuming latter :-))
I'm not sure about that. I'm not really sure I have an opinion on
that yet, without seeing the patch. The discussion upthread was a bit
vague:
"One idea is to free up 3 bits from ip_posid knowing that OffsetNumber
can never really need more than 13 bits with the other constraints in
place."
Not sure exactly what "the other constraints" are, exactly.
/me goes off, tries to figure it out.
If I'm reading the definition of MaxIndexTuplesPerPage correctly, it
thinks that the minimum number of bytes per index tuple is at least
16: I think sizeof(IndexTupleData) will be 8, so when you add 1 and
MAXALIGN, you get to 12, and then ItemIdData is another 4. So an 8k
page (2^13 bits) could have, on a platform with MAXIMUM_ALIGNOF == 4,
as many as 2^9 tuples. To store more than 2^13 tuples, we'd need a
block size > 128k, but it seems 32k is the most we support. So that
seems OK, if I haven't gotten confused about the logic.
I suppose the only other point of concern about stealing some bits
there is that it might make some operations a little more expensive,
because they've got to start masking out the high bits. But that's
*probably* negligible.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 26, 2017 at 2:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Fair point, but I've already said why I think the stakes for this
particular feature are pretty high.
I understand your concerns and not trying to downplay them. I'm doing my
best to test the patch in different ways to ensure we can catch most of the
bugs before the patch is committed. Hopefully with additional reviews and
tests we can plug remaining holes, if any, and be in a comfortable state.
(I have mentioned the idea of overloading ip_posid bits a few times now
and
haven't heard any objection so far. Well, that could either mean that
nobody
has read those emails seriously or there is general acceptance to that
idea.. I am assuming latter :-))I'm not sure about that. I'm not really sure I have an opinion on
that yet, without seeing the patch. The discussion upthread was a bit
vague:
Attached is a complete set of rebased and finished patches. Patches 0002
and 0003 does what I've in mind as far as OffsetNumber bits.
AFAICS this version is a fully functional implementation of WARM, ready for
serious review/test. The chain conversion is now fully functional and
tested with btrees. I've also added support for chain conversion in hash
indexes by overloading ip_posid high order bits. Even though there is a
free bit available in btree index tuple, the patch now uses the same
ip_posid bit even for btree indexes.
A short summary of all attached patches.
0000_interesting_attrs_v15.patch:
This is Alvaro's patch to refactor HeapSatisfiesHOTandKeyUpdate. We now
return a set of modified attributes and let the caller consume that
information in a way it wants. The main WARM patch uses this refactored API.
0001_track_root_lp_v15.patch:
This implements the logic to store the root offset of the HOT chain in the
t_ctid.ip_posid field. We use a free bit in heap tuple header to mark that
a particular tuple is at the end of the chain and store the root offset in
the ip_posid. For pg_upgraded clusters, this information could be missing
and we do the hard-work of going through the page tuples to find the root
offset.
0002_clear_ip_posid_blkid_refs_v15.patch:
This is mostly a cleanup patch which removes direct references to ip_posid
and ip_blkid from various places and replace them with appropriate
ItemPointer[Get|Set][Offset|Block]Number macros.
0003_freeup_3bits_ip_posid_v15.patch:
This patch frees up the high order 3 bits from ip_posid and makes them
available for other uses. As noted, we only need 13 bits to represent
OffsetNumber and hence the high order bits are unused. This patch should
only be applied along with 0002_clear_ip_posid_blkid_refs_v15.patch
0004_warm_updates_v15.patch:
This implements the main WARM logic, except for chain conversion (which is
implemented in the last patch of the series). It uses another free bit in
the heap tuple header to identify the WARM tuples. When the first WARM
update happens, the old and new versions of the tuple are marked with this
flag. All subsequent HOT tuples in the chain are also marked with this flag
so we never lose information about WARM updates, irrespective of whether it
commits or aborts. We then implement recheck logic to decide which index
pointer should return a tuple from the HOT chain.
WARM is currently supported for hash and btree indexes. If a table has an
index of any other type, WARM is disabled.
0005_warm_chain_conversion_v15.patch:
This patch implements the WARM chain conversion as discussed upthread and
also noted in the README.WARM. This patch requires yet another bit in the
heap tuple header. But since the bit is only set along with the
HEAP_WARM_TUPLE bit, we can safely reuse HEAP_MOVED_OFF bit for this
purpose. We also need a bit to distinguish two copies of index pointers to
know which pointer points to the pre-WARM-update HOT chain (Blue chain) and
which pointer points to post-WARM-update HOT chain (Red chain). We steal
this bit from t_tid.ip_posid field in the index tuple headers. As part of
this patch, I moved XLOG_HEAP2_MULTI_INSERT to RM_HEAP_ID (and renamed it
to XLOG_HEAP_MULTI_INSERT). While it's not necessary, I thought it will
allow us to restrict XLOG_HEAP_INIT_PAGE to RM_HEAP_ID and make that bit
available to define additional opcodes in RM_HEAD2_ID.
I've done some elaborate tests with these patches applied. I've primarily
used make-world, pgbench with additional indexes and the WARM stress test
(which was useful in catching CIC bug) to test the feature. While it does
not mean there are no additional bugs, all bugs that were known to me are
fixed in this version. I'll continue to run more tests, especially around
crash recovery, when indexes are dropped and recreated and also do more
performance tests.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0000_interesting_attrs_v15.patchapplication/octet-stream; name=0000_interesting_attrs_v15.patchDownload
commit 8b8bf7805d661c7450f87e237bb9b68eeab465bc
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sun Jan 1 16:29:10 2017 +0530
Alvaro's patch on interesting attrs
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index af25836..74fb09c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -96,11 +96,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3455,6 +3452,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3472,9 +3471,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3501,21 +3497,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3536,7 +3541,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3562,6 +3567,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3573,10 +3582,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3815,6 +3821,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4119,7 +4127,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4134,7 +4142,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4282,13 +4292,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4322,7 +4334,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4367,114 +4379,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
- *
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
-
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
+ int attnum;
+ Bitmapset *modified = NULL;
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
-
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
0005_warm_chain_conversion_v15.patchapplication/octet-stream; name=0005_warm_chain_conversion_v15.patchDownload
commit fb4bc555adc078f4c0fb6b808d1046d8212a90ee
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Tue Feb 28 10:39:01 2017 +0530
Warm chain conversion - v15
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 04abd0f..ff50361 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -88,7 +88,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
while (itup < itupEnd)
{
/* Do we have to delete this tuple? */
- if (callback(&itup->heapPtr, callback_state))
+ if (callback(&itup->heapPtr, false, callback_state) == IBDCR_DELETE)
{
/* Yes; adjust count of tuples that will be left on page */
BloomPageGetOpaque(page)->maxoff--;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index c9ccfee..8ed71c5 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -56,7 +56,8 @@ ginVacuumItemPointers(GinVacuumState *gvs, ItemPointerData *items,
*/
for (i = 0; i < nitem; i++)
{
- if (gvs->callback(items + i, gvs->callback_state))
+ if (gvs->callback(items + i, false, gvs->callback_state) ==
+ IBDCR_DELETE)
{
gvs->result->tuples_removed += 1;
if (!tmpitems)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..0955db6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -202,7 +202,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
iid = PageGetItemId(page, i);
idxtuple = (IndexTuple) PageGetItem(page, iid);
- if (callback(&(idxtuple->t_tid), callback_state))
+ if (callback(&(idxtuple->t_tid), false, callback_state) ==
+ IBDCR_DELETE)
todelete[ntodelete++] = i;
else
stats->num_index_tuples += 1;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6645160..c8a1f43 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -73,6 +73,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambuild = hashbuild;
amroutine->ambuildempty = hashbuildempty;
amroutine->aminsert = hashinsert;
+ amroutine->amwarminsert = hashwarminsert;
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
@@ -231,11 +232,11 @@ hashbuildCallback(Relation index,
* Hash on the heap tuple's key, form an index tuple with hash code.
* Find the appropriate location for the new tuple, and put it there.
*/
-bool
-hashinsert(Relation rel, Datum *values, bool *isnull,
+static bool
+hashinsert_internal(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo, bool warm_update)
{
Datum index_values[1];
bool index_isnull[1];
@@ -251,6 +252,11 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
itup->t_tid = *ht_ctid;
+ if (warm_update)
+ ItemPointerSetFlags(&itup->t_tid, HASH_INDEX_RED_POINTER);
+ else
+ ItemPointerClearFlags(&itup->t_tid);
+
_hash_doinsert(rel, itup);
pfree(itup);
@@ -258,6 +264,26 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
return false;
}
+bool
+hashinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return hashinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, false);
+}
+
+bool
+hashwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return hashinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, true);
+
+}
/*
* hashgettuple() -- Get the next tuple in the scan.
@@ -738,6 +764,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
Page page;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
+ OffsetNumber colorblue[MaxOffsetNumber];
+ int ncolorblue = 0;
bool retain_pin = false;
vacuum_delay_point();
@@ -755,20 +783,35 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
IndexTuple itup;
Bucket bucket;
bool kill_tuple = false;
+ bool color_tuple = false;
+ int flags;
+ bool is_red;
+ IndexBulkDeleteCallbackResult result;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offno));
htup = &(itup->t_tid);
+ flags = ItemPointerGetFlags(&itup->t_tid);
+ is_red = ((flags & HASH_INDEX_RED_POINTER) != 0);
+
/*
* To remove the dead tuples, we strictly want to rely on results
* of callback function. refer btvacuumpage for detailed reason.
*/
- if (callback && callback(htup, callback_state))
+ if (callback)
{
- kill_tuple = true;
- if (tuples_removed)
- *tuples_removed += 1;
+ result = callback(htup, is_red, callback_state);
+ if (result == IBDCR_DELETE)
+ {
+ kill_tuple = true;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (result == IBDCR_COLOR_BLUE)
+ {
+ color_tuple = true;
+ }
}
else if (split_cleanup)
{
@@ -791,6 +834,12 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
}
}
+ if (color_tuple)
+ {
+ /* color the pointer blue */
+ colorblue[ncolorblue++] = offno;
+ }
+
if (kill_tuple)
{
/* mark the item for deletion */
@@ -815,9 +864,24 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
/*
* Apply deletions, advance to next page and write page if needed.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || ncolorblue > 0)
{
- PageIndexMultiDelete(page, deletable, ndeletable);
+ /*
+ * Color the Red pointers Blue.
+ *
+ * We must do this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the
+ * array and hence offnums recorded earlier won't make any sense
+ * after PageIndexMultiDelete is called..
+ */
+ if (ncolorblue > 0)
+ _hash_color_items(page, colorblue, ncolorblue);
+
+ /*
+ * And delete the deletable items
+ */
+ if (ndeletable > 0)
+ PageIndexMultiDelete(page, deletable, ndeletable);
bucket_dirty = true;
MarkBufferDirty(buf);
}
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 00f3ea8..6af1fb9 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1333,3 +1333,17 @@ _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey, int access,
return buf;
}
+
+void _hash_color_items(Page page, OffsetNumber *coloritemnos,
+ uint16 ncoloritems)
+{
+ int i;
+ IndexTuple itup;
+
+ for (i = 0; i < ncoloritems; i++)
+ {
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, coloritemnos[i]));
+ ItemPointerClearFlags(&itup->t_tid);
+ }
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9c4522a..efdd439 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1958,17 +1958,32 @@ heap_fetch(Relation relation,
}
/*
- * Check if the HOT chain containing this tid is actually a WARM chain.
- * Note that even if the WARM update ultimately aborted, we still must do a
- * recheck because the failing UPDATE when have inserted created index entries
- * which are now stale, but still referencing this chain.
+ * Check status of a (possibly) WARM chain.
+ *
+ * This function looks at a HOT/WARM chain starting at tid and return a bitmask
+ * of information. We only follow the chain as long as it's known to be valid
+ * HOT chain. Information returned by the function consists of:
+ *
+ * HCWC_WARM_TUPLE - a warm tuple is found somewhere in the chain. Note that
+ * when a tuple is WARM updated, both old and new versions
+ * of the tuple are treated as WARM tuple
+ *
+ * HCWC_RED_TUPLE - a warm tuple part of the Red chain is found somewhere in
+ * the chain.
+ *
+ * HCWC_BLUE_TUPLE - a warm tuple part of the Blue chain is found somewhere in
+ * the chain.
+ *
+ * If stop_at_warm is true, we stop when the first WARM tuple is found and
+ * return information collected so far.
*/
-static bool
-hot_check_warm_chain(Page dp, ItemPointer tid)
+HeapCheckWarmChainStatus
+heap_check_warm_chain(Page dp, ItemPointer tid, bool stop_at_warm)
{
- TransactionId prev_xmax = InvalidTransactionId;
- OffsetNumber offnum;
- HeapTupleData heapTuple;
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+ HeapCheckWarmChainStatus status = 0;
offnum = ItemPointerGetOffsetNumber(tid);
heapTuple.t_self = *tid;
@@ -1985,7 +2000,16 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
/* check for unused, dead, or redirected items */
if (!ItemIdIsNormal(lp))
+ {
+ if (ItemIdIsRedirected(lp))
+ {
+ /* Follow the redirect */
+ offnum = ItemIdGetRedirect(lp);
+ continue;
+ }
+ /* else must be end of chain */
break;
+ }
heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
@@ -2000,13 +2024,118 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
break;
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ /* We found a WARM tuple */
+ status |= HCWC_WARM_TUPLE;
+
+ /*
+ * If we've been told to stop at the first WARM tuple, just return
+ * whatever information collected so far.
+ */
+ if (stop_at_warm)
+ return status;
+
+ /*
+ * If it's not a Red tuple, then it's definitely a Blue tuple. Set
+ * either of the bit correctly.
+ */
+ if (HeapTupleHeaderIsWarmRed(heapTuple.t_data))
+ status |= HCWC_RED_TUPLE;
+ else
+ status |= HCWC_BLUE_TUPLE;
+ }
+ else
+ /* Must be a tuple belonging to the Blue chain */
+ status |= HCWC_BLUE_TUPLE;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return status;
+}
+
+/*
+ * Scan through the WARM chain starting at tid and reset all WARM related
+ * flags. At the end, the chain will have all characteristics of a regular HOT
+ * chain.
+ *
+ * Return the number of cleared offnums. Cleared offnums are returned in the
+ * passed-in cleared_offnums array. The caller must ensure that the array is
+ * large enough to hold maximum offnums that can be cleared by this invokation
+ * of heap_clear_warm_chain().
+ */
+int
+heap_clear_warm_chain(Page dp, ItemPointer tid, OffsetNumber *cleared_offnums)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+ int num_cleared = 0;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ {
+ if (ItemIdIsRedirected(lp))
+ {
+ /* Follow the redirect */
+ offnum = ItemIdGetRedirect(lp);
+ continue;
+ }
+ /* else must be end of chain */
+ break;
+ }
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
/*
- * Presence of either WARM or WARM updated tuple signals possible
- * breakage and the caller must recheck tuple returned from this chain
- * for index satisfaction
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Clear WARM and Red flags
*/
if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
- return true;
+ {
+ HeapTupleHeaderClearHeapWarmTuple(heapTuple.t_data);
+ HeapTupleHeaderClearWarmRed(heapTuple.t_data);
+ cleared_offnums[num_cleared++] = offnum;
+ }
/*
* Check to see if HOT chain continues past this tuple; if so fetch
@@ -2025,8 +2154,7 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
}
- /* All OK. No need to recheck */
- return false;
+ return num_cleared;
}
/*
@@ -2135,7 +2263,11 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* possible improvements here
*/
if (recheck && *recheck == false)
- *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+ {
+ HeapCheckWarmChainStatus status;
+ status = heap_check_warm_chain(dp, &heapTuple->t_self, true);
+ *recheck = HCWC_IS_WARM(status);
+ }
/*
* When first_call is true (and thus, skip is initially false) we'll
@@ -2888,7 +3020,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
XLogRecPtr recptr;
xl_heap_multi_insert *xlrec;
- uint8 info = XLOG_HEAP2_MULTI_INSERT;
+ uint8 info = XLOG_HEAP_MULTI_INSERT;
char *tupledata;
int totaldatalen;
char *scratchptr = scratch;
@@ -2985,7 +3117,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* filtering by origin on a row level is much more efficient */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- recptr = XLogInsert(RM_HEAP2_ID, info);
+ recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
}
@@ -3409,7 +3541,9 @@ l1:
}
/* store transaction information of xact deleting the tuple */
- tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ tp.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(tp.t_data))
+ tp.t_data->t_infomask &= ~HEAP_MOVED;
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
tp.t_data->t_infomask |= new_infomask;
tp.t_data->t_infomask2 |= new_infomask2;
@@ -4172,7 +4306,9 @@ l2:
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
- oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ oldtup.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(oldtup.t_data))
+ oldtup.t_data->t_infomask &= ~HEAP_MOVED;
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
HeapTupleClearHotUpdated(&oldtup);
/* ... and store info about transaction updating this tuple */
@@ -4419,6 +4555,16 @@ l2:
}
/*
+ * If the old tuple is already a member of the Red chain, mark the new
+ * tuple with the same flag
+ */
+ if (HeapTupleIsHeapWarmTupleRed(&oldtup))
+ {
+ HeapTupleSetHeapWarmTupleRed(heaptup);
+ HeapTupleSetHeapWarmTupleRed(newtup);
+ }
+
+ /*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
* Usually this information will be available in the corresponding
@@ -4435,12 +4581,20 @@ l2:
/* Mark the old tuple as HOT-updated */
HeapTupleSetHotUpdated(&oldtup);
HeapTupleSetHeapWarmTuple(&oldtup);
+
/* And mark the new tuple as heap-only */
HeapTupleSetHeapOnly(heaptup);
+ /* Mark the new tuple as WARM tuple */
HeapTupleSetHeapWarmTuple(heaptup);
+ /* This update also starts a Red chain */
+ HeapTupleSetHeapWarmTupleRed(heaptup);
+ Assert(!HeapTupleIsHeapWarmTupleRed(&oldtup));
+
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
HeapTupleSetHeapWarmTuple(newtup);
+ HeapTupleSetHeapWarmTupleRed(newtup);
+
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
else
@@ -4459,6 +4613,8 @@ l2:
HeapTupleClearHeapOnly(newtup);
HeapTupleClearHeapWarmTuple(heaptup);
HeapTupleClearHeapWarmTuple(newtup);
+ HeapTupleClearHeapWarmTupleRed(heaptup);
+ HeapTupleClearHeapWarmTupleRed(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4477,7 +4633,9 @@ l2:
HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
- oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ oldtup.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(oldtup.t_data))
+ oldtup.t_data->t_infomask &= ~HEAP_MOVED;
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
/* ... and store info about transaction updating this tuple */
Assert(TransactionIdIsValid(xmax_old_tuple));
@@ -6398,7 +6556,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
PageSetPrunable(page, RecentGlobalXmin);
/* store transaction information of xact deleting the tuple */
- tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ tp.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(tp.t_data))
+ tp.t_data->t_infomask &= ~HEAP_MOVED;
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
/*
@@ -6972,7 +7132,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* Old-style VACUUM FULL is gone, but we have to keep this code as long as
* we support having MOVED_OFF/MOVED_IN tuples in the database.
*/
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
@@ -6991,7 +7151,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* have failed; whereas a non-dead MOVED_IN tuple must mean the
* xvac transaction succeeded.
*/
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
frz->frzflags |= XLH_INVALID_XVAC;
else
frz->frzflags |= XLH_FREEZE_XVAC;
@@ -7461,7 +7621,7 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
return true;
}
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsNormal(xid))
@@ -7544,7 +7704,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
return true;
}
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsNormal(xid) &&
@@ -7570,7 +7730,7 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
TransactionId xmax = HeapTupleHeaderGetUpdateXid(tuple);
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
if (TransactionIdPrecedes(*latestRemovedXid, xvac))
*latestRemovedXid = xvac;
@@ -7619,6 +7779,36 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
}
/*
+ * Perform XLogInsert for a heap-warm-clear operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_warmclear(Relation reln, Buffer buffer,
+ OffsetNumber *cleared, int ncleared)
+{
+ xl_heap_warmclear xlrec;
+ XLogRecPtr recptr;
+
+ /* Caller should not call me on a non-WAL-logged relation */
+ Assert(RelationNeedsWAL(reln));
+
+ xlrec.ncleared = ncleared;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapWarmClear);
+
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ if (ncleared > 0)
+ XLogRegisterBufData(0, (char *) cleared,
+ ncleared * sizeof(OffsetNumber));
+
+ recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_WARMCLEAR);
+
+ return recptr;
+}
+
+/*
* Perform XLogInsert for a heap-clean operation. Caller must already
* have modified the buffer and marked it dirty.
*
@@ -8277,6 +8467,60 @@ heap_xlog_clean(XLogReaderState *record)
XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
}
+
+/*
+ * Handles HEAP2_WARMCLEAR record type
+ */
+static void
+heap_xlog_warmclear(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ xl_heap_warmclear *xlrec = (xl_heap_warmclear *) XLogRecGetData(record);
+ Buffer buffer;
+ RelFileNode rnode;
+ BlockNumber blkno;
+ XLogRedoAction action;
+
+ XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+ /*
+ * If we have a full-page image, restore it (using a cleanup lock) and
+ * we're done.
+ */
+ action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true,
+ &buffer);
+ if (action == BLK_NEEDS_REDO)
+ {
+ Page page = (Page) BufferGetPage(buffer);
+ OffsetNumber *cleared;
+ int ncleared;
+ Size datalen;
+ int i;
+
+ cleared = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+ ncleared = xlrec->ncleared;
+
+ for (i = 0; i < ncleared; i++)
+ {
+ ItemId lp;
+ OffsetNumber offnum = cleared[i];
+ HeapTupleData heapTuple;
+
+ lp = PageGetItemId(page, offnum);
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
+
+ HeapTupleHeaderClearHeapWarmTuple(heapTuple.t_data);
+ HeapTupleHeaderClearWarmRed(heapTuple.t_data);
+ }
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+}
+
/*
* Replay XLOG_HEAP2_VISIBLE record.
*
@@ -8523,7 +8767,9 @@ heap_xlog_delete(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
HeapTupleHeaderClearHotUpdated(htup);
fix_infomask_from_infobits(xlrec->infobits_set,
@@ -9186,7 +9432,9 @@ heap_xlog_lock(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
fix_infomask_from_infobits(xlrec->infobits_set, &htup->t_infomask,
&htup->t_infomask2);
@@ -9265,7 +9513,9 @@ heap_xlog_lock_updated(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
fix_infomask_from_infobits(xlrec->infobits_set, &htup->t_infomask,
&htup->t_infomask2);
@@ -9334,6 +9584,9 @@ heap_redo(XLogReaderState *record)
case XLOG_HEAP_INSERT:
heap_xlog_insert(record);
break;
+ case XLOG_HEAP_MULTI_INSERT:
+ heap_xlog_multi_insert(record);
+ break;
case XLOG_HEAP_DELETE:
heap_xlog_delete(record);
break;
@@ -9362,7 +9615,7 @@ heap2_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
- switch (info & XLOG_HEAP_OPMASK)
+ switch (info & XLOG_HEAP2_OPMASK)
{
case XLOG_HEAP2_CLEAN:
heap_xlog_clean(record);
@@ -9376,9 +9629,6 @@ heap2_redo(XLogReaderState *record)
case XLOG_HEAP2_VISIBLE:
heap_xlog_visible(record);
break;
- case XLOG_HEAP2_MULTI_INSERT:
- heap_xlog_multi_insert(record);
- break;
case XLOG_HEAP2_LOCK_UPDATED:
heap_xlog_lock_updated(record);
break;
@@ -9392,6 +9642,9 @@ heap2_redo(XLogReaderState *record)
case XLOG_HEAP2_REWRITE:
heap_xlog_logical_rewrite(record);
break;
+ case XLOG_HEAP2_WARMCLEAR:
+ heap_xlog_warmclear(record);
+ break;
default:
elog(PANIC, "heap2_redo: unknown op code %u", info);
}
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 19e7048..47b01eb 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1620,7 +1620,8 @@ toast_save_datum(Relation rel, Datum value,
toastrel,
toastidxs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- NULL);
+ NULL,
+ false);
}
/*
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f56c58f..e8027f8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -199,7 +199,8 @@ index_insert(Relation indexRelation,
ItemPointer heap_t_ctid,
Relation heapRelation,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo,
+ bool warm_update)
{
RELATION_CHECKS;
CHECK_REL_PROCEDURE(aminsert);
@@ -209,6 +210,12 @@ index_insert(Relation indexRelation,
(HeapTuple) NULL,
InvalidBuffer);
+ if (warm_update)
+ {
+ Assert(indexRelation->rd_amroutine->amwarminsert != NULL);
+ return indexRelation->rd_amroutine->amwarminsert(indexRelation, values,
+ isnull, heap_t_ctid, heapRelation, checkUnique, indexInfo);
+ }
return indexRelation->rd_amroutine->aminsert(indexRelation, values, isnull,
heap_t_ctid, heapRelation,
checkUnique, indexInfo);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f815fd4..7959155 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -766,11 +766,12 @@ _bt_page_recyclable(Page page)
}
/*
- * Delete item(s) from a btree page during VACUUM.
+ * Delete item(s) and color item(s) blue on a btree page during VACUUM.
*
* This must only be used for deleting leaf items. Deleting an item on a
* non-leaf page has to be done as part of an atomic action that includes
- * deleting the page it points to.
+ * deleting the page it points to. We don't ever color pointers on a non-leaf
+ * page.
*
* This routine assumes that the caller has pinned and locked the buffer.
* Also, the given itemnos *must* appear in increasing order in the array.
@@ -786,9 +787,9 @@ _bt_page_recyclable(Page page)
* ensure correct locking.
*/
void
-_bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *itemnos, int nitems,
- BlockNumber lastBlockVacuumed)
+_bt_handleitems_vacuum(Relation rel, Buffer buf,
+ OffsetNumber *delitemnos, int ndelitems,
+ OffsetNumber *coloritemnos, int ncoloritems)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
@@ -796,9 +797,20 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /*
+ * Color the Red pointers Blue.
+ *
+ * We must do this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the array and
+ * hence offnums recorded earlier won't make any sense after
+ * PageIndexMultiDelete is called..
+ */
+ if (ncoloritems > 0)
+ _bt_color_items(page, coloritemnos, ncoloritems);
+
/* Fix the page */
- if (nitems > 0)
- PageIndexMultiDelete(page, itemnos, nitems);
+ if (ndelitems > 0)
+ PageIndexMultiDelete(page, delitemnos, ndelitems);
/*
* We can clear the vacuum cycle ID since this page has certainly been
@@ -824,7 +836,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
XLogRecPtr recptr;
xl_btree_vacuum xlrec_vacuum;
- xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.ndelitems = ndelitems;
+ xlrec_vacuum.ncoloritems = ncoloritems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -835,8 +848,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* is. When XLogInsert stores the whole buffer, the offsets array
* need not be stored too.
*/
- if (nitems > 0)
- XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ if (ndelitems > 0)
+ XLogRegisterBufData(0, (char *) delitemnos, ndelitems * sizeof(OffsetNumber));
+
+ if (ncoloritems > 0)
+ XLogRegisterBufData(0, (char *) coloritemnos, ncoloritems * sizeof(OffsetNumber));
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
@@ -1882,3 +1898,18 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
return true;
}
+
+void
+_bt_color_items(Page page, OffsetNumber *coloritemnos, uint16 ncoloritems)
+{
+ int i;
+ ItemId itemid;
+ IndexTuple itup;
+
+ for (i = 0; i < ncoloritems; i++)
+ {
+ itemid = PageGetItemId(page, coloritemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ ItemPointerClearFlags(&itup->t_tid);
+ }
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 952ed8f..92f490e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
amroutine->aminsert = btinsert;
+ amroutine->amwarminsert = btwarminsert;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -317,11 +318,12 @@ btbuildempty(Relation index)
* Descend the tree recursively, find the appropriate location for our
* new tuple, and put it there.
*/
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
+static bool
+btinsert_internal(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo,
+ bool warm_update)
{
bool result;
IndexTuple itup;
@@ -330,6 +332,11 @@ btinsert(Relation rel, Datum *values, bool *isnull,
itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
itup->t_tid = *ht_ctid;
+ if (warm_update)
+ ItemPointerSetFlags(&itup->t_tid, BTREE_INDEX_RED_POINTER);
+ else
+ ItemPointerClearFlags(&itup->t_tid);
+
result = _bt_doinsert(rel, itup, checkUnique, heapRel);
pfree(itup);
@@ -337,6 +344,26 @@ btinsert(Relation rel, Datum *values, bool *isnull,
return result;
}
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return btinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, false);
+}
+
+bool
+btwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return btinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, true);
+}
+
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -1106,7 +1133,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_handleitems_vacuum(rel, buf, NULL, 0, NULL, 0);
_bt_relbuf(rel, buf);
}
@@ -1204,6 +1231,8 @@ restart:
{
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ OffsetNumber colorblue[MaxOffsetNumber];
+ int ncolorblue;
OffsetNumber offnum,
minoff,
maxoff;
@@ -1242,7 +1271,7 @@ restart:
* Scan over all items to see which ones need deleted according to the
* callback function.
*/
- ndeletable = 0;
+ ndeletable = ncolorblue = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1253,6 +1282,9 @@ restart:
{
IndexTuple itup;
ItemPointer htup;
+ int flags;
+ bool is_red = false;
+ IndexBulkDeleteCallbackResult result;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
@@ -1279,16 +1311,36 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
+ flags = ItemPointerGetFlags(&itup->t_tid);
+ is_red = ((flags & BTREE_INDEX_RED_POINTER) != 0);
+
+ if (is_red)
+ stats->num_red_pointers++;
+ else
+ stats->num_blue_pointers++;
+
+ result = callback(htup, is_red, callback_state);
+ if (result == IBDCR_DELETE)
+ {
+ if (is_red)
+ stats->red_pointers_removed++;
+ else
+ stats->blue_pointers_removed++;
deletable[ndeletable++] = offnum;
+ }
+ else if (result == IBDCR_COLOR_BLUE)
+ {
+ colorblue[ncolorblue++] = offnum;
+ }
}
}
/*
- * Apply any needed deletes. We issue just one _bt_delitems_vacuum()
- * call per page, so as to minimize WAL traffic.
+ * Apply any needed deletes and coloring. We issue just one
+ * _bt_handleitems_vacuum() call per page, so as to minimize WAL
+ * traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || ncolorblue > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1304,8 +1356,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
- vstate->lastBlockVacuumed);
+ _bt_handleitems_vacuum(rel, buf, deletable, ndeletable,
+ colorblue, ncolorblue);
/*
* Remember highest leaf page number we've issued a
@@ -1315,6 +1367,7 @@ restart:
vstate->lastBlockVacuumed = blkno;
stats->tuples_removed += ndeletable;
+ stats->pointers_colored += ncolorblue;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index ac60db0..916c76e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -390,83 +390,9 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
/*
- * This section of code is thought to be no longer needed, after analysis
- * of the calling paths. It is retained to allow the code to be reinstated
- * if a flaw is revealed in that thinking.
- *
- * If we are running non-MVCC scans using this index we need to do some
- * additional work to ensure correctness, which is known as a "pin scan"
- * described in more detail in next paragraphs. We used to do the extra
- * work in all cases, whereas we now avoid that work in most cases. If
- * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
- * additional work required for the pin scan.
- *
- * Avoiding this extra work is important since it requires us to touch
- * every page in the index, so is an O(N) operation. Worse, it is an
- * operation performed in the foreground during redo, so it delays
- * replication directly.
- *
- * If queries might be active then we need to ensure every leaf page is
- * unpinned between the lastBlockVacuumed and the current block, if there
- * are any. This prevents replay of the VACUUM from reaching the stage of
- * removing heap tuples while there could still be indexscans "in flight"
- * to those particular tuples for those scans which could be confused by
- * finding new tuples at the old TID locations (see nbtree/README).
- *
- * It might be worth checking if there are actually any backends running;
- * if not, we could just skip this.
- *
- * Since VACUUM can visit leaf pages out-of-order, it might issue records
- * with lastBlockVacuumed >= block; that's not an error, it just means
- * nothing to do now.
- *
- * Note: since we touch all pages in the range, we will lock non-leaf
- * pages, and also any empty (all-zero) pages that may be in the index. It
- * doesn't seem worth the complexity to avoid that. But it's important
- * that HotStandbyActiveInReplay() will not return true if the database
- * isn't yet consistent; so we need not fear reading still-corrupt blocks
- * here during crash recovery.
- */
- if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
- {
- RelFileNode thisrnode;
- BlockNumber thisblkno;
- BlockNumber blkno;
-
- XLogRecGetBlockTag(record, 0, &thisrnode, NULL, &thisblkno);
-
- for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
- {
- /*
- * We use RBM_NORMAL_NO_LOG mode because it's not an error
- * condition to see all-zero pages. The original btvacuumpage
- * scan would have skipped over all-zero pages, noting them in FSM
- * but not bothering to initialize them just yet; so we mustn't
- * throw an error here. (We could skip acquiring the cleanup lock
- * if PageIsNew, but it's probably not worth the cycles to test.)
- *
- * XXX we don't actually need to read the block, we just need to
- * confirm it is unpinned. If we had a special call into the
- * buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned. Optimizing
- * this is now moot, since in most cases we avoid the scan.
- */
- buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
- RBM_NORMAL_NO_LOG);
- if (BufferIsValid(buffer))
- {
- LockBufferForCleanup(buffer);
- UnlockReleaseBuffer(buffer);
- }
- }
- }
-#endif
-
- /*
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf
* page. See nbtree/README for details.
*/
@@ -482,19 +408,30 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ OffsetNumber *offnums = (OffsetNumber *) ptr;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /*
+ * Color the Red pointers Blue.
+ *
+ * We must do this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the
+ * array and hence offnums recorded earlier won't make any sense
+ * after PageIndexMultiDelete is called..
+ */
+ if (xlrec->ncoloritems > 0)
+ _bt_color_items(page, offnums + xlrec->ndelitems,
+ xlrec->ncoloritems);
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /*
+ * And handle the deleted items too
+ */
+ if (xlrec->ndelitems > 0)
+ PageIndexMultiDelete(page, offnums, xlrec->ndelitems);
}
/*
* Mark the page as not containing any LP_DEAD items --- see comments
- * in _bt_delitems_vacuum().
+ * in _bt_handleitems_vacuum().
*/
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 44d2d63..d373e61 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -44,6 +44,12 @@ heap_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "off %u", xlrec->offnum);
}
+ else if (info == XLOG_HEAP_MULTI_INSERT)
+ {
+ xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
+
+ appendStringInfo(buf, "%d tuples", xlrec->ntuples);
+ }
else if (info == XLOG_HEAP_DELETE)
{
xl_heap_delete *xlrec = (xl_heap_delete *) rec;
@@ -102,7 +108,7 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
char *rec = XLogRecGetData(record);
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
- info &= XLOG_HEAP_OPMASK;
+ info &= XLOG_HEAP2_OPMASK;
if (info == XLOG_HEAP2_CLEAN)
{
xl_heap_clean *xlrec = (xl_heap_clean *) rec;
@@ -129,12 +135,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "cutoff xid %u flags %d",
xlrec->cutoff_xid, xlrec->flags);
}
- else if (info == XLOG_HEAP2_MULTI_INSERT)
- {
- xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
-
- appendStringInfo(buf, "%d tuples", xlrec->ntuples);
- }
else if (info == XLOG_HEAP2_LOCK_UPDATED)
{
xl_heap_lock_updated *xlrec = (xl_heap_lock_updated *) rec;
@@ -171,6 +171,12 @@ heap_identify(uint8 info)
case XLOG_HEAP_INSERT | XLOG_HEAP_INIT_PAGE:
id = "INSERT+INIT";
break;
+ case XLOG_HEAP_MULTI_INSERT:
+ id = "MULTI_INSERT";
+ break;
+ case XLOG_HEAP_MULTI_INSERT | XLOG_HEAP_INIT_PAGE:
+ id = "MULTI_INSERT+INIT";
+ break;
case XLOG_HEAP_DELETE:
id = "DELETE";
break;
@@ -219,12 +225,6 @@ heap2_identify(uint8 info)
case XLOG_HEAP2_VISIBLE:
id = "VISIBLE";
break;
- case XLOG_HEAP2_MULTI_INSERT:
- id = "MULTI_INSERT";
- break;
- case XLOG_HEAP2_MULTI_INSERT | XLOG_HEAP_INIT_PAGE:
- id = "MULTI_INSERT+INIT";
- break;
case XLOG_HEAP2_LOCK_UPDATED:
id = "LOCK_UPDATED";
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index fbde9d6..0e9a2eb 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -48,8 +48,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "ndelitems %u, ncoloritems %u",
+ xlrec->ndelitems, xlrec->ncoloritems);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cce9b3f..5343b10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -155,7 +155,8 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
{
Assert(ItemPointerIsValid(<->heapPtr));
- if (bds->callback(<->heapPtr, bds->callback_state))
+ if (bds->callback(<->heapPtr, false, bds->callback_state) ==
+ IBDCR_DELETE)
{
bds->stats->tuples_removed += 1;
deletable[i] = true;
@@ -425,7 +426,8 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
{
Assert(ItemPointerIsValid(<->heapPtr));
- if (bds->callback(<->heapPtr, bds->callback_state))
+ if (bds->callback(<->heapPtr, false, bds->callback_state) ==
+ IBDCR_DELETE)
{
bds->stats->tuples_removed += 1;
toDelete[xlrec.nDelete] = i;
@@ -902,10 +904,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
/* Dummy callback to delete no tuples during spgvacuumcleanup */
-static bool
-dummy_callback(ItemPointer itemptr, void *state)
+static IndexBulkDeleteCallbackResult
+dummy_callback(ItemPointer itemptr, bool is_red, void *state)
{
- return false;
+ return IBDCR_KEEP;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bba52ec..ab37b43 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -115,7 +115,7 @@ static void IndexCheckExclusion(Relation heapRelation,
IndexInfo *indexInfo);
static inline int64 itemptr_encode(ItemPointer itemptr);
static inline void itemptr_decode(ItemPointer itemptr, int64 encoded);
-static bool validate_index_callback(ItemPointer itemptr, void *opaque);
+static IndexBulkDeleteCallbackResult validate_index_callback(ItemPointer itemptr, bool is_red, void *opaque);
static void validate_index_heapscan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
@@ -2949,15 +2949,15 @@ itemptr_decode(ItemPointer itemptr, int64 encoded)
/*
* validate_index_callback - bulkdelete callback to collect the index TIDs
*/
-static bool
-validate_index_callback(ItemPointer itemptr, void *opaque)
+static IndexBulkDeleteCallbackResult
+validate_index_callback(ItemPointer itemptr, bool is_red, void *opaque)
{
v_i_state *state = (v_i_state *) opaque;
int64 encoded = itemptr_encode(itemptr);
tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false);
state->itups += 1;
- return false; /* never actually delete anything */
+ return IBDCR_KEEP; /* never actually delete anything */
}
/*
@@ -3178,7 +3178,8 @@ validate_index_heapscan(Relation heapRelation,
heapRelation,
indexInfo->ii_Unique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- indexInfo);
+ indexInfo,
+ false);
state->tups_inserted += 1;
}
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index e5355a8..5b6efcf 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -172,7 +172,8 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- indexInfo);
+ indexInfo,
+ warm_update);
}
ExecDropSingleTupleTableSlot(slot);
@@ -222,7 +223,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup, false, NULL);
+ CatalogIndexInsert(indstate, tup, NULL, false);
return oid;
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index d9c0fe7..330b661 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -168,7 +168,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
*/
index_insert(indexRel, values, isnull, &(new_row->t_self),
trigdata->tg_relation, UNIQUE_CHECK_EXISTING,
- indexInfo);
+ indexInfo,
+ false);
}
else
{
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 1388be1..e5d5ca0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -104,6 +104,25 @@
*/
#define PREFETCH_SIZE ((BlockNumber) 32)
+/*
+ * Structure to track WARM chains that can be converted into HOT chains during
+ * this run.
+ *
+ * To reduce space requirement, we're using bitfields. But the way things are
+ * laid down, we're still wasting 1-byte per candidate chain.
+ */
+typedef struct LVRedBlueChain
+{
+ ItemPointerData chain_tid; /* root of the chain */
+ uint8 is_red_chain:2; /* is the WARM chain complete red ? */
+ uint8 keep_warm_chain:2; /* this chain can't be cleared of WARM
+ * tuples */
+ uint8 num_blue_pointers:2;/* number of blue pointers found so
+ * far */
+ uint8 num_red_pointers:2; /* number of red pointers found so far
+ * in the current index */
+} LVRedBlueChain;
+
typedef struct LVRelStats
{
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -121,6 +140,16 @@ typedef struct LVRelStats
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+ double num_warm_chains; /* number of warm chains seen so far */
+
+ /* List of WARM chains that can be converted into HOT chains */
+ /* NB: this list is ordered by TID of the root pointers */
+ int num_redblue_chains; /* current # of entries */
+ int max_redblue_chains; /* # slots allocated in array */
+ LVRedBlueChain *redblue_chains; /* array of LVRedBlueChain */
+ double num_non_convertible_warm_chains;
+
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
@@ -149,6 +178,7 @@ static void lazy_scan_heap(Relation onerel, int options,
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
static void lazy_vacuum_index(Relation indrel,
+ bool clear_warm,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats);
static void lazy_cleanup_index(Relation indrel,
@@ -156,6 +186,10 @@ static void lazy_cleanup_index(Relation indrel,
LVRelStats *vacrelstats);
static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+static int lazy_warmclear_page(Relation onerel, BlockNumber blkno,
+ Buffer buffer, int chainindex, LVRelStats *vacrelstats,
+ Buffer *vmbuffer, bool check_all_visible);
+static void lazy_reset_redblue_pointer_count(LVRelStats *vacrelstats);
static bool should_attempt_truncation(LVRelStats *vacrelstats);
static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -163,8 +197,15 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
-static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_record_red_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr);
+static void lazy_record_blue_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr);
+static IndexBulkDeleteCallbackResult lazy_tid_reaped(ItemPointer itemptr, bool is_red, void *state);
+static IndexBulkDeleteCallbackResult lazy_indexvac_phase1(ItemPointer itemptr, bool is_red, void *state);
+static IndexBulkDeleteCallbackResult lazy_indexvac_phase2(ItemPointer itemptr, bool is_red, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
+static int vac_cmp_redblue_chain(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -683,8 +724,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* If we are close to overrunning the available space for dead-tuple
* TIDs, pause and do a cycle of vacuuming before we tackle this page.
*/
- if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
- vacrelstats->num_dead_tuples > 0)
+ if (((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
+ vacrelstats->num_dead_tuples > 0) ||
+ ((vacrelstats->max_redblue_chains - vacrelstats->num_redblue_chains) < MaxHeapTuplesPerPage &&
+ vacrelstats->num_redblue_chains > 0))
{
const int hvp_index[] = {
PROGRESS_VACUUM_PHASE,
@@ -714,6 +757,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* Remove index entries */
for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i],
+ (vacrelstats->num_redblue_chains > 0),
&indstats[i],
vacrelstats);
@@ -736,6 +780,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* valid.
*/
vacrelstats->num_dead_tuples = 0;
+ vacrelstats->num_redblue_chains = 0;
+ memset(vacrelstats->redblue_chains, 0,
+ vacrelstats->max_redblue_chains * sizeof (LVRedBlueChain));
vacrelstats->num_index_scans++;
/* Report that we are once again scanning the heap */
@@ -939,15 +986,33 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
continue;
}
+ ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
/* Redirect items mustn't be touched */
if (ItemIdIsRedirected(itemid))
{
+ HeapCheckWarmChainStatus status = heap_check_warm_chain(page,
+ &tuple.t_self, false);
+ if (HCWC_IS_WARM(status))
+ {
+ vacrelstats->num_warm_chains++;
+
+ /*
+ * A chain which is either complete Red or Blue is a
+ * candidate for chain conversion. Remember the chain and
+ * its color.
+ */
+ if (HCWC_IS_ALL_RED(status))
+ lazy_record_red_chain(vacrelstats, &tuple.t_self);
+ else if (HCWC_IS_ALL_BLUE(status))
+ lazy_record_blue_chain(vacrelstats, &tuple.t_self);
+ else
+ vacrelstats->num_non_convertible_warm_chains++;
+ }
hastup = true; /* this page won't be truncatable */
continue;
}
- ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
/*
* DEAD item pointers are to be vacuumed normally; but we don't
* count them in tups_vacuumed, else we'd be double-counting (at
@@ -967,6 +1032,28 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
tuple.t_len = ItemIdGetLength(itemid);
tuple.t_tableOid = RelationGetRelid(onerel);
+ if (!HeapTupleIsHeapOnly(&tuple))
+ {
+ HeapCheckWarmChainStatus status = heap_check_warm_chain(page,
+ &tuple.t_self, false);
+ if (HCWC_IS_WARM(status))
+ {
+ vacrelstats->num_warm_chains++;
+
+ /*
+ * A chain which is either complete Red or Blue is a
+ * candidate for chain conversion. Remember the chain and
+ * its color.
+ */
+ if (HCWC_IS_ALL_RED(status))
+ lazy_record_red_chain(vacrelstats, &tuple.t_self);
+ else if (HCWC_IS_ALL_BLUE(status))
+ lazy_record_blue_chain(vacrelstats, &tuple.t_self);
+ else
+ vacrelstats->num_non_convertible_warm_chains++;
+ }
+ }
+
tupgone = false;
switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
@@ -1287,7 +1374,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* If any tuples need to be deleted, perform final vacuum cycle */
/* XXX put a threshold on min number of tuples here? */
- if (vacrelstats->num_dead_tuples > 0)
+ if (vacrelstats->num_dead_tuples > 0 || vacrelstats->num_redblue_chains > 0)
{
const int hvp_index[] = {
PROGRESS_VACUUM_PHASE,
@@ -1305,6 +1392,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* Remove index entries */
for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i],
+ (vacrelstats->num_redblue_chains > 0),
&indstats[i],
vacrelstats);
@@ -1372,7 +1460,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
*
* This routine marks dead tuples as unused and compacts out free
* space on their pages. Pages not having dead tuples recorded from
- * lazy_scan_heap are not visited at all.
+ * lazy_scan_heap are not visited at all. This routine also converts
+ * candidate WARM chains to HOT chains by clearing WARM related flags. The
+ * candidate chains are determined by the preceeding index scans after
+ * looking at the data collected by the first heap scan.
*
* Note: the reason for doing this as a second pass is we cannot remove
* the tuples until we've removed their index entries, and we want to
@@ -1381,7 +1472,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
static void
lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
{
- int tupindex;
+ int tupindex, chainindex;
int npages;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
@@ -1390,33 +1481,69 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
npages = 0;
tupindex = 0;
- while (tupindex < vacrelstats->num_dead_tuples)
+ chainindex = 0;
+ while (tupindex < vacrelstats->num_dead_tuples ||
+ chainindex < vacrelstats->num_redblue_chains)
{
- BlockNumber tblk;
+ BlockNumber tblk, chainblk, vacblk;
Buffer buf;
Page page;
Size freespace;
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
- buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+ tblk = chainblk = InvalidBlockNumber;
+ if (chainindex < vacrelstats->num_redblue_chains)
+ chainblk =
+ ItemPointerGetBlockNumber(&(vacrelstats->redblue_chains[chainindex].chain_tid));
+
+ if (tupindex < vacrelstats->num_dead_tuples)
+ tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
+
+ if (tblk == InvalidBlockNumber)
+ vacblk = chainblk;
+ else if (chainblk == InvalidBlockNumber)
+ vacblk = tblk;
+ else
+ vacblk = Min(chainblk, tblk);
+
+ Assert(vacblk != InvalidBlockNumber);
+
+ buf = ReadBufferExtended(onerel, MAIN_FORKNUM, vacblk, RBM_NORMAL,
vac_strategy);
- if (!ConditionalLockBufferForCleanup(buf))
+
+
+ if (vacblk == chainblk)
+ LockBufferForCleanup(buf);
+ else if (!ConditionalLockBufferForCleanup(buf))
{
ReleaseBuffer(buf);
++tupindex;
continue;
}
- tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
- &vmbuffer);
+
+ /*
+ * Convert WARM chains on this page. This should be done before
+ * vacuuming the page to ensure that we can correctly set visibility
+ * bits after clearing WARM chains.
+ *
+ * If we are going to vacuum this page then don't check for
+ * all-visibility just yet.
+ */
+ if (vacblk == chainblk)
+ chainindex = lazy_warmclear_page(onerel, chainblk, buf, chainindex,
+ vacrelstats, &vmbuffer, chainblk != tblk);
+
+ if (vacblk == tblk)
+ tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
+ &vmbuffer);
/* Now that we've compacted the page, record its available space */
page = BufferGetPage(buf);
freespace = PageGetHeapFreeSpace(page);
UnlockReleaseBuffer(buf);
- RecordPageWithFreeSpace(onerel, tblk, freespace);
+ RecordPageWithFreeSpace(onerel, vacblk, freespace);
npages++;
}
@@ -1435,6 +1562,107 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
}
/*
+ * lazy_warmclear_page() -- clear WARM flag and mark chains blue when possible
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * chainindex is the index in vacrelstats->redblue_chains of the first dead
+ * tuple for this page. We assume the rest follow sequentially.
+ * The return value is the first tupindex after the tuples of this page.
+ *
+ * If check_all_visible is set then we also check if the page has now become
+ * all visible and update visibility map.
+ */
+static int
+lazy_warmclear_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+ int chainindex, LVRelStats *vacrelstats, Buffer *vmbuffer,
+ bool check_all_visible)
+{
+ Page page = BufferGetPage(buffer);
+ OffsetNumber cleared_offnums[MaxHeapTuplesPerPage];
+ int num_cleared = 0;
+ TransactionId visibility_cutoff_xid;
+ bool all_frozen;
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_WARMCLEARED, blkno);
+
+ START_CRIT_SECTION();
+
+ for (; chainindex < vacrelstats->num_redblue_chains ; chainindex++)
+ {
+ BlockNumber tblk;
+ LVRedBlueChain *chain;
+
+ chain = &vacrelstats->redblue_chains[chainindex];
+
+ tblk = ItemPointerGetBlockNumber(&chain->chain_tid);
+ if (tblk != blkno)
+ break; /* past end of tuples for this block */
+
+ /*
+ * Since a heap page can have no more than MaxHeapTuplesPerPage
+ * offnums and we process each offnum only once, MaxHeapTuplesPerPage
+ * size array should be enough to hold all cleared tuples in this page.
+ */
+ if (!chain->keep_warm_chain)
+ num_cleared += heap_clear_warm_chain(page, &chain->chain_tid,
+ cleared_offnums + num_cleared);
+ }
+
+ /*
+ * Mark buffer dirty before we write WAL.
+ */
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(onerel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_heap_warmclear(onerel, buffer,
+ cleared_offnums, num_cleared);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* If not checking for all-visibility then we're done */
+ if (!check_all_visible)
+ return chainindex;
+
+ /*
+ * The following code should match the corresponding code in
+ * lazy_vacuum_page
+ **/
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid,
+ &all_frozen))
+ PageSetAllVisible(page);
+
+ /*
+ * All the changes to the heap page have been done. If the all-visible
+ * flag is now set, also set the VM all-visible bit (and, if possible, the
+ * all-frozen bit) unless this has already been done previously.
+ */
+ if (PageIsAllVisible(page))
+ {
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+
+ Assert(BufferIsValid(*vmbuffer));
+ if (flags != 0)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
+ *vmbuffer, visibility_cutoff_xid, flags);
+ }
+ return chainindex;
+}
+
+/*
* lazy_vacuum_page() -- free dead tuples on a page
* and repair its fragmentation.
*
@@ -1587,6 +1815,16 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
return false;
}
+static void
+lazy_reset_redblue_pointer_count(LVRelStats *vacrelstats)
+{
+ int i;
+ for (i = 0; i < vacrelstats->num_redblue_chains; i++)
+ {
+ LVRedBlueChain *chain = &vacrelstats->redblue_chains[i];
+ chain->num_blue_pointers = chain->num_red_pointers = 0;
+ }
+}
/*
* lazy_vacuum_index() -- vacuum one index relation.
@@ -1596,6 +1834,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
*/
static void
lazy_vacuum_index(Relation indrel,
+ bool clear_warm,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats)
{
@@ -1611,15 +1850,81 @@ lazy_vacuum_index(Relation indrel,
ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
ivinfo.strategy = vac_strategy;
- /* Do bulk deletion */
- *stats = index_bulk_delete(&ivinfo, *stats,
- lazy_tid_reaped, (void *) vacrelstats);
+ /*
+ * If told, convert WARM chains into HOT chains.
+ *
+ * We must have already collected candidate WARM chains i.e. chains which
+ * has either has only Red or only Blue tuples, but not a mix of both.
+ *
+ * This works in two phases. In the first phase, we do a complete index
+ * scan and collect information about index pointers to the candidate
+ * chains, but we don't do conversion. To be precise, we count the number
+ * of Blue and Red index pointers to each candidate chain and use that
+ * knowledge to arrive at a decision and do the actual conversion during
+ * the second phase (we kill known dead pointers though in this phase).
+ *
+ * In the second phase, for each Red chain we check if we have seen a Red
+ * index pointer. For such chains, we kill the Blue pointer and color the
+ * Red pointer Blue. the heap tuples are marked Blue in the second heap
+ * scan. If we did not find any Red pointer to a Red chain, that means that
+ * the chain is reachable from the Blue pointer (because say WARM update
+ * did not added a new entry for this index). In that case, we do nothing.
+ * There is a third case where we find more than one Blue pointers to a Red
+ * chain. This can happen because of aborted vacuums. We don't handle that
+ * case yet, but it should be possible to apply the same recheck logic and
+ * find which of the Blue pointers is redundant and should be removed.
+ *
+ * For Blue chains, we just kill the Red pointer, if it exists and keep the
+ * Blue pointer.
+ */
+ if (clear_warm)
+ {
+ lazy_reset_redblue_pointer_count(vacrelstats);
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase1, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to remove %d row version, found "
+ "%0.f red pointers, %0.f blue pointers, removed "
+ "%0.f red pointers, removed %0.f blue pointers",
+ RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples,
+ (*stats)->num_red_pointers,
+ (*stats)->num_blue_pointers,
+ (*stats)->red_pointers_removed,
+ (*stats)->blue_pointers_removed)));
+
+ (*stats)->num_red_pointers = 0;
+ (*stats)->num_blue_pointers = 0;
+ (*stats)->red_pointers_removed = 0;
+ (*stats)->blue_pointers_removed = 0;
+ (*stats)->pointers_colored = 0;
+
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase2, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to convert red pointers, found "
+ "%0.f red pointers, %0.f blue pointers, removed "
+ "%0.f red pointers, removed %0.f blue pointers, "
+ "colored %0.f red pointers blue",
+ RelationGetRelationName(indrel),
+ (*stats)->num_red_pointers,
+ (*stats)->num_blue_pointers,
+ (*stats)->red_pointers_removed,
+ (*stats)->blue_pointers_removed,
+ (*stats)->pointers_colored)));
+ }
+ else
+ {
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_tid_reaped, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to remove %d row versions",
+ RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples),
+ errdetail("%s.", pg_rusage_show(&ru0))));
+ }
- ereport(elevel,
- (errmsg("scanned index \"%s\" to remove %d row versions",
- RelationGetRelationName(indrel),
- vacrelstats->num_dead_tuples),
- errdetail("%s.", pg_rusage_show(&ru0))));
}
/*
@@ -1993,9 +2298,11 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
if (vacrelstats->hasindex)
{
- maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
+ maxtuples = (vac_work_mem * 1024L) / (sizeof(ItemPointerData) +
+ sizeof(LVRedBlueChain));
maxtuples = Min(maxtuples, INT_MAX);
- maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+ maxtuples = Min(maxtuples, MaxAllocSize / (sizeof(ItemPointerData) +
+ sizeof(LVRedBlueChain)));
/* curious coding here to ensure the multiplication can't overflow */
if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
@@ -2013,6 +2320,57 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
vacrelstats->max_dead_tuples = (int) maxtuples;
vacrelstats->dead_tuples = (ItemPointer)
palloc(maxtuples * sizeof(ItemPointerData));
+
+ /*
+ * XXX Cheat for now and allocate the same size array for tracking blue and
+ * red chains. maxtuples must have been already adjusted above to ensure we
+ * don't cross vac_work_mem.
+ */
+ vacrelstats->num_redblue_chains = 0;
+ vacrelstats->max_redblue_chains = (int) maxtuples;
+ vacrelstats->redblue_chains = (LVRedBlueChain *)
+ palloc0(maxtuples * sizeof(LVRedBlueChain));
+
+}
+
+/*
+ * lazy_record_blue_chain - remember one blue chain
+ */
+static void
+lazy_record_blue_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (vacrelstats->num_redblue_chains < vacrelstats->max_redblue_chains)
+ {
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].chain_tid = *itemptr;
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].is_red_chain = 0;
+ vacrelstats->num_redblue_chains++;
+ }
+}
+
+/*
+ * lazy_record_red_chain - remember one red chain
+ */
+static void
+lazy_record_red_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (vacrelstats->num_redblue_chains < vacrelstats->max_redblue_chains)
+ {
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].chain_tid = *itemptr;
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].is_red_chain = 1;
+ vacrelstats->num_redblue_chains++;
+ }
}
/*
@@ -2043,8 +2401,8 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
*
* Assumes dead_tuples array is in sorted order.
*/
-static bool
-lazy_tid_reaped(ItemPointer itemptr, void *state)
+static IndexBulkDeleteCallbackResult
+lazy_tid_reaped(ItemPointer itemptr, bool is_red, void *state)
{
LVRelStats *vacrelstats = (LVRelStats *) state;
ItemPointer res;
@@ -2055,7 +2413,193 @@ lazy_tid_reaped(ItemPointer itemptr, void *state)
sizeof(ItemPointerData),
vac_cmp_itemptr);
- return (res != NULL);
+ return (res != NULL) ? IBDCR_DELETE : IBDCR_KEEP;
+}
+
+/*
+ * lazy_indexvac_phase1() -- run first pass of index vacuum
+ *
+ * This has the right signature to be an IndexBulkDeleteCallback.
+ */
+static IndexBulkDeleteCallbackResult
+lazy_indexvac_phase1(ItemPointer itemptr, bool is_red, void *state)
+{
+ LVRelStats *vacrelstats = (LVRelStats *) state;
+ ItemPointer res;
+ LVRedBlueChain *chain;
+
+ res = (ItemPointer) bsearch((void *) itemptr,
+ (void *) vacrelstats->dead_tuples,
+ vacrelstats->num_dead_tuples,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+
+ if (res != NULL)
+ return IBDCR_DELETE;
+
+ chain = (LVRedBlueChain *) bsearch((void *) itemptr,
+ (void *) vacrelstats->redblue_chains,
+ vacrelstats->num_redblue_chains,
+ sizeof(LVRedBlueChain),
+ vac_cmp_redblue_chain);
+ if (chain != NULL)
+ {
+ if (is_red)
+ chain->num_red_pointers++;
+ else
+ chain->num_blue_pointers++;
+ }
+ return IBDCR_KEEP;
+}
+
+/*
+ * lazy_indexvac_phase2() -- run first pass of index vacuum
+ *
+ * This has the right signature to be an IndexBulkDeleteCallback.
+ */
+static IndexBulkDeleteCallbackResult
+lazy_indexvac_phase2(ItemPointer itemptr, bool is_red, void *state)
+{
+ LVRelStats *vacrelstats = (LVRelStats *) state;
+ LVRedBlueChain *chain;
+
+ chain = (LVRedBlueChain *) bsearch((void *) itemptr,
+ (void *) vacrelstats->redblue_chains,
+ vacrelstats->num_redblue_chains,
+ sizeof(LVRedBlueChain),
+ vac_cmp_redblue_chain);
+
+ if (chain != NULL && (chain->keep_warm_chain != 1))
+ {
+ /*
+ * At no point, we can have more than 1 Red pointer to any chain and no
+ * more than 2 Blue pointers.
+ */
+ Assert(chain->num_red_pointers <= 1);
+ Assert(chain->num_blue_pointers <= 2);
+
+ if (chain->is_red_chain == 1)
+ {
+ if (is_red)
+ {
+ /*
+ * A Red pointer, pointing to a Blue chain.
+ *
+ * Color the Red pointer Blue (and delete the Blue pointer). We
+ * may have already seen the Blue pointer in the scan and
+ * deleted that or we may see it later in the scan. It doesn't
+ * matter if we fail at any point because we won't clear up
+ * WARM bits on the heap tuples until we have dealt with the
+ * index pointers cleanly.
+ */
+ return IBDCR_COLOR_BLUE;
+ }
+ else
+ {
+ /*
+ * Blue pointer to a Red chain.
+ */
+ if (chain->num_red_pointers > 0)
+ {
+ /*
+ * If there exists a Red pointer to the chain, we can
+ * delete the Blue pointer and clear the WARM bits on the
+ * heap tuples.
+ */
+ return IBDCR_DELETE;
+ }
+ else if (chain->num_blue_pointers == 1)
+ {
+ /*
+ * If this is the only pointer to a Red chain, we must keep the
+ * Blue pointer.
+ *
+ * The presence of Red chain indicates that the WARM update
+ * must have been committed good. But during the update
+ * this index was probably not updated and hence it
+ * contains just one, original Blue pointer to the chain.
+ * We should be able to clear the WARM bits on heap tuples
+ * unless we later find another index which prevents the
+ * cleanup.
+ */
+ return IBDCR_KEEP;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * This is a Blue chain.
+ */
+ if (is_red)
+ {
+ /*
+ * A Red pointer to a Blue chain.
+ *
+ * This can happen when a WARM update is aborted. Later the HOT
+ * chain is pruned leaving behind only Blue tuples in the
+ * chain. But the Red index pointer inserted in the index
+ * remains and it must now be deleted before we clear WARM bits
+ * from the heap tuple.
+ */
+ return IBDCR_DELETE;
+ }
+
+ /*
+ * Blue pointer to a Blue chain.
+ *
+ * If this is the only surviving Blue pointer, keep it and clear
+ * the WARM bits from the heap tuples.
+ */
+ if (chain->num_blue_pointers == 1)
+ return IBDCR_KEEP;
+
+ /*
+ * If there are more than 1 Blue pointers to this chain, we can
+ * apply the recheck logic and kill the redudant Blue pointer and
+ * convert the chain. But that's not yet done.
+ */
+ }
+
+ /*
+ * For everything else, we must keep the WARM bits and also keep the
+ * index pointers.
+ */
+ chain->keep_warm_chain = 1;
+ return IBDCR_KEEP;
+ }
+ return IBDCR_KEEP;
+}
+
+/*
+ * Comparator routines for use with qsort() and bsearch(). Similar to
+ * vac_cmp_itemptr, but right hand argument is LVRedBlueChain struct pointer.
+ */
+static int
+vac_cmp_redblue_chain(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber(&((LVRedBlueChain *) right)->chain_tid);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber(&((LVRedBlueChain *) right)->chain_tid);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
}
/*
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index d62d2de..3e49a8f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -405,7 +405,8 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique, /* type of uniqueness check to do */
- indexInfo); /* index AM may need this */
+ indexInfo, /* index AM may need this */
+ (modified_attrs != NULL)); /* type of uniqueness check to do */
/*
* If the index has an associated exclusion constraint, check that.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..7a9b48a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -347,7 +347,7 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
static void
DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
- uint8 info = XLogRecGetInfo(buf->record) & XLOG_HEAP_OPMASK;
+ uint8 info = XLogRecGetInfo(buf->record) & XLOG_HEAP2_OPMASK;
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
@@ -359,10 +359,6 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
switch (info)
{
- case XLOG_HEAP2_MULTI_INSERT:
- if (SnapBuildProcessChange(builder, xid, buf->origptr))
- DecodeMultiInsert(ctx, buf);
- break;
case XLOG_HEAP2_NEW_CID:
{
xl_heap_new_cid *xlrec;
@@ -390,6 +386,7 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
case XLOG_HEAP2_CLEANUP_INFO:
case XLOG_HEAP2_VISIBLE:
case XLOG_HEAP2_LOCK_UPDATED:
+ case XLOG_HEAP2_WARMCLEAR:
break;
default:
elog(ERROR, "unexpected RM_HEAP2_ID record type: %u", info);
@@ -418,6 +415,10 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (SnapBuildProcessChange(builder, xid, buf->origptr))
DecodeInsert(ctx, buf);
break;
+ case XLOG_HEAP_MULTI_INSERT:
+ if (SnapBuildProcessChange(builder, xid, buf->origptr))
+ DecodeMultiInsert(ctx, buf);
+ break;
/*
* Treat HOT update as normal updates. There is no useful
@@ -809,7 +810,7 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
/*
- * Decode XLOG_HEAP2_MULTI_INSERT_insert record into multiple tuplebufs.
+ * Decode XLOG_HEAP_MULTI_INSERT_insert record into multiple tuplebufs.
*
* Currently MULTI_INSERT will always contain the full tuples.
*/
diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c
index baff998..6a2e2f2 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -106,7 +106,7 @@ HeapTupleHeaderGetCmin(HeapTupleHeader tup)
{
CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
- Assert(!(tup->t_infomask & HEAP_MOVED));
+ Assert(!(HeapTupleHeaderIsMoved(tup)));
Assert(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)));
if (tup->t_infomask & HEAP_COMBOCID)
@@ -120,7 +120,7 @@ HeapTupleHeaderGetCmax(HeapTupleHeader tup)
{
CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
- Assert(!(tup->t_infomask & HEAP_MOVED));
+ Assert(!(HeapTupleHeaderIsMoved(tup)));
/*
* Because GetUpdateXid() performs memory allocations if xmax is a
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 703bdce..0df5a44 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -186,7 +186,7 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -205,7 +205,7 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -377,7 +377,7 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -396,7 +396,7 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -471,7 +471,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return HeapTupleInvisible;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -490,7 +490,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -753,7 +753,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -772,7 +772,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -974,7 +974,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -993,7 +993,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -1180,7 +1180,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
if (HeapTupleHeaderXminInvalid(tuple))
return HEAPTUPLE_DEAD;
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_OFF)
+ else if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -1198,7 +1198,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
InvalidTransactionId);
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d7702e5..68859f2 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -75,6 +75,14 @@ typedef bool (*aminsert_function) (Relation indexRelation,
Relation heapRelation,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+/* insert this WARM tuple */
+typedef bool (*amwarminsert_function) (Relation indexRelation,
+ Datum *values,
+ bool *isnull,
+ ItemPointer heap_tid,
+ Relation heapRelation,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
/* bulk delete */
typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -203,6 +211,7 @@ typedef struct IndexAmRoutine
ambuild_function ambuild;
ambuildempty_function ambuildempty;
aminsert_function aminsert;
+ amwarminsert_function amwarminsert;
ambulkdelete_function ambulkdelete;
amvacuumcleanup_function amvacuumcleanup;
amcanreturn_function amcanreturn; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index f467b18..bf1e6bd 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -75,12 +75,29 @@ typedef struct IndexBulkDeleteResult
bool estimated_count; /* num_index_tuples is an estimate */
double num_index_tuples; /* tuples remaining */
double tuples_removed; /* # removed during vacuum operation */
+ double num_red_pointers; /* # red pointers found */
+ double num_blue_pointers; /* # blue pointers found */
+ double pointers_colored; /* # red pointers colored blue */
+ double red_pointers_removed; /* # red pointers removed */
+ double blue_pointers_removed; /* # blue pointers removed */
BlockNumber pages_deleted; /* # unused pages in index */
BlockNumber pages_free; /* # pages available for reuse */
} IndexBulkDeleteResult;
+/*
+ * IndexBulkDeleteCallback should return one of the following
+ */
+typedef enum IndexBulkDeleteCallbackResult
+{
+ IBDCR_KEEP, /* index tuple should be preserved */
+ IBDCR_DELETE, /* index tuple should be deleted */
+ IBDCR_COLOR_BLUE /* index tuple should be colored blue */
+} IndexBulkDeleteCallbackResult;
+
/* Typedef for callback function to determine if a tuple is bulk-deletable */
-typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
+typedef IndexBulkDeleteCallbackResult (*IndexBulkDeleteCallback) (
+ ItemPointer itemptr,
+ bool is_red, void *state);
/* struct definitions appear in relscan.h */
typedef struct IndexScanDescData *IndexScanDesc;
@@ -135,7 +152,8 @@ extern bool index_insert(Relation indexRelation,
ItemPointer heap_t_ctid,
Relation heapRelation,
IndexUniqueCheck checkUnique,
- struct IndexInfo *indexInfo);
+ struct IndexInfo *indexInfo,
+ bool warm_update);
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index e76a7aa..a2720c1 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -269,6 +269,11 @@ typedef HashMetaPageData *HashMetaPage;
#define HASHPROC 1
#define HASHNProcs 1
+/*
+ * Flags overloaded on t_tid.ip_posid field. They are managed by
+ * ItemPointerSetFlags and corresponing routines.
+ */
+#define HASH_INDEX_RED_POINTER 0x01
/* public routines */
@@ -279,6 +284,10 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+extern bool hashwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
@@ -346,6 +355,8 @@ extern void _hash_expandtable(Relation rel, Buffer metabuf);
extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
Bucket obucket, uint32 maxbucket, uint32 highmask,
uint32 lowmask);
+extern void _hash_color_items(Page page, OffsetNumber *coloritemsno,
+ uint16 ncoloritems);
/* hashsearch.c */
extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9412c3a..719a725 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,20 @@ typedef struct HeapUpdateFailureData
CommandId cmax;
} HeapUpdateFailureData;
+typedef int HeapCheckWarmChainStatus;
+
+#define HCWC_BLUE_TUPLE 0x0001
+#define HCWC_RED_TUPLE 0x0002
+#define HCWC_WARM_TUPLE 0x0004
+
+#define HCWC_IS_MIXED(status) \
+ (((status) & (HCWC_BLUE_TUPLE | HCWC_RED_TUPLE)) != 0)
+#define HCWC_IS_ALL_RED(status) \
+ (((status) & HCWC_BLUE_TUPLE) == 0)
+#define HCWC_IS_ALL_BLUE(status) \
+ (((status) & HCWC_RED_TUPLE) == 0)
+#define HCWC_IS_WARM(status) \
+ (((status) & HCWC_WARM_TUPLE) != 0)
/* ----------------
* function prototypes for heap access method
@@ -183,6 +197,10 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
bool *warm_update);
extern void heap_sync(Relation relation);
+extern HeapCheckWarmChainStatus heap_check_warm_chain(Page dp,
+ ItemPointer tid, bool stop_at_warm);
+extern int heap_clear_warm_chain(Page dp, ItemPointer tid,
+ OffsetNumber *cleared_offnums);
/* in heap/pruneheap.c */
extern void heap_page_prune_opt(Relation relation, Buffer buffer);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 9b081bf..66fd0ea 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -32,7 +32,7 @@
#define XLOG_HEAP_INSERT 0x00
#define XLOG_HEAP_DELETE 0x10
#define XLOG_HEAP_UPDATE 0x20
-/* 0x030 is free, was XLOG_HEAP_MOVE */
+#define XLOG_HEAP_MULTI_INSERT 0x30
#define XLOG_HEAP_HOT_UPDATE 0x40
#define XLOG_HEAP_CONFIRM 0x50
#define XLOG_HEAP_LOCK 0x60
@@ -47,18 +47,23 @@
/*
* We ran out of opcodes, so heapam.c now has a second RmgrId. These opcodes
* are associated with RM_HEAP2_ID, but are not logically different from
- * the ones above associated with RM_HEAP_ID. XLOG_HEAP_OPMASK applies to
- * these, too.
+ * the ones above associated with RM_HEAP_ID.
+ *
+ * In PG 10, we moved XLOG_HEAP2_MULTI_INSERT to RM_HEAP_ID. That allows us to
+ * use 0x80 bit in RM_HEAP2_ID, thus potentially creating another 8 possible
+ * opcodes in RM_HEAP2_ID.
*/
#define XLOG_HEAP2_REWRITE 0x00
#define XLOG_HEAP2_CLEAN 0x10
#define XLOG_HEAP2_FREEZE_PAGE 0x20
#define XLOG_HEAP2_CLEANUP_INFO 0x30
#define XLOG_HEAP2_VISIBLE 0x40
-#define XLOG_HEAP2_MULTI_INSERT 0x50
+#define XLOG_HEAP2_WARMCLEAR 0x50
#define XLOG_HEAP2_LOCK_UPDATED 0x60
#define XLOG_HEAP2_NEW_CID 0x70
+#define XLOG_HEAP2_OPMASK 0x70
+
/*
* xl_heap_insert/xl_heap_multi_insert flag values, 8 bits are available.
*/
@@ -226,6 +231,14 @@ typedef struct xl_heap_clean
#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+typedef struct xl_heap_warmclear
+{
+ uint16 ncleared;
+ /* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_warmclear;
+
+#define SizeOfHeapWarmClear (offsetof(xl_heap_warmclear, ncleared) + sizeof(uint16))
+
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
* Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
@@ -389,6 +402,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
+extern XLogRecPtr log_heap_warmclear(Relation reln, Buffer buffer,
+ OffsetNumber *cleared, int ncleared);
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index b5891ca..1f6ab0d 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -201,6 +201,21 @@ struct HeapTupleHeaderData
* upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
+/*
+ * A WARM chain usually consists of two parts. Each of these parts are HOT
+ * chains in themselves i.e. all indexed columns has the same value, but a WARM
+ * update separates these parts. We call these two parts as Blue chain and Red
+ * chain. We need a mechanism to identify which part a tuple belongs to. We
+ * can't just look at if it's a HeapTupleHeaderIsHeapWarmTuple() because during
+ * WARM update, both old and new tuples are marked as WARM tuples.
+ *
+ * We need another infomask bit for this. But we use the same infomask bit that
+ * was earlier used for by old-style VACUUM FULL. This is safe because
+ * HEAP_WARM_TUPLE flag will always be set along with HEAP_WARM_RED. So if
+ * HEAP_WARM_TUPLE and HEAP_WARM_RED is set then we know that it's referring to
+ * red part of the WARM chain.
+ */
+#define HEAP_WARM_RED 0x4000
#define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
/*
@@ -397,7 +412,7 @@ struct HeapTupleHeaderData
/* SetCmin is reasonably simple since we never need a combo CID */
#define HeapTupleHeaderSetCmin(tup, cid) \
do { \
- Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+ Assert(!HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_cid = (cid); \
(tup)->t_infomask &= ~HEAP_COMBOCID; \
} while (0)
@@ -405,7 +420,7 @@ do { \
/* SetCmax must be used after HeapTupleHeaderAdjustCmax; see combocid.c */
#define HeapTupleHeaderSetCmax(tup, cid, iscombo) \
do { \
- Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+ Assert(!HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_cid = (cid); \
if (iscombo) \
(tup)->t_infomask |= HEAP_COMBOCID; \
@@ -415,7 +430,7 @@ do { \
#define HeapTupleHeaderGetXvac(tup) \
( \
- ((tup)->t_infomask & HEAP_MOVED) ? \
+ HeapTupleHeaderIsMoved(tup) ? \
(tup)->t_choice.t_heap.t_field3.t_xvac \
: \
InvalidTransactionId \
@@ -423,7 +438,7 @@ do { \
#define HeapTupleHeaderSetXvac(tup, xid) \
do { \
- Assert((tup)->t_infomask & HEAP_MOVED); \
+ Assert(HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_xvac = (xid); \
} while (0)
@@ -651,6 +666,58 @@ do { \
)
/*
+ * Macros to check if tuple is a moved-off/in tuple by VACUUM FULL in from
+ * pre-9.0 era. Such tuple must not have HEAP_WARM_TUPLE flag set.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderIsMovedOff(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED_OFF) \
+)
+
+#define HeapTupleHeaderIsMovedIn(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED_IN) \
+)
+
+#define HeapTupleHeaderIsMoved(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED) \
+)
+
+/*
+ * Check if tuple belongs to the Red part of the WARM chain.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderIsWarmRed(tuple) \
+( \
+ HeapTupleHeaderIsHeapWarmTuple(tuple) && \
+ (((tuple)->t_infomask & HEAP_WARM_RED) != 0) \
+)
+
+/*
+ * Mark tuple as a member of the Red chain. Must only be done on a tuple which
+ * is already marked a WARM-tuple.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderSetWarmRed(tuple) \
+( \
+ AssertMacro(HeapTupleHeaderIsHeapWarmTuple(tuple)), \
+ (tuple)->t_infomask |= HEAP_WARM_RED \
+)
+
+#define HeapTupleHeaderClearWarmRed(tuple) \
+( \
+ (tuple)->t_infomask &= ~HEAP_WARM_RED \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
@@ -810,6 +877,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapWarmTuple(tuple) \
HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+#define HeapTupleIsHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderIsWarmRed((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderSetWarmRed((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderClearWarmRed((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d4b35ca..1f4f0bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -427,6 +427,12 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
/*
+ * Flags overloaded on t_tid.ip_posid field. They are managed by
+ * ItemPointerSetFlags and corresponing routines.
+ */
+#define BTREE_INDEX_RED_POINTER 0x01
+
+/*
* external entry points for btree, in nbtree.c
*/
extern IndexBuildResult *btbuild(Relation heap, Relation index,
@@ -436,6 +442,10 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+extern bool btwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -487,10 +497,12 @@ extern void _bt_pageinit(Page page, Size size);
extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
-extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *itemnos, int nitems,
- BlockNumber lastBlockVacuumed);
+extern void _bt_handleitems_vacuum(Relation rel, Buffer buf,
+ OffsetNumber *delitemnos, int ndelitems,
+ OffsetNumber *coloritemnos, int ncoloritems);
extern int _bt_pagedel(Relation rel, Buffer buf);
+extern void _bt_color_items(Page page, OffsetNumber *coloritemnos,
+ uint16 ncoloritems);
/*
* prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index d6a3085..5555742 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -142,34 +142,20 @@ typedef struct xl_btree_reuse_page
/*
* This is what we need to know about vacuum of individual leaf index tuples.
* The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
- * For a non-MVCC index scans there is an additional correctness requirement
- * for applying these changes during recovery, which is that we must do one
- * of these two things for every block in the index:
- * * lock the block for cleanup and apply any required changes
- * * EnsureBlockUnpinned()
- * The purpose of this is to ensure that no index scans started before we
- * finish scanning the index are still running by the time we begin to remove
- * heap tuples.
- *
- * Any changes to any one block are registered on just one WAL record. All
- * blocks that we need to run EnsureBlockUnpinned() are listed as a block range
- * starting from the last block vacuumed through until this one. Individual
- * block numbers aren't given.
+ * single index page when executed by VACUUM. It also includes tuples whose
+ * color is changed from red to blue by VACUUM.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
* have a zero length array of offsets. Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
- BlockNumber lastBlockVacuumed;
-
- /* TARGET OFFSET NUMBERS FOLLOW */
+ uint16 ndelitems;
+ uint16 ncoloritems;
+ /* ndelitems + ncoloritems TARGET OFFSET NUMBERS FOLLOW */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ncoloritems) + sizeof(uint16))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9472ecc..b355b61 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -25,6 +25,7 @@
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_HEAP_BLKS_WARMCLEARED 7
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
0004_warm_updates_v15.patchapplication/octet-stream; name=0004_warm_updates_v15.patchDownload
commit d1dd6d5fdab8c4d6d2dbb574ac3f8a339ba7cde0
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Tue Feb 28 10:37:15 2017 +0530
Main warm patch - v15
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index f2eda67..b356e2b 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -142,6 +142,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b22563b..b4a1465 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -116,6 +116,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 6593771..843389b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -94,6 +94,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24510e7..6645160 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -90,6 +90,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -271,6 +272,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -308,8 +311,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..60e941d 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -363,6 +365,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..dcba734 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..7b9a712
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,306 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
+
+CREATE INDEX CONCURRENTLY
+-------------------------
+
+Currently CREATE INDEX CONCURRENTLY (CIC) is implemented as a 3-phase
+process. In the first phase, we create catalog entry for the new index
+so that the index is visible to all other backends, but still don't use
+it for either read or write. But we ensure that no new broken HOT
+chains are created by new transactions. In the second phase, we build
+the new index using a MVCC snapshot and then make the index available
+for inserts. We then do another pass over the index and insert any
+missing tuples, everytime indexing only it's root line pointer. See
+README.HOT for details about how HOT impacts CIC and how various
+challenges are tackeled.
+
+WARM poses another challenge because it allows creation of HOT chains
+even when an index key is changed. But since the index is not ready for
+insertion until the second phase is over, we might end up with a
+situation where the HOT chain has tuples with different index columns,
+yet only one of these values are indexed by the new index. Note that
+during the third phase, we only index tuples whose root line pointer is
+missing from the index. But we can't easily check if the existing index
+tuple is actually indexing the heap tuple visible to the new MVCC
+snapshot. Finding that information will require us to query the index
+again for every tuple in the chain, especially if it's a WARM tuple.
+This would require repeated access to the index. Another option would be
+to return index keys along with the heap TIDs when index is scanned for
+collecting all indexed TIDs during third phase. We can then compare the
+heap tuple against the already indexed key and decide whether or not to
+index the new tuple.
+
+We solve this problem more simply by disallowing WARM updates until the
+index is ready for insertion. We don't need to disallow WARM on a
+wholesale basis, but only those updates that change the columns of the
+new index are disallowed to be WARM updates.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 064909a..9c4522a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1958,6 +1958,78 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain containing this tid is actually a WARM chain.
+ * Note that even if the WARM update ultimately aborted, we still must do a
+ * recheck because the failing UPDATE when have inserted created index entries
+ * which are now stale, but still referencing this chain.
+ */
+static bool
+hot_check_warm_chain(Page dp, ItemPointer tid)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ return true;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return false;
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1977,11 +2049,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2035,9 +2110,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2050,6 +2128,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2098,7 +2186,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Check to see if HOT chain continues past this tuple; if so fetch
* the next offnum and loop around.
*/
- if (HeapTupleIsHotUpdated(heapTuple))
+ if (HeapTupleIsHotUpdated(heapTuple) &&
+ !HeapTupleHeaderHasRootOffset(heapTuple->t_data))
{
Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
ItemPointerGetBlockNumber(tid));
@@ -2122,18 +2211,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3492,15 +3604,18 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
+ Bitmapset *notready_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3521,6 +3636,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3545,6 +3661,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3566,10 +3686,17 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+ notready_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_NOTREADY);
+
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, notready_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3621,6 +3748,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3876,6 +4006,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4194,6 +4325,37 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !IsSystemRelation(relation) &&
+ !bms_overlap(notready_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup))
+ use_warm_update = true;
+ }
}
else
{
@@ -4240,6 +4402,22 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * Note: If we ever have a mechanism to avoid duplicate <key, TID> in
+ * indexes, we could look at relaxing this restriction and allow even
+ * more WARM udpates
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4252,12 +4430,35 @@ l2:
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
+ else if (use_warm_update)
+ {
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4367,7 +4568,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4507,7 +4711,8 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
* via ereport().
*/
void
-simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
+simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
+ Bitmapset **modified_attrs, bool *warm_update)
{
HTSU_Result result;
HeapUpdateFailureData hufd;
@@ -4516,7 +4721,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, modified_attrs, warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7568,6 +7773,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7579,6 +7785,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7652,6 +7861,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8629,16 +8840,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8698,6 +8915,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextTid(htup, &newtid);
@@ -8833,6 +9055,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f54337c..c2bd7d6 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -834,6 +834,13 @@ heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
continue;
+ /*
+ * If the tuple has root line pointer, it must be the end of the
+ * chain
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
/* Set up to scan the HOT-chain */
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4e7eca7..f56c58f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -75,10 +75,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -535,7 +552,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
+ * scan->xs_tuple_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -574,7 +591,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -601,6 +618,12 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple.
+ * Otherwise we must recheck every tuple.
+ */
+ scan->xs_tuple_recheck = scan->xs_recheck;
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -610,32 +633,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
- }
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
+ if (res)
+ return &scan->xs_ctup;
+ }
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6dca810..b5cb619 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,11 +20,14 @@
#include "access/nbtxlog.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -250,6 +253,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -309,6 +315,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -326,112 +334,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
}
-
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
+ else if (recheck)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
+
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
-
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
+
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
+
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 775f2ff..952ed8f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "pgstat.h"
#include "storage/condition_variable.h"
#include "storage/indexfsm.h"
@@ -163,6 +164,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = btestimateparallelscan;
amroutine->aminitparallelscan = btinitparallelscan;
amroutine->amparallelrescan = btparallelrescan;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -344,8 +346,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
+ /* btree indexes are never lossy, except for WARM tuples */
scan->xs_recheck = false;
+ scan->xs_tuple_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5b259a3..c376c1b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2069,3 +2073,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index e57ac49..59ef7f3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -72,6 +72,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f8d9214..bba52ec 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_AmCache = NULL;
ii->ii_Context = CurrentMemoryContext;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index abc344a..e5355a8 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -66,10 +66,15 @@ CatalogCloseIndexes(CatalogIndexState indstate)
*
* This should be called for each inserted or updated catalog tuple.
*
+ * If the tuple was WARM updated, the modified_attrs contains the list of
+ * columns updated by the update. We must not insert new index entries for
+ * indexes which do not refer to any of the modified columns.
+ *
* This is effectively a cut-down version of ExecInsertIndexTuples.
*/
static void
-CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
+CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update)
{
int i;
int numIndexes;
@@ -79,12 +84,28 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
IndexInfo **indexInfoArray;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData root_tid;
- /* HOT update does not require index inserts */
- if (HeapTupleIsHeapOnly(heapTuple))
+ /*
+ * HOT update does not require index inserts, but WARM may need for some
+ * indexes.
+ */
+ if (HeapTupleIsHeapOnly(heapTuple) && !warm_update)
return;
/*
+ * If we've done a WARM update, then we must index the TID of the root line
+ * pointer and not the actual TID of the new tuple.
+ */
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(heapTuple->t_self)),
+ HeapTupleHeaderGetRootOffset(heapTuple->t_data));
+ else
+ ItemPointerCopy(&heapTuple->t_self, &root_tid);
+
+
+ /*
* Get information from the state structure. Fall out if nothing to do.
*/
numIndexes = indstate->ri_NumIndices;
@@ -112,6 +133,17 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
continue;
/*
+ * If we've done WARM update, then we must not insert a new index tuple
+ * if none of the index keys have changed. This is not just an
+ * optimization, but a requirement for WARM to work correctly.
+ */
+ if (warm_update)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
+ /*
* Expressional and partial indexes on system catalogs are not
* supported, nor exclusion constraints, nor deferred uniqueness
*/
@@ -136,7 +168,7 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
index_insert(relationDescs[i], /* index relation */
values, /* array of index Datums */
isnull, /* is-null flags */
- &(heapTuple->t_self), /* tid of heap tuple */
+ &root_tid,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
@@ -168,7 +200,7 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, NULL, false);
CatalogCloseIndexes(indstate);
return oid;
@@ -190,7 +222,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, false, NULL);
return oid;
}
@@ -210,12 +242,14 @@ void
CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup)
{
CatalogIndexState indstate;
+ bool warm_update;
+ Bitmapset *modified_attrs;
indstate = CatalogOpenIndexes(heapRel);
- simple_heap_update(heapRel, otid, tup);
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
CatalogCloseIndexes(indstate);
}
@@ -231,9 +265,12 @@ void
CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
CatalogIndexState indstate)
{
- simple_heap_update(heapRel, otid, tup);
+ Bitmapset *modified_attrs;
+ bool warm_update;
+
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..7fb1295 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -498,6 +498,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -528,7 +529,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index e2544e5..d9c0fe7 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = castNode(TriggerData, fcinfo->context);
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 01a63c8..f078a5d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2680,6 +2680,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2834,6 +2836,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 72bb06c..d8f033d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -699,7 +699,14 @@ DefineIndex(Oid relationId,
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * tries to either insert into it or use it for queries. In addition to
+ * that, WARM updates will be disallowed if an update is modifying one of
+ * the columns used by this new index. This is necessary to ensure that we
+ * don't create WARM tuples which do not have corresponding entry in this
+ * index. It must be noted that during the second phase, we will index only
+ * those heap tuples whose root line pointer is not already in the index,
+ * hence it's important that all tuples in a given chain, has the same
+ * value for any indexed column (including this new index).
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
@@ -747,7 +754,10 @@ DefineIndex(Oid relationId,
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
- * chain have different index keys.
+ * chain have different index keys. Also, the new index is consulted for
+ * deciding whether a WARM update is possible, and WARM update is not done
+ * if a column used by this index is being updated. This ensures that we
+ * don't create WARM tuples which are not indexed by this index.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
@@ -782,7 +792,8 @@ DefineIndex(Oid relationId,
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
- * insert new entries into the index for insertions and non-HOT updates.
+ * insert new entries into the index for insertions and non-HOT updates or
+ * WARM updates where this index needs new entry.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..1388be1 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1032,6 +1032,19 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM
+ * tuple, there could be multiple index entries
+ * pointing to the root of this chain. We can't do
+ * index-only scans for such tuples without verifying
+ * index key check. So mark the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ break;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, visibility_cutoff_xid))
visibility_cutoff_xid = xmin;
@@ -2158,6 +2171,18 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 2142273..d62d2de 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique, /* type of uniqueness check to do */
indexInfo); /* index AM may need this */
@@ -791,6 +804,9 @@ retry:
{
if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
+ else
+ ItemPointerCopy(&tup->t_self, &ctid_wait);
+
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index ebf3f6b..1fa13a5 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -399,6 +399,8 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate, false, NULL,
NIL);
@@ -445,6 +447,8 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
if (!skip_tuple)
{
List *recheckIndexes = NIL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Check the constraints of the tuple */
if (rel->rd_att->constr)
@@ -455,13 +459,30 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
/* OK, update the tuple and index entries for it */
simple_heap_update(rel, &searchslot->tts_tuple->t_self,
- slot->tts_tuple);
+ slot->tts_tuple, &modified_attrs, &warm_update);
if (resultRelInfo->ri_NumIndices > 0 &&
- !HeapTupleIsHeapOnly(slot->tts_tuple))
+ (!HeapTupleIsHeapOnly(slot->tts_tuple) || warm_update))
+ {
+ ItemPointerData root_tid;
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self,
+ &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
+
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL,
NIL);
+ }
/* AFTER ROW UPDATE Triggers */
ExecARUpdateTriggers(estate, resultRelInfo,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c871aa0..eb98b2d 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -39,6 +39,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -364,11 +365,27 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ /*
+ * If the heap tuple needs a recheck because of a WARM update,
+ * it's a lossy case
+ */
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 0a9dfdb..38c7827 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -118,10 +118,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..a1f3440 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -512,6 +512,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -558,6 +559,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -891,6 +893,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -1007,7 +1012,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1094,10 +1099,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ada374c..308ae8c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4088,6 +4090,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5197,6 +5200,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5224,6 +5228,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a987d0d..b8677f3 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -145,6 +145,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1644,6 +1660,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..c85898c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2338,6 +2338,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_pkattr);
bms_free(relation->rd_idattr);
@@ -4352,6 +4353,13 @@ RelationGetIndexList(Relation relation)
return list_copy(relation->rd_indexlist);
/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */
+ relation->rd_indexattr = NULL;
+
+ /*
* We build the list we intend to return (in the caller's context) while
* doing the scan. After successfully completing the scan, we copy that
* list into the relcache entry. This avoids cache-context memory leakage
@@ -4759,15 +4767,19 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *pkindexattrs; /* columns in the primary index */
Bitmapset *idindexattrs; /* columns in the replica identity */
+ Bitmapset *indxnotreadyattrs; /* columns in not ready indexes */
List *indexoidlist;
List *newindexoidlist;
Oid relpkindex;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4782,6 +4794,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return bms_copy(relation->rd_indxnotreadyattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4822,9 +4838,11 @@ restart:
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
pkindexattrs = NULL;
idindexattrs = NULL;
+ indxnotreadyattrs = NULL;
foreach(l, indexoidlist)
{
Oid indexOid = lfirst_oid(l);
@@ -4861,6 +4879,10 @@ restart:
indexattrs = bms_add_member(indexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_member(indxnotreadyattrs,
+ attrnum - FirstLowInvalidHeapAttributeNumber);
+
if (isKey)
uindexattrs = bms_add_member(uindexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
@@ -4876,10 +4898,29 @@ restart:
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_members(indxnotreadyattrs,
+ exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
+
index_close(indexDesc, AccessShareLock);
}
@@ -4912,15 +4953,22 @@ restart:
goto restart;
}
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_pkattr);
relation->rd_pkattr = NULL;
bms_free(relation->rd_idattr);
relation->rd_idattr = NULL;
+ bms_free(relation->rd_indxnotreadyattr);
+ relation->rd_indxnotreadyattr = NULL;
/*
* Now save copies of the bitmaps in the relcache entry. We intentionally
@@ -4933,7 +4981,9 @@ restart:
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_pkattr = bms_copy(pkindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
+ relation->rd_indxnotreadyattr = bms_copy(indxnotreadyattrs);
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4947,6 +4997,10 @@ restart:
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return indxnotreadyattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
@@ -5559,6 +5613,7 @@ load_relcache_init_file(bool shared)
rel->rd_keyattr = NULL;
rel->rd_pkattr = NULL;
rel->rd_idattr = NULL;
+ rel->rd_indxnotreadyattr = NULL;
rel->rd_pubactions = NULL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index f919cf8..d7702e5 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -152,6 +153,10 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -217,6 +222,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to support WARM */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c0b79f..e76a7aa 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -389,4 +389,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 95aa976..9412c3a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -161,7 +162,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -176,7 +178,9 @@ extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
- HeapTuple tup);
+ HeapTuple tup,
+ Bitmapset **modified_attrs,
+ bool *warm_update);
extern void heap_sync(Relation relation);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index e6019d5..9b081bf 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 4d614b7..b5891ca 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) != 0 \
+)
+
/*
* Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
* the TID of the tuple itself in t_ctid field to mark the end of the chain.
@@ -785,6 +801,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f9304db..d4b35ca 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -537,6 +537,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce3ca8d..12d3b0c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -112,7 +112,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a4cc86d..aec9c89 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2740,6 +2740,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3353 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2892,6 +2894,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3354 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b..c4495a3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -382,6 +382,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index ea3f3a5..ebeec74 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -41,5 +41,4 @@ extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6332ea0..41e270b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -64,6 +64,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8b710ec..2ee690b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1178,7 +1180,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a617a7c..fbac7c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -138,9 +138,14 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
+ Bitmapset *rd_indxnotreadyattr; /* columns used by indexes not yet
+ ready */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_pkattr; /* cols included in primary key */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
PublicationActions *rd_pubactions; /* publication actions */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index da36b67..d18bd09 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -50,7 +50,9 @@ typedef enum IndexAttrBitmapKind
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
INDEX_ATTR_BITMAP_PRIMARY_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE,
+ INDEX_ATTR_BITMAP_NOTREADY
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c661f1d..561d9579 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1732,6 +1732,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1875,6 +1876,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1918,6 +1920,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1955,7 +1958,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1971,7 +1975,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1993,7 +1998,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..6391891
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,367 @@
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab1
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Check if index only scan works correctly
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab1
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1
+ Index Cond: (b = 140001)
+(2 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab1;
+------------------
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab2
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE a = 1;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab2
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE b = 701;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+VACUUM updtst_tab2;
+EXPLAIN (costs off) SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2
+ Index Cond: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab2 WHERE b = 701;
+ b
+-----
+ 701
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab2;
+------------------
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 1;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-------------------------
+ Seq Scan on updtst_tab3
+ Filter: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 701;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+ b
+------
+ 1421
+(1 row)
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+SET enable_seqscan = false;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 98
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 2;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3
+ Index Cond: (b = 702)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 702;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+ b
+------
+ 1422
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab3;
+------------------
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index edeb2d6..2268705 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..f31127c
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,172 @@
+
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Check if index only scan works correctly
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab1;
+
+------------------
+
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE a = 1;
+
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE b = 701;
+
+VACUUM updtst_tab2;
+EXPLAIN (costs off) SELECT b FROM updtst_tab2 WHERE b = 701;
+SELECT b FROM updtst_tab2 WHERE b = 701;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab2;
+------------------
+
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+SELECT * FROM updtst_tab3 WHERE a = 1;
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+SET enable_seqscan = false;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+SELECT * FROM updtst_tab3 WHERE a = 2;
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab3;
+------------------
+
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
+
+
0003_freeup_3bits_ip_posid_v15.patchapplication/octet-stream; name=0003_freeup_3bits_ip_posid_v15.patchDownload
commit a5838065cefe12c84c97eaaa1f1d9b571641273a
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Fri Feb 24 10:41:31 2017 +0530
Free up 3 bits from ip_posid
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index b4e9fec..9c7e6ea 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -928,7 +928,7 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key,
* Find the minimum item > advancePast among the active entry streams.
*
* Note: a lossy-page entry is encoded by a ItemPointer with max value for
- * offset (0xffff), so that it will sort after any exact entries for the
+ * offset (0x1fff), so that it will sort after any exact entries for the
* same page. So we'll prefer to return exact pointers not lossy
* pointers, which is good.
*/
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 8d2d31a..b22b9f5 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -253,7 +253,7 @@ ginCompressPostingList(const ItemPointer ipd, int nipd, int maxsize,
Assert(ndecoded == totalpacked);
for (i = 0; i < ndecoded; i++)
- Assert(memcmp(&tmp[i], &ipd[i], sizeof(ItemPointerData)) == 0);
+ Assert(ItemPointerEquals(&tmp[i], &ipd[i]));
pfree(tmp);
}
#endif
diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 438912c..3f7a3f0 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -160,14 +160,14 @@ typedef struct GinMetaPageData
(GinItemPointerGetOffsetNumber(p) == (OffsetNumber)0 && \
GinItemPointerGetBlockNumber(p) == (BlockNumber)0)
#define ItemPointerSetMax(p) \
- ItemPointerSet((p), InvalidBlockNumber, (OffsetNumber)0xffff)
+ ItemPointerSet((p), InvalidBlockNumber, (OffsetNumber)OffsetNumberMask)
#define ItemPointerIsMax(p) \
- (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)0xffff && \
+ (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)OffsetNumberMask && \
GinItemPointerGetBlockNumber(p) == InvalidBlockNumber)
#define ItemPointerSetLossyPage(p, b) \
- ItemPointerSet((p), (b), (OffsetNumber)0xffff)
+ ItemPointerSet((p), (b), (OffsetNumber)OffsetNumberMask)
#define ItemPointerIsLossyPage(p) \
- (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)0xffff && \
+ (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)OffsetNumberMask && \
GinItemPointerGetBlockNumber(p) != InvalidBlockNumber)
/*
@@ -218,7 +218,7 @@ typedef signed char GinNullCategory;
*/
#define GinGetNPosting(itup) GinItemPointerGetOffsetNumber(&(itup)->t_tid)
#define GinSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
-#define GIN_TREE_POSTING ((OffsetNumber)0xffff)
+#define GIN_TREE_POSTING ((OffsetNumber)OffsetNumberMask)
#define GinIsPostingTree(itup) (GinGetNPosting(itup) == GIN_TREE_POSTING)
#define GinSetPostingTree(itup, blkno) ( GinSetNPosting((itup),GIN_TREE_POSTING), ItemPointerSetBlockNumber(&(itup)->t_tid, blkno) )
#define GinGetPostingTree(itup) GinItemPointerGetBlockNumber(&(itup)->t_tid)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 5b33030..c532dc3 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -269,8 +269,8 @@ typedef struct
* invalid tuples in an index, so throwing an error is as far as we go with
* supporting that.
*/
-#define TUPLE_IS_VALID 0xffff
-#define TUPLE_IS_INVALID 0xfffe
+#define TUPLE_IS_VALID OffsetNumberMask
+#define TUPLE_IS_INVALID OffsetNumberPrev(OffsetNumberMask)
#define GistTupleIsInvalid(itup) ( ItemPointerGetOffsetNumber( &((itup)->t_tid) ) == TUPLE_IS_INVALID )
#define GistTupleSetValid(itup) ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_VALID )
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 24433c7..4d614b7 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -288,7 +288,7 @@ struct HeapTupleHeaderData
* than MaxOffsetNumber, so that it can be distinguished from a valid
* offset number in a regular item pointer.
*/
-#define SpecTokenOffsetNumber 0xfffe
+#define SpecTokenOffsetNumber OffsetNumberPrev(OffsetNumberMask)
/*
* HeapTupleHeader accessor macros
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 60d0070..3144bdd 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -57,7 +57,7 @@ typedef ItemPointerData *ItemPointer;
* True iff the disk item pointer is not NULL.
*/
#define ItemPointerIsValid(pointer) \
- ((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))
+ ((bool) (PointerIsValid(pointer) && (((pointer)->ip_posid & OffsetNumberMask) != 0)))
/*
* ItemPointerGetBlockNumber
@@ -82,13 +82,37 @@ typedef ItemPointerData *ItemPointer;
#define ItemPointerGetOffsetNumber(pointer) \
( \
AssertMacro(ItemPointerIsValid(pointer)), \
- (pointer)->ip_posid \
+ ((pointer)->ip_posid & OffsetNumberMask) \
)
/* Same as ItemPointerGetOffsetNumber but without any assert-checks */
#define ItemPointerGetOffsetNumberNoCheck(pointer) \
( \
- (pointer)->ip_posid \
+ ((pointer)->ip_posid & OffsetNumberMask) \
+)
+
+/*
+ * Get the flags stored in high order bits in the OffsetNumber.
+ */
+#define ItemPointerGetFlags(pointer) \
+( \
+ ((pointer)->ip_posid & ~OffsetNumberMask) >> OffsetNumberBits \
+)
+
+/*
+ * Set the flag bits. We first left-shift since flags are defined starting 0x01
+ */
+#define ItemPointerSetFlags(pointer, flags) \
+( \
+ ((pointer)->ip_posid |= ((flags) << OffsetNumberBits)) \
+)
+
+/*
+ * Clear all flags.
+ */
+#define ItemPointerClearFlags(pointer) \
+( \
+ ((pointer)->ip_posid &= OffsetNumberMask) \
)
/*
@@ -99,7 +123,7 @@ typedef ItemPointerData *ItemPointer;
( \
AssertMacro(PointerIsValid(pointer)), \
BlockIdSet(&((pointer)->ip_blkid), blockNumber), \
- (pointer)->ip_posid = offNum \
+ (pointer)->ip_posid = (offNum) \
)
/*
diff --git a/src/include/storage/off.h b/src/include/storage/off.h
index fe8638f..fe1834c 100644
--- a/src/include/storage/off.h
+++ b/src/include/storage/off.h
@@ -26,8 +26,15 @@ typedef uint16 OffsetNumber;
#define InvalidOffsetNumber ((OffsetNumber) 0)
#define FirstOffsetNumber ((OffsetNumber) 1)
#define MaxOffsetNumber ((OffsetNumber) (BLCKSZ / sizeof(ItemIdData)))
-#define OffsetNumberMask (0xffff) /* valid uint16 bits */
+/*
+ * Currently we support maxinum 32kB blocks and each ItemId takes 6 bytes. That
+ * limits the number of line pointers to (32kB/6 = 5461). 13 bits are enought o
+ * represent all line pointers. Hence we can reuse the high order bits in
+ * OffsetNumber for other purposes.
+ */
+#define OffsetNumberMask (0x1fff) /* valid uint16 bits */
+#define OffsetNumberBits 13 /* number of valid bits in OffsetNumber */
/* ----------------
* support macros
* ----------------
0002_clear_ip_posid_blkid_refs_v15.patchapplication/octet-stream; name=0002_clear_ip_posid_blkid_refs_v15.patchDownload
commit 93e9160e3f85159c4f58d57ef6c3ba30c421db4b
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Thu Feb 23 10:12:17 2017 +0530
Remove direct references to ip_posid and ip_blkid - same as
remove_ip_posid_blkid_ref_v3 submitted to hackers
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index d50ec3a..2ec265e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -363,8 +363,8 @@ bt_page_items(PG_FUNCTION_ARGS)
j = 0;
values[j++] = psprintf("%d", uargs->offset);
values[j++] = psprintf("(%u,%u)",
- BlockIdGetBlockNumber(&(itup->t_tid.ip_blkid)),
- itup->t_tid.ip_posid);
+ ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
+ ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 06a1992..e65040d 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -353,7 +353,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
* heap_getnext may find no tuples on a given page, so we cannot
* simply examine the pages returned by the heap scan.
*/
- tupblock = BlockIdGetBlockNumber(&tuple->t_self.ip_blkid);
+ tupblock = ItemPointerGetBlockNumber(&tuple->t_self);
while (block <= tupblock)
{
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 60f005c..b4e9fec 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -626,8 +626,9 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry,
}
else
{
- entry->btree.itemptr = advancePast;
- entry->btree.itemptr.ip_posid++;
+ ItemPointerSet(&entry->btree.itemptr,
+ GinItemPointerGetBlockNumber(&advancePast),
+ OffsetNumberNext(GinItemPointerGetOffsetNumber(&advancePast)));
}
entry->btree.fullScan = false;
stack = ginFindLeafPage(&entry->btree, true, snapshot);
@@ -979,15 +980,17 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key,
if (GinItemPointerGetBlockNumber(&advancePast) <
GinItemPointerGetBlockNumber(&minItem))
{
- advancePast.ip_blkid = minItem.ip_blkid;
- advancePast.ip_posid = 0;
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&minItem),
+ InvalidOffsetNumber);
}
}
else
{
- Assert(minItem.ip_posid > 0);
- advancePast = minItem;
- advancePast.ip_posid--;
+ Assert(GinItemPointerGetOffsetNumber(&minItem) > 0);
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&minItem),
+ OffsetNumberPrev(GinItemPointerGetOffsetNumber(&minItem)));
}
/*
@@ -1245,15 +1248,17 @@ scanGetItem(IndexScanDesc scan, ItemPointerData advancePast,
if (GinItemPointerGetBlockNumber(&advancePast) <
GinItemPointerGetBlockNumber(&key->curItem))
{
- advancePast.ip_blkid = key->curItem.ip_blkid;
- advancePast.ip_posid = 0;
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&key->curItem),
+ InvalidOffsetNumber);
}
}
else
{
- Assert(key->curItem.ip_posid > 0);
- advancePast = key->curItem;
- advancePast.ip_posid--;
+ Assert(GinItemPointerGetOffsetNumber(&key->curItem) > 0);
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&key->curItem),
+ OffsetNumberPrev(GinItemPointerGetOffsetNumber(&key->curItem)));
}
/*
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 598069d..8d2d31a 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -79,13 +79,11 @@ itemptr_to_uint64(const ItemPointer iptr)
uint64 val;
Assert(ItemPointerIsValid(iptr));
- Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits));
+ Assert(GinItemPointerGetOffsetNumber(iptr) < (1 << MaxHeapTuplesPerPageBits));
- val = iptr->ip_blkid.bi_hi;
- val <<= 16;
- val |= iptr->ip_blkid.bi_lo;
+ val = GinItemPointerGetBlockNumber(iptr);
val <<= MaxHeapTuplesPerPageBits;
- val |= iptr->ip_posid;
+ val |= GinItemPointerGetOffsetNumber(iptr);
return val;
}
@@ -93,11 +91,9 @@ itemptr_to_uint64(const ItemPointer iptr)
static inline void
uint64_to_itemptr(uint64 val, ItemPointer iptr)
{
- iptr->ip_posid = val & ((1 << MaxHeapTuplesPerPageBits) - 1);
+ GinItemPointerSetOffsetNumber(iptr, val & ((1 << MaxHeapTuplesPerPageBits) - 1));
val = val >> MaxHeapTuplesPerPageBits;
- iptr->ip_blkid.bi_lo = val & 0xFFFF;
- val = val >> 16;
- iptr->ip_blkid.bi_hi = val & 0xFFFF;
+ GinItemPointerSetBlockNumber(iptr, val);
Assert(ItemPointerIsValid(iptr));
}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8aac670..b6f8f5a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3006,8 +3006,8 @@ DisplayMapping(HTAB *tuplecid_data)
ent->key.relnode.dbNode,
ent->key.relnode.spcNode,
ent->key.relnode.relNode,
- BlockIdGetBlockNumber(&ent->key.tid.ip_blkid),
- ent->key.tid.ip_posid,
+ ItemPointerGetBlockNumber(&ent->key.tid),
+ ItemPointerGetOffsetNumber(&ent->key.tid),
ent->cmin,
ent->cmax
);
diff --git a/src/backend/storage/page/itemptr.c b/src/backend/storage/page/itemptr.c
index 703cbb9..28ac885 100644
--- a/src/backend/storage/page/itemptr.c
+++ b/src/backend/storage/page/itemptr.c
@@ -54,18 +54,21 @@ ItemPointerCompare(ItemPointer arg1, ItemPointer arg2)
/*
* Don't use ItemPointerGetBlockNumber or ItemPointerGetOffsetNumber here,
* because they assert ip_posid != 0 which might not be true for a
- * user-supplied TID.
+ * user-supplied TID. Instead we use ItemPointerGetBlockNumberNoCheck and
+ * ItemPointerGetOffsetNumberNoCheck which do not do any validation.
*/
- BlockNumber b1 = BlockIdGetBlockNumber(&(arg1->ip_blkid));
- BlockNumber b2 = BlockIdGetBlockNumber(&(arg2->ip_blkid));
+ BlockNumber b1 = ItemPointerGetBlockNumberNoCheck(arg1);
+ BlockNumber b2 = ItemPointerGetBlockNumberNoCheck(arg2);
if (b1 < b2)
return -1;
else if (b1 > b2)
return 1;
- else if (arg1->ip_posid < arg2->ip_posid)
+ else if (ItemPointerGetOffsetNumberNoCheck(arg1) <
+ ItemPointerGetOffsetNumberNoCheck(arg2))
return -1;
- else if (arg1->ip_posid > arg2->ip_posid)
+ else if (ItemPointerGetOffsetNumberNoCheck(arg1) >
+ ItemPointerGetOffsetNumberNoCheck(arg2))
return 1;
else
return 0;
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index a3b372f..735c006 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -109,8 +109,8 @@ tidout(PG_FUNCTION_ARGS)
OffsetNumber offsetNumber;
char buf[32];
- blockNumber = BlockIdGetBlockNumber(&(itemPtr->ip_blkid));
- offsetNumber = itemPtr->ip_posid;
+ blockNumber = ItemPointerGetBlockNumberNoCheck(itemPtr);
+ offsetNumber = ItemPointerGetOffsetNumberNoCheck(itemPtr);
/* Perhaps someday we should output this as a record. */
snprintf(buf, sizeof(buf), "(%u,%u)", blockNumber, offsetNumber);
@@ -146,14 +146,12 @@ Datum
tidsend(PG_FUNCTION_ARGS)
{
ItemPointer itemPtr = PG_GETARG_ITEMPOINTER(0);
- BlockId blockId;
BlockNumber blockNumber;
OffsetNumber offsetNumber;
StringInfoData buf;
- blockId = &(itemPtr->ip_blkid);
- blockNumber = BlockIdGetBlockNumber(blockId);
- offsetNumber = itemPtr->ip_posid;
+ blockNumber = ItemPointerGetBlockNumberNoCheck(itemPtr);
+ offsetNumber = ItemPointerGetOffsetNumberNoCheck(itemPtr);
pq_begintypsend(&buf);
pq_sendint(&buf, blockNumber, sizeof(blockNumber));
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 34e7339..2fd4479 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -460,8 +460,8 @@ extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
static inline int
ginCompareItemPointers(ItemPointer a, ItemPointer b)
{
- uint64 ia = (uint64) a->ip_blkid.bi_hi << 32 | (uint64) a->ip_blkid.bi_lo << 16 | a->ip_posid;
- uint64 ib = (uint64) b->ip_blkid.bi_hi << 32 | (uint64) b->ip_blkid.bi_lo << 16 | b->ip_posid;
+ uint64 ia = (uint64) GinItemPointerGetBlockNumber(a) << 32 | GinItemPointerGetOffsetNumber(a);
+ uint64 ib = (uint64) GinItemPointerGetBlockNumber(b) << 32 | GinItemPointerGetOffsetNumber(b);
if (ia == ib)
return 0;
diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index a3fb056..438912c 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -132,10 +132,17 @@ typedef struct GinMetaPageData
* to avoid Asserts, since sometimes the ip_posid isn't "valid"
*/
#define GinItemPointerGetBlockNumber(pointer) \
- BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+ (ItemPointerGetBlockNumberNoCheck(pointer))
#define GinItemPointerGetOffsetNumber(pointer) \
- ((pointer)->ip_posid)
+ (ItemPointerGetOffsetNumberNoCheck(pointer))
+
+#define GinItemPointerSetBlockNumber(pointer, blkno) \
+ (ItemPointerSetBlockNumber((pointer), (blkno)))
+
+#define GinItemPointerSetOffsetNumber(pointer, offnum) \
+ (ItemPointerSetOffsetNumber((pointer), (offnum)))
+
/*
* Special-case item pointer values needed by the GIN search logic.
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7552186..24433c7 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -428,7 +428,7 @@ do { \
#define HeapTupleHeaderIsSpeculative(tup) \
( \
- (tup)->t_ctid.ip_posid == SpecTokenOffsetNumber \
+ (ItemPointerGetOffsetNumberNoCheck(&(tup)->t_ctid) == SpecTokenOffsetNumber) \
)
#define HeapTupleHeaderGetSpeculativeToken(tup) \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6289ffa..f9304db 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -151,9 +151,8 @@ typedef struct BTMetaPageData
* within a level). - vadim 04/09/97
*/
#define BTTidSame(i1, i2) \
- ( (i1).ip_blkid.bi_hi == (i2).ip_blkid.bi_hi && \
- (i1).ip_blkid.bi_lo == (i2).ip_blkid.bi_lo && \
- (i1).ip_posid == (i2).ip_posid )
+ ((ItemPointerGetBlockNumber(&(i1)) == ItemPointerGetBlockNumber(&(i2))) && \
+ (ItemPointerGetOffsetNumber(&(i1)) == ItemPointerGetOffsetNumber(&(i2))))
#define BTEntrySame(i1, i2) \
BTTidSame((i1)->t_tid, (i2)->t_tid)
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 576aaa8..60d0070 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -69,6 +69,12 @@ typedef ItemPointerData *ItemPointer;
BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
)
+/* Same as ItemPointerGetBlockNumber but without any assert-checks */
+#define ItemPointerGetBlockNumberNoCheck(pointer) \
+( \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
+)
+
/*
* ItemPointerGetOffsetNumber
* Returns the offset number of a disk item pointer.
@@ -79,6 +85,12 @@ typedef ItemPointerData *ItemPointer;
(pointer)->ip_posid \
)
+/* Same as ItemPointerGetOffsetNumber but without any assert-checks */
+#define ItemPointerGetOffsetNumberNoCheck(pointer) \
+( \
+ (pointer)->ip_posid \
+)
+
/*
* ItemPointerSet
* Sets a disk item pointer to the specified block and offset.
0001_track_root_lp_v15.patchapplication/octet-stream; name=0001_track_root_lp_v15.patchDownload
commit 98408805eac6736d7a0e7850d34c75fc866dfaff
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Sun Jan 1 16:29:53 2017 +0530
Track root line pointer in t_ctid->ip_posid field.
This patch is same as v12 submitted to hackers
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 74fb09c..064909a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -94,7 +94,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2248,13 +2249,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2385,6 +2386,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2423,8 +2425,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
- RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup,
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
+
+ /* We must not overwrite the speculative insertion token. */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2652,6 +2659,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2722,7 +2730,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
+
+ /* Mark this tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2730,7 +2743,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
+ /* Mark each tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -3002,6 +3018,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3012,6 +3029,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3053,7 +3071,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3183,7 +3202,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain.
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3232,6 +3261,22 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section.
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever.
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tp.t_self));
+
START_CRIT_SECTION();
/*
@@ -3259,8 +3304,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain. */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3461,6 +3508,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3523,6 +3572,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3807,7 +3857,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3947,6 +4002,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3974,6 +4030,14 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available.
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self));
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3988,7 +4052,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4146,6 +4211,10 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4171,6 +4240,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above).
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4178,10 +4258,22 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ root_offnum = RelationPutHeapTuple(relation, newbuf, heaptup, false,
+ root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain.
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4194,7 +4286,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4233,6 +4325,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4513,7 +4606,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4522,9 +4616,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4544,6 +4640,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4571,7 +4668,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5009,7 +5110,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5057,6 +5163,10 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self));
+
START_CRIT_SECTION();
/*
@@ -5085,7 +5195,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5599,6 +5712,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5607,6 +5721,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5836,7 +5952,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5845,7 +5961,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5962,7 +6078,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6088,8 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7437,6 +7552,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7557,6 +7673,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8211,7 +8330,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, xlrec->offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8301,7 +8426,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8436,8 +8562,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8573,7 +8699,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8706,13 +8832,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored.
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8775,6 +8905,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself. */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8838,11 +8971,17 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6529fe3..8052519 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,20 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple where the caller
+ * tells us about the root line pointer of the chain. The latter is used
+ * during insertion of a new row, hence root line pointer is set to the offset
+ * where this tuple is inserted.
*/
-void
+OffsetNumber
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,17 +68,24 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead).
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number. */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(root_offnum))
+ root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, root_offnum);
}
+
+ return root_offnum;
}
/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..f54337c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that).
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,47 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
+ *
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line pointers
+ * and tuples on the page in order to find the root line pointers. To minimize
+ * the cost, we break early if target_offnum is specified and root line pointer
+ * to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +807,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return.
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping. */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +869,65 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return.
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item. */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion.
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple.
+ */
+OffsetNumber
+heap_get_root_tuple(Page page, OffsetNumber target_offnum)
+{
+ OffsetNumber offnum = InvalidOffsetNumber;
+ heap_get_root_tuples_internal(page, target_offnum, &offnum);
+ return offnum;
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c7b283c..6ced1e7 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain.
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest.
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 5242dee..2142273 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -789,7 +789,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3f76a40..1705799 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2587,7 +2587,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2595,7 +2595,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a864f78..95aa976 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -189,6 +189,7 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern OffsetNumber heap_get_root_tuple(Page page, OffsetNumber target_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index b285f17..e6019d5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 2824f23..921cb37 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -35,8 +35,8 @@ typedef struct BulkInsertStateData
} BulkInsertStateData;
-extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+extern OffsetNumber RelationPutHeapTuple(Relation relation, Buffer buffer,
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a6c7e31..7552186 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,43 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+/*
+ * Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
+ * the TID of the tuple itself in t_ctid field to mark the end of the chain.
+ * But starting PG v10, we use a special flag HEAP_LATEST_TUPLE to identify the
+ * last tuple and store the root line pointer of the HOT chain in t_ctid field
+ * instead.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * Starting from PostgreSQL 10, the latest tuple in an update chain has
+ * HEAP_LATEST_TUPLE set; but tuples upgraded from earlier versions do not.
+ * For those, we determine whether a tuple is latest by testing that its t_ctid
+ * points to itself.
+ *
+ * Note: beware of multiple evaluations of "tup" and "tid" arguments.
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +585,56 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * now have a new tuple in the chain and this is no longer the last tuple of
+ * the chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller must have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+/*
+ * Get the root line pointer of the HOT chain. The caller should have confirmed
+ * that the root offset is cached before calling this macro.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * Return whether the tuple has a cached root offset. We don't use
+ * HeapTupleHeaderIsHeapLatest because that one also considers the case of
+ * t_ctid pointing to itself, for tuples migrated from pre v10 clusters. Here
+ * we are only interested in the tuples which are marked with HEAP_LATEST_TUPLE
+ * flag.
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
Here's a rebased set of patches. This is the same Pavan posted; I only
fixed some whitespace and a trivial conflict in indexam.c, per 9b88f27cb42f.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 8, 2017 at 12:00 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Here's a rebased set of patches. This is the same Pavan posted; I only
fixed some whitespace and a trivial conflict in indexam.c, per 9b88f27cb42f.
No attachments.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
Here's a rebased set of patches. This is the same Pavan posted; I only
fixed some whitespace and a trivial conflict in indexam.c, per 9b88f27cb42f.
Jaime noted that I forgot the attachments. Here they are
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-interesting-attrs-v16.patchtext/plain; charset=us-asciiDownload
From ba96dd9053eaf326fb6fa28cf80dcc28daa5551d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 8 Mar 2017 13:48:13 -0300
Subject: [PATCH 1/6] interesting attrs v16
---
src/backend/access/heap/heapam.c | 178 ++++++++++++---------------------------
1 file changed, 53 insertions(+), 125 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index af25836..74fb09c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -96,11 +96,8 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
-static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
- Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
+ Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup);
static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
LockTupleMode mode, LockWaitPolicy wait_policy,
@@ -3455,6 +3452,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3472,9 +3471,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
pagefree;
bool have_tuple_lock = false;
bool iscombo;
- bool satisfies_hot;
- bool satisfies_key;
- bool satisfies_id;
bool use_hot_update = false;
bool key_intact;
bool all_visible_cleared = false;
@@ -3501,21 +3497,30 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
errmsg("cannot update tuples during a parallel operation")));
/*
- * Fetch the list of attributes to be checked for HOT update. This is
- * wasted effort if we fail to update or have to put the new tuple on a
- * different page. But we must compute the list before obtaining buffer
- * lock --- in the worst case, if we are doing an update on one of the
- * relevant system catalogs, we could deadlock if we try to fetch the list
- * later. In any case, the relcache caches the data so this is usually
- * pretty cheap.
+ * Fetch the list of attributes to be checked for various operations.
*
- * Note that we get a copy here, so we need not worry about relcache flush
- * happening midway through.
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are doing
+ * an update on one of the relevant system catalogs, we could deadlock if
+ * we try to fetch the list later. In any case, the relcache caches the
+ * data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity, the columns that
+ * are considered the "key" of rows in the table, and columns that are
+ * part of indirect indexes.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ interesting_attrs = bms_add_members(NULL, hot_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
+
block = ItemPointerGetBlockNumber(otid);
buffer = ReadBuffer(relation, block);
@@ -3536,7 +3541,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(ItemIdIsNormal(lp));
/*
- * Fill in enough data in oldtup for HeapSatisfiesHOTandKeyUpdate to work
+ * Fill in enough data in oldtup for HeapDetermineModifiedColumns to work
* properly.
*/
oldtup.t_tableOid = RelationGetRelid(relation);
@@ -3562,6 +3567,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
}
+ /* Determine columns modified by the update. */
+ modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+ &oldtup, newtup);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3573,10 +3582,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* is updates that don't manipulate key columns, not those that
* serendipitiously arrive at the same key values.
*/
- HeapSatisfiesHOTandKeyUpdate(relation, hot_attrs, key_attrs, id_attrs,
- &satisfies_hot, &satisfies_key,
- &satisfies_id, &oldtup, newtup);
- if (satisfies_key)
+ if (!bms_overlap(modified_attrs, key_attrs))
{
*lockmode = LockTupleNoKeyExclusive;
mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3815,6 +3821,8 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return result;
}
@@ -4119,7 +4127,7 @@ l2:
* to do a HOT update. Check if any of the index columns have been
* changed. If not, then HOT update is possible.
*/
- if (satisfies_hot)
+ if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
}
else
@@ -4134,7 +4142,9 @@ l2:
* ExtractReplicaIdentity() will return NULL if nothing needs to be
* logged.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &oldtup, !satisfies_id, &old_key_copied);
+ old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
+ bms_overlap(modified_attrs, id_attrs),
+ &old_key_copied);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4282,13 +4292,15 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(modified_attrs);
+ bms_free(interesting_attrs);
return HeapTupleMayBeUpdated;
}
/*
* Check if the specified attribute's value is same in both given tuples.
- * Subroutine for HeapSatisfiesHOTandKeyUpdate.
+ * Subroutine for HeapDetermineModifiedColumns.
*/
static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
@@ -4322,7 +4334,7 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Extract the corresponding values. XXX this is pretty inefficient if
- * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
* a single heap_deform_tuple call on each tuple, instead? But that
* doesn't work for system columns ...
*/
@@ -4367,114 +4379,30 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
/*
* Check which columns are being updated.
*
- * This simultaneously checks conditions for HOT updates, for FOR KEY
- * SHARE updates, and REPLICA IDENTITY concerns. Since much of the time they
- * will be checking very similar sets of columns, and doing the same tests on
- * them, it makes sense to optimize and do them together.
+ * Given an updated tuple, determine (and return into the output bitmapset),
+ * from those listed as interesting, the set of columns that changed.
*
- * We receive three bitmapsets comprising the three sets of columns we're
- * interested in. Note these are destructively modified; that is OK since
- * this is invoked at most once in heap_update.
- *
- * hot_result is set to TRUE if it's okay to do a HOT update (i.e. it does not
- * modified indexed columns); key_result is set to TRUE if the update does not
- * modify columns used in the key; id_result is set to TRUE if the update does
- * not modify columns in any index marked as the REPLICA IDENTITY.
+ * The input bitmapset is destructively modified; that is OK since this is
+ * invoked at most once in heap_update.
*/
-static void
-HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs,
- Bitmapset *key_attrs, Bitmapset *id_attrs,
- bool *satisfies_hot, bool *satisfies_key,
- bool *satisfies_id,
+static Bitmapset *
+HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
HeapTuple oldtup, HeapTuple newtup)
{
- int next_hot_attnum;
- int next_key_attnum;
- int next_id_attnum;
- bool hot_result = true;
- bool key_result = true;
- bool id_result = true;
+ int attnum;
+ Bitmapset *modified = NULL;
- /* If REPLICA IDENTITY is set to FULL, id_attrs will be empty. */
- Assert(bms_is_subset(id_attrs, key_attrs));
- Assert(bms_is_subset(key_attrs, hot_attrs));
-
- /*
- * If one of these sets contains no remaining bits, bms_first_member will
- * return -1, and after adding FirstLowInvalidHeapAttributeNumber (which
- * is negative!) we'll get an attribute number that can't possibly be
- * real, and thus won't match any actual attribute number.
- */
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
-
- for (;;)
+ while ((attnum = bms_first_member(interesting_cols)) >= 0)
{
- bool changed;
- int check_now;
+ attnum += FirstLowInvalidHeapAttributeNumber;
- /*
- * Since the HOT attributes are a superset of the key attributes and
- * the key attributes are a superset of the id attributes, this logic
- * is guaranteed to identify the next column that needs to be checked.
- */
- if (hot_result && next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_hot_attnum;
- else if (key_result && next_key_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_key_attnum;
- else if (id_result && next_id_attnum > FirstLowInvalidHeapAttributeNumber)
- check_now = next_id_attnum;
- else
- break;
-
- /* See whether it changed. */
- changed = !heap_tuple_attr_equals(RelationGetDescr(relation),
- check_now, oldtup, newtup);
- if (changed)
- {
- if (check_now == next_hot_attnum)
- hot_result = false;
- if (check_now == next_key_attnum)
- key_result = false;
- if (check_now == next_id_attnum)
- id_result = false;
-
- /* if all are false now, we can stop checking */
- if (!hot_result && !key_result && !id_result)
- break;
- }
-
- /*
- * Advance the next attribute numbers for the sets that contain the
- * attribute we just checked. As we work our way through the columns,
- * the next_attnum values will rise; but when each set becomes empty,
- * bms_first_member() will return -1 and the attribute number will end
- * up with a value less than FirstLowInvalidHeapAttributeNumber.
- */
- if (hot_result && check_now == next_hot_attnum)
- {
- next_hot_attnum = bms_first_member(hot_attrs);
- next_hot_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (key_result && check_now == next_key_attnum)
- {
- next_key_attnum = bms_first_member(key_attrs);
- next_key_attnum += FirstLowInvalidHeapAttributeNumber;
- }
- if (id_result && check_now == next_id_attnum)
- {
- next_id_attnum = bms_first_member(id_attrs);
- next_id_attnum += FirstLowInvalidHeapAttributeNumber;
- }
+ if (!heap_tuple_attr_equals(RelationGetDescr(relation),
+ attnum, oldtup, newtup))
+ modified = bms_add_member(modified,
+ attnum - FirstLowInvalidHeapAttributeNumber);
}
- *satisfies_hot = hot_result;
- *satisfies_key = key_result;
- *satisfies_id = id_result;
+ return modified;
}
/*
--
2.1.4
0002-track-root-lp-v16.patchtext/plain; charset=us-asciiDownload
From 6c4c004a2a7f5f269dc33942f7c397fe962c8685 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 8 Mar 2017 13:48:33 -0300
Subject: [PATCH 2/6] track root lp v16
---
src/backend/access/heap/heapam.c | 209 ++++++++++++++++++++++++++++------
src/backend/access/heap/hio.c | 25 +++-
src/backend/access/heap/pruneheap.c | 126 ++++++++++++++++++--
src/backend/access/heap/rewriteheap.c | 21 +++-
src/backend/executor/execIndexing.c | 3 +-
src/backend/executor/execMain.c | 4 +-
src/include/access/heapam.h | 1 +
src/include/access/heapam_xlog.h | 4 +-
src/include/access/hio.h | 4 +-
src/include/access/htup_details.h | 97 +++++++++++++++-
10 files changed, 428 insertions(+), 66 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 74fb09c..93cde9a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -94,7 +94,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
- HeapTuple newtup, HeapTuple old_key_tup,
+ HeapTuple newtup, OffsetNumber root_offnum,
+ HeapTuple old_key_tup,
bool all_visible_cleared, bool new_all_visible_cleared);
static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
Bitmapset *interesting_cols,
@@ -2248,13 +2249,13 @@ heap_get_latest_tid(Relation relation,
*/
if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
- ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+ HeapTupleHeaderIsHeapLatest(tp.t_data, &ctid))
{
UnlockReleaseBuffer(buffer);
break;
}
- ctid = tp.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tp.t_data, &ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
UnlockReleaseBuffer(buffer);
} /* end of loop */
@@ -2385,6 +2386,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ OffsetNumber root_offnum;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2423,8 +2425,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
- RelationPutHeapTuple(relation, buffer, heaptup,
- (options & HEAP_INSERT_SPECULATIVE) != 0);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup,
+ (options & HEAP_INSERT_SPECULATIVE) != 0,
+ InvalidOffsetNumber);
+
+ /* We must not overwrite the speculative insertion token. */
+ if ((options & HEAP_INSERT_SPECULATIVE) == 0)
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
@@ -2652,6 +2659,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+ OffsetNumber root_offnum;
needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2722,7 +2730,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* RelationGetBufferForTuple has ensured that the first tuple fits.
* Put that on the page, and then as many other tuples as fit.
*/
- RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false,
+ InvalidOffsetNumber);
+
+ /* Mark this tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptuples[ndone]->t_data, root_offnum);
+
for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];
@@ -2730,7 +2743,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;
- RelationPutHeapTuple(relation, buffer, heaptup, false);
+ root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, false,
+ InvalidOffsetNumber);
+ /* Mark each tuple as the latest and also set root offset. */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
/*
* We don't use heap_multi_insert for catalog tuples yet, but
@@ -3002,6 +3018,7 @@ heap_delete(Relation relation, ItemPointer tid,
HeapTupleData tp;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
TransactionId new_xmax;
@@ -3012,6 +3029,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool all_visible_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
+ OffsetNumber root_offnum;
Assert(ItemPointerIsValid(tid));
@@ -3053,7 +3071,8 @@ heap_delete(Relation relation, ItemPointer tid,
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
- lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+ offnum = ItemPointerGetOffsetNumber(tid);
+ lp = PageGetItemId(page, offnum);
Assert(ItemIdIsNormal(lp));
tp.t_tableOid = RelationGetRelid(relation);
@@ -3183,7 +3202,17 @@ l1:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tp.t_data->t_ctid;
+
+ /*
+ * If we're at the end of the chain, then just return the same TID back
+ * to the caller. The caller uses that as a hint to know if we have hit
+ * the end of the chain.
+ */
+ if (!HeapTupleHeaderIsHeapLatest(tp.t_data, &tp.t_self))
+ HeapTupleHeaderGetNextTid(tp.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&tp.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
@@ -3232,6 +3261,22 @@ l1:
xid, LockTupleExclusive, true,
&new_xmax, &new_infomask, &new_infomask2);
+ /*
+ * heap_get_root_tuple_one() may call palloc, which is disallowed once we
+ * enter the critical section. So check if the root offset is cached in the
+ * tuple and if not, fetch that information hard way before entering the
+ * critical section.
+ *
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever.
+ */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tp.t_self));
+
START_CRIT_SECTION();
/*
@@ -3259,8 +3304,10 @@ l1:
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
- /* Make sure there is no forward chain link in t_ctid */
- tp.t_data->t_ctid = tp.t_self;
+
+ /* Mark this tuple as the latest tuple in the update chain. */
+ if (!HeapTupleHeaderHasRootOffset(tp.t_data))
+ HeapTupleHeaderSetHeapLatest(tp.t_data, root_offnum);
MarkBufferDirty(buffer);
@@ -3461,6 +3508,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool old_key_copied = false;
Page page;
BlockNumber block;
+ OffsetNumber offnum;
+ OffsetNumber root_offnum;
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
@@ -3523,6 +3572,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
block = ItemPointerGetBlockNumber(otid);
+ offnum = ItemPointerGetOffsetNumber(otid);
buffer = ReadBuffer(relation, block);
page = BufferGetPage(buffer);
@@ -3807,7 +3857,12 @@ l2:
result == HeapTupleUpdated ||
result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = oldtup.t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(oldtup.t_data, &oldtup.t_self))
+ HeapTupleHeaderGetNextTid(oldtup.t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(&oldtup.t_self, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
@@ -3947,6 +4002,7 @@ l2:
uint16 infomask_lock_old_tuple,
infomask2_lock_old_tuple;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
/*
* To prevent concurrent sessions from updating the tuple, we have to
@@ -3974,6 +4030,14 @@ l2:
Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
+ /*
+ * Fetch root offset before entering the critical section. We do this
+ * only if the information is not already available.
+ */
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&oldtup.t_self));
+
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
@@ -3988,7 +4052,8 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* temporarily make it look not-updated, but locked */
- oldtup.t_data->t_ctid = oldtup.t_self;
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ HeapTupleHeaderSetHeapLatest(oldtup.t_data, root_offnum);
/*
* Clear all-frozen bit on visibility map if needed. We could
@@ -4146,6 +4211,10 @@ l2:
bms_overlap(modified_attrs, id_attrs),
&old_key_copied);
+ if (!HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -4171,6 +4240,17 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+ /*
+ * For HOT (or WARM) updated tuples, we store the offset of the root
+ * line pointer of this chain in the ip_posid field of the new tuple.
+ * Usually this information will be available in the corresponding
+ * field of the old tuple. But for aborted updates or pg_upgraded
+ * databases, we might be seeing the old-style CTID chains and hence
+ * the information must be obtained by hard way (we should have done
+ * that before entering the critical section above).
+ */
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
else
{
@@ -4178,10 +4258,22 @@ l2:
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ root_offnum = InvalidOffsetNumber;
}
- RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
-
+ /* insert new tuple */
+ root_offnum = RelationPutHeapTuple(relation, newbuf, heaptup, false,
+ root_offnum);
+ /*
+ * Also mark both copies as latest and set the root offset information. If
+ * we're doing a HOT/WARM update, then we just copy the information from
+ * old tuple, if available or computed above. For regular updates,
+ * RelationPutHeapTuple must have returned us the actual offset number
+ * where the new version was inserted and we store the same value since the
+ * update resulted in a new HOT-chain.
+ */
+ HeapTupleHeaderSetHeapLatest(heaptup->t_data, root_offnum);
+ HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
@@ -4194,7 +4286,7 @@ l2:
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
/* record address of new tuple in t_ctid of old one */
- oldtup.t_data->t_ctid = heaptup->t_self;
+ HeapTupleHeaderSetNextTid(oldtup.t_data, &(heaptup->t_self));
/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
if (PageIsAllVisible(BufferGetPage(buffer)))
@@ -4233,6 +4325,7 @@ l2:
recptr = log_heap_update(relation, buffer,
newbuf, &oldtup, heaptup,
+ root_offnum,
old_key_tuple,
all_visible_cleared,
all_visible_cleared_new);
@@ -4513,7 +4606,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
ItemId lp;
Page page;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber block;
+ BlockNumber block;
+ OffsetNumber offnum;
TransactionId xid,
xmax;
uint16 old_infomask,
@@ -4522,9 +4616,11 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
bool first_time = true;
bool have_tuple_lock = false;
bool cleared_all_frozen = false;
+ OffsetNumber root_offnum;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
block = ItemPointerGetBlockNumber(tid);
+ offnum = ItemPointerGetOffsetNumber(tid);
/*
* Before locking the buffer, pin the visibility map page if it appears to
@@ -4544,6 +4640,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple->t_len = ItemIdGetLength(lp);
tuple->t_tableOid = RelationGetRelid(relation);
+ tuple->t_self = *tid;
l3:
result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
@@ -4571,7 +4668,11 @@ l3:
xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
infomask = tuple->t_data->t_infomask;
infomask2 = tuple->t_data->t_infomask2;
- ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &t_ctid);
+ else
+ ItemPointerCopy(tid, &t_ctid);
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
@@ -5009,7 +5110,12 @@ failed:
Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
result == HeapTupleWouldBlock);
Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
- hufd->ctid = tuple->t_data->t_ctid;
+
+ if (!HeapTupleHeaderIsHeapLatest(tuple->t_data, tid))
+ HeapTupleHeaderGetNextTid(tuple->t_data, &hufd->ctid);
+ else
+ ItemPointerCopy(tid, &hufd->ctid);
+
hufd->xmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
if (result == HeapTupleSelfUpdated)
hufd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
@@ -5057,6 +5163,10 @@ failed:
GetCurrentTransactionId(), mode, false,
&xid, &new_infomask, &new_infomask2);
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&tuple->t_self));
+
START_CRIT_SECTION();
/*
@@ -5085,7 +5195,10 @@ failed:
* the tuple as well.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
- tuple->t_data->t_ctid = *tid;
+ {
+ if (!HeapTupleHeaderHasRootOffset(tuple->t_data))
+ HeapTupleHeaderSetHeapLatest(tuple->t_data, root_offnum);
+ }
/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
@@ -5599,6 +5712,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
bool cleared_all_frozen = false;
Buffer vmbuffer = InvalidBuffer;
BlockNumber block;
+ OffsetNumber offnum;
ItemPointerCopy(tid, &tupid);
@@ -5607,6 +5721,8 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
new_infomask = 0;
new_xmax = InvalidTransactionId;
block = ItemPointerGetBlockNumber(&tupid);
+ offnum = ItemPointerGetOffsetNumber(&tupid);
+
ItemPointerCopy(&tupid, &(mytup.t_self));
if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5836,7 +5952,7 @@ l4:
/* if we find the end of update chain, we're done. */
if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
- ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ HeapTupleHeaderIsHeapLatest(mytup.t_data, &mytup.t_self) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))
{
result = HeapTupleMayBeUpdated;
@@ -5845,7 +5961,7 @@ l4:
/* tail recursion */
priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
- ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
+ HeapTupleHeaderGetNextTid(mytup.t_data, &tupid);
UnlockReleaseBuffer(buf);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
@@ -5962,7 +6078,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
* Replace the speculative insertion token with a real t_ctid, pointing to
* itself like it does on regular tuples.
*/
- htup->t_ctid = tuple->t_self;
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
@@ -6088,8 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
HeapTupleHeaderSetXmin(tp.t_data, InvalidTransactionId);
/* Clear the speculative insertion token too */
- tp.t_data->t_ctid = tp.t_self;
-
+ HeapTupleHeaderSetHeapLatest(tp.t_data, ItemPointerGetOffsetNumber(tid));
MarkBufferDirty(buffer);
/*
@@ -7437,6 +7552,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
+ OffsetNumber root_offnum,
HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared)
{
@@ -7557,6 +7673,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ Assert(OffsetNumberIsValid(root_offnum));
+ xlrec.root_offnum = root_offnum;
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
@@ -8211,7 +8330,13 @@ heap_xlog_delete(XLogReaderState *record)
PageClearAllVisible(page);
/* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = target_tid;
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, xlrec->offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8301,7 +8426,8 @@ heap_xlog_insert(XLogReaderState *record)
htup->t_hoff = xlhdr.t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- htup->t_ctid = target_tid;
+
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->offnum);
if (PageAddItem(page, (Item) htup, newlen, xlrec->offnum,
true, true) == InvalidOffsetNumber)
@@ -8436,8 +8562,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
htup->t_hoff = xlhdr->t_hoff;
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
- ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
- ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
@@ -8573,7 +8699,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
/* Set forward chain link in t_ctid */
- htup->t_ctid = newtid;
+ HeapTupleHeaderSetNextTid(htup, &newtid);
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, XLogRecGetXid(record));
@@ -8706,13 +8832,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetXmin(htup, XLogRecGetXid(record));
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
- /* Make sure there is no forward chain link in t_ctid */
- htup->t_ctid = newtid;
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
+ /*
+ * Make sure the tuple is marked as the latest and root offset
+ * information is restored.
+ */
+ HeapTupleHeaderSetHeapLatest(htup, xlrec->root_offnum);
+
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
@@ -8775,6 +8905,9 @@ heap_xlog_confirm(XLogReaderState *record)
*/
ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum);
+ /* For newly inserted tuple, set root offset to itself. */
+ HeapTupleHeaderSetHeapLatest(htup, offnum);
+
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
@@ -8838,11 +8971,17 @@ heap_xlog_lock(XLogReaderState *record)
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
+ ItemPointerData target_tid;
+
+ ItemPointerSet(&target_tid, BufferGetBlockNumber(buffer), offnum);
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
- ItemPointerSet(&htup->t_ctid,
- BufferGetBlockNumber(buffer),
- offnum);
+ if (!HeapTupleHeaderHasRootOffset(htup))
+ {
+ OffsetNumber root_offnum;
+ root_offnum = heap_get_root_tuple(page, offnum);
+ HeapTupleHeaderSetHeapLatest(htup, root_offnum);
+ }
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6529fe3..8052519 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -31,12 +31,20 @@
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
+ *
+ * The caller can optionally tell us to set the root offset to the given value.
+ * Otherwise, the root offset is set to the offset of the new location once its
+ * known. The former is used while updating an existing tuple where the caller
+ * tells us about the root line pointer of the chain. The latter is used
+ * during insertion of a new row, hence root line pointer is set to the offset
+ * where this tuple is inserted.
*/
-void
+OffsetNumber
RelationPutHeapTuple(Relation relation,
Buffer buffer,
HeapTuple tuple,
- bool token)
+ bool token,
+ OffsetNumber root_offnum)
{
Page pageHeader;
OffsetNumber offnum;
@@ -60,17 +68,24 @@ RelationPutHeapTuple(Relation relation,
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
/*
- * Insert the correct position into CTID of the stored tuple, too (unless
- * this is a speculative insertion, in which case the token is held in
- * CTID field instead)
+ * Set block number and the root offset into CTID of the stored tuple, too
+ * (unless this is a speculative insertion, in which case the token is held
+ * in CTID field instead).
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
Item item = PageGetItem(pageHeader, itemId);
+ /* Copy t_ctid to set the correct block number. */
((HeapTupleHeader) item)->t_ctid = tuple->t_self;
+
+ if (!OffsetNumberIsValid(root_offnum))
+ root_offnum = offnum;
+ HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item, root_offnum);
}
+
+ return root_offnum;
}
/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..f54337c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -55,6 +55,8 @@ static void heap_prune_record_redirect(PruneState *prstate,
static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum);
static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
+static void heap_get_root_tuples_internal(Page page,
+ OffsetNumber target_offnum, OffsetNumber *root_offsets);
/*
* Optionally prune and repair fragmentation in the specified page.
@@ -553,6 +555,17 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+
+ /*
+ * If the tuple was HOT-updated and the update was later
+ * aborted, someone could mark this tuple to be the last tuple
+ * in the chain, without clearing the HOT-updated flag. So we must
+ * check if this is the last tuple in the chain and stop following the
+ * CTID, else we risk getting into an infinite recursion (though
+ * prstate->marked[] currently protects against that).
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
/*
* Advance to next chain member.
*/
@@ -726,27 +739,47 @@ heap_page_prune_execute(Buffer buffer,
/*
- * For all items in this page, find their respective root line pointers.
- * If item k is part of a HOT-chain with root at item j, then we set
- * root_offsets[k - 1] = j.
+ * Either for all items in this page or for the given item, find their
+ * respective root line pointers.
*
- * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * When target_offnum is a valid offset number, the caller is interested in
+ * just one item. In that case, the root line pointer is returned in
+ * root_offsets.
+ *
+ * When target_offnum is a InvalidOffsetNumber then the caller wants to know
+ * the root line pointers of all the items in this page. The root_offsets array
+ * must have MaxHeapTuplesPerPage entries in that case. If item k is part of a
+ * HOT-chain with root at item j, then we set root_offsets[k - 1] = j. We zero
+ * out all unused entries.
*
* The function must be called with at least share lock on the buffer, to
* prevent concurrent prune operations.
*
+ * This is not a cheap function since it must scan through all line pointers
+ * and tuples on the page in order to find the root line pointers. To minimize
+ * the cost, we break early if target_offnum is specified and root line pointer
+ * to target_offnum is found.
+ *
* Note: The information collected here is valid only as long as the caller
* holds a pin on the buffer. Once pin is released, a tuple might be pruned
* and reused by a completely unrelated tuple.
+ *
+ * Note: This function must not be called inside a critical section because it
+ * internally calls HeapTupleHeaderGetUpdateXid which somewhere down the stack
+ * may try to allocate heap memory. Memory allocation is disallowed in a
+ * critical section.
*/
-void
-heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+static void
+heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
+ OffsetNumber *root_offsets)
{
OffsetNumber offnum,
maxoff;
- MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+ if (OffsetNumberIsValid(target_offnum))
+ *root_offsets = InvalidOffsetNumber;
+ else
+ MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
@@ -774,9 +807,28 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
/*
* This is either a plain tuple or the root of a HOT-chain.
- * Remember it in the mapping.
+ *
+ * If the target_offnum is specified and if we found its mapping,
+ * return.
*/
- root_offsets[offnum - 1] = offnum;
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (target_offnum == offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember it in the mapping. */
+ root_offsets[offnum - 1] = offnum;
+ }
/* If it's not the start of a HOT-chain, we're done with it */
if (!HeapTupleHeaderIsHotUpdated(htup))
@@ -817,15 +869,65 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;
- /* Remember the root line pointer for this item */
- root_offsets[nextoffnum - 1] = offnum;
+ /*
+ * If target_offnum is specified and we found its mapping, return.
+ */
+ if (OffsetNumberIsValid(target_offnum))
+ {
+ if (nextoffnum == target_offnum)
+ {
+ root_offsets[0] = offnum;
+ return;
+ }
+ /*
+ * No need to remember mapping for any other item. The
+ * root_offsets array may not even has place for them. So be
+ * careful about not writing past the array.
+ */
+ }
+ else
+ {
+ /* Remember the root line pointer for this item. */
+ root_offsets[nextoffnum - 1] = offnum;
+ }
/* Advance to next chain member, if any */
if (!HeapTupleHeaderIsHotUpdated(htup))
break;
+ /*
+ * If the tuple was HOT-updated and the update was later aborted,
+ * someone could mark this tuple to be the last tuple in the chain
+ * and store root offset in CTID, without clearing the HOT-updated
+ * flag. So we must check if CTID is actually root offset and break
+ * to avoid infinite recursion.
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
}
}
+
+/*
+ * Get root line pointer for the given tuple.
+ */
+OffsetNumber
+heap_get_root_tuple(Page page, OffsetNumber target_offnum)
+{
+ OffsetNumber offnum = InvalidOffsetNumber;
+ heap_get_root_tuples_internal(page, target_offnum, &offnum);
+ return offnum;
+}
+
+/*
+ * Get root line pointers for all tuples in the page
+ */
+void
+heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
+{
+ return heap_get_root_tuples_internal(page, InvalidOffsetNumber,
+ root_offsets);
+}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c7b283c..0792971 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -419,14 +419,18 @@ rewrite_heap_tuple(RewriteState state,
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
- !(ItemPointerEquals(&(old_tuple->t_self),
- &(old_tuple->t_data->t_ctid))))
+ !(HeapTupleHeaderIsHeapLatest(old_tuple->t_data, &old_tuple->t_self)))
{
OldToNewMapping mapping;
memset(&hashkey, 0, sizeof(hashkey));
hashkey.xmin = HeapTupleHeaderGetUpdateXid(old_tuple->t_data);
- hashkey.tid = old_tuple->t_data->t_ctid;
+
+ /*
+ * We've already checked that this is not the last tuple in the chain,
+ * so fetch the next TID in the chain.
+ */
+ HeapTupleHeaderGetNextTid(old_tuple->t_data, &hashkey.tid);
mapping = (OldToNewMapping)
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -439,7 +443,7 @@ rewrite_heap_tuple(RewriteState state,
* set the ctid of this tuple to point to the new location, and
* insert it right away.
*/
- new_tuple->t_data->t_ctid = mapping->new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);
/* We don't need the mapping entry anymore */
hash_search(state->rs_old_new_tid_map, &hashkey,
@@ -525,7 +529,7 @@ rewrite_heap_tuple(RewriteState state,
new_tuple = unresolved->tuple;
free_new = true;
old_tid = unresolved->old_tid;
- new_tuple->t_data->t_ctid = new_tid;
+ HeapTupleHeaderSetNextTid(new_tuple->t_data, &new_tid);
/*
* We don't need the hash entry anymore, but don't free its
@@ -731,7 +735,12 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
newitemid = PageGetItemId(page, newoff);
onpage_tup = (HeapTupleHeader) PageGetItem(page, newitemid);
- onpage_tup->t_ctid = tup->t_self;
+ /*
+ * Set t_ctid just to ensure that block number is copied correctly, but
+ * then immediately mark the tuple as the latest.
+ */
+ HeapTupleHeaderSetNextTid(onpage_tup, &tup->t_self);
+ HeapTupleHeaderSetHeapLatest(onpage_tup, newoff);
}
/* If heaptup is a private copy, release it. */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 5242dee..2142273 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -789,7 +789,8 @@ retry:
DirtySnapshot.speculativeToken &&
TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
{
- ctid_wait = tup->t_data->t_ctid;
+ if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
+ HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f5cd65d..44a501f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2592,7 +2592,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
* As above, it should be safe to examine xmax and t_ctid without the
* buffer content lock, because they can't be changing.
*/
- if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+ if (HeapTupleHeaderIsHeapLatest(tuple.t_data, &tuple.t_self))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
@@ -2600,7 +2600,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
}
/* updated, so look at the updated row */
- tuple.t_self = tuple.t_data->t_ctid;
+ HeapTupleHeaderGetNextTid(tuple.t_data, &tuple.t_self);
/* updated row should have xmin matching this xmax */
priorXmax = HeapTupleHeaderGetUpdateXid(tuple.t_data);
ReleaseBuffer(buffer);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a864f78..95aa976 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -189,6 +189,7 @@ extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused);
+extern OffsetNumber heap_get_root_tuple(Page page, OffsetNumber target_offnum);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index b285f17..e6019d5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -193,6 +193,8 @@ typedef struct xl_heap_update
uint8 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
+ OffsetNumber root_offnum; /* offset of the root line pointer in case of
+ HOT or WARM update */
/*
* If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
@@ -200,7 +202,7 @@ typedef struct xl_heap_update
*/
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, root_offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 2824f23..921cb37 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -35,8 +35,8 @@ typedef struct BulkInsertStateData
} BulkInsertStateData;
-extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
- HeapTuple tuple, bool token);
+extern OffsetNumber RelationPutHeapTuple(Relation relation, Buffer buffer,
+ HeapTuple tuple, bool token, OffsetNumber root_offnum);
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a6c7e31..7552186 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,13 +260,19 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+/* bits 0x0800 are available */
+#define HEAP_LATEST_TUPLE 0x1000 /*
+ * This is the last tuple in chain and
+ * ip_posid points to the root line
+ * pointer
+ */
#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
@@ -504,6 +510,43 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+/*
+ * Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
+ * the TID of the tuple itself in t_ctid field to mark the end of the chain.
+ * But starting PG v10, we use a special flag HEAP_LATEST_TUPLE to identify the
+ * last tuple and store the root line pointer of the HOT chain in t_ctid field
+ * instead.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
+
+#define HeapTupleHeaderClearHeapLatest(tup) \
+( \
+ (tup)->t_infomask2 &= ~HEAP_LATEST_TUPLE \
+)
+
+/*
+ * Starting from PostgreSQL 10, the latest tuple in an update chain has
+ * HEAP_LATEST_TUPLE set; but tuples upgraded from earlier versions do not.
+ * For those, we determine whether a tuple is latest by testing that its t_ctid
+ * points to itself.
+ *
+ * Note: beware of multiple evaluations of "tup" and "tid" arguments.
+ */
+#define HeapTupleHeaderIsHeapLatest(tup, tid) \
+( \
+ (((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0) || \
+ ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(tid)) && \
+ (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(tid))) \
+)
+
+
#define HeapTupleHeaderSetHeapOnly(tup) \
( \
(tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
@@ -542,6 +585,56 @@ do { \
/*
+ * Set the t_ctid chain and also clear the HEAP_LATEST_TUPLE flag since we
+ * now have a new tuple in the chain and this is no longer the last tuple of
+ * the chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderSetNextTid(tup, tid) \
+do { \
+ ItemPointerCopy((tid), &((tup)->t_ctid)); \
+ HeapTupleHeaderClearHeapLatest((tup)); \
+} while (0)
+
+/*
+ * Get TID of next tuple in the update chain. Caller must have checked that
+ * we are not already at the end of the chain because in that case t_ctid may
+ * actually store the root line pointer of the HOT chain.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetNextTid(tup, next_ctid) \
+do { \
+ AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \
+ ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \
+} while (0)
+
+/*
+ * Get the root line pointer of the HOT chain. The caller should have confirmed
+ * that the root offset is cached before calling this macro.
+ *
+ * Note: beware of multiple evaluations of "tup" argument.
+ */
+#define HeapTupleHeaderGetRootOffset(tup) \
+( \
+ AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \
+ ItemPointerGetOffsetNumber(&(tup)->t_ctid) \
+)
+
+/*
+ * Return whether the tuple has a cached root offset. We don't use
+ * HeapTupleHeaderIsHeapLatest because that one also considers the case of
+ * t_ctid pointing to itself, for tuples migrated from pre v10 clusters. Here
+ * we are only interested in the tuples which are marked with HEAP_LATEST_TUPLE
+ * flag.
+ */
+#define HeapTupleHeaderHasRootOffset(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
--
2.1.4
0003-clear-ip_posid-blkid-refs-v16.patchtext/plain; charset=us-asciiDownload
From 2621b6c0eea72452341994f1f7b5dce9ca17652a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 8 Mar 2017 13:48:58 -0300
Subject: [PATCH 3/6] clear ip_posid/blkid refs v16
---
contrib/pageinspect/btreefuncs.c | 4 ++--
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/gin/ginget.c | 29 +++++++++++++++----------
src/backend/access/gin/ginpostinglist.c | 14 +++++-------
src/backend/replication/logical/reorderbuffer.c | 4 ++--
src/backend/storage/page/itemptr.c | 13 ++++++-----
src/backend/utils/adt/tid.c | 10 ++++-----
src/include/access/gin_private.h | 4 ++--
src/include/access/ginblock.h | 11 ++++++++--
src/include/access/htup_details.h | 2 +-
src/include/access/nbtree.h | 5 ++---
src/include/storage/itemptr.h | 12 ++++++++++
12 files changed, 65 insertions(+), 45 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index d50ec3a..2ec265e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -363,8 +363,8 @@ bt_page_items(PG_FUNCTION_ARGS)
j = 0;
values[j++] = psprintf("%d", uargs->offset);
values[j++] = psprintf("(%u,%u)",
- BlockIdGetBlockNumber(&(itup->t_tid.ip_blkid)),
- itup->t_tid.ip_posid);
+ ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
+ ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 06a1992..e65040d 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -353,7 +353,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
* heap_getnext may find no tuples on a given page, so we cannot
* simply examine the pages returned by the heap scan.
*/
- tupblock = BlockIdGetBlockNumber(&tuple->t_self.ip_blkid);
+ tupblock = ItemPointerGetBlockNumber(&tuple->t_self);
while (block <= tupblock)
{
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 87cd9ea..aa0b02f 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -626,8 +626,9 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry,
}
else
{
- entry->btree.itemptr = advancePast;
- entry->btree.itemptr.ip_posid++;
+ ItemPointerSet(&entry->btree.itemptr,
+ GinItemPointerGetBlockNumber(&advancePast),
+ OffsetNumberNext(GinItemPointerGetOffsetNumber(&advancePast)));
}
entry->btree.fullScan = false;
stack = ginFindLeafPage(&entry->btree, true, snapshot);
@@ -979,15 +980,17 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key,
if (GinItemPointerGetBlockNumber(&advancePast) <
GinItemPointerGetBlockNumber(&minItem))
{
- advancePast.ip_blkid = minItem.ip_blkid;
- advancePast.ip_posid = 0;
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&minItem),
+ InvalidOffsetNumber);
}
}
else
{
- Assert(minItem.ip_posid > 0);
- advancePast = minItem;
- advancePast.ip_posid--;
+ Assert(GinItemPointerGetOffsetNumber(&minItem) > 0);
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&minItem),
+ OffsetNumberPrev(GinItemPointerGetOffsetNumber(&minItem)));
}
/*
@@ -1245,15 +1248,17 @@ scanGetItem(IndexScanDesc scan, ItemPointerData advancePast,
if (GinItemPointerGetBlockNumber(&advancePast) <
GinItemPointerGetBlockNumber(&key->curItem))
{
- advancePast.ip_blkid = key->curItem.ip_blkid;
- advancePast.ip_posid = 0;
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&key->curItem),
+ InvalidOffsetNumber);
}
}
else
{
- Assert(key->curItem.ip_posid > 0);
- advancePast = key->curItem;
- advancePast.ip_posid--;
+ Assert(GinItemPointerGetOffsetNumber(&key->curItem) > 0);
+ ItemPointerSet(&advancePast,
+ GinItemPointerGetBlockNumber(&key->curItem),
+ OffsetNumberPrev(GinItemPointerGetOffsetNumber(&key->curItem)));
}
/*
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 598069d..8d2d31a 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -79,13 +79,11 @@ itemptr_to_uint64(const ItemPointer iptr)
uint64 val;
Assert(ItemPointerIsValid(iptr));
- Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits));
+ Assert(GinItemPointerGetOffsetNumber(iptr) < (1 << MaxHeapTuplesPerPageBits));
- val = iptr->ip_blkid.bi_hi;
- val <<= 16;
- val |= iptr->ip_blkid.bi_lo;
+ val = GinItemPointerGetBlockNumber(iptr);
val <<= MaxHeapTuplesPerPageBits;
- val |= iptr->ip_posid;
+ val |= GinItemPointerGetOffsetNumber(iptr);
return val;
}
@@ -93,11 +91,9 @@ itemptr_to_uint64(const ItemPointer iptr)
static inline void
uint64_to_itemptr(uint64 val, ItemPointer iptr)
{
- iptr->ip_posid = val & ((1 << MaxHeapTuplesPerPageBits) - 1);
+ GinItemPointerSetOffsetNumber(iptr, val & ((1 << MaxHeapTuplesPerPageBits) - 1));
val = val >> MaxHeapTuplesPerPageBits;
- iptr->ip_blkid.bi_lo = val & 0xFFFF;
- val = val >> 16;
- iptr->ip_blkid.bi_hi = val & 0xFFFF;
+ GinItemPointerSetBlockNumber(iptr, val);
Assert(ItemPointerIsValid(iptr));
}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8aac670..b6f8f5a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3006,8 +3006,8 @@ DisplayMapping(HTAB *tuplecid_data)
ent->key.relnode.dbNode,
ent->key.relnode.spcNode,
ent->key.relnode.relNode,
- BlockIdGetBlockNumber(&ent->key.tid.ip_blkid),
- ent->key.tid.ip_posid,
+ ItemPointerGetBlockNumber(&ent->key.tid),
+ ItemPointerGetOffsetNumber(&ent->key.tid),
ent->cmin,
ent->cmax
);
diff --git a/src/backend/storage/page/itemptr.c b/src/backend/storage/page/itemptr.c
index 703cbb9..28ac885 100644
--- a/src/backend/storage/page/itemptr.c
+++ b/src/backend/storage/page/itemptr.c
@@ -54,18 +54,21 @@ ItemPointerCompare(ItemPointer arg1, ItemPointer arg2)
/*
* Don't use ItemPointerGetBlockNumber or ItemPointerGetOffsetNumber here,
* because they assert ip_posid != 0 which might not be true for a
- * user-supplied TID.
+ * user-supplied TID. Instead we use ItemPointerGetBlockNumberNoCheck and
+ * ItemPointerGetOffsetNumberNoCheck which do not do any validation.
*/
- BlockNumber b1 = BlockIdGetBlockNumber(&(arg1->ip_blkid));
- BlockNumber b2 = BlockIdGetBlockNumber(&(arg2->ip_blkid));
+ BlockNumber b1 = ItemPointerGetBlockNumberNoCheck(arg1);
+ BlockNumber b2 = ItemPointerGetBlockNumberNoCheck(arg2);
if (b1 < b2)
return -1;
else if (b1 > b2)
return 1;
- else if (arg1->ip_posid < arg2->ip_posid)
+ else if (ItemPointerGetOffsetNumberNoCheck(arg1) <
+ ItemPointerGetOffsetNumberNoCheck(arg2))
return -1;
- else if (arg1->ip_posid > arg2->ip_posid)
+ else if (ItemPointerGetOffsetNumberNoCheck(arg1) >
+ ItemPointerGetOffsetNumberNoCheck(arg2))
return 1;
else
return 0;
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index a3b372f..735c006 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -109,8 +109,8 @@ tidout(PG_FUNCTION_ARGS)
OffsetNumber offsetNumber;
char buf[32];
- blockNumber = BlockIdGetBlockNumber(&(itemPtr->ip_blkid));
- offsetNumber = itemPtr->ip_posid;
+ blockNumber = ItemPointerGetBlockNumberNoCheck(itemPtr);
+ offsetNumber = ItemPointerGetOffsetNumberNoCheck(itemPtr);
/* Perhaps someday we should output this as a record. */
snprintf(buf, sizeof(buf), "(%u,%u)", blockNumber, offsetNumber);
@@ -146,14 +146,12 @@ Datum
tidsend(PG_FUNCTION_ARGS)
{
ItemPointer itemPtr = PG_GETARG_ITEMPOINTER(0);
- BlockId blockId;
BlockNumber blockNumber;
OffsetNumber offsetNumber;
StringInfoData buf;
- blockId = &(itemPtr->ip_blkid);
- blockNumber = BlockIdGetBlockNumber(blockId);
- offsetNumber = itemPtr->ip_posid;
+ blockNumber = ItemPointerGetBlockNumberNoCheck(itemPtr);
+ offsetNumber = ItemPointerGetOffsetNumberNoCheck(itemPtr);
pq_begintypsend(&buf);
pq_sendint(&buf, blockNumber, sizeof(blockNumber));
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 34e7339..2fd4479 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -460,8 +460,8 @@ extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
static inline int
ginCompareItemPointers(ItemPointer a, ItemPointer b)
{
- uint64 ia = (uint64) a->ip_blkid.bi_hi << 32 | (uint64) a->ip_blkid.bi_lo << 16 | a->ip_posid;
- uint64 ib = (uint64) b->ip_blkid.bi_hi << 32 | (uint64) b->ip_blkid.bi_lo << 16 | b->ip_posid;
+ uint64 ia = (uint64) GinItemPointerGetBlockNumber(a) << 32 | GinItemPointerGetOffsetNumber(a);
+ uint64 ib = (uint64) GinItemPointerGetBlockNumber(b) << 32 | GinItemPointerGetOffsetNumber(b);
if (ia == ib)
return 0;
diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index a3fb056..438912c 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -132,10 +132,17 @@ typedef struct GinMetaPageData
* to avoid Asserts, since sometimes the ip_posid isn't "valid"
*/
#define GinItemPointerGetBlockNumber(pointer) \
- BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+ (ItemPointerGetBlockNumberNoCheck(pointer))
#define GinItemPointerGetOffsetNumber(pointer) \
- ((pointer)->ip_posid)
+ (ItemPointerGetOffsetNumberNoCheck(pointer))
+
+#define GinItemPointerSetBlockNumber(pointer, blkno) \
+ (ItemPointerSetBlockNumber((pointer), (blkno)))
+
+#define GinItemPointerSetOffsetNumber(pointer, offnum) \
+ (ItemPointerSetOffsetNumber((pointer), (offnum)))
+
/*
* Special-case item pointer values needed by the GIN search logic.
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7552186..24433c7 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -428,7 +428,7 @@ do { \
#define HeapTupleHeaderIsSpeculative(tup) \
( \
- (tup)->t_ctid.ip_posid == SpecTokenOffsetNumber \
+ (ItemPointerGetOffsetNumberNoCheck(&(tup)->t_ctid) == SpecTokenOffsetNumber) \
)
#define HeapTupleHeaderGetSpeculativeToken(tup) \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6289ffa..f9304db 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -151,9 +151,8 @@ typedef struct BTMetaPageData
* within a level). - vadim 04/09/97
*/
#define BTTidSame(i1, i2) \
- ( (i1).ip_blkid.bi_hi == (i2).ip_blkid.bi_hi && \
- (i1).ip_blkid.bi_lo == (i2).ip_blkid.bi_lo && \
- (i1).ip_posid == (i2).ip_posid )
+ ((ItemPointerGetBlockNumber(&(i1)) == ItemPointerGetBlockNumber(&(i2))) && \
+ (ItemPointerGetOffsetNumber(&(i1)) == ItemPointerGetOffsetNumber(&(i2))))
#define BTEntrySame(i1, i2) \
BTTidSame((i1)->t_tid, (i2)->t_tid)
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 576aaa8..60d0070 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -69,6 +69,12 @@ typedef ItemPointerData *ItemPointer;
BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
)
+/* Same as ItemPointerGetBlockNumber but without any assert-checks */
+#define ItemPointerGetBlockNumberNoCheck(pointer) \
+( \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
+)
+
/*
* ItemPointerGetOffsetNumber
* Returns the offset number of a disk item pointer.
@@ -79,6 +85,12 @@ typedef ItemPointerData *ItemPointer;
(pointer)->ip_posid \
)
+/* Same as ItemPointerGetOffsetNumber but without any assert-checks */
+#define ItemPointerGetOffsetNumberNoCheck(pointer) \
+( \
+ (pointer)->ip_posid \
+)
+
/*
* ItemPointerSet
* Sets a disk item pointer to the specified block and offset.
--
2.1.4
0004-freeup-3bits-ip_posid-v16.patchtext/plain; charset=us-asciiDownload
From b9bd5c82336decc3124b10a6d34d8222d0dd487f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 8 Mar 2017 13:49:22 -0300
Subject: [PATCH 4/6] freeup 3bits ip_posid v16
---
src/backend/access/gin/ginget.c | 2 +-
src/backend/access/gin/ginpostinglist.c | 2 +-
src/include/access/ginblock.h | 10 +++++-----
src/include/access/gist_private.h | 4 ++--
src/include/access/htup_details.h | 2 +-
src/include/storage/itemptr.h | 32 ++++++++++++++++++++++++++++----
src/include/storage/off.h | 9 ++++++++-
7 files changed, 46 insertions(+), 15 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index aa0b02f..1e1c978 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -928,7 +928,7 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key,
* Find the minimum item > advancePast among the active entry streams.
*
* Note: a lossy-page entry is encoded by a ItemPointer with max value for
- * offset (0xffff), so that it will sort after any exact entries for the
+ * offset (0x1fff), so that it will sort after any exact entries for the
* same page. So we'll prefer to return exact pointers not lossy
* pointers, which is good.
*/
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 8d2d31a..b22b9f5 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -253,7 +253,7 @@ ginCompressPostingList(const ItemPointer ipd, int nipd, int maxsize,
Assert(ndecoded == totalpacked);
for (i = 0; i < ndecoded; i++)
- Assert(memcmp(&tmp[i], &ipd[i], sizeof(ItemPointerData)) == 0);
+ Assert(ItemPointerEquals(&tmp[i], &ipd[i]));
pfree(tmp);
}
#endif
diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 438912c..3f7a3f0 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -160,14 +160,14 @@ typedef struct GinMetaPageData
(GinItemPointerGetOffsetNumber(p) == (OffsetNumber)0 && \
GinItemPointerGetBlockNumber(p) == (BlockNumber)0)
#define ItemPointerSetMax(p) \
- ItemPointerSet((p), InvalidBlockNumber, (OffsetNumber)0xffff)
+ ItemPointerSet((p), InvalidBlockNumber, (OffsetNumber)OffsetNumberMask)
#define ItemPointerIsMax(p) \
- (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)0xffff && \
+ (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)OffsetNumberMask && \
GinItemPointerGetBlockNumber(p) == InvalidBlockNumber)
#define ItemPointerSetLossyPage(p, b) \
- ItemPointerSet((p), (b), (OffsetNumber)0xffff)
+ ItemPointerSet((p), (b), (OffsetNumber)OffsetNumberMask)
#define ItemPointerIsLossyPage(p) \
- (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)0xffff && \
+ (GinItemPointerGetOffsetNumber(p) == (OffsetNumber)OffsetNumberMask && \
GinItemPointerGetBlockNumber(p) != InvalidBlockNumber)
/*
@@ -218,7 +218,7 @@ typedef signed char GinNullCategory;
*/
#define GinGetNPosting(itup) GinItemPointerGetOffsetNumber(&(itup)->t_tid)
#define GinSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
-#define GIN_TREE_POSTING ((OffsetNumber)0xffff)
+#define GIN_TREE_POSTING ((OffsetNumber)OffsetNumberMask)
#define GinIsPostingTree(itup) (GinGetNPosting(itup) == GIN_TREE_POSTING)
#define GinSetPostingTree(itup, blkno) ( GinSetNPosting((itup),GIN_TREE_POSTING), ItemPointerSetBlockNumber(&(itup)->t_tid, blkno) )
#define GinGetPostingTree(itup) GinItemPointerGetBlockNumber(&(itup)->t_tid)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 1ad4ed6..0ad11f1 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -269,8 +269,8 @@ typedef struct
* invalid tuples in an index, so throwing an error is as far as we go with
* supporting that.
*/
-#define TUPLE_IS_VALID 0xffff
-#define TUPLE_IS_INVALID 0xfffe
+#define TUPLE_IS_VALID OffsetNumberMask
+#define TUPLE_IS_INVALID OffsetNumberPrev(OffsetNumberMask)
#define GistTupleIsInvalid(itup) ( ItemPointerGetOffsetNumber( &((itup)->t_tid) ) == TUPLE_IS_INVALID )
#define GistTupleSetValid(itup) ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_VALID )
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 24433c7..4d614b7 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -288,7 +288,7 @@ struct HeapTupleHeaderData
* than MaxOffsetNumber, so that it can be distinguished from a valid
* offset number in a regular item pointer.
*/
-#define SpecTokenOffsetNumber 0xfffe
+#define SpecTokenOffsetNumber OffsetNumberPrev(OffsetNumberMask)
/*
* HeapTupleHeader accessor macros
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 60d0070..3144bdd 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -57,7 +57,7 @@ typedef ItemPointerData *ItemPointer;
* True iff the disk item pointer is not NULL.
*/
#define ItemPointerIsValid(pointer) \
- ((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))
+ ((bool) (PointerIsValid(pointer) && (((pointer)->ip_posid & OffsetNumberMask) != 0)))
/*
* ItemPointerGetBlockNumber
@@ -82,13 +82,37 @@ typedef ItemPointerData *ItemPointer;
#define ItemPointerGetOffsetNumber(pointer) \
( \
AssertMacro(ItemPointerIsValid(pointer)), \
- (pointer)->ip_posid \
+ ((pointer)->ip_posid & OffsetNumberMask) \
)
/* Same as ItemPointerGetOffsetNumber but without any assert-checks */
#define ItemPointerGetOffsetNumberNoCheck(pointer) \
( \
- (pointer)->ip_posid \
+ ((pointer)->ip_posid & OffsetNumberMask) \
+)
+
+/*
+ * Get the flags stored in high order bits in the OffsetNumber.
+ */
+#define ItemPointerGetFlags(pointer) \
+( \
+ ((pointer)->ip_posid & ~OffsetNumberMask) >> OffsetNumberBits \
+)
+
+/*
+ * Set the flag bits. We first left-shift since flags are defined starting 0x01
+ */
+#define ItemPointerSetFlags(pointer, flags) \
+( \
+ ((pointer)->ip_posid |= ((flags) << OffsetNumberBits)) \
+)
+
+/*
+ * Clear all flags.
+ */
+#define ItemPointerClearFlags(pointer) \
+( \
+ ((pointer)->ip_posid &= OffsetNumberMask) \
)
/*
@@ -99,7 +123,7 @@ typedef ItemPointerData *ItemPointer;
( \
AssertMacro(PointerIsValid(pointer)), \
BlockIdSet(&((pointer)->ip_blkid), blockNumber), \
- (pointer)->ip_posid = offNum \
+ (pointer)->ip_posid = (offNum) \
)
/*
diff --git a/src/include/storage/off.h b/src/include/storage/off.h
index fe8638f..fe1834c 100644
--- a/src/include/storage/off.h
+++ b/src/include/storage/off.h
@@ -26,8 +26,15 @@ typedef uint16 OffsetNumber;
#define InvalidOffsetNumber ((OffsetNumber) 0)
#define FirstOffsetNumber ((OffsetNumber) 1)
#define MaxOffsetNumber ((OffsetNumber) (BLCKSZ / sizeof(ItemIdData)))
-#define OffsetNumberMask (0xffff) /* valid uint16 bits */
+/*
+ * Currently we support maxinum 32kB blocks and each ItemId takes 6 bytes. That
+ * limits the number of line pointers to (32kB/6 = 5461). 13 bits are enought o
+ * represent all line pointers. Hence we can reuse the high order bits in
+ * OffsetNumber for other purposes.
+ */
+#define OffsetNumberMask (0x1fff) /* valid uint16 bits */
+#define OffsetNumberBits 13 /* number of valid bits in OffsetNumber */
/* ----------------
* support macros
* ----------------
--
2.1.4
0005-warm-updates-v16.patchtext/plain; charset=us-asciiDownload
From ec58bc56548045d34bd92d2042432f7d5eaee5d4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 8 Mar 2017 13:49:43 -0300
Subject: [PATCH 5/6] warm updates v16
---
contrib/bloom/blutils.c | 1 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 5 +-
src/backend/access/hash/hashsearch.c | 5 +
src/backend/access/hash/hashutil.c | 110 +++++++++
src/backend/access/heap/README.WARM | 306 +++++++++++++++++++++++++
src/backend/access/heap/heapam.c | 256 +++++++++++++++++++--
src/backend/access/heap/pruneheap.c | 7 +
src/backend/access/index/indexam.c | 89 ++++++--
src/backend/access/nbtree/nbtinsert.c | 229 +++++++++++--------
src/backend/access/nbtree/nbtree.c | 5 +-
src/backend/access/nbtree/nbtutils.c | 104 +++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/catalog/index.c | 15 ++
src/backend/catalog/indexing.c | 57 ++++-
src/backend/catalog/system_views.sql | 4 +-
src/backend/commands/constraint.c | 4 +-
src/backend/commands/copy.c | 3 +
src/backend/commands/indexcmds.c | 17 +-
src/backend/commands/vacuumlazy.c | 25 ++
src/backend/executor/execIndexing.c | 18 +-
src/backend/executor/execReplication.c | 25 +-
src/backend/executor/nodeBitmapHeapscan.c | 21 +-
src/backend/executor/nodeIndexscan.c | 6 +-
src/backend/executor/nodeModifyTable.c | 27 ++-
src/backend/postmaster/pgstat.c | 7 +-
src/backend/utils/adt/pgstatfuncs.c | 31 +++
src/backend/utils/cache/relcache.c | 61 ++++-
src/include/access/amapi.h | 8 +
src/include/access/hash.h | 4 +
src/include/access/heapam.h | 12 +-
src/include/access/heapam_xlog.h | 1 +
src/include/access/htup_details.h | 29 ++-
src/include/access/nbtree.h | 2 +
src/include/access/relscan.h | 3 +-
src/include/catalog/pg_proc.h | 4 +
src/include/executor/executor.h | 1 +
src/include/executor/nodeIndexscan.h | 1 -
src/include/nodes/execnodes.h | 1 +
src/include/pgstat.h | 4 +-
src/include/utils/rel.h | 5 +
src/include/utils/relcache.h | 4 +-
src/test/regress/expected/rules.out | 12 +-
src/test/regress/expected/warm.out | 367 ++++++++++++++++++++++++++++++
src/test/regress/parallel_schedule | 2 +
src/test/regress/sql/warm.sql | 171 ++++++++++++++
47 files changed, 1905 insertions(+), 167 deletions(-)
create mode 100644 src/backend/access/heap/README.WARM
create mode 100644 src/test/regress/expected/warm.out
create mode 100644 src/test/regress/sql/warm.sql
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index f2eda67..b356e2b 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -142,6 +142,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b22563b..b4a1465 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -116,6 +116,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 6593771..843389b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -94,6 +94,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 1f8a7f6..9b20ae6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -90,6 +90,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = hashrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -271,6 +272,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
ItemPointer current;
bool res;
+ IndexTuple itup;
+
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
@@ -308,8 +311,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
offnum <= maxoffnum;
offnum = OffsetNumberNext(offnum))
{
- IndexTuple itup;
-
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
break;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..60e941d 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -59,6 +59,8 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
return true;
}
@@ -363,6 +365,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
so->hashso_heappos = itup->t_tid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = itup;
+
return true;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..dcba734 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -17,8 +17,12 @@
#include "access/hash.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
+#include "nodes/execnodes.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/datum.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
@@ -446,3 +450,109 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
return new_bucket;
}
+
+/*
+ * Recheck if the heap tuple satisfies the key stored in the index tuple
+ */
+bool
+hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values2[INDEX_MAX_KEYS];
+ bool isnull2[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+ */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL then they are equal
+ */
+ if (isnull2[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If either is NULL then they are not equal
+ */
+ if (isnull2[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now do a raw memory comparison
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/heap/README.WARM b/src/backend/access/heap/README.WARM
new file mode 100644
index 0000000..7b9a712
--- /dev/null
+++ b/src/backend/access/heap/README.WARM
@@ -0,0 +1,306 @@
+src/backend/access/heap/README.WARM
+
+Write Amplification Reduction Method (WARM)
+===========================================
+
+The Heap Only Tuple (HOT) feature greatly eliminated redudant index
+entries and allowed re-use of the dead space occupied by previously
+updated or deleted tuples (see src/backend/access/heap/README.HOT)
+
+One of the necessary conditions for satisfying HOT update is that the
+update must not change a column used in any of the indexes on the table.
+The condition is sometimes hard to meet, especially for complex
+workloads with several indexes on large yet frequently updated tables.
+Worse, sometimes only one or two index columns may be updated, but the
+regular non-HOT update will still insert a new index entry in every
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.
+
+WARM is a technique devised to address these problems.
+
+
+Update Chains With Multiple Index Entries Pointing to the Root
+--------------------------------------------------------------
+
+When a non-HOT update is caused by an index key change, a new index
+entry must be inserted for the changed index. But if the index key
+hasn't changed for other indexes, we don't really need to insert a new
+entry. Even though the existing index entry is pointing to the old
+tuple, the new tuple is reachable via the t_ctid chain. To keep things
+simple, a WARM update requires that the heap block must have enough
+space to store the new version of the tuple. This is same as HOT
+updates.
+
+In WARM, we ensure that every index entry always points to the root of
+the WARM chain. In fact, a WARM chain looks exactly like a HOT chain
+except for the fact that there could be multiple index entries pointing
+to the root of the chain. So when new entry is inserted in an index for
+updated tuple, and if we are doing a WARM update, the new entry is made
+point to the root of the WARM chain.
+
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly
+one index entry pointing to the tuple from both indexes.
+
+ lp [1]
+ [1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's entry (aaaa) also points to 1
+
+Now if the tuple's second column is updated and if there is room on the
+page, we perform a WARM update. To do so, Index1 does not get any new
+entry and Index2's new entry will still point to the root tuple of the
+chain.
+
+ lp [1] [2]
+ [1111, aaaa]->[111, bbbb]
+
+ Index1's entry (1111) points to 1
+ Index2's old entry (aaaa) points to 1
+ Index2's new entry (bbbb) also points to 1
+
+"A update chain which has more than one index entries pointing to its
+root line pointer is called WARM chain and the action that creates a
+WARM chain is called WARM update."
+
+Since all indexes always point to the root of the WARM chain, even when
+there are more than one index entries, WARM chains can be pruned and
+dead tuples can be removed without a need to do corresponding index
+cleanup.
+
+While this solves the problem of pruning dead tuples from a HOT/WARM
+chain, it also opens up a new technical challenge because now we have a
+situation where a heap tuple is reachable from multiple index entries,
+each having a different index key. While MVCC still ensures that only
+valid tuples are returned, a tuple with a wrong index key may be
+returned because of wrong index entries. In the above example, tuple
+[1111, bbbb] is reachable from both keys (aaaa) as well as (bbbb). For
+this reason, tuples returned from a WARM chain must always be rechecked
+for index key-match.
+
+Recheck Index Key Againt Heap Tuple
+-----------------------------------
+
+Since every Index AM has it's own notion of index tuples, each Index AM
+must implement its own method to recheck heap tuples. For example, a
+hash index stores the hash value of the column and hence recheck routine
+for hash AM must first compute the hash value of the heap attribute and
+then compare it against the value stored in the index tuple.
+
+The patch currently implement recheck routines for hash and btree
+indexes. If the table has an index which doesn't support recheck
+routine, WARM updates are disabled on such tables.
+
+Problem With Duplicate (key, ctid) Index Entries
+------------------------------------------------
+
+The index-key recheck logic works as long as there are no duplicate
+index keys, both pointing to the same WARM chain. In that case, the same
+valid tuple will be reachable via multiple index keys, yet satisfying
+the index key checks. In the above example, if the tuple [1111, bbbb] is
+again updated to [1111, aaaa] and if we insert a new index entry (aaaa)
+pointing to the root line pointer, we will end up with the following
+structure:
+
+ lp [1] [2] [3]
+ [1111, aaaa]->[1111, bbbb]->[1111, aaaa]
+
+ Index1's entry (1111) points to 1
+ Index2's oldest entry (aaaa) points to 1
+ Index2's old entry (bbbb) also points to 1
+ Index2's new entry (aaaa) also points to 1
+
+We must solve this problem to ensure that the same tuple is not
+reachable via multiple index pointers. There are couple of ways to
+address this issue:
+
+1. Do not allow WARM update to a tuple from a WARM chain. This
+guarantees that there can never be duplicate index entries to the same
+root line pointer because we must have checked for old and new index
+keys while doing the first WARM update.
+
+2. Do not allow duplicate (key, ctid) index pointers. In the above
+example, since (aaaa, 1) already exists in the index, we must not insert
+a duplicate index entry.
+
+The patch currently implements 1 i.e. do not do WARM updates to a tuple
+from a WARM chain. HOT updates are fine because they do not add a new
+index entry.
+
+Even with the restriction, this is a significant improvement because the
+number of regular UPDATEs are curtailed down to half.
+
+Expression and Partial Indexes
+------------------------------
+
+Expressions may evaluate to the same value even if the underlying column
+values have changed. A simple example is an index on "lower(col)" which
+will return the same value if the new heap value only differs in the
+case sensitivity. So we can not solely rely on the heap column check to
+decide whether or not to insert a new index entry for expression
+indexes. Similarly, for partial indexes, the predicate expression must
+be evaluated to decide whether or not to cause a new index entry when
+columns referred in the predicate expressions change.
+
+(None of these things are currently implemented and we squarely disallow
+WARM update if a column from expression indexes or predicate has
+changed).
+
+
+Efficiently Finding the Root Line Pointer
+-----------------------------------------
+
+During WARM update, we must be able to find the root line pointer of the
+tuple being updated. It must be noted that the t_ctid field in the heap
+tuple header is usually used to find the next tuple in the update chain.
+But the tuple that we are updating, must be the last tuple in the update
+chain. In such cases, the c_tid field usually points the tuple itself.
+So in theory, we could use the t_ctid to store additional information in
+the last tuple of the update chain, if the information about the tuple
+being the last tuple is stored elsewhere.
+
+We now utilize another bit from t_infomask2 to explicitly identify that
+this is the last tuple in the update chain.
+
+HEAP_LATEST_TUPLE - When this bit is set, the tuple is the last tuple in
+the update chain. The OffsetNumber part of t_ctid points to the root
+line pointer of the chain when HEAP_LATEST_TUPLE flag is set.
+
+If UPDATE operation is aborted, the last tuple in the update chain
+becomes dead. The root line pointer information stored in the tuple
+which remains the last valid tuple in the chain is also lost. In such
+rare cases, the root line pointer must be found in a hard way by
+scanning the entire heap page.
+
+Tracking WARM Chains
+--------------------
+
+The old and every subsequent tuple in the chain is marked with a special
+HEAP_WARM_TUPLE flag. We use the last remaining bit in t_infomask2 to
+store this information.
+
+When a tuple is returned from a WARM chain, the caller must do
+additional checks to ensure that the tuple matches the index key. Even
+if the tuple comes precedes the WARM update in the chain, it must still
+be rechecked for the index key match (case when old tuple is returned by
+the new index key). So we must follow the update chain everytime to the
+end to see check if this is a WARM chain.
+
+When the old updated tuple is retired and the root line pointer is
+converted into a redirected line pointer, we can copy the information
+about WARM chain to the redirected line pointer by storing a special
+value in the lp_len field of the line pointer. This will handle the most
+common case where a WARM chain is replaced by a redirect line pointer
+and a single tuple in the chain.
+
+Converting WARM chains back to HOT chains (VACUUM ?)
+----------------------------------------------------
+
+The current implementation of WARM allows only one WARM update per
+chain. This simplifies the design and addresses certain issues around
+duplicate scans. But this also implies that the benefit of WARM will be
+no more than 50%, which is still significant, but if we could return
+WARM chains back to normal status, we could do far more WARM updates.
+
+A distinct property of a WARM chain is that at least one index has more
+than one live index entries pointing to the root of the chain. In other
+words, if we can remove duplicate entry from every index or conclusively
+prove that there are no duplicate index entries for the root line
+pointer, the chain can again be marked as HOT.
+
+Here is one idea:
+
+A WARM chain has two parts, separated by the tuple that caused WARM
+update. All tuples in each part has matching index keys, but certain
+index keys may not match between these two parts. Lets say we mark heap
+tuples in each part with a special Red-Blue flag. The same flag is
+replicated in the index tuples. For example, when new rows are inserted
+in a table, they are marked with Blue flag and the index entries
+associated with those rows are also marked with Blue flag. When a row is
+WARM updated, the new version is marked with Red flag and the new index
+entry created by the update is also marked with Red flag.
+
+
+Heap chain: [1] [2] [3] [4]
+ [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R
+
+Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)
+ (bbbb)R points to 1 (satisfies only tuples marked with R)
+
+Index2: (1111)B points to 1 (satisfied bith B and R tuples)
+
+
+It's clear that for indexes with Red and Blue pointers, a heap tuple
+with Blue flag will be reachable from Blue pointer and that with Red
+flag will be reachable from Red pointer. But for indexes which did not
+create a new entry, both Blue and Red tuples will be reachable from Blue
+pointer (there is no Red pointer in such indexes). So, as a side note,
+matching Red and Blue flags is not enough from index scan perspective.
+
+During first heap scan of VACUUM, we look for tuples with
+HEAP_WARM_TUPLE set. If all live tuples in the chain are either marked
+with Blue flag or Red flag (but no mix of Red and Blue), then the chain
+is a candidate for HOT conversion. We remember the root line pointer
+and Red-Blue flag of the WARM chain in a separate array.
+
+If we have a Red WARM chain, then our goal is to remove Blue pointers
+and vice versa. But there is a catch. For Index2 above, there is only
+Blue pointer and that must not be removed. IOW we should remove Blue
+pointer iff a Red pointer exists. Since index vacuum may visit Red and
+Blue pointers in any order, I think we will need another index pass to
+remove dead index pointers. So in the first index pass we check which
+WARM candidates have 2 index pointers. In the second pass, we remove the
+dead pointer and reset Red flag is the surviving index pointer is Red.
+
+During the second heap scan, we fix WARM chain by clearing
+HEAP_WARM_TUPLE flag and also reset Red flag to Blue.
+
+There are some more problems around aborted vacuums. For example, if
+vacuum aborts after changing Red index flag to Blue but before removing
+the other Blue pointer, we will end up with two Blue pointers to a Red
+WARM chain. But since the HEAP_WARM_TUPLE flag on the heap tuple is
+still set, further WARM updates to the chain will be blocked. I guess we
+will need some special handling for case with multiple Blue pointers. We
+can either leave these WARM chains alone and let them die with a
+subsequent non-WARM update or must apply heap-recheck logic during index
+vacuum to find the dead pointer. Given that vacuum-aborts are not
+common, I am inclined to leave this case unhandled. We must still check
+for presence of multiple Blue pointers and ensure that we don't
+accidently remove either of the Blue pointers and not clear WARM chains
+either.
+
+CREATE INDEX CONCURRENTLY
+-------------------------
+
+Currently CREATE INDEX CONCURRENTLY (CIC) is implemented as a 3-phase
+process. In the first phase, we create catalog entry for the new index
+so that the index is visible to all other backends, but still don't use
+it for either read or write. But we ensure that no new broken HOT
+chains are created by new transactions. In the second phase, we build
+the new index using a MVCC snapshot and then make the index available
+for inserts. We then do another pass over the index and insert any
+missing tuples, everytime indexing only it's root line pointer. See
+README.HOT for details about how HOT impacts CIC and how various
+challenges are tackeled.
+
+WARM poses another challenge because it allows creation of HOT chains
+even when an index key is changed. But since the index is not ready for
+insertion until the second phase is over, we might end up with a
+situation where the HOT chain has tuples with different index columns,
+yet only one of these values are indexed by the new index. Note that
+during the third phase, we only index tuples whose root line pointer is
+missing from the index. But we can't easily check if the existing index
+tuple is actually indexing the heap tuple visible to the new MVCC
+snapshot. Finding that information will require us to query the index
+again for every tuple in the chain, especially if it's a WARM tuple.
+This would require repeated access to the index. Another option would be
+to return index keys along with the heap TIDs when index is scanned for
+collecting all indexed TIDs during third phase. We can then compare the
+heap tuple against the already indexed key and decide whether or not to
+index the new tuple.
+
+We solve this problem more simply by disallowing WARM updates until the
+index is ready for insertion. We don't need to disallow WARM on a
+wholesale basis, but only those updates that change the columns of the
+new index are disallowed to be WARM updates.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 93cde9a..b9ff94d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1958,6 +1958,78 @@ heap_fetch(Relation relation,
}
/*
+ * Check if the HOT chain containing this tid is actually a WARM chain.
+ * Note that even if the WARM update ultimately aborted, we still must do a
+ * recheck because the failing UPDATE when have inserted created index entries
+ * which are now stale, but still referencing this chain.
+ */
+static bool
+hot_check_warm_chain(Page dp, ItemPointer tid)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ break;
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Presence of either WARM or WARM updated tuple signals possible
+ * breakage and the caller must recheck tuple returned from this chain
+ * for index satisfaction
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ return true;
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ /* All OK. No need to recheck */
+ return false;
+}
+
+/*
* heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot
*
* On entry, *tid is the TID of a tuple (either a simple tuple, or the root
@@ -1977,11 +2049,14 @@ heap_fetch(Relation relation,
* Unlike heap_fetch, the caller must already have pin and (at least) share
* lock on the buffer; it is still pinned/locked at exit. Also unlike
* heap_fetch, we do not report any pgstats count; caller may do so if wanted.
+ *
+ * recheck should be set false on entry by caller, will be set true on exit
+ * if a WARM tuple is encountered.
*/
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, bool *recheck)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -2035,9 +2110,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);
/*
- * Shouldn't see a HEAP_ONLY tuple at chain start.
+ * Shouldn't see a HEAP_ONLY tuple at chain start, unless we are
+ * dealing with a WARM updated tuple in which case deferred triggers
+ * may request to fetch a WARM tuple from middle of a chain.
*/
- if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+ if (at_chain_start && HeapTupleIsHeapOnly(heapTuple) &&
+ !HeapTupleIsHeapWarmTuple(heapTuple))
break;
/*
@@ -2050,6 +2128,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
break;
/*
+ * Check if there exists a WARM tuple somewhere down the chain and set
+ * recheck to TRUE.
+ *
+ * XXX This is not very efficient right now, and we should look for
+ * possible improvements here
+ */
+ if (recheck && *recheck == false)
+ *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+
+ /*
* When first_call is true (and thus, skip is initially false) we'll
* return the first tuple we find. But on later passes, heapTuple
* will initially be pointing to the tuple we returned last time.
@@ -2098,7 +2186,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Check to see if HOT chain continues past this tuple; if so fetch
* the next offnum and loop around.
*/
- if (HeapTupleIsHotUpdated(heapTuple))
+ if (HeapTupleIsHotUpdated(heapTuple) &&
+ !HeapTupleHeaderHasRootOffset(heapTuple->t_data))
{
Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
ItemPointerGetBlockNumber(tid));
@@ -2122,18 +2211,41 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
*/
bool
heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead, bool *recheck, Buffer *cbuffer,
+ HeapTuple heapTuple)
{
bool result;
Buffer buffer;
- HeapTupleData heapTuple;
+ ItemPointerData ret_tid = *tid;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
- result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
- &heapTuple, all_dead, true);
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ result = heap_hot_search_buffer(&ret_tid, relation, buffer, snapshot,
+ heapTuple, all_dead, true, recheck);
+
+ /*
+ * If we are returning a potential candidate tuple from this chain and the
+ * caller has requested for "recheck" hint, keep the buffer locked and
+ * pinned. The caller must release the lock and pin on the buffer in all
+ * such cases
+ */
+ if (!result || !recheck || !(*recheck))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buffer);
+ }
+
+ /*
+ * Set the caller supplied tid with the actual location of the tuple being
+ * returned
+ */
+ if (result)
+ {
+ *tid = ret_tid;
+ if (cbuffer)
+ *cbuffer = buffer;
+ }
+
return result;
}
@@ -3492,15 +3604,18 @@ simple_heap_delete(Relation relation, ItemPointer tid)
HTSU_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *key_attrs;
Bitmapset *id_attrs;
+ Bitmapset *exprindx_attrs;
Bitmapset *interesting_attrs;
Bitmapset *modified_attrs;
+ Bitmapset *notready_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
@@ -3521,6 +3636,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
+ bool use_warm_update = false;
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
@@ -3545,6 +3661,10 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
+ /* Assume no-warm update */
+ if (warm_update)
+ *warm_update = false;
+
/*
* Fetch the list of attributes to be checked for various operations.
*
@@ -3566,10 +3686,17 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
+ exprindx_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE);
+ notready_attrs = RelationGetIndexAttrBitmap(relation,
+ INDEX_ATTR_BITMAP_NOTREADY);
+
+
interesting_attrs = bms_add_members(NULL, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
-
+ interesting_attrs = bms_add_members(interesting_attrs, exprindx_attrs);
+ interesting_attrs = bms_add_members(interesting_attrs, notready_attrs);
block = ItemPointerGetBlockNumber(otid);
offnum = ItemPointerGetOffsetNumber(otid);
@@ -3621,6 +3748,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
&oldtup, newtup);
+ if (modified_attrsp)
+ *modified_attrsp = bms_copy(modified_attrs);
+
/*
* If we're not updating any "key" column, we can grab a weaker lock type.
* This allows for more concurrency when we are running simultaneously
@@ -3876,6 +4006,7 @@ l2:
bms_free(hot_attrs);
bms_free(key_attrs);
bms_free(id_attrs);
+ bms_free(exprindx_attrs);
bms_free(modified_attrs);
bms_free(interesting_attrs);
return result;
@@ -4194,6 +4325,37 @@ l2:
*/
if (!bms_overlap(modified_attrs, hot_attrs))
use_hot_update = true;
+ else
+ {
+ /*
+ * If no WARM updates yet on this chain, let this update be a WARM
+ * update.
+ *
+ * We check for both warm and warm updated tuples since if the
+ * previous WARM update aborted, we may still have added
+ * another index entry for this HOT chain. In such situations, we
+ * must not attempt a WARM update until duplicate (key, CTID) index
+ * entry issue is sorted out
+ *
+ * XXX Later we'll add more checks to ensure WARM chains can
+ * further be WARM updated. This is probably good to do first rounf
+ * of tests of remaining functionality
+ *
+ * XXX Disable WARM updates on system tables. There is nothing in
+ * principle that stops us from supporting this. But it would
+ * require API change to propogate the changed columns back to the
+ * caller so that CatalogUpdateIndexes() can avoid adding new
+ * entries to indexes that are not changed by update. This will be
+ * fixed once basic patch is tested. !!FIXME
+ */
+ if (relation->rd_supportswarm &&
+ !bms_overlap(modified_attrs, exprindx_attrs) &&
+ !bms_is_subset(hot_attrs, modified_attrs) &&
+ !IsSystemRelation(relation) &&
+ !bms_overlap(notready_attrs, modified_attrs) &&
+ !HeapTupleIsHeapWarmTuple(&oldtup))
+ use_warm_update = true;
+ }
}
else
{
@@ -4240,6 +4402,22 @@ l2:
HeapTupleSetHeapOnly(heaptup);
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
+
+ /*
+ * Even if we are doing a HOT update, we must carry forward the WARM
+ * flag because we may have already inserted another index entry
+ * pointing to our root and a third entry may create duplicates
+ *
+ * Note: If we ever have a mechanism to avoid duplicate <key, TID> in
+ * indexes, we could look at relaxing this restriction and allow even
+ * more WARM udpates
+ */
+ if (HeapTupleIsHeapWarmTuple(&oldtup))
+ {
+ HeapTupleSetHeapWarmTuple(heaptup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ }
+
/*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
@@ -4252,12 +4430,35 @@ l2:
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
}
+ else if (use_warm_update)
+ {
+ /* Mark the old tuple as HOT-updated */
+ HeapTupleSetHotUpdated(&oldtup);
+ HeapTupleSetHeapWarmTuple(&oldtup);
+ /* And mark the new tuple as heap-only */
+ HeapTupleSetHeapOnly(heaptup);
+ HeapTupleSetHeapWarmTuple(heaptup);
+ /* Mark the caller's copy too, in case different from heaptup */
+ HeapTupleSetHeapOnly(newtup);
+ HeapTupleSetHeapWarmTuple(newtup);
+ if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
+ root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
+ else
+ root_offnum = heap_get_root_tuple(page,
+ ItemPointerGetOffsetNumber(&(oldtup.t_self)));
+
+ /* Let the caller know we did a WARM update */
+ if (warm_update)
+ *warm_update = true;
+ }
else
{
/* Make sure tuples are correctly marked as not-HOT */
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
+ HeapTupleClearHeapWarmTuple(heaptup);
+ HeapTupleClearHeapWarmTuple(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4367,7 +4568,10 @@ l2:
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- pgstat_count_heap_update(relation, use_hot_update);
+ /*
+ * Count HOT and WARM updates separately
+ */
+ pgstat_count_heap_update(relation, use_hot_update, use_warm_update);
/*
* If heaptup is a private copy, release it. Don't forget to copy t_self
@@ -4507,7 +4711,8 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
* via ereport().
*/
void
-simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
+simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
+ Bitmapset **modified_attrs, bool *warm_update)
{
HTSU_Result result;
HeapUpdateFailureData hufd;
@@ -4516,7 +4721,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, modified_attrs, warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -7568,6 +7773,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool need_tuple_data = RelationIsLogicallyLogged(reln);
bool init;
int bufflags;
+ bool warm_update = false;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -7579,6 +7785,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ if (HeapTupleIsHeapWarmTuple(newtup))
+ warm_update = true;
+
/*
* If the old and new tuple are on the same page, we only need to log the
* parts of the new tuple that were changed. That saves on the amount of
@@ -7652,6 +7861,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_CONTAINS_OLD_KEY;
}
}
+ if (warm_update)
+ xlrec.flags |= XLH_UPDATE_WARM_UPDATE;
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
@@ -8629,16 +8840,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
Size freespace = 0;
XLogRedoAction oldaction;
XLogRedoAction newaction;
+ bool warm_update = false;
/* initialize to keep the compiler quiet */
oldtup.t_data = NULL;
oldtup.t_len = 0;
+ if (xlrec->flags & XLH_UPDATE_WARM_UPDATE)
+ warm_update = true;
+
XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
{
/* HOT updates are never done across pages */
Assert(!hot_update);
+ /* WARM updates are never done across pages */
+ Assert(!warm_update);
}
else
oldblk = newblk;
@@ -8698,6 +8915,11 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
&htup->t_infomask2);
HeapTupleHeaderSetXmax(htup, xlrec->old_xmax);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+ /* Mark the old tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
/* Set forward chain link in t_ctid */
HeapTupleHeaderSetNextTid(htup, &newtid);
@@ -8833,6 +9055,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
HeapTupleHeaderSetCmin(htup, FirstCommandId);
HeapTupleHeaderSetXmax(htup, xlrec->new_xmax);
+ /* Mark the new tuple has a WARM tuple */
+ if (warm_update)
+ HeapTupleHeaderSetHeapWarmTuple(htup);
+
offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple");
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f54337c..4e8ed79 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -834,6 +834,13 @@ heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,
if (!HeapTupleHeaderIsHotUpdated(htup))
continue;
+ /*
+ * If the tuple has root line pointer, it must be the end of the
+ * chain
+ */
+ if (HeapTupleHeaderHasRootOffset(htup))
+ break;
+
/* Set up to scan the HOT-chain */
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index cc5ac8b..da6c252 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -75,10 +75,12 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
+#include "executor/executor.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/datum.h"
#include "utils/snapmgr.h"
#include "utils/tqual.h"
@@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;
+ /*
+ * If the index supports recheck, make sure that index tuple is saved
+ * during index scans.
+ *
+ * XXX Ideally, we should look at all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ scan->xs_want_itup = true;
+
return scan;
}
@@ -535,8 +552,8 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
- * scan->xs_recheck and possibly scan->xs_itup/scan->xs_hitup, though we
- * pay no attention to those fields here.
+ * scan->xs_tuple_recheck and possibly scan->xs_itup/scan->xs_hitup,
+ * though we pay no attention to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
@@ -574,7 +591,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
* dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
* call).
*
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * Note: caller must check scan->xs_tuple_recheck, and perform rechecking of the
* scan keys if required. We do not do that here because we don't have
* enough information to do it efficiently in the general case.
* ----------------
@@ -601,6 +618,12 @@ index_fetch_heap(IndexScanDesc scan)
*/
if (prev_buf != scan->xs_cbuf)
heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+
+ /*
+ * If we're not always re-checking, reset recheck for this tuple.
+ * Otherwise we must recheck every tuple.
+ */
+ scan->xs_tuple_recheck = scan->xs_recheck;
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -610,32 +633,64 @@ index_fetch_heap(IndexScanDesc scan)
scan->xs_snapshot,
&scan->xs_ctup,
&all_dead,
- !scan->xs_continue_hot);
+ !scan->xs_continue_hot,
+ &scan->xs_tuple_recheck);
LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
if (got_heap_tuple)
{
+ bool res = true;
+
+ /*
+ * Ok we got a tuple which satisfies the snapshot, but if its part of a
+ * WARM chain, we must do additional checks to ensure that we are
+ * indeed returning a correct tuple. Note that if the index AM does not
+ * implement amrecheck method, then we don't any additional checks
+ * since WARM must have been disabled on such tables
+ *
+ * XXX What happens when a new index which does not support amcheck is
+ * added to the table? Do we need to handle this case or is CREATE
+ * INDEX and CREATE INDEX CONCURRENTLY smart enough to handle this
+ * issue?
+ */
+ if (scan->xs_tuple_recheck &&
+ scan->xs_itup &&
+ scan->indexRelation->rd_amroutine->amrecheck)
+ {
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
+ res = scan->indexRelation->rd_amroutine->amrecheck(
+ scan->indexRelation,
+ scan->xs_itup,
+ scan->heapRelation,
+ &scan->xs_ctup);
+ LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+ }
+
/*
* Only in a non-MVCC snapshot can more than one member of the HOT
* chain be visible.
*/
scan->xs_continue_hot = !IsMVCCSnapshot(scan->xs_snapshot);
pgstat_count_heap_fetch(scan->indexRelation);
- return &scan->xs_ctup;
+
+ if (res)
+ return &scan->xs_ctup;
}
+ else
+ {
+ /* We've reached the end of the HOT chain. */
+ scan->xs_continue_hot = false;
- /* We've reached the end of the HOT chain. */
- scan->xs_continue_hot = false;
-
- /*
- * If we scanned a whole HOT chain and found only dead tuples, tell index
- * AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
- */
- if (!scan->xactStartedInRecovery)
- scan->kill_prior_tuple = all_dead;
+ /*
+ * If we scanned a whole HOT chain and found only dead tuples, tell index
+ * AM to kill its entry for that TID (this will take effect in the next
+ * amgettuple call, in index_getnext_tid). We do not do this when in
+ * recovery because it may violate MVCC to do so. See comments in
+ * RelationGetIndexScan().
+ */
+ if (!scan->xactStartedInRecovery)
+ scan->kill_prior_tuple = all_dead;
+ }
return NULL;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6dca810..b5cb619 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,11 +20,14 @@
#include "access/nbtxlog.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/tqual.h"
-
+#include "utils/datum.h"
typedef struct
{
@@ -250,6 +253,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ Buffer buffer;
+ HeapTupleData heapTuple;
+ bool recheck = false;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -309,6 +315,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
curitup = (IndexTuple) PageGetItem(page, curitemid);
htid = curitup->t_tid;
+ recheck = false;
+
/*
* If we are doing a recheck, we expect to find the tuple we
* are rechecking. It's not a duplicate, but we have to keep
@@ -326,112 +334,153 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* have just a single index entry for the entire chain.
*/
else if (heap_hot_search(&htid, heapRel, &SnapshotDirty,
- &all_dead))
+ &all_dead, &recheck, &buffer,
+ &heapTuple))
{
TransactionId xwait;
+ bool result = true;
/*
- * It is a duplicate. If we are only doing a partial
- * check, then don't bother checking if the tuple is being
- * updated in another transaction. Just return the fact
- * that it is a potential conflict and leave the full
- * check till later.
+ * If the tuple was WARM update, we may again see our own
+ * tuple. Since WARM updates don't create new index
+ * entries, our own tuple is only reachable via the old
+ * index pointer
*/
- if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ if (checkUnique == UNIQUE_CHECK_EXISTING &&
+ ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- *is_unique = false;
- return InvalidTransactionId;
+ found = true;
+ result = false;
+ if (recheck)
+ UnlockReleaseBuffer(buffer);
+ }
+ else if (recheck)
+ {
+ result = btrecheck(rel, curitup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
}
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
- SnapshotDirty.xmin : SnapshotDirty.xmax;
-
- if (TransactionIdIsValid(xwait))
- {
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- /* Tell _bt_doinsert to wait... */
- *speculativeToken = SnapshotDirty.speculativeToken;
- return xwait;
- }
-
- /*
- * Otherwise we have a definite conflict. But before
- * complaining, look to see if the tuple we want to insert
- * is itself now committed dead --- if so, don't complain.
- * This is a waste of time in normal scenarios but we must
- * do it to support CREATE INDEX CONCURRENTLY.
- *
- * We must follow HOT-chains here because during
- * concurrent index build, we insert the root TID though
- * the actual tuple may be somewhere in the HOT-chain.
- * While following the chain we might not stop at the
- * exact tuple which triggered the insert, but that's OK
- * because if we find a live tuple anywhere in this chain,
- * we have a unique key conflict. The other live tuple is
- * not part of this chain because it had a different index
- * entry.
- */
- htid = itup->t_tid;
- if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL))
- {
- /* Normal case --- it's still live */
- }
- else
+ if (result)
{
/*
- * It's been deleted, so no error, and no need to
- * continue searching
+ * It is a duplicate. If we are only doing a partial
+ * check, then don't bother checking if the tuple is being
+ * updated in another transaction. Just return the fact
+ * that it is a potential conflict and leave the full
+ * check till later.
*/
- break;
- }
+ if (checkUnique == UNIQUE_CHECK_PARTIAL)
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ *is_unique = false;
+ return InvalidTransactionId;
+ }
- /*
- * Check for a conflict-in as we would if we were going to
- * write to this page. We aren't actually going to write,
- * but we want a chance to report SSI conflicts that would
- * otherwise be masked by this unique constraint
- * violation.
- */
- CheckForSerializableConflictIn(rel, NULL, buf);
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ xwait = (TransactionIdIsValid(SnapshotDirty.xmin)) ?
+ SnapshotDirty.xmin : SnapshotDirty.xmax;
- /*
- * This is a definite conflict. Break the tuple down into
- * datums and report the error. But first, make sure we
- * release the buffer locks we're holding ---
- * BuildIndexValueDescription could make catalog accesses,
- * which in the worst case might touch this same index and
- * cause deadlocks.
- */
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf);
- _bt_relbuf(rel, buf);
+ if (TransactionIdIsValid(xwait))
+ {
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ /* Tell _bt_doinsert to wait... */
+ *speculativeToken = SnapshotDirty.speculativeToken;
+ return xwait;
+ }
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
+ /*
+ * Otherwise we have a definite conflict. But before
+ * complaining, look to see if the tuple we want to insert
+ * is itself now committed dead --- if so, don't complain.
+ * This is a waste of time in normal scenarios but we must
+ * do it to support CREATE INDEX CONCURRENTLY.
+ *
+ * We must follow HOT-chains here because during
+ * concurrent index build, we insert the root TID though
+ * the actual tuple may be somewhere in the HOT-chain.
+ * While following the chain we might not stop at the
+ * exact tuple which triggered the insert, but that's OK
+ * because if we find a live tuple anywhere in this chain,
+ * we have a unique key conflict. The other live tuple is
+ * not part of this chain because it had a different index
+ * entry.
+ */
+ recheck = false;
+ ItemPointerCopy(&itup->t_tid, &htid);
+ if (heap_hot_search(&htid, heapRel, SnapshotSelf, NULL,
+ &recheck, &buffer, &heapTuple))
+ {
+ bool result = true;
+ if (recheck)
+ {
+ /*
+ * Recheck if the tuple actually satisfies the
+ * index key. Otherwise, we might be following
+ * a wrong index pointer and mustn't entertain
+ * this tuple
+ */
+ result = btrecheck(rel, itup, heapRel, &heapTuple);
+ UnlockReleaseBuffer(buffer);
+ }
+ if (!result)
+ break;
+ /* Normal case --- it's still live */
+ }
+ else
+ {
+ /*
+ * It's been deleted, so no error, and no need to
+ * continue searching
+ */
+ break;
+ }
- index_deform_tuple(itup, RelationGetDescr(rel),
- values, isnull);
+ /*
+ * Check for a conflict-in as we would if we were going to
+ * write to this page. We aren't actually going to write,
+ * but we want a chance to report SSI conflicts that would
+ * otherwise be masked by this unique constraint
+ * violation.
+ */
+ CheckForSerializableConflictIn(rel, NULL, buf);
- key_desc = BuildIndexValueDescription(rel, values,
- isnull);
+ /*
+ * This is a definite conflict. Break the tuple down into
+ * datums and report the error. But first, make sure we
+ * release the buffer locks we're holding ---
+ * BuildIndexValueDescription could make catalog accesses,
+ * which in the worst case might touch this same index and
+ * cause deadlocks.
+ */
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf);
+ _bt_relbuf(rel, buf);
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("duplicate key value violates unique constraint \"%s\"",
- RelationGetRelationName(rel)),
- key_desc ? errdetail("Key %s already exists.",
- key_desc) : 0,
- errtableconstraint(heapRel,
- RelationGetRelationName(rel))));
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ index_deform_tuple(itup, RelationGetDescr(rel),
+ values, isnull);
+
+ key_desc = BuildIndexValueDescription(rel, values,
+ isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("duplicate key value violates unique constraint \"%s\"",
+ RelationGetRelationName(rel)),
+ key_desc ? errdetail("Key %s already exists.",
+ key_desc) : 0,
+ errtableconstraint(heapRel,
+ RelationGetRelationName(rel))));
+ }
}
}
else if (all_dead)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 775f2ff..952ed8f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "executor/nodeIndexscan.h"
#include "pgstat.h"
#include "storage/condition_variable.h"
#include "storage/indexfsm.h"
@@ -163,6 +164,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = btestimateparallelscan;
amroutine->aminitparallelscan = btinitparallelscan;
amroutine->amparallelrescan = btparallelrescan;
+ amroutine->amrecheck = btrecheck;
PG_RETURN_POINTER(amroutine);
}
@@ -344,8 +346,9 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
- /* btree indexes are never lossy */
+ /* btree indexes are never lossy, except for WARM tuples */
scan->xs_recheck = false;
+ scan->xs_tuple_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5b259a3..c376c1b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,11 +20,15 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/index.h"
+#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/execnodes.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/datum.h"
typedef struct BTSortArrayContext
@@ -2069,3 +2073,103 @@ btproperty(Oid index_oid, int attno,
return false; /* punt to generic code */
}
}
+
+/*
+ * Check if the index tuple's key matches the one computed from the given heap
+ * tuple's attribute
+ */
+bool
+btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
+{
+ IndexInfo *indexInfo;
+ EState *estate;
+ ExprContext *econtext;
+ TupleTableSlot *slot;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int i;
+ bool equal;
+ int natts = indexRel->rd_rel->relnatts;
+ Form_pg_attribute att;
+
+ /* Get IndexInfo for this index */
+ indexInfo = BuildIndexInfo(indexRel);
+
+ /*
+ * The heap tuple must be put into a slot for FormIndexDatum.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRel));
+
+ ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+ /*
+ * Typically the index won't have expressions, but if it does we need an
+ * EState to evaluate them. We need it for exclusion constraints too,
+ * even if they are just on simple columns.
+ */
+ if (indexInfo->ii_Expressions != NIL ||
+ indexInfo->ii_ExclusionOps != NULL)
+ {
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ econtext->ecxt_scantuple = slot;
+ }
+ else
+ estate = NULL;
+
+ /*
+ * Form the index values and isnull flags for the index entry that we need
+ * to check.
+ *
+ * Note: if the index uses functions that are not as immutable as they are
+ * supposed to be, this could produce an index tuple different from the
+ * original. The index AM can catch such errors by verifying that it
+ * finds a matching index entry with the tuple's TID. For exclusion
+ * constraints we check this in check_exclusion_constraint().
+ */
+ FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+ equal = true;
+ for (i = 1; i <= natts; i++)
+ {
+ Datum indxvalue;
+ bool indxisnull;
+
+ indxvalue = index_getattr(indexTuple, i, indexRel->rd_att, &indxisnull);
+
+ /*
+ * If both are NULL, then they are equal
+ */
+ if (isnull[i - 1] && indxisnull)
+ continue;
+
+ /*
+ * If just one is NULL, then they are not equal
+ */
+ if (isnull[i - 1] || indxisnull)
+ {
+ equal = false;
+ break;
+ }
+
+ /*
+ * Now just do a raw memory comparison. If the index tuple was formed
+ * using this heap tuple, the computed index values must match
+ */
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
+ }
+
+ if (estate != NULL)
+ FreeExecutorState(estate);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ return equal;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index e57ac49..59ef7f3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -72,6 +72,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amestimateparallelscan = NULL;
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amrecheck = NULL;
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8d42a34..049eb28 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/var.h"
#include "parser/parser.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -1691,6 +1692,20 @@ BuildIndexInfo(Relation index)
ii->ii_AmCache = NULL;
ii->ii_Context = CurrentMemoryContext;
+ /* build a bitmap of all table attributes referred by this index */
+ for (i = 0; i < ii->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attr = ii->ii_KeyAttrNumbers[i];
+ ii->ii_indxattrs = bms_add_member(ii->ii_indxattrs, attr -
+ FirstLowInvalidHeapAttributeNumber);
+ }
+
+ /* Collect all attributes used in expressions, too */
+ pull_varattnos((Node *) ii->ii_Expressions, 1, &ii->ii_indxattrs);
+
+ /* Collect all attributes in the index predicate, too */
+ pull_varattnos((Node *) ii->ii_Predicate, 1, &ii->ii_indxattrs);
+
return ii;
}
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index abc344a..970254f 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -66,10 +66,15 @@ CatalogCloseIndexes(CatalogIndexState indstate)
*
* This should be called for each inserted or updated catalog tuple.
*
+ * If the tuple was WARM updated, the modified_attrs contains the list of
+ * columns updated by the update. We must not insert new index entries for
+ * indexes which do not refer to any of the modified columns.
+ *
* This is effectively a cut-down version of ExecInsertIndexTuples.
*/
static void
-CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
+CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
+ Bitmapset *modified_attrs, bool warm_update)
{
int i;
int numIndexes;
@@ -79,12 +84,28 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
IndexInfo **indexInfoArray;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData root_tid;
- /* HOT update does not require index inserts */
- if (HeapTupleIsHeapOnly(heapTuple))
+ /*
+ * HOT update does not require index inserts, but WARM may need for some
+ * indexes.
+ */
+ if (HeapTupleIsHeapOnly(heapTuple) && !warm_update)
return;
/*
+ * If we've done a WARM update, then we must index the TID of the root line
+ * pointer and not the actual TID of the new tuple.
+ */
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(heapTuple->t_self)),
+ HeapTupleHeaderGetRootOffset(heapTuple->t_data));
+ else
+ ItemPointerCopy(&heapTuple->t_self, &root_tid);
+
+
+ /*
* Get information from the state structure. Fall out if nothing to do.
*/
numIndexes = indstate->ri_NumIndices;
@@ -112,6 +133,17 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
continue;
/*
+ * If we've done WARM update, then we must not insert a new index tuple
+ * if none of the index keys have changed. This is not just an
+ * optimization, but a requirement for WARM to work correctly.
+ */
+ if (warm_update)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
+ /*
* Expressional and partial indexes on system catalogs are not
* supported, nor exclusion constraints, nor deferred uniqueness
*/
@@ -136,7 +168,7 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple)
index_insert(relationDescs[i], /* index relation */
values, /* array of index Datums */
isnull, /* is-null flags */
- &(heapTuple->t_self), /* tid of heap tuple */
+ &root_tid,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
@@ -168,7 +200,7 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, NULL, false);
CatalogCloseIndexes(indstate);
return oid;
@@ -190,7 +222,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, false, NULL);
return oid;
}
@@ -210,12 +242,14 @@ void
CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup)
{
CatalogIndexState indstate;
+ bool warm_update;
+ Bitmapset *modified_attrs;
indstate = CatalogOpenIndexes(heapRel);
- simple_heap_update(heapRel, otid, tup);
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
- CatalogIndexInsert(indstate, tup);
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
CatalogCloseIndexes(indstate);
}
@@ -231,9 +265,12 @@ void
CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
CatalogIndexState indstate)
{
- simple_heap_update(heapRel, otid, tup);
+ Bitmapset *modified_attrs;
+ bool warm_update;
- CatalogIndexInsert(indstate, tup);
+ simple_heap_update(heapRel, otid, tup, &modified_attrs, &warm_update);
+
+ CatalogIndexInsert(indstate, tup, modified_attrs, warm_update);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ba980de..410ccd3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -498,6 +498,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(C.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -528,7 +529,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(C.oid) AS n_tup_warm_upd
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index e2544e5..d9c0fe7 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -40,6 +40,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
TriggerData *trigdata = castNode(TriggerData, fcinfo->context);
const char *funcname = "unique_key_recheck";
HeapTuple new_row;
+ HeapTupleData heapTuple;
ItemPointerData tmptid;
Relation indexRel;
IndexInfo *indexInfo;
@@ -102,7 +103,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
* removed.
*/
tmptid = new_row->t_self;
- if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL))
+ if (!heap_hot_search(&tmptid, trigdata->tg_relation, SnapshotSelf, NULL,
+ NULL, NULL, &heapTuple))
{
/*
* All rows in the HOT chain are dead, so skip the check.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3102ab1..428fc65 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2681,6 +2681,8 @@ CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot,
&(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate,
false,
NULL,
@@ -2835,6 +2837,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ &(bufferedTuples[i]->t_self), NULL,
estate, false, NULL, NIL);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 72bb06c..d8f033d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -699,7 +699,14 @@ DefineIndex(Oid relationId,
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * tries to either insert into it or use it for queries. In addition to
+ * that, WARM updates will be disallowed if an update is modifying one of
+ * the columns used by this new index. This is necessary to ensure that we
+ * don't create WARM tuples which do not have corresponding entry in this
+ * index. It must be noted that during the second phase, we will index only
+ * those heap tuples whose root line pointer is not already in the index,
+ * hence it's important that all tuples in a given chain, has the same
+ * value for any indexed column (including this new index).
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
@@ -747,7 +754,10 @@ DefineIndex(Oid relationId,
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
- * chain have different index keys.
+ * chain have different index keys. Also, the new index is consulted for
+ * deciding whether a WARM update is possible, and WARM update is not done
+ * if a column used by this index is being updated. This ensures that we
+ * don't create WARM tuples which are not indexed by this index.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
@@ -782,7 +792,8 @@ DefineIndex(Oid relationId,
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
- * insert new entries into the index for insertions and non-HOT updates.
+ * insert new entries into the index for insertions and non-HOT updates or
+ * WARM updates where this index needs new entry.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5d47f16..7376099 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1033,6 +1033,19 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM
+ * tuple, there could be multiple index entries
+ * pointing to the root of this chain. We can't do
+ * index-only scans for such tuples without verifying
+ * index key check. So mark the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ break;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, visibility_cutoff_xid))
visibility_cutoff_xid = xmin;
@@ -2159,6 +2172,18 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
break;
}
+ /*
+ * If this tuple was ever WARM updated or is a WARM tuple,
+ * there could be multiple index entries pointing to the
+ * root of this chain. We can't do index-only scans for
+ * such tuples without verifying index key check. So mark
+ * the page as !all_visible
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(tuple.t_data))
+ {
+ all_visible = false;
+ }
+
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 2142273..d62d2de 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -270,6 +270,8 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
+ ItemPointer root_tid,
+ Bitmapset *modified_attrs,
EState *estate,
bool noDupErr,
bool *specConflict,
@@ -324,6 +326,17 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
if (!indexInfo->ii_ReadyForInserts)
continue;
+ /*
+ * If modified_attrs is set, we only insert index entries for those
+ * indexes whose column has changed. All other indexes can use their
+ * existing index pointers to look up the new tuple
+ */
+ if (modified_attrs)
+ {
+ if (!bms_overlap(modified_attrs, indexInfo->ii_indxattrs))
+ continue;
+ }
+
/* Check for partial index */
if (indexInfo->ii_Predicate != NIL)
{
@@ -389,7 +402,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
index_insert(indexRelation, /* index relation */
values, /* array of index Datums */
isnull, /* null flags */
- tupleid, /* tid of heap tuple */
+ root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique, /* type of uniqueness check to do */
indexInfo); /* index AM may need this */
@@ -791,6 +804,9 @@ retry:
{
if (!HeapTupleHeaderIsHeapLatest(tup->t_data, &tup->t_self))
HeapTupleHeaderGetNextTid(tup->t_data, &ctid_wait);
+ else
+ ItemPointerCopy(&tup->t_self, &ctid_wait);
+
reason_wait = indexInfo->ii_ExclusionOps ?
XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
index_endscan(index_scan);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f20d728..943a30c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -399,6 +399,8 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self),
+ NULL,
estate, false, NULL,
NIL);
@@ -445,6 +447,8 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
if (!skip_tuple)
{
List *recheckIndexes = NIL;
+ bool warm_update;
+ Bitmapset *modified_attrs;
/* Check the constraints of the tuple */
if (rel->rd_att->constr)
@@ -455,13 +459,30 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
/* OK, update the tuple and index entries for it */
simple_heap_update(rel, &searchslot->tts_tuple->t_self,
- slot->tts_tuple);
+ slot->tts_tuple, &modified_attrs, &warm_update);
if (resultRelInfo->ri_NumIndices > 0 &&
- !HeapTupleIsHeapOnly(slot->tts_tuple))
+ (!HeapTupleIsHeapOnly(slot->tts_tuple) || warm_update))
+ {
+ ItemPointerData root_tid;
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self,
+ &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
+
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL,
NIL);
+ }
/* AFTER ROW UPDATE Triggers */
ExecARUpdateTriggers(estate, resultRelInfo,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1aa9f1..35b0b83 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -39,6 +39,7 @@
#include "access/relscan.h"
#include "access/transam.h"
+#include "access/valid.h"
#include "executor/execdebug.h"
#include "executor/nodeBitmapHeapscan.h"
#include "pgstat.h"
@@ -314,11 +315,27 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
+ bool recheck = false;
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
- scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ &heapTuple, NULL, true, &recheck))
+ {
+ bool valid = true;
+
+ if (scan->rs_key)
+ HeapKeyTest(&heapTuple, RelationGetDescr(scan->rs_rd),
+ scan->rs_nkeys, scan->rs_key, valid);
+ if (valid)
+ scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+
+ /*
+ * If the heap tuple needs a recheck because of a WARM update,
+ * it's a lossy case
+ */
+ if (recheck)
+ tbmres->recheck = true;
+ }
}
}
else
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index cb6aff9..355a2d8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -142,10 +142,10 @@ IndexNext(IndexScanState *node)
false); /* don't pfree */
/*
- * If the index was lossy, we have to recheck the index quals using
- * the fetched tuple.
+ * If the index was lossy or the tuple was WARM, we have to recheck
+ * the index quals using the fetched tuple.
*/
- if (scandesc->xs_recheck)
+ if (scandesc->xs_recheck || scandesc->xs_tuple_recheck)
{
econtext->ecxt_scantuple = slot;
ResetExprContext(econtext);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..a1f3440 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -512,6 +512,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, true, &specConflict,
arbiterIndexes);
@@ -558,6 +559,7 @@ ExecInsert(ModifyTableState *mtstate,
/* insert index entries for tuple */
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &(tuple->t_self), NULL,
estate, false, NULL,
arbiterIndexes);
}
@@ -891,6 +893,9 @@ ExecUpdate(ItemPointer tupleid,
HTSU_Result result;
HeapUpdateFailureData hufd;
List *recheckIndexes = NIL;
+ Bitmapset *modified_attrs = NULL;
+ ItemPointerData root_tid;
+ bool warm_update;
/*
* abort the operation if not running transactions
@@ -1007,7 +1012,7 @@ lreplace:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
- &hufd, &lockmode);
+ &hufd, &lockmode, &modified_attrs, &warm_update);
switch (result)
{
case HeapTupleSelfUpdated:
@@ -1094,10 +1099,28 @@ lreplace:;
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
+ *
+ * If it's a WARM update, then we must insert new entries with TID
+ * pointing to the root of the WARM chain.
*/
- if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
+ if (resultRelInfo->ri_NumIndices > 0 &&
+ (!HeapTupleIsHeapOnly(tuple) || warm_update))
+ {
+ if (warm_update)
+ ItemPointerSet(&root_tid,
+ ItemPointerGetBlockNumber(&(tuple->t_self)),
+ HeapTupleHeaderGetRootOffset(tuple->t_data));
+ else
+ {
+ ItemPointerCopy(&tuple->t_self, &root_tid);
+ bms_free(modified_attrs);
+ modified_attrs = NULL;
+ }
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+ &root_tid,
+ modified_attrs,
estate, false, NULL, NIL);
+ }
}
if (canSetTag)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2fb9a8b..35cc6c5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1823,7 +1823,7 @@ pgstat_count_heap_insert(Relation rel, int n)
* pgstat_count_heap_update - count a tuple update
*/
void
-pgstat_count_heap_update(Relation rel, bool hot)
+pgstat_count_heap_update(Relation rel, bool hot, bool warm)
{
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
@@ -1841,6 +1841,8 @@ pgstat_count_heap_update(Relation rel, bool hot)
/* t_tuples_hot_updated is nontransactional, so just advance it */
if (hot)
pgstat_info->t_counts.t_tuples_hot_updated++;
+ else if (warm)
+ pgstat_info->t_counts.t_tuples_warm_updated++;
}
}
@@ -4088,6 +4090,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->tuples_updated = 0;
result->tuples_deleted = 0;
result->tuples_hot_updated = 0;
+ result->tuples_warm_updated = 0;
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
@@ -5197,6 +5200,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated = tabmsg->t_counts.t_tuples_warm_updated;
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
@@ -5224,6 +5228,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
+ tabentry->tuples_warm_updated += tabmsg->t_counts.t_tuples_warm_updated;
/* If table was truncated, first reset the live/dead counters */
if (tabmsg->t_counts.t_truncated)
{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a987d0d..b8677f3 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -145,6 +145,22 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
Datum
+pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+
+Datum
pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
@@ -1644,6 +1660,21 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_xact_tuples_warm_updated(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_TableStatus *tabentry;
+
+ if ((tabentry = find_tabstat_entry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int64) (tabentry->t_counts.t_tuples_warm_updated);
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..c85898c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2338,6 +2338,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
list_free_deep(relation->rd_fkeylist);
list_free(relation->rd_indexlist);
bms_free(relation->rd_indexattr);
+ bms_free(relation->rd_exprindexattr);
bms_free(relation->rd_keyattr);
bms_free(relation->rd_pkattr);
bms_free(relation->rd_idattr);
@@ -4352,6 +4353,13 @@ RelationGetIndexList(Relation relation)
return list_copy(relation->rd_indexlist);
/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */
+ relation->rd_indexattr = NULL;
+
+ /*
* We build the list we intend to return (in the caller's context) while
* doing the scan. After successfully completing the scan, we copy that
* list into the relcache entry. This avoids cache-context memory leakage
@@ -4759,15 +4767,19 @@ Bitmapset *
RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
{
Bitmapset *indexattrs; /* indexed columns */
+ Bitmapset *exprindexattrs; /* indexed columns in expression/prediacate
+ indexes */
Bitmapset *uindexattrs; /* columns in unique indexes */
Bitmapset *pkindexattrs; /* columns in the primary index */
Bitmapset *idindexattrs; /* columns in the replica identity */
+ Bitmapset *indxnotreadyattrs; /* columns in not ready indexes */
List *indexoidlist;
List *newindexoidlist;
Oid relpkindex;
Oid relreplindex;
ListCell *l;
MemoryContext oldcxt;
+ bool supportswarm = true;/* True if the table can be WARM updated */
/* Quick exit if we already computed the result. */
if (relation->rd_indexattr != NULL)
@@ -4782,6 +4794,10 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return bms_copy(relation->rd_idattr);
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return bms_copy(relation->rd_exprindexattr);
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return bms_copy(relation->rd_indxnotreadyattr);
default:
elog(ERROR, "unknown attrKind %u", attrKind);
}
@@ -4822,9 +4838,11 @@ restart:
* won't be returned at all by RelationGetIndexList.
*/
indexattrs = NULL;
+ exprindexattrs = NULL;
uindexattrs = NULL;
pkindexattrs = NULL;
idindexattrs = NULL;
+ indxnotreadyattrs = NULL;
foreach(l, indexoidlist)
{
Oid indexOid = lfirst_oid(l);
@@ -4861,6 +4879,10 @@ restart:
indexattrs = bms_add_member(indexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_member(indxnotreadyattrs,
+ attrnum - FirstLowInvalidHeapAttributeNumber);
+
if (isKey)
uindexattrs = bms_add_member(uindexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
@@ -4876,10 +4898,29 @@ restart:
}
/* Collect all attributes used in expressions, too */
- pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Expressions, 1, &exprindexattrs);
/* Collect all attributes in the index predicate, too */
- pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &indexattrs);
+ pull_varattnos((Node *) indexInfo->ii_Predicate, 1, &exprindexattrs);
+
+ /*
+ * indexattrs should include attributes referenced in index expressions
+ * and predicates too
+ */
+ indexattrs = bms_add_members(indexattrs, exprindexattrs);
+
+ if (!indexInfo->ii_ReadyForInserts)
+ indxnotreadyattrs = bms_add_members(indxnotreadyattrs,
+ exprindexattrs);
+
+ /*
+ * Check if the index has amrecheck method defined. If the method is
+ * not defined, the index does not support WARM update. Completely
+ * disable WARM updates on such tables
+ */
+ if (!indexDesc->rd_amroutine->amrecheck)
+ supportswarm = false;
+
index_close(indexDesc, AccessShareLock);
}
@@ -4912,15 +4953,22 @@ restart:
goto restart;
}
+ /* Remember if the table can do WARM updates */
+ relation->rd_supportswarm = supportswarm;
+
/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_indexattr);
relation->rd_indexattr = NULL;
+ bms_free(relation->rd_exprindexattr);
+ relation->rd_exprindexattr = NULL;
bms_free(relation->rd_keyattr);
relation->rd_keyattr = NULL;
bms_free(relation->rd_pkattr);
relation->rd_pkattr = NULL;
bms_free(relation->rd_idattr);
relation->rd_idattr = NULL;
+ bms_free(relation->rd_indxnotreadyattr);
+ relation->rd_indxnotreadyattr = NULL;
/*
* Now save copies of the bitmaps in the relcache entry. We intentionally
@@ -4933,7 +4981,9 @@ restart:
relation->rd_keyattr = bms_copy(uindexattrs);
relation->rd_pkattr = bms_copy(pkindexattrs);
relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
+ relation->rd_exprindexattr = bms_copy(exprindexattrs);
+ relation->rd_indexattr = bms_copy(bms_union(indexattrs, exprindexattrs));
+ relation->rd_indxnotreadyattr = bms_copy(indxnotreadyattrs);
MemoryContextSwitchTo(oldcxt);
/* We return our original working copy for caller to play with */
@@ -4947,6 +4997,10 @@ restart:
return bms_copy(relation->rd_pkattr);
case INDEX_ATTR_BITMAP_IDENTITY_KEY:
return idindexattrs;
+ case INDEX_ATTR_BITMAP_EXPR_PREDICATE:
+ return exprindexattrs;
+ case INDEX_ATTR_BITMAP_NOTREADY:
+ return indxnotreadyattrs;
default:
elog(ERROR, "unknown attrKind %u", attrKind);
return NULL;
@@ -5559,6 +5613,7 @@ load_relcache_init_file(bool shared)
rel->rd_keyattr = NULL;
rel->rd_pkattr = NULL;
rel->rd_idattr = NULL;
+ rel->rd_indxnotreadyattr = NULL;
rel->rd_pubactions = NULL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index f919cf8..d7702e5 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -13,6 +13,7 @@
#define AMAPI_H
#include "access/genam.h"
+#include "access/itup.h"
/*
* We don't wish to include planner header files here, since most of an index
@@ -152,6 +153,10 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* recheck index tuple and heap tuple match */
+typedef bool (*amrecheck_function) (Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -217,6 +222,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to support WARM */
+ amrecheck_function amrecheck; /* can be NULL */
} IndexAmRoutine;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bfdfed8..0af6b4e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -391,4 +391,8 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
+/* hash.c */
+extern bool hashrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
+
#endif /* HASH_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 95aa976..9412c3a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -137,9 +137,10 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
Relation stats_relation);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, bool *recheck);
extern bool heap_hot_search(ItemPointer tid, Relation relation,
- Snapshot snapshot, bool *all_dead);
+ Snapshot snapshot, bool *all_dead,
+ bool *recheck, Buffer *buffer, HeapTuple heapTuple);
extern void heap_get_latest_tid(Relation relation, Snapshot snapshot,
ItemPointer tid);
@@ -161,7 +162,8 @@ extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
- HeapUpdateFailureData *hufd, LockTupleMode *lockmode);
+ HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+ Bitmapset **modified_attrsp, bool *warm_update);
extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_update,
@@ -176,7 +178,9 @@ extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
- HeapTuple tup);
+ HeapTuple tup,
+ Bitmapset **modified_attrs,
+ bool *warm_update);
extern void heap_sync(Relation relation);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index e6019d5..9b081bf 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -80,6 +80,7 @@
#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+#define XLH_UPDATE_WARM_UPDATE (1<<7)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 4d614b7..b5891ca 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -260,7 +260,8 @@ struct HeapTupleHeaderData
* information stored in t_infomask2:
*/
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
-/* bits 0x0800 are available */
+#define HEAP_WARM_TUPLE 0x0800 /* This tuple is a part of a WARM chain
+ */
#define HEAP_LATEST_TUPLE 0x1000 /*
* This is the last tuple in chain and
* ip_posid points to the root line
@@ -271,7 +272,7 @@ struct HeapTupleHeaderData
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
-#define HEAP2_XACT_MASK 0xF000 /* visibility-related bits */
+#define HEAP2_XACT_MASK 0xF800 /* visibility-related bits */
/*
@@ -510,6 +511,21 @@ do { \
((tup)->t_infomask2 & HEAP_ONLY_TUPLE) != 0 \
)
+#define HeapTupleHeaderSetHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 |= HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderClearHeapWarmTuple(tup) \
+do { \
+ (tup)->t_infomask2 &= ~HEAP_WARM_TUPLE; \
+} while (0)
+
+#define HeapTupleHeaderIsHeapWarmTuple(tup) \
+( \
+ ((tup)->t_infomask2 & HEAP_WARM_TUPLE) != 0 \
+)
+
/*
* Mark this as the last tuple in the HOT chain. Before PG v10 we used to store
* the TID of the tuple itself in t_ctid field to mark the end of the chain.
@@ -785,6 +801,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapOnly(tuple) \
HeapTupleHeaderClearHeapOnly((tuple)->t_data)
+#define HeapTupleIsHeapWarmTuple(tuple) \
+ HeapTupleHeaderIsHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTuple(tuple) \
+ HeapTupleHeaderSetHeapWarmTuple((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTuple(tuple) \
+ HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f9304db..d4b35ca 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -537,6 +537,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
+extern bool btrecheck(Relation indexRel, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3fc726d..f971b43 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -119,7 +119,8 @@ typedef struct IndexScanDescData
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
- bool xs_recheck; /* T means scan keys must be rechecked */
+ bool xs_recheck; /* T means scan keys must be rechecked for each tuple */
+ bool xs_tuple_recheck; /* T means scan keys must be rechecked for current tuple */
/*
* When fetching with an ordering operator, the values of the ORDER BY
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ec4aedb..ec42c30 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2740,6 +2740,8 @@ DATA(insert OID = 1933 ( pg_stat_get_tuples_deleted PGNSP PGUID 12 1 0 0 0 f f
DESCR("statistics: number of tuples deleted");
DATA(insert OID = 1972 ( pg_stat_get_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated");
+DATA(insert OID = 3353 ( pg_stat_get_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated");
DATA(insert OID = 2878 ( pg_stat_get_live_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_live_tuples _null_ _null_ _null_ ));
DESCR("statistics: number of live tuples");
DATA(insert OID = 2879 ( pg_stat_get_dead_tuples PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_dead_tuples _null_ _null_ _null_ ));
@@ -2892,6 +2894,8 @@ DATA(insert OID = 3042 ( pg_stat_get_xact_tuples_deleted PGNSP PGUID 12 1 0 0
DESCR("statistics: number of tuples deleted in current transaction");
DATA(insert OID = 3043 ( pg_stat_get_xact_tuples_hot_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_hot_updated _null_ _null_ _null_ ));
DESCR("statistics: number of tuples hot updated in current transaction");
+DATA(insert OID = 3354 ( pg_stat_get_xact_tuples_warm_updated PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_tuples_warm_updated _null_ _null_ _null_ ));
+DESCR("statistics: number of tuples warm updated in current transaction");
DATA(insert OID = 3044 ( pg_stat_get_xact_blocks_fetched PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_fetched _null_ _null_ _null_ ));
DESCR("statistics: number of blocks fetched in current transaction");
DATA(insert OID = 3045 ( pg_stat_get_xact_blocks_hit PGNSP PGUID 12 1 0 0 0 f f f f t f v r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b..c4495a3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -382,6 +382,7 @@ extern void UnregisterExprContextCallback(ExprContext *econtext,
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
+ ItemPointer root_tid, Bitmapset *modified_attrs,
EState *estate, bool noDupErr, bool *specConflict,
List *arbiterIndexes);
extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index ea3f3a5..ebeec74 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -41,5 +41,4 @@ extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-
#endif /* NODEINDEXSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2fde67a..0b16157 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -64,6 +64,7 @@ typedef struct IndexInfo
NodeTag type;
int ii_NumIndexAttrs;
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS];
+ Bitmapset *ii_indxattrs; /* bitmap of all columns used in this index */
List *ii_Expressions; /* list of Expr */
List *ii_ExpressionsState; /* list of ExprState */
List *ii_Predicate; /* list of Expr */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0062fb8..70a7c8d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -105,6 +105,7 @@ typedef struct PgStat_TableCounts
PgStat_Counter t_tuples_updated;
PgStat_Counter t_tuples_deleted;
PgStat_Counter t_tuples_hot_updated;
+ PgStat_Counter t_tuples_warm_updated;
bool t_truncated;
PgStat_Counter t_delta_live_tuples;
@@ -625,6 +626,7 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter tuples_updated;
PgStat_Counter tuples_deleted;
PgStat_Counter tuples_hot_updated;
+ PgStat_Counter tuples_warm_updated;
PgStat_Counter n_live_tuples;
PgStat_Counter n_dead_tuples;
@@ -1178,7 +1180,7 @@ pgstat_report_wait_end(void)
(pgStatBlockWriteTime += (n))
extern void pgstat_count_heap_insert(Relation rel, int n);
-extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_heap_update(Relation rel, bool hot, bool warm);
extern void pgstat_count_heap_delete(Relation rel);
extern void pgstat_count_truncate(Relation rel);
extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a617a7c..fbac7c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -138,9 +138,14 @@ typedef struct RelationData
/* data managed by RelationGetIndexAttrBitmap: */
Bitmapset *rd_indexattr; /* identifies columns used in indexes */
+ Bitmapset *rd_exprindexattr; /* indentified columns used in expression or
+ predicate indexes */
+ Bitmapset *rd_indxnotreadyattr; /* columns used by indexes not yet
+ ready */
Bitmapset *rd_keyattr; /* cols that can be ref'd by foreign keys */
Bitmapset *rd_pkattr; /* cols included in primary key */
Bitmapset *rd_idattr; /* included in replica identity index */
+ bool rd_supportswarm;/* True if the table can be WARM updated */
PublicationActions *rd_pubactions; /* publication actions */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index da36b67..d18bd09 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -50,7 +50,9 @@ typedef enum IndexAttrBitmapKind
INDEX_ATTR_BITMAP_ALL,
INDEX_ATTR_BITMAP_KEY,
INDEX_ATTR_BITMAP_PRIMARY_KEY,
- INDEX_ATTR_BITMAP_IDENTITY_KEY
+ INDEX_ATTR_BITMAP_IDENTITY_KEY,
+ INDEX_ATTR_BITMAP_EXPR_PREDICATE,
+ INDEX_ATTR_BITMAP_NOTREADY
} IndexAttrBitmapKind;
extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c661f1d..561d9579 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1732,6 +1732,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_tuples_warm_updated(c.oid) AS n_tup_warm_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
@@ -1875,6 +1876,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1918,6 +1920,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,
pg_stat_all_tables.n_tup_hot_upd,
+ pg_stat_all_tables.n_tup_warm_upd,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
@@ -1955,7 +1958,8 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
pg_stat_get_xact_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_xact_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_xact_tuples_deleted(c.oid) AS n_tup_del,
- pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd
+ pg_stat_get_xact_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
+ pg_stat_get_xact_tuples_warm_updated(c.oid) AS n_tup_warm_upd
FROM ((pg_class c
LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
@@ -1971,7 +1975,8 @@ pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_xact_user_functions| SELECT p.oid AS funcid,
@@ -1993,7 +1998,8 @@ pg_stat_xact_user_tables| SELECT pg_stat_xact_all_tables.relid,
pg_stat_xact_all_tables.n_tup_ins,
pg_stat_xact_all_tables.n_tup_upd,
pg_stat_xact_all_tables.n_tup_del,
- pg_stat_xact_all_tables.n_tup_hot_upd
+ pg_stat_xact_all_tables.n_tup_hot_upd,
+ pg_stat_xact_all_tables.n_tup_warm_upd
FROM pg_stat_xact_all_tables
WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
pg_statio_all_indexes| SELECT c.oid AS relid,
diff --git a/src/test/regress/expected/warm.out b/src/test/regress/expected/warm.out
new file mode 100644
index 0000000..6391891
--- /dev/null
+++ b/src/test/regress/expected/warm.out
@@ -0,0 +1,367 @@
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab1
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+ a | b | c | d
+---+--------+------+-----
+ 1 | 140001 | foo3 | bar
+(1 row)
+
+-- Check if index only scan works correctly
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab1
+ Recheck Cond: (b = 140001)
+ -> Bitmap Index Scan on updtst_indx1
+ Index Cond: (b = 140001)
+(4 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx1 on updtst_tab1
+ Index Cond: (b = 140001)
+(2 rows)
+
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+ b
+--------
+ 140001
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab1;
+------------------
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab2
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE a = 1;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+-----------------------------------------
+ Bitmap Heap Scan on updtst_tab2
+ Recheck Cond: (b = 701)
+ -> Bitmap Index Scan on updtst_indx2
+ Index Cond: (b = 701)
+(4 rows)
+
+SELECT * FROM updtst_tab2 WHERE b = 701;
+ a | b | c | d
+---+-----+------+-----
+ 1 | 701 | foo6 | bar
+(1 row)
+
+VACUUM updtst_tab2;
+EXPLAIN (costs off) SELECT b FROM updtst_tab2 WHERE b = 701;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx2 on updtst_tab2
+ Index Cond: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab2 WHERE b = 701;
+ b
+-----
+ 701
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab2;
+------------------
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 99
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 1;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+ a | b | c | d
+---+------+-------+-----
+ 1 | 1421 | foo12 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 701;
+ QUERY PLAN
+-------------------------
+ Seq Scan on updtst_tab3
+ Filter: (b = 701)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 701;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+ b
+------
+ 1421
+(1 row)
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+SET enable_seqscan = false;
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+ count
+-------
+ 98
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+SELECT * FROM updtst_tab3 WHERE a = 2;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+ a | b | c | d
+---+------+-------+-----
+ 2 | 1422 | foo22 | bar
+(1 row)
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 702;
+ QUERY PLAN
+---------------------------------------------------
+ Index Only Scan using updtst_indx3 on updtst_tab3
+ Index Cond: (b = 702)
+(2 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 702;
+ b
+---
+(0 rows)
+
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+ b
+------
+ 1422
+(1 row)
+
+SET enable_seqscan = true;
+DROP TABLE updtst_tab3;
+------------------
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Bitmap Heap Scan on test_warm (cost=4.18..12.65 rows=4 width=64)
+ Recheck Cond: (lower(a) = 'test'::text)
+ -> Bitmap Index Scan on test_warmindx (cost=0.00..4.18 rows=4 width=0)
+ Index Cond: (lower(a) = 'test'::text)
+(4 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+select *, ctid from test_warm where a = 'test';
+ a | b | ctid
+---+---+------
+(0 rows)
+
+select *, ctid from test_warm where a = 'TEST';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Index Scan using test_warmindx on test_warm (cost=0.15..20.22 rows=4 width=64)
+ Index Cond: (lower(a) = 'test'::text)
+(2 rows)
+
+select *, ctid from test_warm where lower(a) = 'test';
+ a | b | ctid
+------+-----+-------
+ TEST | foo | (0,2)
+(1 row)
+
+DROP TABLE test_warm;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 13bf494..0b6193b 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -42,6 +42,8 @@ test: create_type
test: create_table
test: create_function_2
+test: warm
+
# ----------
# Load huge amounts of data
# We should split the data files into single files and then
diff --git a/src/test/regress/sql/warm.sql b/src/test/regress/sql/warm.sql
new file mode 100644
index 0000000..c025087
--- /dev/null
+++ b/src/test/regress/sql/warm.sql
@@ -0,0 +1,171 @@
+-- WARM update tests
+
+CREATE TABLE updtst_tab1 (a integer unique, b int, c text, d text);
+CREATE INDEX updtst_indx1 ON updtst_tab1 (b);
+INSERT INTO updtst_tab1
+ SELECT generate_series(1,10000), generate_series(70001, 80000), 'foo', 'bar';
+
+-- This should be a HOT update as non-index key is updated, but the
+-- page won't have any free space, so probably a non-HOT update
+UPDATE updtst_tab1 SET c = 'foo1' WHERE a = 1;
+
+-- Next update should be a HOT update as dead space is recycled
+UPDATE updtst_tab1 SET c = 'foo2' WHERE a = 1;
+
+-- And next too
+UPDATE updtst_tab1 SET c = 'foo3' WHERE a = 1;
+
+-- Now update one of the index key columns
+UPDATE updtst_tab1 SET b = b + 70000 WHERE a = 1;
+
+-- Ensure that the correct row is fetched
+SELECT * FROM updtst_tab1 WHERE a = 1;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Even when seqscan is disabled and indexscan is forced
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT * FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Check if index only scan works correctly
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+-- Table must be vacuumed to force index-only scan
+VACUUM updtst_tab1;
+EXPLAIN (costs off) SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+SELECT b FROM updtst_tab1 WHERE b = 70001 + 70000;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab1;
+
+------------------
+
+CREATE TABLE updtst_tab2 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx2 ON updtst_tab2 (b);
+INSERT INTO updtst_tab2
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+UPDATE updtst_tab2 SET b = b + 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo1' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab2 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab2 SET c = 'foo6' WHERE a = 1;
+
+SELECT count(*) FROM updtst_tab2 WHERE c = 'foo';
+SELECT * FROM updtst_tab2 WHERE c = 'foo6';
+
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE a = 1;
+
+SET enable_seqscan = false;
+EXPLAIN (costs off) SELECT * FROM updtst_tab2 WHERE b = 701;
+SELECT * FROM updtst_tab2 WHERE b = 701;
+
+VACUUM updtst_tab2;
+EXPLAIN (costs off) SELECT b FROM updtst_tab2 WHERE b = 701;
+SELECT b FROM updtst_tab2 WHERE b = 701;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab2;
+------------------
+
+CREATE TABLE updtst_tab3 (a integer unique, b int, c text, d text) WITH (fillfactor = 80);
+CREATE INDEX updtst_indx3 ON updtst_tab3 (b);
+INSERT INTO updtst_tab3
+ SELECT generate_series(1,100), generate_series(701, 800), 'foo', 'bar';
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo1', b = b + 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo2' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo3' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo4' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo5' WHERE a = 1;
+UPDATE updtst_tab3 SET c = 'foo6' WHERE a = 1;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo11', b = b + 750 WHERE b = 701;
+UPDATE updtst_tab3 SET c = 'foo12' WHERE a = 1;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 1;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo6';
+SELECT * FROM updtst_tab3 WHERE c = 'foo12';
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+SELECT * FROM updtst_tab3 WHERE a = 1;
+
+SELECT * FROM updtst_tab3 WHERE b = 701;
+SELECT * FROM updtst_tab3 WHERE b = 1421;
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 701;
+SELECT b FROM updtst_tab3 WHERE b = 1421;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo23' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 700 WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo24' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo25' WHERE a = 2;
+UPDATE updtst_tab3 SET c = 'foo26' WHERE a = 2;
+
+-- Abort the transaction and ensure the original tuple is visible correctly
+ROLLBACK;
+
+SET enable_seqscan = false;
+
+BEGIN;
+UPDATE updtst_tab3 SET c = 'foo21', b = b + 750 WHERE b = 702;
+UPDATE updtst_tab3 SET c = 'foo22' WHERE a = 2;
+UPDATE updtst_tab3 SET b = b - 30 WHERE a = 2;
+COMMIT;
+
+SELECT count(*) FROM updtst_tab3 WHERE c = 'foo';
+SELECT * FROM updtst_tab3 WHERE c = 'foo26';
+SELECT * FROM updtst_tab3 WHERE c = 'foo22';
+
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+SELECT * FROM updtst_tab3 WHERE a = 2;
+
+-- Try fetching both old and new value using updtst_indx3
+SELECT * FROM updtst_tab3 WHERE b = 702;
+SELECT * FROM updtst_tab3 WHERE b = 1422;
+
+VACUUM updtst_tab3;
+EXPLAIN (costs off) SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 702;
+SELECT b FROM updtst_tab3 WHERE b = 1422;
+
+SET enable_seqscan = true;
+
+DROP TABLE updtst_tab3;
+------------------
+
+CREATE TABLE test_warm (a text unique, b text);
+CREATE INDEX test_warmindx ON test_warm (lower(a));
+INSERT INTO test_warm values ('test', 'foo');
+UPDATE test_warm SET a = 'TEST';
+select *, ctid from test_warm where lower(a) = 'test';
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where a = 'test';
+select *, ctid from test_warm where a = 'TEST';
+set enable_bitmapscan TO false;
+explain select * from test_warm where lower(a) = 'test';
+select *, ctid from test_warm where lower(a) = 'test';
+DROP TABLE test_warm;
--
2.1.4
0006-warm-chain-conversion-v16.patchtext/plain; charset=us-asciiDownload
From 2c901fe7c1829d21e3630070750c12d4415fb40c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 8 Mar 2017 13:51:12 -0300
Subject: [PATCH 6/6] warm chain conversion v16
---
contrib/bloom/blvacuum.c | 2 +-
src/backend/access/gin/ginvacuum.c | 3 +-
src/backend/access/gist/gistvacuum.c | 3 +-
src/backend/access/hash/hash.c | 82 ++++-
src/backend/access/hash/hashpage.c | 14 +
src/backend/access/heap/heapam.c | 323 +++++++++++++++--
src/backend/access/heap/tuptoaster.c | 3 +-
src/backend/access/index/indexam.c | 9 +-
src/backend/access/nbtree/nbtpage.c | 51 ++-
src/backend/access/nbtree/nbtree.c | 75 +++-
src/backend/access/nbtree/nbtxlog.c | 99 +----
src/backend/access/rmgrdesc/heapdesc.c | 26 +-
src/backend/access/rmgrdesc/nbtdesc.c | 4 +-
src/backend/access/spgist/spgvacuum.c | 12 +-
src/backend/catalog/index.c | 11 +-
src/backend/catalog/indexing.c | 5 +-
src/backend/commands/constraint.c | 3 +-
src/backend/commands/vacuumlazy.c | 602 +++++++++++++++++++++++++++++--
src/backend/executor/execIndexing.c | 3 +-
src/backend/replication/logical/decode.c | 13 +-
src/backend/utils/time/combocid.c | 4 +-
src/backend/utils/time/tqual.c | 24 +-
src/include/access/amapi.h | 9 +
src/include/access/genam.h | 22 +-
src/include/access/hash.h | 11 +
src/include/access/heapam.h | 18 +
src/include/access/heapam_xlog.h | 23 +-
src/include/access/htup_details.h | 84 ++++-
src/include/access/nbtree.h | 18 +-
src/include/access/nbtxlog.h | 26 +-
src/include/commands/progress.h | 1 +
31 files changed, 1321 insertions(+), 262 deletions(-)
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 04abd0f..ff50361 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -88,7 +88,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
while (itup < itupEnd)
{
/* Do we have to delete this tuple? */
- if (callback(&itup->heapPtr, callback_state))
+ if (callback(&itup->heapPtr, false, callback_state) == IBDCR_DELETE)
{
/* Yes; adjust count of tuples that will be left on page */
BloomPageGetOpaque(page)->maxoff--;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index c9ccfee..8ed71c5 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -56,7 +56,8 @@ ginVacuumItemPointers(GinVacuumState *gvs, ItemPointerData *items,
*/
for (i = 0; i < nitem; i++)
{
- if (gvs->callback(items + i, gvs->callback_state))
+ if (gvs->callback(items + i, false, gvs->callback_state) ==
+ IBDCR_DELETE)
{
gvs->result->tuples_removed += 1;
if (!tmpitems)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..0955db6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -202,7 +202,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
iid = PageGetItemId(page, i);
idxtuple = (IndexTuple) PageGetItem(page, iid);
- if (callback(&(idxtuple->t_tid), callback_state))
+ if (callback(&(idxtuple->t_tid), false, callback_state) ==
+ IBDCR_DELETE)
todelete[ntodelete++] = i;
else
stats->num_index_tuples += 1;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 9b20ae6..5310c67 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -73,6 +73,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambuild = hashbuild;
amroutine->ambuildempty = hashbuildempty;
amroutine->aminsert = hashinsert;
+ amroutine->amwarminsert = hashwarminsert;
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
@@ -231,11 +232,11 @@ hashbuildCallback(Relation index,
* Hash on the heap tuple's key, form an index tuple with hash code.
* Find the appropriate location for the new tuple, and put it there.
*/
-bool
-hashinsert(Relation rel, Datum *values, bool *isnull,
+static bool
+hashinsert_internal(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo, bool warm_update)
{
Datum index_values[1];
bool index_isnull[1];
@@ -251,6 +252,11 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
itup->t_tid = *ht_ctid;
+ if (warm_update)
+ ItemPointerSetFlags(&itup->t_tid, HASH_INDEX_RED_POINTER);
+ else
+ ItemPointerClearFlags(&itup->t_tid);
+
_hash_doinsert(rel, itup);
pfree(itup);
@@ -258,6 +264,26 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
return false;
}
+bool
+hashinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return hashinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, false);
+}
+
+bool
+hashwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return hashinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, true);
+
+}
/*
* hashgettuple() -- Get the next tuple in the scan.
@@ -738,6 +764,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
Page page;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
+ OffsetNumber colorblue[MaxOffsetNumber];
+ int ncolorblue = 0;
bool retain_pin = false;
vacuum_delay_point();
@@ -755,20 +783,35 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
IndexTuple itup;
Bucket bucket;
bool kill_tuple = false;
+ bool color_tuple = false;
+ int flags;
+ bool is_red;
+ IndexBulkDeleteCallbackResult result;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offno));
htup = &(itup->t_tid);
+ flags = ItemPointerGetFlags(&itup->t_tid);
+ is_red = ((flags & HASH_INDEX_RED_POINTER) != 0);
+
/*
* To remove the dead tuples, we strictly want to rely on results
* of callback function. refer btvacuumpage for detailed reason.
*/
- if (callback && callback(htup, callback_state))
+ if (callback)
{
- kill_tuple = true;
- if (tuples_removed)
- *tuples_removed += 1;
+ result = callback(htup, is_red, callback_state);
+ if (result == IBDCR_DELETE)
+ {
+ kill_tuple = true;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (result == IBDCR_COLOR_BLUE)
+ {
+ color_tuple = true;
+ }
}
else if (split_cleanup)
{
@@ -791,6 +834,12 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
}
}
+ if (color_tuple)
+ {
+ /* color the pointer blue */
+ colorblue[ncolorblue++] = offno;
+ }
+
if (kill_tuple)
{
/* mark the item for deletion */
@@ -815,9 +864,24 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
/*
* Apply deletions, advance to next page and write page if needed.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || ncolorblue > 0)
{
- PageIndexMultiDelete(page, deletable, ndeletable);
+ /*
+ * Color the Red pointers Blue.
+ *
+ * We must do this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the
+ * array and hence offnums recorded earlier won't make any sense
+ * after PageIndexMultiDelete is called..
+ */
+ if (ncolorblue > 0)
+ _hash_color_items(page, colorblue, ncolorblue);
+
+ /*
+ * And delete the deletable items
+ */
+ if (ndeletable > 0)
+ PageIndexMultiDelete(page, deletable, ndeletable);
bucket_dirty = true;
MarkBufferDirty(buf);
}
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index c73929c..7df3e12 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1376,3 +1376,17 @@ _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey, int access,
return buf;
}
+
+void _hash_color_items(Page page, OffsetNumber *coloritemnos,
+ uint16 ncoloritems)
+{
+ int i;
+ IndexTuple itup;
+
+ for (i = 0; i < ncoloritems; i++)
+ {
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, coloritemnos[i]));
+ ItemPointerClearFlags(&itup->t_tid);
+ }
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b9ff94d..0ffb9a9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1958,17 +1958,32 @@ heap_fetch(Relation relation,
}
/*
- * Check if the HOT chain containing this tid is actually a WARM chain.
- * Note that even if the WARM update ultimately aborted, we still must do a
- * recheck because the failing UPDATE when have inserted created index entries
- * which are now stale, but still referencing this chain.
+ * Check status of a (possibly) WARM chain.
+ *
+ * This function looks at a HOT/WARM chain starting at tid and return a bitmask
+ * of information. We only follow the chain as long as it's known to be valid
+ * HOT chain. Information returned by the function consists of:
+ *
+ * HCWC_WARM_TUPLE - a warm tuple is found somewhere in the chain. Note that
+ * when a tuple is WARM updated, both old and new versions
+ * of the tuple are treated as WARM tuple
+ *
+ * HCWC_RED_TUPLE - a warm tuple part of the Red chain is found somewhere in
+ * the chain.
+ *
+ * HCWC_BLUE_TUPLE - a warm tuple part of the Blue chain is found somewhere in
+ * the chain.
+ *
+ * If stop_at_warm is true, we stop when the first WARM tuple is found and
+ * return information collected so far.
*/
-static bool
-hot_check_warm_chain(Page dp, ItemPointer tid)
+HeapCheckWarmChainStatus
+heap_check_warm_chain(Page dp, ItemPointer tid, bool stop_at_warm)
{
- TransactionId prev_xmax = InvalidTransactionId;
- OffsetNumber offnum;
- HeapTupleData heapTuple;
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+ HeapCheckWarmChainStatus status = 0;
offnum = ItemPointerGetOffsetNumber(tid);
heapTuple.t_self = *tid;
@@ -1985,7 +2000,16 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
/* check for unused, dead, or redirected items */
if (!ItemIdIsNormal(lp))
+ {
+ if (ItemIdIsRedirected(lp))
+ {
+ /* Follow the redirect */
+ offnum = ItemIdGetRedirect(lp);
+ continue;
+ }
+ /* else must be end of chain */
break;
+ }
heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
@@ -2000,13 +2024,30 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
break;
- /*
- * Presence of either WARM or WARM updated tuple signals possible
- * breakage and the caller must recheck tuple returned from this chain
- * for index satisfaction
- */
if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
- return true;
+ {
+ /* We found a WARM tuple */
+ status |= HCWC_WARM_TUPLE;
+
+ /*
+ * If we've been told to stop at the first WARM tuple, just return
+ * whatever information collected so far.
+ */
+ if (stop_at_warm)
+ return status;
+
+ /*
+ * If it's not a Red tuple, then it's definitely a Blue tuple. Set
+ * either of the bit correctly.
+ */
+ if (HeapTupleHeaderIsWarmRed(heapTuple.t_data))
+ status |= HCWC_RED_TUPLE;
+ else
+ status |= HCWC_BLUE_TUPLE;
+ }
+ else
+ /* Must be a tuple belonging to the Blue chain */
+ status |= HCWC_BLUE_TUPLE;
/*
* Check to see if HOT chain continues past this tuple; if so fetch
@@ -2026,7 +2067,94 @@ hot_check_warm_chain(Page dp, ItemPointer tid)
}
/* All OK. No need to recheck */
- return false;
+ return status;
+}
+
+/*
+ * Scan through the WARM chain starting at tid and reset all WARM related
+ * flags. At the end, the chain will have all characteristics of a regular HOT
+ * chain.
+ *
+ * Return the number of cleared offnums. Cleared offnums are returned in the
+ * passed-in cleared_offnums array. The caller must ensure that the array is
+ * large enough to hold maximum offnums that can be cleared by this invokation
+ * of heap_clear_warm_chain().
+ */
+int
+heap_clear_warm_chain(Page dp, ItemPointer tid, OffsetNumber *cleared_offnums)
+{
+ TransactionId prev_xmax = InvalidTransactionId;
+ OffsetNumber offnum;
+ HeapTupleData heapTuple;
+ int num_cleared = 0;
+
+ offnum = ItemPointerGetOffsetNumber(tid);
+ heapTuple.t_self = *tid;
+ /* Scan through possible multiple members of HOT-chain */
+ for (;;)
+ {
+ ItemId lp;
+
+ /* check for bogus TID */
+ if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+ break;
+
+ lp = PageGetItemId(dp, offnum);
+
+ /* check for unused, dead, or redirected items */
+ if (!ItemIdIsNormal(lp))
+ {
+ if (ItemIdIsRedirected(lp))
+ {
+ /* Follow the redirect */
+ offnum = ItemIdGetRedirect(lp);
+ continue;
+ }
+ /* else must be end of chain */
+ break;
+ }
+
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
+ ItemPointerSetOffsetNumber(&heapTuple.t_self, offnum);
+
+ /*
+ * The xmin should match the previous xmax value, else chain is
+ * broken.
+ */
+ if (TransactionIdIsValid(prev_xmax) &&
+ !TransactionIdEquals(prev_xmax,
+ HeapTupleHeaderGetXmin(heapTuple.t_data)))
+ break;
+
+
+ /*
+ * Clear WARM and Red flags
+ */
+ if (HeapTupleHeaderIsHeapWarmTuple(heapTuple.t_data))
+ {
+ HeapTupleHeaderClearHeapWarmTuple(heapTuple.t_data);
+ HeapTupleHeaderClearWarmRed(heapTuple.t_data);
+ cleared_offnums[num_cleared++] = offnum;
+ }
+
+ /*
+ * Check to see if HOT chain continues past this tuple; if so fetch
+ * the next offnum and loop around.
+ */
+ if (!HeapTupleIsHotUpdated(&heapTuple))
+ break;
+
+ /*
+ * It can't be a HOT chain if the tuple contains root line pointer
+ */
+ if (HeapTupleHeaderHasRootOffset(heapTuple.t_data))
+ break;
+
+ offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
+ prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple.t_data);
+ }
+
+ return num_cleared;
}
/*
@@ -2135,7 +2263,11 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* possible improvements here
*/
if (recheck && *recheck == false)
- *recheck = hot_check_warm_chain(dp, &heapTuple->t_self);
+ {
+ HeapCheckWarmChainStatus status;
+ status = heap_check_warm_chain(dp, &heapTuple->t_self, true);
+ *recheck = HCWC_IS_WARM(status);
+ }
/*
* When first_call is true (and thus, skip is initially false) we'll
@@ -2888,7 +3020,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
XLogRecPtr recptr;
xl_heap_multi_insert *xlrec;
- uint8 info = XLOG_HEAP2_MULTI_INSERT;
+ uint8 info = XLOG_HEAP_MULTI_INSERT;
char *tupledata;
int totaldatalen;
char *scratchptr = scratch;
@@ -2985,7 +3117,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* filtering by origin on a row level is much more efficient */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- recptr = XLogInsert(RM_HEAP2_ID, info);
+ recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
}
@@ -3409,7 +3541,9 @@ l1:
}
/* store transaction information of xact deleting the tuple */
- tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ tp.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(tp.t_data))
+ tp.t_data->t_infomask &= ~HEAP_MOVED;
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
tp.t_data->t_infomask |= new_infomask;
tp.t_data->t_infomask2 |= new_infomask2;
@@ -4172,7 +4306,9 @@ l2:
START_CRIT_SECTION();
/* Clear obsolete visibility flags ... */
- oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ oldtup.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(oldtup.t_data))
+ oldtup.t_data->t_infomask &= ~HEAP_MOVED;
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
HeapTupleClearHotUpdated(&oldtup);
/* ... and store info about transaction updating this tuple */
@@ -4419,6 +4555,16 @@ l2:
}
/*
+ * If the old tuple is already a member of the Red chain, mark the new
+ * tuple with the same flag
+ */
+ if (HeapTupleIsHeapWarmTupleRed(&oldtup))
+ {
+ HeapTupleSetHeapWarmTupleRed(heaptup);
+ HeapTupleSetHeapWarmTupleRed(newtup);
+ }
+
+ /*
* For HOT (or WARM) updated tuples, we store the offset of the root
* line pointer of this chain in the ip_posid field of the new tuple.
* Usually this information will be available in the corresponding
@@ -4435,12 +4581,20 @@ l2:
/* Mark the old tuple as HOT-updated */
HeapTupleSetHotUpdated(&oldtup);
HeapTupleSetHeapWarmTuple(&oldtup);
+
/* And mark the new tuple as heap-only */
HeapTupleSetHeapOnly(heaptup);
+ /* Mark the new tuple as WARM tuple */
HeapTupleSetHeapWarmTuple(heaptup);
+ /* This update also starts a Red chain */
+ HeapTupleSetHeapWarmTupleRed(heaptup);
+ Assert(!HeapTupleIsHeapWarmTupleRed(&oldtup));
+
/* Mark the caller's copy too, in case different from heaptup */
HeapTupleSetHeapOnly(newtup);
HeapTupleSetHeapWarmTuple(newtup);
+ HeapTupleSetHeapWarmTupleRed(newtup);
+
if (HeapTupleHeaderHasRootOffset(oldtup.t_data))
root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);
else
@@ -4459,6 +4613,8 @@ l2:
HeapTupleClearHeapOnly(newtup);
HeapTupleClearHeapWarmTuple(heaptup);
HeapTupleClearHeapWarmTuple(newtup);
+ HeapTupleClearHeapWarmTupleRed(heaptup);
+ HeapTupleClearHeapWarmTupleRed(newtup);
root_offnum = InvalidOffsetNumber;
}
@@ -4477,7 +4633,9 @@ l2:
HeapTupleHeaderSetHeapLatest(newtup->t_data, root_offnum);
/* Clear obsolete visibility flags, possibly set by ourselves above... */
- oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ oldtup.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(oldtup.t_data))
+ oldtup.t_data->t_infomask &= ~HEAP_MOVED;
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
/* ... and store info about transaction updating this tuple */
Assert(TransactionIdIsValid(xmax_old_tuple));
@@ -6398,7 +6556,9 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
PageSetPrunable(page, RecentGlobalXmin);
/* store transaction information of xact deleting the tuple */
- tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ tp.t_data->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(tp.t_data))
+ tp.t_data->t_infomask &= ~HEAP_MOVED;
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
/*
@@ -6972,7 +7132,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* Old-style VACUUM FULL is gone, but we have to keep this code as long as
* we support having MOVED_OFF/MOVED_IN tuples in the database.
*/
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
@@ -6991,7 +7151,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* have failed; whereas a non-dead MOVED_IN tuple must mean the
* xvac transaction succeeded.
*/
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
frz->frzflags |= XLH_INVALID_XVAC;
else
frz->frzflags |= XLH_FREEZE_XVAC;
@@ -7461,7 +7621,7 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
return true;
}
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsNormal(xid))
@@ -7544,7 +7704,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
return true;
}
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
xid = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsNormal(xid) &&
@@ -7570,7 +7730,7 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
TransactionId xmax = HeapTupleHeaderGetUpdateXid(tuple);
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
- if (tuple->t_infomask & HEAP_MOVED)
+ if (HeapTupleHeaderIsMoved(tuple))
{
if (TransactionIdPrecedes(*latestRemovedXid, xvac))
*latestRemovedXid = xvac;
@@ -7619,6 +7779,36 @@ log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
}
/*
+ * Perform XLogInsert for a heap-warm-clear operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_warmclear(Relation reln, Buffer buffer,
+ OffsetNumber *cleared, int ncleared)
+{
+ xl_heap_warmclear xlrec;
+ XLogRecPtr recptr;
+
+ /* Caller should not call me on a non-WAL-logged relation */
+ Assert(RelationNeedsWAL(reln));
+
+ xlrec.ncleared = ncleared;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapWarmClear);
+
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ if (ncleared > 0)
+ XLogRegisterBufData(0, (char *) cleared,
+ ncleared * sizeof(OffsetNumber));
+
+ recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_WARMCLEAR);
+
+ return recptr;
+}
+
+/*
* Perform XLogInsert for a heap-clean operation. Caller must already
* have modified the buffer and marked it dirty.
*
@@ -8277,6 +8467,60 @@ heap_xlog_clean(XLogReaderState *record)
XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
}
+
+/*
+ * Handles HEAP2_WARMCLEAR record type
+ */
+static void
+heap_xlog_warmclear(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ xl_heap_warmclear *xlrec = (xl_heap_warmclear *) XLogRecGetData(record);
+ Buffer buffer;
+ RelFileNode rnode;
+ BlockNumber blkno;
+ XLogRedoAction action;
+
+ XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+ /*
+ * If we have a full-page image, restore it (using a cleanup lock) and
+ * we're done.
+ */
+ action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true,
+ &buffer);
+ if (action == BLK_NEEDS_REDO)
+ {
+ Page page = (Page) BufferGetPage(buffer);
+ OffsetNumber *cleared;
+ int ncleared;
+ Size datalen;
+ int i;
+
+ cleared = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+ ncleared = xlrec->ncleared;
+
+ for (i = 0; i < ncleared; i++)
+ {
+ ItemId lp;
+ OffsetNumber offnum = cleared[i];
+ HeapTupleData heapTuple;
+
+ lp = PageGetItemId(page, offnum);
+ heapTuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
+
+ HeapTupleHeaderClearHeapWarmTuple(heapTuple.t_data);
+ HeapTupleHeaderClearWarmRed(heapTuple.t_data);
+ }
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+}
+
/*
* Replay XLOG_HEAP2_VISIBLE record.
*
@@ -8523,7 +8767,9 @@ heap_xlog_delete(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
HeapTupleHeaderClearHotUpdated(htup);
fix_infomask_from_infobits(xlrec->infobits_set,
@@ -9186,7 +9432,9 @@ heap_xlog_lock(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
fix_infomask_from_infobits(xlrec->infobits_set, &htup->t_infomask,
&htup->t_infomask2);
@@ -9265,7 +9513,9 @@ heap_xlog_lock_updated(XLogReaderState *record)
htup = (HeapTupleHeader) PageGetItem(page, lp);
- htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
+ htup->t_infomask &= ~HEAP_XMAX_BITS;
+ if (HeapTupleHeaderIsMoved(htup))
+ htup->t_infomask &= ~HEAP_MOVED;
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
fix_infomask_from_infobits(xlrec->infobits_set, &htup->t_infomask,
&htup->t_infomask2);
@@ -9334,6 +9584,9 @@ heap_redo(XLogReaderState *record)
case XLOG_HEAP_INSERT:
heap_xlog_insert(record);
break;
+ case XLOG_HEAP_MULTI_INSERT:
+ heap_xlog_multi_insert(record);
+ break;
case XLOG_HEAP_DELETE:
heap_xlog_delete(record);
break;
@@ -9362,7 +9615,7 @@ heap2_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
- switch (info & XLOG_HEAP_OPMASK)
+ switch (info & XLOG_HEAP2_OPMASK)
{
case XLOG_HEAP2_CLEAN:
heap_xlog_clean(record);
@@ -9376,9 +9629,6 @@ heap2_redo(XLogReaderState *record)
case XLOG_HEAP2_VISIBLE:
heap_xlog_visible(record);
break;
- case XLOG_HEAP2_MULTI_INSERT:
- heap_xlog_multi_insert(record);
- break;
case XLOG_HEAP2_LOCK_UPDATED:
heap_xlog_lock_updated(record);
break;
@@ -9392,6 +9642,9 @@ heap2_redo(XLogReaderState *record)
case XLOG_HEAP2_REWRITE:
heap_xlog_logical_rewrite(record);
break;
+ case XLOG_HEAP2_WARMCLEAR:
+ heap_xlog_warmclear(record);
+ break;
default:
elog(PANIC, "heap2_redo: unknown op code %u", info);
}
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 19e7048..47b01eb 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1620,7 +1620,8 @@ toast_save_datum(Relation rel, Datum value,
toastrel,
toastidxs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- NULL);
+ NULL,
+ false);
}
/*
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index da6c252..e0553d0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -199,7 +199,8 @@ index_insert(Relation indexRelation,
ItemPointer heap_t_ctid,
Relation heapRelation,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo,
+ bool warm_update)
{
RELATION_CHECKS;
CHECK_REL_PROCEDURE(aminsert);
@@ -209,6 +210,12 @@ index_insert(Relation indexRelation,
(HeapTuple) NULL,
InvalidBuffer);
+ if (warm_update)
+ {
+ Assert(indexRelation->rd_amroutine->amwarminsert != NULL);
+ return indexRelation->rd_amroutine->amwarminsert(indexRelation, values,
+ isnull, heap_t_ctid, heapRelation, checkUnique, indexInfo);
+ }
return indexRelation->rd_amroutine->aminsert(indexRelation, values, isnull,
heap_t_ctid, heapRelation,
checkUnique, indexInfo);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f815fd4..7959155 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -766,11 +766,12 @@ _bt_page_recyclable(Page page)
}
/*
- * Delete item(s) from a btree page during VACUUM.
+ * Delete item(s) and color item(s) blue on a btree page during VACUUM.
*
* This must only be used for deleting leaf items. Deleting an item on a
* non-leaf page has to be done as part of an atomic action that includes
- * deleting the page it points to.
+ * deleting the page it points to. We don't ever color pointers on a non-leaf
+ * page.
*
* This routine assumes that the caller has pinned and locked the buffer.
* Also, the given itemnos *must* appear in increasing order in the array.
@@ -786,9 +787,9 @@ _bt_page_recyclable(Page page)
* ensure correct locking.
*/
void
-_bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *itemnos, int nitems,
- BlockNumber lastBlockVacuumed)
+_bt_handleitems_vacuum(Relation rel, Buffer buf,
+ OffsetNumber *delitemnos, int ndelitems,
+ OffsetNumber *coloritemnos, int ncoloritems)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
@@ -796,9 +797,20 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /*
+ * Color the Red pointers Blue.
+ *
+ * We must do this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the array and
+ * hence offnums recorded earlier won't make any sense after
+ * PageIndexMultiDelete is called..
+ */
+ if (ncoloritems > 0)
+ _bt_color_items(page, coloritemnos, ncoloritems);
+
/* Fix the page */
- if (nitems > 0)
- PageIndexMultiDelete(page, itemnos, nitems);
+ if (ndelitems > 0)
+ PageIndexMultiDelete(page, delitemnos, ndelitems);
/*
* We can clear the vacuum cycle ID since this page has certainly been
@@ -824,7 +836,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
XLogRecPtr recptr;
xl_btree_vacuum xlrec_vacuum;
- xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.ndelitems = ndelitems;
+ xlrec_vacuum.ncoloritems = ncoloritems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -835,8 +848,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* is. When XLogInsert stores the whole buffer, the offsets array
* need not be stored too.
*/
- if (nitems > 0)
- XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ if (ndelitems > 0)
+ XLogRegisterBufData(0, (char *) delitemnos, ndelitems * sizeof(OffsetNumber));
+
+ if (ncoloritems > 0)
+ XLogRegisterBufData(0, (char *) coloritemnos, ncoloritems * sizeof(OffsetNumber));
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
@@ -1882,3 +1898,18 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
return true;
}
+
+void
+_bt_color_items(Page page, OffsetNumber *coloritemnos, uint16 ncoloritems)
+{
+ int i;
+ ItemId itemid;
+ IndexTuple itup;
+
+ for (i = 0; i < ncoloritems; i++)
+ {
+ itemid = PageGetItemId(page, coloritemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ ItemPointerClearFlags(&itup->t_tid);
+ }
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 952ed8f..92f490e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
amroutine->aminsert = btinsert;
+ amroutine->amwarminsert = btwarminsert;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -317,11 +318,12 @@ btbuildempty(Relation index)
* Descend the tree recursively, find the appropriate location for our
* new tuple, and put it there.
*/
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
+static bool
+btinsert_internal(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
- IndexInfo *indexInfo)
+ IndexInfo *indexInfo,
+ bool warm_update)
{
bool result;
IndexTuple itup;
@@ -330,6 +332,11 @@ btinsert(Relation rel, Datum *values, bool *isnull,
itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
itup->t_tid = *ht_ctid;
+ if (warm_update)
+ ItemPointerSetFlags(&itup->t_tid, BTREE_INDEX_RED_POINTER);
+ else
+ ItemPointerClearFlags(&itup->t_tid);
+
result = _bt_doinsert(rel, itup, checkUnique, heapRel);
pfree(itup);
@@ -337,6 +344,26 @@ btinsert(Relation rel, Datum *values, bool *isnull,
return result;
}
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return btinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, false);
+}
+
+bool
+btwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ IndexInfo *indexInfo)
+{
+ return btinsert_internal(rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexInfo, true);
+}
+
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -1106,7 +1133,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_handleitems_vacuum(rel, buf, NULL, 0, NULL, 0);
_bt_relbuf(rel, buf);
}
@@ -1204,6 +1231,8 @@ restart:
{
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ OffsetNumber colorblue[MaxOffsetNumber];
+ int ncolorblue;
OffsetNumber offnum,
minoff,
maxoff;
@@ -1242,7 +1271,7 @@ restart:
* Scan over all items to see which ones need deleted according to the
* callback function.
*/
- ndeletable = 0;
+ ndeletable = ncolorblue = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1253,6 +1282,9 @@ restart:
{
IndexTuple itup;
ItemPointer htup;
+ int flags;
+ bool is_red = false;
+ IndexBulkDeleteCallbackResult result;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
@@ -1279,16 +1311,36 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
+ flags = ItemPointerGetFlags(&itup->t_tid);
+ is_red = ((flags & BTREE_INDEX_RED_POINTER) != 0);
+
+ if (is_red)
+ stats->num_red_pointers++;
+ else
+ stats->num_blue_pointers++;
+
+ result = callback(htup, is_red, callback_state);
+ if (result == IBDCR_DELETE)
+ {
+ if (is_red)
+ stats->red_pointers_removed++;
+ else
+ stats->blue_pointers_removed++;
deletable[ndeletable++] = offnum;
+ }
+ else if (result == IBDCR_COLOR_BLUE)
+ {
+ colorblue[ncolorblue++] = offnum;
+ }
}
}
/*
- * Apply any needed deletes. We issue just one _bt_delitems_vacuum()
- * call per page, so as to minimize WAL traffic.
+ * Apply any needed deletes and coloring. We issue just one
+ * _bt_handleitems_vacuum() call per page, so as to minimize WAL
+ * traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || ncolorblue > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1304,8 +1356,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
- vstate->lastBlockVacuumed);
+ _bt_handleitems_vacuum(rel, buf, deletable, ndeletable,
+ colorblue, ncolorblue);
/*
* Remember highest leaf page number we've issued a
@@ -1315,6 +1367,7 @@ restart:
vstate->lastBlockVacuumed = blkno;
stats->tuples_removed += ndeletable;
+ stats->pointers_colored += ncolorblue;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index ac60db0..916c76e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -390,83 +390,9 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
/*
- * This section of code is thought to be no longer needed, after analysis
- * of the calling paths. It is retained to allow the code to be reinstated
- * if a flaw is revealed in that thinking.
- *
- * If we are running non-MVCC scans using this index we need to do some
- * additional work to ensure correctness, which is known as a "pin scan"
- * described in more detail in next paragraphs. We used to do the extra
- * work in all cases, whereas we now avoid that work in most cases. If
- * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
- * additional work required for the pin scan.
- *
- * Avoiding this extra work is important since it requires us to touch
- * every page in the index, so is an O(N) operation. Worse, it is an
- * operation performed in the foreground during redo, so it delays
- * replication directly.
- *
- * If queries might be active then we need to ensure every leaf page is
- * unpinned between the lastBlockVacuumed and the current block, if there
- * are any. This prevents replay of the VACUUM from reaching the stage of
- * removing heap tuples while there could still be indexscans "in flight"
- * to those particular tuples for those scans which could be confused by
- * finding new tuples at the old TID locations (see nbtree/README).
- *
- * It might be worth checking if there are actually any backends running;
- * if not, we could just skip this.
- *
- * Since VACUUM can visit leaf pages out-of-order, it might issue records
- * with lastBlockVacuumed >= block; that's not an error, it just means
- * nothing to do now.
- *
- * Note: since we touch all pages in the range, we will lock non-leaf
- * pages, and also any empty (all-zero) pages that may be in the index. It
- * doesn't seem worth the complexity to avoid that. But it's important
- * that HotStandbyActiveInReplay() will not return true if the database
- * isn't yet consistent; so we need not fear reading still-corrupt blocks
- * here during crash recovery.
- */
- if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
- {
- RelFileNode thisrnode;
- BlockNumber thisblkno;
- BlockNumber blkno;
-
- XLogRecGetBlockTag(record, 0, &thisrnode, NULL, &thisblkno);
-
- for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
- {
- /*
- * We use RBM_NORMAL_NO_LOG mode because it's not an error
- * condition to see all-zero pages. The original btvacuumpage
- * scan would have skipped over all-zero pages, noting them in FSM
- * but not bothering to initialize them just yet; so we mustn't
- * throw an error here. (We could skip acquiring the cleanup lock
- * if PageIsNew, but it's probably not worth the cycles to test.)
- *
- * XXX we don't actually need to read the block, we just need to
- * confirm it is unpinned. If we had a special call into the
- * buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned. Optimizing
- * this is now moot, since in most cases we avoid the scan.
- */
- buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
- RBM_NORMAL_NO_LOG);
- if (BufferIsValid(buffer))
- {
- LockBufferForCleanup(buffer);
- UnlockReleaseBuffer(buffer);
- }
- }
- }
-#endif
-
- /*
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf
* page. See nbtree/README for details.
*/
@@ -482,19 +408,30 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ OffsetNumber *offnums = (OffsetNumber *) ptr;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /*
+ * Color the Red pointers Blue.
+ *
+ * We must do this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the
+ * array and hence offnums recorded earlier won't make any sense
+ * after PageIndexMultiDelete is called..
+ */
+ if (xlrec->ncoloritems > 0)
+ _bt_color_items(page, offnums + xlrec->ndelitems,
+ xlrec->ncoloritems);
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /*
+ * And handle the deleted items too
+ */
+ if (xlrec->ndelitems > 0)
+ PageIndexMultiDelete(page, offnums, xlrec->ndelitems);
}
/*
* Mark the page as not containing any LP_DEAD items --- see comments
- * in _bt_delitems_vacuum().
+ * in _bt_handleitems_vacuum().
*/
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 44d2d63..d373e61 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -44,6 +44,12 @@ heap_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "off %u", xlrec->offnum);
}
+ else if (info == XLOG_HEAP_MULTI_INSERT)
+ {
+ xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
+
+ appendStringInfo(buf, "%d tuples", xlrec->ntuples);
+ }
else if (info == XLOG_HEAP_DELETE)
{
xl_heap_delete *xlrec = (xl_heap_delete *) rec;
@@ -102,7 +108,7 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
char *rec = XLogRecGetData(record);
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
- info &= XLOG_HEAP_OPMASK;
+ info &= XLOG_HEAP2_OPMASK;
if (info == XLOG_HEAP2_CLEAN)
{
xl_heap_clean *xlrec = (xl_heap_clean *) rec;
@@ -129,12 +135,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "cutoff xid %u flags %d",
xlrec->cutoff_xid, xlrec->flags);
}
- else if (info == XLOG_HEAP2_MULTI_INSERT)
- {
- xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
-
- appendStringInfo(buf, "%d tuples", xlrec->ntuples);
- }
else if (info == XLOG_HEAP2_LOCK_UPDATED)
{
xl_heap_lock_updated *xlrec = (xl_heap_lock_updated *) rec;
@@ -171,6 +171,12 @@ heap_identify(uint8 info)
case XLOG_HEAP_INSERT | XLOG_HEAP_INIT_PAGE:
id = "INSERT+INIT";
break;
+ case XLOG_HEAP_MULTI_INSERT:
+ id = "MULTI_INSERT";
+ break;
+ case XLOG_HEAP_MULTI_INSERT | XLOG_HEAP_INIT_PAGE:
+ id = "MULTI_INSERT+INIT";
+ break;
case XLOG_HEAP_DELETE:
id = "DELETE";
break;
@@ -219,12 +225,6 @@ heap2_identify(uint8 info)
case XLOG_HEAP2_VISIBLE:
id = "VISIBLE";
break;
- case XLOG_HEAP2_MULTI_INSERT:
- id = "MULTI_INSERT";
- break;
- case XLOG_HEAP2_MULTI_INSERT | XLOG_HEAP_INIT_PAGE:
- id = "MULTI_INSERT+INIT";
- break;
case XLOG_HEAP2_LOCK_UPDATED:
id = "LOCK_UPDATED";
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index fbde9d6..0e9a2eb 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -48,8 +48,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "ndelitems %u, ncoloritems %u",
+ xlrec->ndelitems, xlrec->ncoloritems);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cce9b3f..5343b10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -155,7 +155,8 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
{
Assert(ItemPointerIsValid(<->heapPtr));
- if (bds->callback(<->heapPtr, bds->callback_state))
+ if (bds->callback(<->heapPtr, false, bds->callback_state) ==
+ IBDCR_DELETE)
{
bds->stats->tuples_removed += 1;
deletable[i] = true;
@@ -425,7 +426,8 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
{
Assert(ItemPointerIsValid(<->heapPtr));
- if (bds->callback(<->heapPtr, bds->callback_state))
+ if (bds->callback(<->heapPtr, false, bds->callback_state) ==
+ IBDCR_DELETE)
{
bds->stats->tuples_removed += 1;
toDelete[xlrec.nDelete] = i;
@@ -902,10 +904,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
/* Dummy callback to delete no tuples during spgvacuumcleanup */
-static bool
-dummy_callback(ItemPointer itemptr, void *state)
+static IndexBulkDeleteCallbackResult
+dummy_callback(ItemPointer itemptr, bool is_red, void *state)
{
- return false;
+ return IBDCR_KEEP;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 049eb28..166efd8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -115,7 +115,7 @@ static void IndexCheckExclusion(Relation heapRelation,
IndexInfo *indexInfo);
static inline int64 itemptr_encode(ItemPointer itemptr);
static inline void itemptr_decode(ItemPointer itemptr, int64 encoded);
-static bool validate_index_callback(ItemPointer itemptr, void *opaque);
+static IndexBulkDeleteCallbackResult validate_index_callback(ItemPointer itemptr, bool is_red, void *opaque);
static void validate_index_heapscan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
@@ -2949,15 +2949,15 @@ itemptr_decode(ItemPointer itemptr, int64 encoded)
/*
* validate_index_callback - bulkdelete callback to collect the index TIDs
*/
-static bool
-validate_index_callback(ItemPointer itemptr, void *opaque)
+static IndexBulkDeleteCallbackResult
+validate_index_callback(ItemPointer itemptr, bool is_red, void *opaque)
{
v_i_state *state = (v_i_state *) opaque;
int64 encoded = itemptr_encode(itemptr);
tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false);
state->itups += 1;
- return false; /* never actually delete anything */
+ return IBDCR_KEEP; /* never actually delete anything */
}
/*
@@ -3178,7 +3178,8 @@ validate_index_heapscan(Relation heapRelation,
heapRelation,
indexInfo->ii_Unique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- indexInfo);
+ indexInfo,
+ false);
state->tups_inserted += 1;
}
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 970254f..6392f33 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -172,7 +172,8 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple,
heapRelation,
relationDescs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- indexInfo);
+ indexInfo,
+ warm_update);
}
ExecDropSingleTupleTableSlot(slot);
@@ -222,7 +223,7 @@ CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
oid = simple_heap_insert(heapRel, tup);
- CatalogIndexInsert(indstate, tup, false, NULL);
+ CatalogIndexInsert(indstate, tup, NULL, false);
return oid;
}
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index d9c0fe7..330b661 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -168,7 +168,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
*/
index_insert(indexRel, values, isnull, &(new_row->t_self),
trigdata->tg_relation, UNIQUE_CHECK_EXISTING,
- indexInfo);
+ indexInfo,
+ false);
}
else
{
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 7376099..deb76cb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -104,6 +104,25 @@
*/
#define PREFETCH_SIZE ((BlockNumber) 32)
+/*
+ * Structure to track WARM chains that can be converted into HOT chains during
+ * this run.
+ *
+ * To reduce space requirement, we're using bitfields. But the way things are
+ * laid down, we're still wasting 1-byte per candidate chain.
+ */
+typedef struct LVRedBlueChain
+{
+ ItemPointerData chain_tid; /* root of the chain */
+ uint8 is_red_chain:2; /* is the WARM chain complete red ? */
+ uint8 keep_warm_chain:2; /* this chain can't be cleared of WARM
+ * tuples */
+ uint8 num_blue_pointers:2;/* number of blue pointers found so
+ * far */
+ uint8 num_red_pointers:2; /* number of red pointers found so far
+ * in the current index */
+} LVRedBlueChain;
+
typedef struct LVRelStats
{
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -121,6 +140,16 @@ typedef struct LVRelStats
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+ double num_warm_chains; /* number of warm chains seen so far */
+
+ /* List of WARM chains that can be converted into HOT chains */
+ /* NB: this list is ordered by TID of the root pointers */
+ int num_redblue_chains; /* current # of entries */
+ int max_redblue_chains; /* # slots allocated in array */
+ LVRedBlueChain *redblue_chains; /* array of LVRedBlueChain */
+ double num_non_convertible_warm_chains;
+
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
@@ -149,6 +178,7 @@ static void lazy_scan_heap(Relation onerel, int options,
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
static void lazy_vacuum_index(Relation indrel,
+ bool clear_warm,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats);
static void lazy_cleanup_index(Relation indrel,
@@ -156,6 +186,10 @@ static void lazy_cleanup_index(Relation indrel,
LVRelStats *vacrelstats);
static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+static int lazy_warmclear_page(Relation onerel, BlockNumber blkno,
+ Buffer buffer, int chainindex, LVRelStats *vacrelstats,
+ Buffer *vmbuffer, bool check_all_visible);
+static void lazy_reset_redblue_pointer_count(LVRelStats *vacrelstats);
static bool should_attempt_truncation(LVRelStats *vacrelstats);
static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -163,8 +197,15 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
-static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_record_red_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr);
+static void lazy_record_blue_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr);
+static IndexBulkDeleteCallbackResult lazy_tid_reaped(ItemPointer itemptr, bool is_red, void *state);
+static IndexBulkDeleteCallbackResult lazy_indexvac_phase1(ItemPointer itemptr, bool is_red, void *state);
+static IndexBulkDeleteCallbackResult lazy_indexvac_phase2(ItemPointer itemptr, bool is_red, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
+static int vac_cmp_redblue_chain(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -684,8 +725,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* If we are close to overrunning the available space for dead-tuple
* TIDs, pause and do a cycle of vacuuming before we tackle this page.
*/
- if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
- vacrelstats->num_dead_tuples > 0)
+ if (((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
+ vacrelstats->num_dead_tuples > 0) ||
+ ((vacrelstats->max_redblue_chains - vacrelstats->num_redblue_chains) < MaxHeapTuplesPerPage &&
+ vacrelstats->num_redblue_chains > 0))
{
const int hvp_index[] = {
PROGRESS_VACUUM_PHASE,
@@ -715,6 +758,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* Remove index entries */
for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i],
+ (vacrelstats->num_redblue_chains > 0),
&indstats[i],
vacrelstats);
@@ -737,6 +781,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* valid.
*/
vacrelstats->num_dead_tuples = 0;
+ vacrelstats->num_redblue_chains = 0;
+ memset(vacrelstats->redblue_chains, 0,
+ vacrelstats->max_redblue_chains * sizeof (LVRedBlueChain));
vacrelstats->num_index_scans++;
/* Report that we are once again scanning the heap */
@@ -940,15 +987,33 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
continue;
}
+ ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
/* Redirect items mustn't be touched */
if (ItemIdIsRedirected(itemid))
{
+ HeapCheckWarmChainStatus status = heap_check_warm_chain(page,
+ &tuple.t_self, false);
+ if (HCWC_IS_WARM(status))
+ {
+ vacrelstats->num_warm_chains++;
+
+ /*
+ * A chain which is either complete Red or Blue is a
+ * candidate for chain conversion. Remember the chain and
+ * its color.
+ */
+ if (HCWC_IS_ALL_RED(status))
+ lazy_record_red_chain(vacrelstats, &tuple.t_self);
+ else if (HCWC_IS_ALL_BLUE(status))
+ lazy_record_blue_chain(vacrelstats, &tuple.t_self);
+ else
+ vacrelstats->num_non_convertible_warm_chains++;
+ }
hastup = true; /* this page won't be truncatable */
continue;
}
- ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
/*
* DEAD item pointers are to be vacuumed normally; but we don't
* count them in tups_vacuumed, else we'd be double-counting (at
@@ -968,6 +1033,28 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
tuple.t_len = ItemIdGetLength(itemid);
tuple.t_tableOid = RelationGetRelid(onerel);
+ if (!HeapTupleIsHeapOnly(&tuple))
+ {
+ HeapCheckWarmChainStatus status = heap_check_warm_chain(page,
+ &tuple.t_self, false);
+ if (HCWC_IS_WARM(status))
+ {
+ vacrelstats->num_warm_chains++;
+
+ /*
+ * A chain which is either complete Red or Blue is a
+ * candidate for chain conversion. Remember the chain and
+ * its color.
+ */
+ if (HCWC_IS_ALL_RED(status))
+ lazy_record_red_chain(vacrelstats, &tuple.t_self);
+ else if (HCWC_IS_ALL_BLUE(status))
+ lazy_record_blue_chain(vacrelstats, &tuple.t_self);
+ else
+ vacrelstats->num_non_convertible_warm_chains++;
+ }
+ }
+
tupgone = false;
switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
@@ -1288,7 +1375,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* If any tuples need to be deleted, perform final vacuum cycle */
/* XXX put a threshold on min number of tuples here? */
- if (vacrelstats->num_dead_tuples > 0)
+ if (vacrelstats->num_dead_tuples > 0 || vacrelstats->num_redblue_chains > 0)
{
const int hvp_index[] = {
PROGRESS_VACUUM_PHASE,
@@ -1306,6 +1393,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
/* Remove index entries */
for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i],
+ (vacrelstats->num_redblue_chains > 0),
&indstats[i],
vacrelstats);
@@ -1373,7 +1461,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
*
* This routine marks dead tuples as unused and compacts out free
* space on their pages. Pages not having dead tuples recorded from
- * lazy_scan_heap are not visited at all.
+ * lazy_scan_heap are not visited at all. This routine also converts
+ * candidate WARM chains to HOT chains by clearing WARM related flags. The
+ * candidate chains are determined by the preceeding index scans after
+ * looking at the data collected by the first heap scan.
*
* Note: the reason for doing this as a second pass is we cannot remove
* the tuples until we've removed their index entries, and we want to
@@ -1382,7 +1473,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
static void
lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
{
- int tupindex;
+ int tupindex, chainindex;
int npages;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
@@ -1391,33 +1482,69 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
npages = 0;
tupindex = 0;
- while (tupindex < vacrelstats->num_dead_tuples)
+ chainindex = 0;
+ while (tupindex < vacrelstats->num_dead_tuples ||
+ chainindex < vacrelstats->num_redblue_chains)
{
- BlockNumber tblk;
+ BlockNumber tblk, chainblk, vacblk;
Buffer buf;
Page page;
Size freespace;
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
- buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+ tblk = chainblk = InvalidBlockNumber;
+ if (chainindex < vacrelstats->num_redblue_chains)
+ chainblk =
+ ItemPointerGetBlockNumber(&(vacrelstats->redblue_chains[chainindex].chain_tid));
+
+ if (tupindex < vacrelstats->num_dead_tuples)
+ tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
+
+ if (tblk == InvalidBlockNumber)
+ vacblk = chainblk;
+ else if (chainblk == InvalidBlockNumber)
+ vacblk = tblk;
+ else
+ vacblk = Min(chainblk, tblk);
+
+ Assert(vacblk != InvalidBlockNumber);
+
+ buf = ReadBufferExtended(onerel, MAIN_FORKNUM, vacblk, RBM_NORMAL,
vac_strategy);
- if (!ConditionalLockBufferForCleanup(buf))
+
+
+ if (vacblk == chainblk)
+ LockBufferForCleanup(buf);
+ else if (!ConditionalLockBufferForCleanup(buf))
{
ReleaseBuffer(buf);
++tupindex;
continue;
}
- tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
- &vmbuffer);
+
+ /*
+ * Convert WARM chains on this page. This should be done before
+ * vacuuming the page to ensure that we can correctly set visibility
+ * bits after clearing WARM chains.
+ *
+ * If we are going to vacuum this page then don't check for
+ * all-visibility just yet.
+ */
+ if (vacblk == chainblk)
+ chainindex = lazy_warmclear_page(onerel, chainblk, buf, chainindex,
+ vacrelstats, &vmbuffer, chainblk != tblk);
+
+ if (vacblk == tblk)
+ tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
+ &vmbuffer);
/* Now that we've compacted the page, record its available space */
page = BufferGetPage(buf);
freespace = PageGetHeapFreeSpace(page);
UnlockReleaseBuffer(buf);
- RecordPageWithFreeSpace(onerel, tblk, freespace);
+ RecordPageWithFreeSpace(onerel, vacblk, freespace);
npages++;
}
@@ -1436,6 +1563,107 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
}
/*
+ * lazy_warmclear_page() -- clear WARM flag and mark chains blue when possible
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * chainindex is the index in vacrelstats->redblue_chains of the first dead
+ * tuple for this page. We assume the rest follow sequentially.
+ * The return value is the first tupindex after the tuples of this page.
+ *
+ * If check_all_visible is set then we also check if the page has now become
+ * all visible and update visibility map.
+ */
+static int
+lazy_warmclear_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+ int chainindex, LVRelStats *vacrelstats, Buffer *vmbuffer,
+ bool check_all_visible)
+{
+ Page page = BufferGetPage(buffer);
+ OffsetNumber cleared_offnums[MaxHeapTuplesPerPage];
+ int num_cleared = 0;
+ TransactionId visibility_cutoff_xid;
+ bool all_frozen;
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_WARMCLEARED, blkno);
+
+ START_CRIT_SECTION();
+
+ for (; chainindex < vacrelstats->num_redblue_chains ; chainindex++)
+ {
+ BlockNumber tblk;
+ LVRedBlueChain *chain;
+
+ chain = &vacrelstats->redblue_chains[chainindex];
+
+ tblk = ItemPointerGetBlockNumber(&chain->chain_tid);
+ if (tblk != blkno)
+ break; /* past end of tuples for this block */
+
+ /*
+ * Since a heap page can have no more than MaxHeapTuplesPerPage
+ * offnums and we process each offnum only once, MaxHeapTuplesPerPage
+ * size array should be enough to hold all cleared tuples in this page.
+ */
+ if (!chain->keep_warm_chain)
+ num_cleared += heap_clear_warm_chain(page, &chain->chain_tid,
+ cleared_offnums + num_cleared);
+ }
+
+ /*
+ * Mark buffer dirty before we write WAL.
+ */
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(onerel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_heap_warmclear(onerel, buffer,
+ cleared_offnums, num_cleared);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* If not checking for all-visibility then we're done */
+ if (!check_all_visible)
+ return chainindex;
+
+ /*
+ * The following code should match the corresponding code in
+ * lazy_vacuum_page
+ **/
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid,
+ &all_frozen))
+ PageSetAllVisible(page);
+
+ /*
+ * All the changes to the heap page have been done. If the all-visible
+ * flag is now set, also set the VM all-visible bit (and, if possible, the
+ * all-frozen bit) unless this has already been done previously.
+ */
+ if (PageIsAllVisible(page))
+ {
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+
+ Assert(BufferIsValid(*vmbuffer));
+ if (flags != 0)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
+ *vmbuffer, visibility_cutoff_xid, flags);
+ }
+ return chainindex;
+}
+
+/*
* lazy_vacuum_page() -- free dead tuples on a page
* and repair its fragmentation.
*
@@ -1588,6 +1816,16 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
return false;
}
+static void
+lazy_reset_redblue_pointer_count(LVRelStats *vacrelstats)
+{
+ int i;
+ for (i = 0; i < vacrelstats->num_redblue_chains; i++)
+ {
+ LVRedBlueChain *chain = &vacrelstats->redblue_chains[i];
+ chain->num_blue_pointers = chain->num_red_pointers = 0;
+ }
+}
/*
* lazy_vacuum_index() -- vacuum one index relation.
@@ -1597,6 +1835,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
*/
static void
lazy_vacuum_index(Relation indrel,
+ bool clear_warm,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats)
{
@@ -1612,15 +1851,81 @@ lazy_vacuum_index(Relation indrel,
ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
ivinfo.strategy = vac_strategy;
- /* Do bulk deletion */
- *stats = index_bulk_delete(&ivinfo, *stats,
- lazy_tid_reaped, (void *) vacrelstats);
+ /*
+ * If told, convert WARM chains into HOT chains.
+ *
+ * We must have already collected candidate WARM chains i.e. chains which
+ * has either has only Red or only Blue tuples, but not a mix of both.
+ *
+ * This works in two phases. In the first phase, we do a complete index
+ * scan and collect information about index pointers to the candidate
+ * chains, but we don't do conversion. To be precise, we count the number
+ * of Blue and Red index pointers to each candidate chain and use that
+ * knowledge to arrive at a decision and do the actual conversion during
+ * the second phase (we kill known dead pointers though in this phase).
+ *
+ * In the second phase, for each Red chain we check if we have seen a Red
+ * index pointer. For such chains, we kill the Blue pointer and color the
+ * Red pointer Blue. the heap tuples are marked Blue in the second heap
+ * scan. If we did not find any Red pointer to a Red chain, that means that
+ * the chain is reachable from the Blue pointer (because say WARM update
+ * did not added a new entry for this index). In that case, we do nothing.
+ * There is a third case where we find more than one Blue pointers to a Red
+ * chain. This can happen because of aborted vacuums. We don't handle that
+ * case yet, but it should be possible to apply the same recheck logic and
+ * find which of the Blue pointers is redundant and should be removed.
+ *
+ * For Blue chains, we just kill the Red pointer, if it exists and keep the
+ * Blue pointer.
+ */
+ if (clear_warm)
+ {
+ lazy_reset_redblue_pointer_count(vacrelstats);
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase1, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to remove %d row version, found "
+ "%0.f red pointers, %0.f blue pointers, removed "
+ "%0.f red pointers, removed %0.f blue pointers",
+ RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples,
+ (*stats)->num_red_pointers,
+ (*stats)->num_blue_pointers,
+ (*stats)->red_pointers_removed,
+ (*stats)->blue_pointers_removed)));
+
+ (*stats)->num_red_pointers = 0;
+ (*stats)->num_blue_pointers = 0;
+ (*stats)->red_pointers_removed = 0;
+ (*stats)->blue_pointers_removed = 0;
+ (*stats)->pointers_colored = 0;
+
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase2, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to convert red pointers, found "
+ "%0.f red pointers, %0.f blue pointers, removed "
+ "%0.f red pointers, removed %0.f blue pointers, "
+ "colored %0.f red pointers blue",
+ RelationGetRelationName(indrel),
+ (*stats)->num_red_pointers,
+ (*stats)->num_blue_pointers,
+ (*stats)->red_pointers_removed,
+ (*stats)->blue_pointers_removed,
+ (*stats)->pointers_colored)));
+ }
+ else
+ {
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ lazy_tid_reaped, (void *) vacrelstats);
+ ereport(elevel,
+ (errmsg("scanned index \"%s\" to remove %d row versions",
+ RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples),
+ errdetail("%s.", pg_rusage_show(&ru0))));
+ }
- ereport(elevel,
- (errmsg("scanned index \"%s\" to remove %d row versions",
- RelationGetRelationName(indrel),
- vacrelstats->num_dead_tuples),
- errdetail("%s.", pg_rusage_show(&ru0))));
}
/*
@@ -1994,9 +2299,11 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
if (vacrelstats->hasindex)
{
- maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
+ maxtuples = (vac_work_mem * 1024L) / (sizeof(ItemPointerData) +
+ sizeof(LVRedBlueChain));
maxtuples = Min(maxtuples, INT_MAX);
- maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+ maxtuples = Min(maxtuples, MaxAllocSize / (sizeof(ItemPointerData) +
+ sizeof(LVRedBlueChain)));
/* curious coding here to ensure the multiplication can't overflow */
if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
@@ -2014,6 +2321,57 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
vacrelstats->max_dead_tuples = (int) maxtuples;
vacrelstats->dead_tuples = (ItemPointer)
palloc(maxtuples * sizeof(ItemPointerData));
+
+ /*
+ * XXX Cheat for now and allocate the same size array for tracking blue and
+ * red chains. maxtuples must have been already adjusted above to ensure we
+ * don't cross vac_work_mem.
+ */
+ vacrelstats->num_redblue_chains = 0;
+ vacrelstats->max_redblue_chains = (int) maxtuples;
+ vacrelstats->redblue_chains = (LVRedBlueChain *)
+ palloc0(maxtuples * sizeof(LVRedBlueChain));
+
+}
+
+/*
+ * lazy_record_blue_chain - remember one blue chain
+ */
+static void
+lazy_record_blue_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (vacrelstats->num_redblue_chains < vacrelstats->max_redblue_chains)
+ {
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].chain_tid = *itemptr;
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].is_red_chain = 0;
+ vacrelstats->num_redblue_chains++;
+ }
+}
+
+/*
+ * lazy_record_red_chain - remember one red chain
+ */
+static void
+lazy_record_red_chain(LVRelStats *vacrelstats,
+ ItemPointer itemptr)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (vacrelstats->num_redblue_chains < vacrelstats->max_redblue_chains)
+ {
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].chain_tid = *itemptr;
+ vacrelstats->redblue_chains[vacrelstats->num_redblue_chains].is_red_chain = 1;
+ vacrelstats->num_redblue_chains++;
+ }
}
/*
@@ -2044,8 +2402,8 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
*
* Assumes dead_tuples array is in sorted order.
*/
-static bool
-lazy_tid_reaped(ItemPointer itemptr, void *state)
+static IndexBulkDeleteCallbackResult
+lazy_tid_reaped(ItemPointer itemptr, bool is_red, void *state)
{
LVRelStats *vacrelstats = (LVRelStats *) state;
ItemPointer res;
@@ -2056,7 +2414,193 @@ lazy_tid_reaped(ItemPointer itemptr, void *state)
sizeof(ItemPointerData),
vac_cmp_itemptr);
- return (res != NULL);
+ return (res != NULL) ? IBDCR_DELETE : IBDCR_KEEP;
+}
+
+/*
+ * lazy_indexvac_phase1() -- run first pass of index vacuum
+ *
+ * This has the right signature to be an IndexBulkDeleteCallback.
+ */
+static IndexBulkDeleteCallbackResult
+lazy_indexvac_phase1(ItemPointer itemptr, bool is_red, void *state)
+{
+ LVRelStats *vacrelstats = (LVRelStats *) state;
+ ItemPointer res;
+ LVRedBlueChain *chain;
+
+ res = (ItemPointer) bsearch((void *) itemptr,
+ (void *) vacrelstats->dead_tuples,
+ vacrelstats->num_dead_tuples,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+
+ if (res != NULL)
+ return IBDCR_DELETE;
+
+ chain = (LVRedBlueChain *) bsearch((void *) itemptr,
+ (void *) vacrelstats->redblue_chains,
+ vacrelstats->num_redblue_chains,
+ sizeof(LVRedBlueChain),
+ vac_cmp_redblue_chain);
+ if (chain != NULL)
+ {
+ if (is_red)
+ chain->num_red_pointers++;
+ else
+ chain->num_blue_pointers++;
+ }
+ return IBDCR_KEEP;
+}
+
+/*
+ * lazy_indexvac_phase2() -- run first pass of index vacuum
+ *
+ * This has the right signature to be an IndexBulkDeleteCallback.
+ */
+static IndexBulkDeleteCallbackResult
+lazy_indexvac_phase2(ItemPointer itemptr, bool is_red, void *state)
+{
+ LVRelStats *vacrelstats = (LVRelStats *) state;
+ LVRedBlueChain *chain;
+
+ chain = (LVRedBlueChain *) bsearch((void *) itemptr,
+ (void *) vacrelstats->redblue_chains,
+ vacrelstats->num_redblue_chains,
+ sizeof(LVRedBlueChain),
+ vac_cmp_redblue_chain);
+
+ if (chain != NULL && (chain->keep_warm_chain != 1))
+ {
+ /*
+ * At no point, we can have more than 1 Red pointer to any chain and no
+ * more than 2 Blue pointers.
+ */
+ Assert(chain->num_red_pointers <= 1);
+ Assert(chain->num_blue_pointers <= 2);
+
+ if (chain->is_red_chain == 1)
+ {
+ if (is_red)
+ {
+ /*
+ * A Red pointer, pointing to a Blue chain.
+ *
+ * Color the Red pointer Blue (and delete the Blue pointer). We
+ * may have already seen the Blue pointer in the scan and
+ * deleted that or we may see it later in the scan. It doesn't
+ * matter if we fail at any point because we won't clear up
+ * WARM bits on the heap tuples until we have dealt with the
+ * index pointers cleanly.
+ */
+ return IBDCR_COLOR_BLUE;
+ }
+ else
+ {
+ /*
+ * Blue pointer to a Red chain.
+ */
+ if (chain->num_red_pointers > 0)
+ {
+ /*
+ * If there exists a Red pointer to the chain, we can
+ * delete the Blue pointer and clear the WARM bits on the
+ * heap tuples.
+ */
+ return IBDCR_DELETE;
+ }
+ else if (chain->num_blue_pointers == 1)
+ {
+ /*
+ * If this is the only pointer to a Red chain, we must keep the
+ * Blue pointer.
+ *
+ * The presence of Red chain indicates that the WARM update
+ * must have been committed good. But during the update
+ * this index was probably not updated and hence it
+ * contains just one, original Blue pointer to the chain.
+ * We should be able to clear the WARM bits on heap tuples
+ * unless we later find another index which prevents the
+ * cleanup.
+ */
+ return IBDCR_KEEP;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * This is a Blue chain.
+ */
+ if (is_red)
+ {
+ /*
+ * A Red pointer to a Blue chain.
+ *
+ * This can happen when a WARM update is aborted. Later the HOT
+ * chain is pruned leaving behind only Blue tuples in the
+ * chain. But the Red index pointer inserted in the index
+ * remains and it must now be deleted before we clear WARM bits
+ * from the heap tuple.
+ */
+ return IBDCR_DELETE;
+ }
+
+ /*
+ * Blue pointer to a Blue chain.
+ *
+ * If this is the only surviving Blue pointer, keep it and clear
+ * the WARM bits from the heap tuples.
+ */
+ if (chain->num_blue_pointers == 1)
+ return IBDCR_KEEP;
+
+ /*
+ * If there are more than 1 Blue pointers to this chain, we can
+ * apply the recheck logic and kill the redudant Blue pointer and
+ * convert the chain. But that's not yet done.
+ */
+ }
+
+ /*
+ * For everything else, we must keep the WARM bits and also keep the
+ * index pointers.
+ */
+ chain->keep_warm_chain = 1;
+ return IBDCR_KEEP;
+ }
+ return IBDCR_KEEP;
+}
+
+/*
+ * Comparator routines for use with qsort() and bsearch(). Similar to
+ * vac_cmp_itemptr, but right hand argument is LVRedBlueChain struct pointer.
+ */
+static int
+vac_cmp_redblue_chain(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber(&((LVRedBlueChain *) right)->chain_tid);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber(&((LVRedBlueChain *) right)->chain_tid);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
}
/*
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index d62d2de..3e49a8f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -405,7 +405,8 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
root_tid, /* tid of heap or root tuple */
heapRelation, /* heap relation */
checkUnique, /* type of uniqueness check to do */
- indexInfo); /* index AM may need this */
+ indexInfo, /* index AM may need this */
+ (modified_attrs != NULL)); /* type of uniqueness check to do */
/*
* If the index has an associated exclusion constraint, check that.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..7a9b48a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -347,7 +347,7 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
static void
DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
- uint8 info = XLogRecGetInfo(buf->record) & XLOG_HEAP_OPMASK;
+ uint8 info = XLogRecGetInfo(buf->record) & XLOG_HEAP2_OPMASK;
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
@@ -359,10 +359,6 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
switch (info)
{
- case XLOG_HEAP2_MULTI_INSERT:
- if (SnapBuildProcessChange(builder, xid, buf->origptr))
- DecodeMultiInsert(ctx, buf);
- break;
case XLOG_HEAP2_NEW_CID:
{
xl_heap_new_cid *xlrec;
@@ -390,6 +386,7 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
case XLOG_HEAP2_CLEANUP_INFO:
case XLOG_HEAP2_VISIBLE:
case XLOG_HEAP2_LOCK_UPDATED:
+ case XLOG_HEAP2_WARMCLEAR:
break;
default:
elog(ERROR, "unexpected RM_HEAP2_ID record type: %u", info);
@@ -418,6 +415,10 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (SnapBuildProcessChange(builder, xid, buf->origptr))
DecodeInsert(ctx, buf);
break;
+ case XLOG_HEAP_MULTI_INSERT:
+ if (SnapBuildProcessChange(builder, xid, buf->origptr))
+ DecodeMultiInsert(ctx, buf);
+ break;
/*
* Treat HOT update as normal updates. There is no useful
@@ -809,7 +810,7 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
/*
- * Decode XLOG_HEAP2_MULTI_INSERT_insert record into multiple tuplebufs.
+ * Decode XLOG_HEAP_MULTI_INSERT_insert record into multiple tuplebufs.
*
* Currently MULTI_INSERT will always contain the full tuples.
*/
diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c
index baff998..6a2e2f2 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -106,7 +106,7 @@ HeapTupleHeaderGetCmin(HeapTupleHeader tup)
{
CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
- Assert(!(tup->t_infomask & HEAP_MOVED));
+ Assert(!(HeapTupleHeaderIsMoved(tup)));
Assert(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)));
if (tup->t_infomask & HEAP_COMBOCID)
@@ -120,7 +120,7 @@ HeapTupleHeaderGetCmax(HeapTupleHeader tup)
{
CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
- Assert(!(tup->t_infomask & HEAP_MOVED));
+ Assert(!(HeapTupleHeaderIsMoved(tup)));
/*
* Because GetUpdateXid() performs memory allocations if xmax is a
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 703bdce..0df5a44 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -186,7 +186,7 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -205,7 +205,7 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -377,7 +377,7 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -396,7 +396,7 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -471,7 +471,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return HeapTupleInvisible;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -490,7 +490,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -753,7 +753,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -772,7 +772,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -974,7 +974,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
return false;
/* Used by pre-9.0 binary upgrades */
- if (tuple->t_infomask & HEAP_MOVED_OFF)
+ if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -993,7 +993,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -1180,7 +1180,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
if (HeapTupleHeaderXminInvalid(tuple))
return HEAPTUPLE_DEAD;
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_OFF)
+ else if (HeapTupleHeaderIsMovedOff(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
@@ -1198,7 +1198,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
InvalidTransactionId);
}
/* Used by pre-9.0 binary upgrades */
- else if (tuple->t_infomask & HEAP_MOVED_IN)
+ else if (HeapTupleHeaderIsMovedIn(tuple))
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d7702e5..68859f2 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -75,6 +75,14 @@ typedef bool (*aminsert_function) (Relation indexRelation,
Relation heapRelation,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+/* insert this WARM tuple */
+typedef bool (*amwarminsert_function) (Relation indexRelation,
+ Datum *values,
+ bool *isnull,
+ ItemPointer heap_tid,
+ Relation heapRelation,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
/* bulk delete */
typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -203,6 +211,7 @@ typedef struct IndexAmRoutine
ambuild_function ambuild;
ambuildempty_function ambuildempty;
aminsert_function aminsert;
+ amwarminsert_function amwarminsert;
ambulkdelete_function ambulkdelete;
amvacuumcleanup_function amvacuumcleanup;
amcanreturn_function amcanreturn; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index f467b18..bf1e6bd 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -75,12 +75,29 @@ typedef struct IndexBulkDeleteResult
bool estimated_count; /* num_index_tuples is an estimate */
double num_index_tuples; /* tuples remaining */
double tuples_removed; /* # removed during vacuum operation */
+ double num_red_pointers; /* # red pointers found */
+ double num_blue_pointers; /* # blue pointers found */
+ double pointers_colored; /* # red pointers colored blue */
+ double red_pointers_removed; /* # red pointers removed */
+ double blue_pointers_removed; /* # blue pointers removed */
BlockNumber pages_deleted; /* # unused pages in index */
BlockNumber pages_free; /* # pages available for reuse */
} IndexBulkDeleteResult;
+/*
+ * IndexBulkDeleteCallback should return one of the following
+ */
+typedef enum IndexBulkDeleteCallbackResult
+{
+ IBDCR_KEEP, /* index tuple should be preserved */
+ IBDCR_DELETE, /* index tuple should be deleted */
+ IBDCR_COLOR_BLUE /* index tuple should be colored blue */
+} IndexBulkDeleteCallbackResult;
+
/* Typedef for callback function to determine if a tuple is bulk-deletable */
-typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
+typedef IndexBulkDeleteCallbackResult (*IndexBulkDeleteCallback) (
+ ItemPointer itemptr,
+ bool is_red, void *state);
/* struct definitions appear in relscan.h */
typedef struct IndexScanDescData *IndexScanDesc;
@@ -135,7 +152,8 @@ extern bool index_insert(Relation indexRelation,
ItemPointer heap_t_ctid,
Relation heapRelation,
IndexUniqueCheck checkUnique,
- struct IndexInfo *indexInfo);
+ struct IndexInfo *indexInfo,
+ bool warm_update);
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 0af6b4e..97d9cfb 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -269,6 +269,11 @@ typedef HashMetaPageData *HashMetaPage;
#define HASHPROC 1
#define HASHNProcs 1
+/*
+ * Flags overloaded on t_tid.ip_posid field. They are managed by
+ * ItemPointerSetFlags and corresponing routines.
+ */
+#define HASH_INDEX_RED_POINTER 0x01
/* public routines */
@@ -279,6 +284,10 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+extern bool hashwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
@@ -348,6 +357,8 @@ extern void _hash_expandtable(Relation rel, Buffer metabuf);
extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
Bucket obucket, uint32 maxbucket, uint32 highmask,
uint32 lowmask);
+extern void _hash_color_items(Page page, OffsetNumber *coloritemsno,
+ uint16 ncoloritems);
/* hashsearch.c */
extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9412c3a..719a725 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,20 @@ typedef struct HeapUpdateFailureData
CommandId cmax;
} HeapUpdateFailureData;
+typedef int HeapCheckWarmChainStatus;
+
+#define HCWC_BLUE_TUPLE 0x0001
+#define HCWC_RED_TUPLE 0x0002
+#define HCWC_WARM_TUPLE 0x0004
+
+#define HCWC_IS_MIXED(status) \
+ (((status) & (HCWC_BLUE_TUPLE | HCWC_RED_TUPLE)) != 0)
+#define HCWC_IS_ALL_RED(status) \
+ (((status) & HCWC_BLUE_TUPLE) == 0)
+#define HCWC_IS_ALL_BLUE(status) \
+ (((status) & HCWC_RED_TUPLE) == 0)
+#define HCWC_IS_WARM(status) \
+ (((status) & HCWC_WARM_TUPLE) != 0)
/* ----------------
* function prototypes for heap access method
@@ -183,6 +197,10 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
bool *warm_update);
extern void heap_sync(Relation relation);
+extern HeapCheckWarmChainStatus heap_check_warm_chain(Page dp,
+ ItemPointer tid, bool stop_at_warm);
+extern int heap_clear_warm_chain(Page dp, ItemPointer tid,
+ OffsetNumber *cleared_offnums);
/* in heap/pruneheap.c */
extern void heap_page_prune_opt(Relation relation, Buffer buffer);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 9b081bf..66fd0ea 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -32,7 +32,7 @@
#define XLOG_HEAP_INSERT 0x00
#define XLOG_HEAP_DELETE 0x10
#define XLOG_HEAP_UPDATE 0x20
-/* 0x030 is free, was XLOG_HEAP_MOVE */
+#define XLOG_HEAP_MULTI_INSERT 0x30
#define XLOG_HEAP_HOT_UPDATE 0x40
#define XLOG_HEAP_CONFIRM 0x50
#define XLOG_HEAP_LOCK 0x60
@@ -47,18 +47,23 @@
/*
* We ran out of opcodes, so heapam.c now has a second RmgrId. These opcodes
* are associated with RM_HEAP2_ID, but are not logically different from
- * the ones above associated with RM_HEAP_ID. XLOG_HEAP_OPMASK applies to
- * these, too.
+ * the ones above associated with RM_HEAP_ID.
+ *
+ * In PG 10, we moved XLOG_HEAP2_MULTI_INSERT to RM_HEAP_ID. That allows us to
+ * use 0x80 bit in RM_HEAP2_ID, thus potentially creating another 8 possible
+ * opcodes in RM_HEAP2_ID.
*/
#define XLOG_HEAP2_REWRITE 0x00
#define XLOG_HEAP2_CLEAN 0x10
#define XLOG_HEAP2_FREEZE_PAGE 0x20
#define XLOG_HEAP2_CLEANUP_INFO 0x30
#define XLOG_HEAP2_VISIBLE 0x40
-#define XLOG_HEAP2_MULTI_INSERT 0x50
+#define XLOG_HEAP2_WARMCLEAR 0x50
#define XLOG_HEAP2_LOCK_UPDATED 0x60
#define XLOG_HEAP2_NEW_CID 0x70
+#define XLOG_HEAP2_OPMASK 0x70
+
/*
* xl_heap_insert/xl_heap_multi_insert flag values, 8 bits are available.
*/
@@ -226,6 +231,14 @@ typedef struct xl_heap_clean
#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+typedef struct xl_heap_warmclear
+{
+ uint16 ncleared;
+ /* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_warmclear;
+
+#define SizeOfHeapWarmClear (offsetof(xl_heap_warmclear, ncleared) + sizeof(uint16))
+
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
* Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
@@ -389,6 +402,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
TransactionId latestRemovedXid);
+extern XLogRecPtr log_heap_warmclear(Relation reln, Buffer buffer,
+ OffsetNumber *cleared, int ncleared);
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index b5891ca..ba5e94d 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -201,6 +201,21 @@ struct HeapTupleHeaderData
* upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
+/*
+ * A WARM chain usually consists of two parts. Each of these parts are HOT
+ * chains in themselves i.e. all indexed columns has the same value, but a WARM
+ * update separates these parts. We call these two parts as Blue chain and Red
+ * chain. We need a mechanism to identify which part a tuple belongs to. We
+ * can't just look at if it's a HeapTupleHeaderIsHeapWarmTuple() because during
+ * WARM update, both old and new tuples are marked as WARM tuples.
+ *
+ * We need another infomask bit for this. But we use the same infomask bit that
+ * was earlier used for by old-style VACUUM FULL. This is safe because
+ * HEAP_WARM_TUPLE flag will always be set along with HEAP_WARM_RED. So if
+ * HEAP_WARM_TUPLE and HEAP_WARM_RED is set then we know that it's referring to
+ * red part of the WARM chain.
+ */
+#define HEAP_WARM_RED 0x4000
#define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
/*
@@ -397,7 +412,7 @@ struct HeapTupleHeaderData
/* SetCmin is reasonably simple since we never need a combo CID */
#define HeapTupleHeaderSetCmin(tup, cid) \
do { \
- Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+ Assert(!HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_cid = (cid); \
(tup)->t_infomask &= ~HEAP_COMBOCID; \
} while (0)
@@ -405,7 +420,7 @@ do { \
/* SetCmax must be used after HeapTupleHeaderAdjustCmax; see combocid.c */
#define HeapTupleHeaderSetCmax(tup, cid, iscombo) \
do { \
- Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+ Assert(!HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_cid = (cid); \
if (iscombo) \
(tup)->t_infomask |= HEAP_COMBOCID; \
@@ -415,7 +430,7 @@ do { \
#define HeapTupleHeaderGetXvac(tup) \
( \
- ((tup)->t_infomask & HEAP_MOVED) ? \
+ HeapTupleHeaderIsMoved(tup) ? \
(tup)->t_choice.t_heap.t_field3.t_xvac \
: \
InvalidTransactionId \
@@ -423,7 +438,7 @@ do { \
#define HeapTupleHeaderSetXvac(tup, xid) \
do { \
- Assert((tup)->t_infomask & HEAP_MOVED); \
+ Assert(HeapTupleHeaderIsMoved(tup)); \
(tup)->t_choice.t_heap.t_field3.t_xvac = (xid); \
} while (0)
@@ -651,6 +666,58 @@ do { \
)
/*
+ * Macros to check if tuple is a moved-off/in tuple by VACUUM FULL in from
+ * pre-9.0 era. Such tuple must not have HEAP_WARM_TUPLE flag set.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderIsMovedOff(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED_OFF) \
+)
+
+#define HeapTupleHeaderIsMovedIn(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED_IN) \
+)
+
+#define HeapTupleHeaderIsMoved(tuple) \
+( \
+ !HeapTupleHeaderIsHeapWarmTuple((tuple)) && \
+ ((tuple)->t_infomask & HEAP_MOVED) \
+)
+
+/*
+ * Check if tuple belongs to the Red part of the WARM chain.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderIsWarmRed(tuple) \
+( \
+ HeapTupleHeaderIsHeapWarmTuple(tuple) && \
+ (((tuple)->t_infomask & HEAP_WARM_RED) != 0) \
+)
+
+/*
+ * Mark tuple as a member of the Red chain. Must only be done on a tuple which
+ * is already marked a WARM-tuple.
+ *
+ * Beware of multiple evaluations of the argument.
+ */
+#define HeapTupleHeaderSetWarmRed(tuple) \
+( \
+ AssertMacro(HeapTupleHeaderIsHeapWarmTuple(tuple)), \
+ (tuple)->t_infomask |= HEAP_WARM_RED \
+)
+
+#define HeapTupleHeaderClearWarmRed(tuple) \
+( \
+ (tuple)->t_infomask &= ~HEAP_WARM_RED \
+)
+
+/*
* BITMAPLEN(NATTS) -
* Computes size of null bitmap given number of data columns.
*/
@@ -810,6 +877,15 @@ struct MinimalTupleData
#define HeapTupleClearHeapWarmTuple(tuple) \
HeapTupleHeaderClearHeapWarmTuple((tuple)->t_data)
+#define HeapTupleIsHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderIsWarmRed((tuple)->t_data)
+
+#define HeapTupleSetHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderSetWarmRed((tuple)->t_data)
+
+#define HeapTupleClearHeapWarmTupleRed(tuple) \
+ HeapTupleHeaderClearWarmRed((tuple)->t_data)
+
#define HeapTupleGetOid(tuple) \
HeapTupleHeaderGetOid((tuple)->t_data)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d4b35ca..1f4f0bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -427,6 +427,12 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
/*
+ * Flags overloaded on t_tid.ip_posid field. They are managed by
+ * ItemPointerSetFlags and corresponing routines.
+ */
+#define BTREE_INDEX_RED_POINTER 0x01
+
+/*
* external entry points for btree, in nbtree.c
*/
extern IndexBuildResult *btbuild(Relation heap, Relation index,
@@ -436,6 +442,10 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique,
struct IndexInfo *indexInfo);
+extern bool btwarminsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -487,10 +497,12 @@ extern void _bt_pageinit(Page page, Size size);
extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
-extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *itemnos, int nitems,
- BlockNumber lastBlockVacuumed);
+extern void _bt_handleitems_vacuum(Relation rel, Buffer buf,
+ OffsetNumber *delitemnos, int ndelitems,
+ OffsetNumber *coloritemnos, int ncoloritems);
extern int _bt_pagedel(Relation rel, Buffer buf);
+extern void _bt_color_items(Page page, OffsetNumber *coloritemnos,
+ uint16 ncoloritems);
/*
* prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index d6a3085..5555742 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -142,34 +142,20 @@ typedef struct xl_btree_reuse_page
/*
* This is what we need to know about vacuum of individual leaf index tuples.
* The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
- * For a non-MVCC index scans there is an additional correctness requirement
- * for applying these changes during recovery, which is that we must do one
- * of these two things for every block in the index:
- * * lock the block for cleanup and apply any required changes
- * * EnsureBlockUnpinned()
- * The purpose of this is to ensure that no index scans started before we
- * finish scanning the index are still running by the time we begin to remove
- * heap tuples.
- *
- * Any changes to any one block are registered on just one WAL record. All
- * blocks that we need to run EnsureBlockUnpinned() are listed as a block range
- * starting from the last block vacuumed through until this one. Individual
- * block numbers aren't given.
+ * single index page when executed by VACUUM. It also includes tuples whose
+ * color is changed from red to blue by VACUUM.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
* have a zero length array of offsets. Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
- BlockNumber lastBlockVacuumed;
-
- /* TARGET OFFSET NUMBERS FOLLOW */
+ uint16 ndelitems;
+ uint16 ncoloritems;
+ /* ndelitems + ncoloritems TARGET OFFSET NUMBERS FOLLOW */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ncoloritems) + sizeof(uint16))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9472ecc..b355b61 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -25,6 +25,7 @@
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_HEAP_BLKS_WARMCLEARED 7
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
--
2.1.4
On Wed, Mar 8, 2017 at 12:14 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Alvaro Herrera wrote:
Here's a rebased set of patches. This is the same Pavan posted; I only
fixed some whitespace and a trivial conflict in indexam.c, per 9b88f27cb42f.Jaime noted that I forgot the attachments. Here they are
If I recall correctly, the main concern about 0001 was whether it
might negatively affect performance, and testing showed that, if
anything, it was a little better. Does that sound right?
Regarding 0002, I think this could use some documentation someplace
explaining the overall theory of operation. README.HOT, maybe?
+ * Most often and unless we are dealing with a pg-upgraded cluster, the
+ * root offset information should be cached. So there should not be too
+ * much overhead of fetching this information. Also, once a tuple is
+ * updated, the information will be copied to the new version. So it's not
+ * as if we're going to pay this price forever.
What if a tuple is updated -- presumably clearing the
HEAP_LATEST_TUPLE on the tuple at the end of the chain -- and then the
update aborts? Then we must be back to not having this information.
One overall question about this patch series is how we feel about
using up this many bits. 0002 uses a bit from infomask, and 0005 uses
a bit from infomask2. I'm not sure if that's everything, and then I
think we're steeling some bits from the item pointers, too. While the
performance benefits of the patch sound pretty good based on the test
results so far, this is definitely the very last time we'll be able to
implement a feature that requires this many bits.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, Mar 8, 2017 at 12:14 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Alvaro Herrera wrote:
Here's a rebased set of patches. This is the same Pavan posted; I only
fixed some whitespace and a trivial conflict in indexam.c, per 9b88f27cb42f.Jaime noted that I forgot the attachments. Here they are
If I recall correctly, the main concern about 0001 was whether it
might negatively affect performance, and testing showed that, if
anything, it was a little better. Does that sound right?
Not really -- it's a bit slower actually in a synthetic case measuring
exactly the slowed-down case. See
/messages/by-id/CAD__OugK12ZqMWWjZiM-YyuD1y8JmMy6x9YEctNiF3rPp6hy0g@mail.gmail.com
I bet in normal cases it's unnoticeable. If WARM flies, then it's going
to provide a larger improvement than is lost to this.
Regarding 0002, I think this could use some documentation someplace
explaining the overall theory of operation. README.HOT, maybe?
Hmm. Yeah, we should have something to that effect. 0005 includes
README.WARM, but I think there should be some place unified that
explains the whole thing.
+ * Most often and unless we are dealing with a pg-upgraded cluster, the + * root offset information should be cached. So there should not be too + * much overhead of fetching this information. Also, once a tuple is + * updated, the information will be copied to the new version. So it's not + * as if we're going to pay this price forever.What if a tuple is updated -- presumably clearing the
HEAP_LATEST_TUPLE on the tuple at the end of the chain -- and then the
update aborts? Then we must be back to not having this information.
I will leave this question until I have grokked how this actually works.
One overall question about this patch series is how we feel about
using up this many bits. 0002 uses a bit from infomask, and 0005 uses
a bit from infomask2. I'm not sure if that's everything, and then I
think we're steeling some bits from the item pointers, too. While the
performance benefits of the patch sound pretty good based on the test
results so far, this is definitely the very last time we'll be able to
implement a feature that requires this many bits.
Yeah, this patch series uses a lot of bits. At some point we should
really add the "last full-scanned by version X" we discussed a long time
ago, and free the MOVED_IN / MOVED_OFF bits that have been unused for so
long. Sadly, once we add that, we need to wait one more release before
we can use the bits anyway.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 8, 2017 at 2:30 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Not really -- it's a bit slower actually in a synthetic case measuring
exactly the slowed-down case. See
/messages/by-id/CAD__OugK12ZqMWWjZiM-YyuD1y8JmMy6x9YEctNiF3rPp6hy0g@mail.gmail.com
I bet in normal cases it's unnoticeable. If WARM flies, then it's going
to provide a larger improvement than is lost to this.
Hmm, that test case isn't all that synthetic. It's just a single
column bulk update, which isn't anything all that crazy, and 5-10%
isn't nothing.
I'm kinda surprised it made that much difference, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, Mar 8, 2017 at 2:30 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Not really -- it's a bit slower actually in a synthetic case measuring
exactly the slowed-down case. See
/messages/by-id/CAD__OugK12ZqMWWjZiM-YyuD1y8JmMy6x9YEctNiF3rPp6hy0g@mail.gmail.com
I bet in normal cases it's unnoticeable. If WARM flies, then it's going
to provide a larger improvement than is lost to this.Hmm, that test case isn't all that synthetic. It's just a single
column bulk update, which isn't anything all that crazy,
The problem is that the update touches the second indexed column. With
the original code we would have stopped checking at that point, but with
the patched code we continue to verify all the other indexed columns for
changes.
Maybe we need more than one bitmapset to be given -- multiple ones for
for "any of these" checks (such as HOT, KEY and Identity) which can be
stopped as soon as one is found, and one for "all of these" (for WARM,
indirect indexes) which needs to be checked to completion.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
@@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;+ /* + * If the index supports recheck, make sure that index tuple is saved + * during index scans. + * + * XXX Ideally, we should look at all indexes on the table and check if + * WARM is at all supported on the base table. If WARM is not supported + * then we don't need to do any recheck. RelationGetIndexAttrBitmap() does + * do that and sets rd_supportswarm after looking at all indexes. But we + * don't know if the function was called earlier in the session when we're + * here. We can't call it now because there exists a risk of causing + * deadlock. + */ + if (indexRelation->rd_amroutine->amrecheck) + scan->xs_want_itup = true; + return scan; }
I didn't like this comment very much. But it's not necessary: you have
already given relcache responsibility for setting rd_supportswarm. The
only problem seems to be that you set it in RelationGetIndexAttrBitmap
instead of RelationGetIndexList, but it's not clear to me why. I think
if the latter function is in charge, then we can trust the flag more
than the current situation. Let's set the value to false on relcache
entry build, for safety's sake.
I noticed that nbtinsert.c and nbtree.c have a bunch of new includes
that they don't actually need. Let's remove those. nbtutils.c does
need them because of btrecheck(). Speaking of which:
I have already commented about the executor involvement in btrecheck();
that doesn't seem good. I previously suggested to pass the EState down
from caller, but that's not a great idea either since you still need to
do the actual FormIndexDatum. I now think that a workable option would
be to compute the values/isnulls arrays so that btrecheck gets them
already computed. With that, this function would be no more of a
modularity violation that HeapSatisfiesHOTAndKey() itself.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
After looking at how index_fetch_heap and heap_hot_search_buffer
interact, I can't say I'm in love with the idea. I started thinking
that we should not have index_fetch_heap release the buffer lock only to
re-acquire it five lines later, so it should keep the buffer lock, do
the recheck and only release it afterwards (I realize that this means
there'd be need for two additional "else release buffer lock" branches);
but then this got me thinking that perhaps it would be better to have
another routine that does both call heap_hot_search_buffer and then call
recheck -- it occurs to me that what we're doing here is essentially
heap_warm_search_buffer.
Does that make sense?
Another thing is BuildIndexInfo being called over and over for each
recheck(). Surely we need to cache the indexinfo for each indexscan.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
@@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
scan->heapRelation = heapRelation;
scan->xs_snapshot = snapshot;+ /* + * If the index supports recheck, make sure that index tuple issaved
+ * during index scans. + * + * XXX Ideally, we should look at all indexes on the table andcheck if
+ * WARM is at all supported on the base table. If WARM is not
supported
+ * then we don't need to do any recheck.
RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes.
But we
+ * don't know if the function was called earlier in the session
when we're
+ * here. We can't call it now because there exists a risk of
causing
+ * deadlock. + */ + if (indexRelation->rd_amroutine->amrecheck) + scan->xs_want_itup = true; + return scan; }I didn't like this comment very much. But it's not necessary: you have
already given relcache responsibility for setting rd_supportswarm. The
only problem seems to be that you set it in RelationGetIndexAttrBitmap
instead of RelationGetIndexList, but it's not clear to me why.
Hmm. I think you're right. Will fix that way and test.
I noticed that nbtinsert.c and nbtree.c have a bunch of new includes
that they don't actually need. Let's remove those. nbtutils.c does
need them because of btrecheck().
Right. It's probably a left over from the way I wrote the first version.
Will fix.
Speaking of which:
I have already commented about the executor involvement in btrecheck();
that doesn't seem good. I previously suggested to pass the EState down
from caller, but that's not a great idea either since you still need to
do the actual FormIndexDatum. I now think that a workable option would
be to compute the values/isnulls arrays so that btrecheck gets them
already computed.
I agree with your complaint about modularity violation. What I am unclear
is how passing values/isnulls array will fix that. The way code is
structured currently, recheck routines are called by index_fetch_heap(). So
if we try to compute values/isnulls in that function, we'll still need
access EState, which AFAIU will lead to similar violation. Or am I
mis-reading your idea?
I wonder if we should instead invent something similar to IndexRecheck(),
but instead of running ExecQual(), this new routine will compare the index
values by the given HeapTuple against given IndexTuple. ISTM that for this
to work we'll need to modify all callers of index_getnext() and teach them
to invoke the AM specific recheck method if xs_tuple_recheck flag is set to
true by index_getnext().
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 14, 2017 at 5:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
After looking at how index_fetch_heap and heap_hot_search_buffer
interact, I can't say I'm in love with the idea. I started thinking
that we should not have index_fetch_heap release the buffer lock only to
re-acquire it five lines later, so it should keep the buffer lock, do
the recheck and only release it afterwards (I realize that this means
there'd be need for two additional "else release buffer lock" branches);
Yes, it makes sense.
but then this got me thinking that perhaps it would be better to have
another routine that does both call heap_hot_search_buffer and then call
recheck -- it occurs to me that what we're doing here is essentially
heap_warm_search_buffer.Does that make sense?
We can do that, but it's not clear to me if that would be a huge
improvement. Also, I think we need to first decide on how to model the
recheck logic since that might affect this function significantly. For
example, if we decide to do recheck at a higher level then we will most
likely end up releasing and reacquiring the lock anyways.
Another thing is BuildIndexInfo being called over and over for each
recheck(). Surely we need to cache the indexinfo for each indexscan.
Good point. What should that place be though? Can we just cache them in the
relcache and maintain them along with the list of indexes? Looking at the
current callers, ExecOpenIndices() usually cache them in the ResultRelInfo,
which is sufficient because INSERT/UPDATE/DELETE code paths are the most
relevant paths where caching definitely helps. The only other place where
it may get called once per tuple is unique_key_recheck(), which is used for
deferred unique key tests and hence probably not very common.
BTW I wanted to share some more numbers from a recent performance test. I
thought it's important because the latest patch has fully functional chain
conversion code as well as all WAL-logging related pieces are in place
too. I ran these tests on a box borrowed from Tomas (thanks!). This has
64GB RAM and 350GB SSD with 1GB on-board RAM. I used the same test setup
that I used for the first test results reported on this thread i.e. a
modified pgbench_accounts table with additional columns and additional
indexes (one index on abalance so that every UPDATE is a potential WARM
update).
In a test where table + indexes exceeds RAM, running for 8hrs and
auto-vacuum parameters set such that we get 2-3 autovacuums on the table
during the test, we see WARM delivering more than 100% TPS as compared to
master. In this graph, I've plotted a moving average of TPS and the spikes
that we see coincides with the checkpoints (checkpoint_timeout is set to
20mins and max_wal_size large enough to avoid any xlog-based checkpoints).
The spikes are more prominent on WARM but I guess that's purely because it
delivers much higher TPS. I haven't shown here but I see WARM updates close
to 65-70% of the total updates. Also there is significant reduction in WAL
generated per txn.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
Moderate_AV_4Indexes_100FF_SF1200_Duration28800s_Run2.pdfapplication/pdf; name=Moderate_AV_4Indexes_100FF_SF1200_Duration28800s_Run2.pdfDownload
%PDF-1.3
%�����������
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
x��K�d1�e9�U�
��c��q�z��A/ P��FD�����B��TW��@�����B������������_'�������|����������������:����:�����=����~��?�����,�_���xN���}�����_�������z�������W���/J�^~?/������������WA��#������g�u�^����\��ZQ���o�����S������E~.g�{����x�����|z>~D��3y(����s�'��y?_�@��������3|�>���y�q>Q$�M���s���+����\|2�������~�|���Nsy�]��/��=�|^� ����,?����7���?�_����Gds�}�*����6O��f5��}z"�_��������K�u�~>}?~��h^v;�Dn�E��q���<v}<�o#�~P�$���q�~����|W�|��+����~;��|��������O�\��[�v;#g%z�V=��+O-����l����������s����y���Q�����SS/O��>��E-���W��D����tC���Q�J���4�fY���;��5�������CY��>������y=!�;�Ng*�X��;?�iUT��.����?~�|�����g�G�o��ZMC�y�� ������n�\_Wz����)�r=#�����J���o�KT���~���<�62�/O$�}�}����|}_�����#|�����P*e��t�o����
=�'jI�n�����,�
��m����NhS������E��BN����P`��=������(��'���a����#��w��$j���J����� ���a��M��\������@���8}��oT��#�����?n���
:��>��}}F���P� ���?h1M-$}z���_S���d���!wT�
�>����*e���[h��
w�7d���P�k x�3��C=��� ~��3�������{��u���sG�P��<J;��?v��7���N�.�]���3l,$�8]~NG.;���&�������w�Q�j�gz�Z(Z��DQ�g'�vhP�b6�2`I���������xY(��P,��U�6���Ukh�Y(�u E�S4-Y)����O���^)J{_(
�R�.E��E���K�>�@QF���_�?EQ��F��L��(��PT����j�(��FQ��R�>](��$��A[��R�����`E���BQ�Z�A�u�lud��~Q��j�(u�(j�^(�u!EX���)��w��
��)
�f`�,
���>�St@���P�d�E-i�Eot�J��[fQ�l�E���Yd�(�J�/���������)����������`E��JQ��BQ�P���fV��Qd��t�����[Q���f��}���=�NQJ
E%A�� /�(j/��O��1c�O7�*���7�}��
�fQF����(��AQL�e���}��Q�:n�mE�W&�e�(��6(=�
����Un��~��?{����.�����wy�r����g�g���w=��n���W7��c��b0�4��
�[4������r�F���1��6�����KS�{�"W
c������G7<F��.Z�S��~�AK���<�Q�;^����4�������.��Z�0�����:��'�P��1�1{���E0=��3�d�C-���h���E�hb���:DT �E~�Z��L�����-.�����1�7�dy��L��c�3Xc��=nc��kw�NX������'���]?+���F��'ndT".c4�~Y\�������]�eL��!��2Pye�.�Wt�?��Qi��o�����#
������>