Multiple FPI_FOR_HINT for the same block during killing btree index items
Hi all,
During investigating the issue our customer had, I realized that
_bt_killitems() can record the same FPI pages multiple times
simultaneously. This can happen when several concurrent index scans
are processing pages that contain killable tuples. Because killing
index items could be performed while holding a buffer lock in shared
mode concurrent processes record multiple FPI_FOR_HINT for the same
block.
Here is the reproducer:
cat <<EOF | psql -d postgres
drop table if exists tbl;
create table tbl (c int primary key) with (autovacuum_enabled = off);
insert into tbl select generate_series(1,300);
update tbl set c = c * -1 where c = 100;
checkpoint;
EOF
for n in `seq 1 4`
do
psql -d postgres -c "select from tbl where c = 100" &
done
The server needs to enable wal_log_hints and this might need to run
several times. After running the script we can see this issue by
pg_waldump:
rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1C3D8,
prev 1/8FD1C368, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW
rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1E408,
prev 1/8FD1C3D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW
This is an excerpt from _bt_killitems() of version 12.2. By recent
changes the code of HEAD looks different much but the part in question
is essentially not changed much. That is, it's reproducible even with
HEAD.
for (i = 0; i < numKilled; i++)
{
int itemIndex = so->killedItems[i];
BTScanPosItem *kitem = &so->currPos.items[itemIndex];
OffsetNumber offnum = kitem->indexOffset;
Assert(itemIndex >= so->currPos.firstItem &&
itemIndex <= so->currPos.lastItem);
if (offnum < minoff)
continue; /* pure paranoia */
while (offnum <= maxoff)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
{
/* found the item */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
}
offnum = OffsetNumberNext(offnum);
}
}
/*
* Since this can be redone later if needed, mark as dirty hint.
*
* Whenever we mark anything LP_DEAD, we also set the page's
* BTP_HAS_GARBAGE flag, which is likewise just a hint.
*/
if (killedsomething)
{
opaque->btpo_flags |= BTP_HAS_GARBAGE;
MarkBufferDirtyHint(so->currPos.buf, true);
}
The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.
I understand that we can call MarkBufferDirtyHint while holding a
buffer lock in share mode as the comment of MarkBufferDirtyHint()
says, but I'd like to improve this behavior so that we can avoid
multiple FPI_FOR_HINT for the same block.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Apr-09, Masahiko Sawada wrote:
The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.
I agree.
It seems worth pointing out that when this code was written, these hint
bit changes were not logged, so this consideration did not apply then.
But we added data checksums and wal_log_hints, which changed the
equation.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
Here is the reproducer:
What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.
The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.
I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.
--
Peter Geoghegan
On Thu, Apr 9, 2020 at 3:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:Here is the reproducer:
What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.
We saw the issue on our PG11 clusters. The specific index we noticed
in the wal dump (I don't think we confirmed if there were others) as
one on a `created_at` column, to give you an idea of cardinality.
The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.
I don't have a strong opinion on the lock.
James
On Thu, Apr 9, 2020 at 1:37 PM James Coleman <jtc331@gmail.com> wrote:
We saw the issue on our PG11 clusters. The specific index we noticed
in the wal dump (I don't think we confirmed if there were others) as
one on a `created_at` column, to give you an idea of cardinality.
You tend to get a lot of problems with indexes like that when there
are consistent updates (actually, that's more of a thing with an
updated_at index). But non-HOT updates alone might result in what you
could describe as "updates" to the index.
With Postgres 11, a low cardinality index could place new/successor
duplicate index tuples (those needed for non-HOT updates) on a more or
less random leaf page (you'll recall that this is determined by the
old "getting tired" logic). This is the kind of thing I had in mind
when I asked Sawada-san about it.
Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.
--
Peter Geoghegan
On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.
What I really meant was an index where hundreds or even thousands of
rows for each distinct timestamp value are expected. Not an index
where almost every row has a distinct timestamp value. Both timestamp
index patterns are common, obviously.
--
Peter Geoghegan
On Thu, Apr 9, 2020 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.What I really meant was an index where hundreds or even thousands of
rows for each distinct timestamp value are expected. Not an index
where almost every row has a distinct timestamp value. Both timestamp
index patterns are common, obviously.
I'll try to run some numbers tomorrow to confirm, but I believe that
the created_at value is almost (if not completely) unique. So, no,
it's not a low cardinality case like that.
I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.
James
On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote:
I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.
The pg_stats.null_frac for the column/index might be interesting here. I
believe that Active Record will sometimes generate created_at columns
that sometimes end up containing NULL values. Not sure why.
--
Peter Geoghegan
On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:Here is the reproducer:
What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.
Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?
Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
btree_fpi.patchapplication/octet-stream; name=btree_fpi.patchDownload
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 08d7201f19..3b8865331a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1858,7 +1858,17 @@ _bt_killitems(IndexScanDesc scan)
else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
killtuple = true;
- if (killtuple)
+ /*
+ * Mark index item as dead, if not yet. Since killing dead
+ * index items can be performed while holding a buffer lock
+ * in share mode, it's possible that concurrent processes
+ * attempt to kill the same index item simultaneously, which
+ * in turn leads to write multiple full-page writes for hintbit.
+ * Checking if the index item is already marked as dead will
+ * help with reducing the likelihood of having multiple
+ * full-page writes.
+ */
+ if (killtuple && !ItemIdIsDead(iid))
{
/* found the item/all posting list items */
ItemIdMarkDead(iid);
At Fri, 10 Apr 2020 12:32:31 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:Here is the reproducer:
What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?
FWIW, looks good to me.
Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.
Makes sense if the lock were acquired just before the "BM_DIRTY |
BM_JUST_DIRTIED) check. Could we use double-checking, as similar to
the patch for ItemIdIsDead()?
if ((pg_atomic_read_u32(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
...
* essential that CreateCheckpoint waits for virtual transactions
* rather than full transactionids.
*/
/* blah, blah */
buf_state = LockBufHdr(bufHdr);if (buf_state & (BM_ | BM_JUST) != (..))
{
MyProc->delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer, buffer_std);
}
}
else
buf_state = LockBuffer(bufHdr);
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Apr 9, 2020 at 10:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote:
I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.The pg_stats.null_frac for the column/index might be interesting here. I
believe that Active Record will sometimes generate created_at columns
that sometimes end up containing NULL values. Not sure why.
null_frac is 0 for created_at (what I expected). Also (under current
data) all created_at values are unique except a single row duplicate.
That being said, remember the write pattern above: every row gets
deleted eventually, so there'd be a lots of dead tuples overall.
James
On 2020-Apr-10, Masahiko Sawada wrote:
Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?
Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!
Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.
I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.
I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.
Like the attached. I didn't verify that it works well or that it
actually improves performance ...
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
excl-killitems.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..95ba3cc401 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1726,6 +1726,7 @@ _bt_killitems(IndexScanDesc scan)
int numKilled = so->numKilled;
bool killedsomething = false;
bool droppedpin PG_USED_FOR_ASSERTS_ONLY;
+ bool have_exclusive = false;
Assert(BTScanPosIsValid(so->currPos));
@@ -1767,6 +1768,7 @@ _bt_killitems(IndexScanDesc scan)
}
}
+restart:
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1864,6 +1866,19 @@ _bt_killitems(IndexScanDesc scan)
*/
if (killtuple && !ItemIdIsDead(iid))
{
+ if (!have_exclusive && XLogHintBitIsNeeded())
+ {
+ /*
+ * If wal_log_hints is enabled, we only want to do this
+ * with an exclusive lock, so exchange our lock and
+ * restart from the top.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(so->currPos.buf, BT_WRITE);
+ have_exclusive = true;
+ goto restart;
+ }
+
/* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:
On 2020-Apr-10, Masahiko Sawada wrote:
Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.Like the attached. I didn't verify that it works well or that it
actually improves performance ...
This is not related to your latest patch.
But I believe I can improve the performance.
So:
1. If killedsomething is false
2. Any killtuple is true and (not ItemIdIsDead(iid)) is false
3. Nothing to be done.
So why do all the work and then discard it.
We can eliminate the current item much earlier, testing if it is already
dead.
regards,
Ranier VIlela
Attachments:
avoid_killing_btree_items_aready_dead.patchapplication/octet-stream; name=avoid_killing_btree_items_aready_dead.patchDownload
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..ee7e8edf58 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1776,6 +1776,7 @@ _bt_killitems(IndexScanDesc scan)
int itemIndex = so->killedItems[i];
BTScanPosItem *kitem = &so->currPos.items[itemIndex];
OffsetNumber offnum = kitem->indexOffset;
+ OffsetNumber current_offnum PG_USED_FOR_ASSERTS_ONLY;
Assert(itemIndex >= so->currPos.firstItem &&
itemIndex <= so->currPos.lastItem);
@@ -1784,9 +1785,19 @@ _bt_killitems(IndexScanDesc scan)
while (offnum <= maxoff)
{
ItemId iid = PageGetItemId(page, offnum);
- IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- bool killtuple = false;
+ IndexTuple ituple;
+ bool killtuple;
+
+ /* Save current offnum and advance to the next, avoiding call twice OffsetNumberNext */
+ current_offnum = offnum;
+ offnum = OffsetNumberNext(offnum);
+
+ /* Item already as dead, goes to next */
+ if (ItemIdIsDead(iid))
+ continue;
+ killtuple = false;
+ ituple = (IndexTuple) PageGetItem(page, iid);
if (BTreeTupleIsPosting(ituple))
{
int pi = i + 1;
@@ -1819,7 +1830,7 @@ _bt_killitems(IndexScanDesc scan)
* though only in the common case where the page can't
* have been concurrently modified
*/
- Assert(kitem->indexOffset == offnum || !droppedpin);
+ Assert(kitem->indexOffset == current_offnum || !droppedpin);
/*
* Read-ahead to later kitems here.
@@ -1862,14 +1873,13 @@ _bt_killitems(IndexScanDesc scan)
* set to WAL (if wal_log_hints or data checksums are enabled),
* which is undesirable.
*/
- if (killtuple && !ItemIdIsDead(iid))
+ if (killtuple)
{
/* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
}
- offnum = OffsetNumberNext(offnum);
}
}
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:
On 2020-Apr-10, Masahiko Sawada wrote:
Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.
Regarding the block, I disagree in part, because in the worst case,
the block can be requested in the last item analyzed, leading to redo all
the work from the beginning.
If we are in _bt_killitems it is because there is a high probability that
there will be items to be deleted,
why not request the block soon, if this meets the conditions?
1. XLogHintBitIsNeeded ()
2.! AutoVacuumingActive ()
3. New exclusive configuration variable option to activate the lock?
Masahiko reported that it occurs only when (autovacuum_enabled = off);
regards,
Ranier Vilela
Attachments:
avoid_killing_btree_items_already_dead_v2.patchapplication/octet-stream; name=avoid_killing_btree_items_already_dead_v2.patchDownload
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..e1b1110bf1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1767,6 +1767,15 @@ _bt_killitems(IndexScanDesc scan)
}
}
+ if (XLogHintBitIsNeeded() && !AutoVacuumingActive())
+ {
+ /*
+ * If wal_log_hints is enabled, we only want to do this
+ * with an exclusive lock.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(so->currPos.buf, BT_WRITE);
+ }
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1776,6 +1785,7 @@ _bt_killitems(IndexScanDesc scan)
int itemIndex = so->killedItems[i];
BTScanPosItem *kitem = &so->currPos.items[itemIndex];
OffsetNumber offnum = kitem->indexOffset;
+ OffsetNumber current_offnum PG_USED_FOR_ASSERTS_ONLY;
Assert(itemIndex >= so->currPos.firstItem &&
itemIndex <= so->currPos.lastItem);
@@ -1784,9 +1794,19 @@ _bt_killitems(IndexScanDesc scan)
while (offnum <= maxoff)
{
ItemId iid = PageGetItemId(page, offnum);
- IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- bool killtuple = false;
+ IndexTuple ituple;
+ bool killtuple;
+
+ /* Save current offnum and advance to the next, avoiding call twice OffsetNumberNext */
+ current_offnum = offnum;
+ offnum = OffsetNumberNext(offnum);
+
+ /* Item already as dead, goes to next */
+ if (ItemIdIsDead(iid))
+ continue;
+ killtuple = false;
+ ituple = (IndexTuple) PageGetItem(page, iid);
if (BTreeTupleIsPosting(ituple))
{
int pi = i + 1;
@@ -1819,7 +1839,7 @@ _bt_killitems(IndexScanDesc scan)
* though only in the common case where the page can't
* have been concurrently modified
*/
- Assert(kitem->indexOffset == offnum || !droppedpin);
+ Assert(kitem->indexOffset == current_offnum || !droppedpin);
/*
* Read-ahead to later kitems here.
@@ -1862,14 +1882,13 @@ _bt_killitems(IndexScanDesc scan)
* set to WAL (if wal_log_hints or data checksums are enabled),
* which is undesirable.
*/
- if (killtuple && !ItemIdIsDead(iid))
+ if (killtuple)
{
/* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
}
- offnum = OffsetNumberNext(offnum);
}
}