Multiple FPI_FOR_HINT for the same block during killing btree index items

Started by Masahiko Sawadaalmost 6 years ago14 messages

masahiko.sawada@2ndquadrant.com

almost 6 years ago

Hi all,

During investigating the issue our customer had, I realized that
_bt_killitems() can record the same FPI pages multiple times
simultaneously. This can happen when several concurrent index scans
are processing pages that contain killable tuples. Because killing
index items could be performed while holding a buffer lock in shared
mode concurrent processes record multiple FPI_FOR_HINT for the same
block.

Here is the reproducer:

cat <<EOF | psql -d postgres
drop table if exists tbl;
create table tbl (c int primary key) with (autovacuum_enabled = off);
insert into tbl select generate_series(1,300);
update tbl set c = c * -1 where c = 100;
checkpoint;
EOF

for n in `seq 1 4`
do
psql -d postgres -c "select from tbl where c = 100" &
done

The server needs to enable wal_log_hints and this might need to run
several times. After running the script we can see this issue by
pg_waldump:

rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1C3D8,
prev 1/8FD1C368, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW
rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1E408,
prev 1/8FD1C3D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW

This is an excerpt from _bt_killitems() of version 12.2. By recent
changes the code of HEAD looks different much but the part in question
is essentially not changed much. That is, it's reproducible even with
HEAD.

for (i = 0; i < numKilled; i++)
{
int itemIndex = so->killedItems[i];
BTScanPosItem *kitem = &so->currPos.items[itemIndex];
OffsetNumber offnum = kitem->indexOffset;

Assert(itemIndex >= so->currPos.firstItem &&
itemIndex <= so->currPos.lastItem);
if (offnum < minoff)
continue; /* pure paranoia */
while (offnum <= maxoff)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);

if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
{
/* found the item */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
}
offnum = OffsetNumberNext(offnum);
}
}

/*
* Since this can be redone later if needed, mark as dirty hint.
*
* Whenever we mark anything LP_DEAD, we also set the page's
* BTP_HAS_GARBAGE flag, which is likewise just a hint.
*/
if (killedsomething)
{
opaque->btpo_flags |= BTP_HAS_GARBAGE;
MarkBufferDirtyHint(so->currPos.buf, true);
}

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I understand that we can call MarkBufferDirtyHint while holding a
buffer lock in share mode as the comment of MarkBufferDirtyHint()
says, but I'd like to improve this behavior so that we can avoid
multiple FPI_FOR_HINT for the same block.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Masahiko Sawada (#1)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On 2020-Apr-09, Masahiko Sawada wrote:

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I agree.

It seems worth pointing out that when this code was written, these hint
bit changes were not logged, so this consideration did not apply then.
But we added data checksums and wal_log_hints, which changed the
equation.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Masahiko Sawada (#1)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Here is the reproducer:

What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.

--
Peter Geoghegan

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Peter Geoghegan (#3)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Thu, Apr 9, 2020 at 3:05 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Here is the reproducer:

What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.

We saw the issue on our PG11 clusters. The specific index we noticed
in the wal dump (I don't think we confirmed if there were others) as
one on a `created_at` column, to give you an idea of cardinality.

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.

I don't have a strong opinion on the lock.

James

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: James Coleman (#4)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Thu, Apr 9, 2020 at 1:37 PM James Coleman <jtc331@gmail.com> wrote:

We saw the issue on our PG11 clusters. The specific index we noticed
in the wal dump (I don't think we confirmed if there were others) as
one on a `created_at` column, to give you an idea of cardinality.

You tend to get a lot of problems with indexes like that when there
are consistent updates (actually, that's more of a thing with an
updated_at index). But non-HOT updates alone might result in what you
could describe as "updates" to the index.

With Postgres 11, a low cardinality index could place new/successor
duplicate index tuples (those needed for non-HOT updates) on a more or
less random leaf page (you'll recall that this is determined by the
old "getting tired" logic). This is the kind of thing I had in mind
when I asked Sawada-san about it.

Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#5)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.

What I really meant was an index where hundreds or even thousands of
rows for each distinct timestamp value are expected. Not an index
where almost every row has a distinct timestamp value. Both timestamp
index patterns are common, obviously.

--
Peter Geoghegan

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Peter Geoghegan (#6)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Thu, Apr 9, 2020 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.

What I really meant was an index where hundreds or even thousands of
rows for each distinct timestamp value are expected. Not an index
where almost every row has a distinct timestamp value. Both timestamp
index patterns are common, obviously.

I'll try to run some numbers tomorrow to confirm, but I believe that
the created_at value is almost (if not completely) unique. So, no,
it's not a low cardinality case like that.

I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.

James

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: James Coleman (#7)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote:

I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.

The pg_stats.null_frac for the column/index might be interesting here. I
believe that Active Record will sometimes generate created_at columns
that sometimes end up containing NULL values. Not sure why.

--
Peter Geoghegan

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Peter Geoghegan (#3)

1 attachment(s)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Here is the reproducer:

What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

btree_fpi.patchapplication/octet-stream; name=btree_fpi.patchDownload

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 08d7201f19..3b8865331a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1858,7 +1858,17 @@ _bt_killitems(IndexScanDesc scan)
 			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
 				killtuple = true;
 
-			if (killtuple)
+			/*
+			 * Mark index item as dead, if not yet.  Since killing dead
+			 * index items can be performed while holding a buffer lock
+			 * in share mode, it's possible that concurrent processes
+			 * attempt to kill the same index item simultaneously, which
+			 * in turn leads to write multiple full-page writes for hintbit.
+			 * Checking if the index item is already marked as dead will
+			 * help with reducing the likelihood of having multiple
+			 * full-page writes.
+			 */
+			if (killtuple && !ItemIdIsDead(iid))
 			{
 				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);

#10

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#9)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

At Fri, 10 Apr 2020 12:32:31 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Here is the reproducer:

What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

FWIW, looks good to me.

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

Makes sense if the lock were acquired just before the "BM_DIRTY |
BM_JUST_DIRTIED) check. Could we use double-checking, as similar to
the patch for ItemIdIsDead()?

if ((pg_atomic_read_u32(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{

...

* essential that CreateCheckpoint waits for virtual transactions
* rather than full transactionids.
*/
/* blah, blah */
buf_state = LockBufHdr(bufHdr);

if (buf_state & (BM_ | BM_JUST) != (..))
{
MyProc->delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer, buffer_std);
}
}
else
buf_state = LockBuffer(bufHdr);

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Peter Geoghegan (#8)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On Thu, Apr 9, 2020 at 10:08 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote:

I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.

The pg_stats.null_frac for the column/index might be interesting here. I
believe that Active Record will sometimes generate created_at columns
that sometimes end up containing NULL values. Not sure why.

null_frac is 0 for created_at (what I expected). Also (under current
data) all created_at values are unique except a single row duplicate.

That being said, remember the write pattern above: every row gets
deleted eventually, so there'd be a lots of dead tuples overall.

James

#12

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Masahiko Sawada (#9)

1 attachment(s)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

On 2020-Apr-10, Masahiko Sawada wrote:

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.

Like the attached. I didn't verify that it works well or that it
actually improves performance ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

excl-killitems.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..95ba3cc401 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1726,6 +1726,7 @@ _bt_killitems(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
 	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
+	bool		have_exclusive = false;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1767,6 +1768,7 @@ _bt_killitems(IndexScanDesc scan)
 		}
 	}
 
+restart:
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -1864,6 +1866,19 @@ _bt_killitems(IndexScanDesc scan)
 			 */
 			if (killtuple && !ItemIdIsDead(iid))
 			{
+				if (!have_exclusive && XLogHintBitIsNeeded())
+				{
+					/*
+					 * If wal_log_hints is enabled, we only want to do this
+					 * with an exclusive lock, so exchange our lock and
+					 * restart from the top.
+					 */
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					LockBuffer(so->currPos.buf, BT_WRITE);
+					have_exclusive = true;
+					goto restart;
+				}
+
 				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;

#13

Ranier Vilela

ranier.vf@gmail.com

over 5 years ago

In reply to: Alvaro Herrera (#12)

1 attachment(s)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:

On 2020-Apr-10, Masahiko Sawada wrote:

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.

Like the attached. I didn't verify that it works well or that it
actually improves performance ...

This is not related to your latest patch.
But I believe I can improve the performance.

So:
1. If killedsomething is false
2. Any killtuple is true and (not ItemIdIsDead(iid)) is false
3. Nothing to be done.

So why do all the work and then discard it.
We can eliminate the current item much earlier, testing if it is already
dead.

regards,
Ranier VIlela

Attachments:

avoid_killing_btree_items_aready_dead.patchapplication/octet-stream; name=avoid_killing_btree_items_aready_dead.patchDownload

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..ee7e8edf58 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1776,6 +1776,7 @@ _bt_killitems(IndexScanDesc scan)
 		int			itemIndex = so->killedItems[i];
 		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
+	    OffsetNumber current_offnum PG_USED_FOR_ASSERTS_ONLY;
 
 		Assert(itemIndex >= so->currPos.firstItem &&
 			   itemIndex <= so->currPos.lastItem);
@@ -1784,9 +1785,19 @@ _bt_killitems(IndexScanDesc scan)
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-			bool		killtuple = false;
+			IndexTuple	ituple;
+			bool        killtuple;
+
+            /* Save current offnum and advance to the next, avoiding call twice OffsetNumberNext */
+            current_offnum = offnum;
+		    offnum = OffsetNumberNext(offnum);
+			
+            /* Item already as dead, goes to next */
+            if (ItemIdIsDead(iid))
+				continue;
 
+			killtuple = false;
+			ituple = (IndexTuple) PageGetItem(page, iid);
 			if (BTreeTupleIsPosting(ituple))
 			{
 				int			pi = i + 1;
@@ -1819,7 +1830,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !droppedpin);
+					Assert(kitem->indexOffset == current_offnum || !droppedpin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -1862,14 +1873,13 @@ _bt_killitems(IndexScanDesc scan)
 			 * set to WAL (if wal_log_hints or data checksums are enabled),
 			 * which is undesirable.
 			 */
-			if (killtuple && !ItemIdIsDead(iid))
+			if (killtuple)
 			{
 				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
 			}
-			offnum = OffsetNumberNext(offnum);
 		}
 	}

#14

Ranier Vilela

ranier.vf@gmail.com

over 5 years ago

In reply to: Alvaro Herrera (#12)

1 attachment(s)

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:

On 2020-Apr-10, Masahiko Sawada wrote:

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.

Regarding the block, I disagree in part, because in the worst case,
the block can be requested in the last item analyzed, leading to redo all
the work from the beginning.
If we are in _bt_killitems it is because there is a high probability that
there will be items to be deleted,
why not request the block soon, if this meets the conditions?

1. XLogHintBitIsNeeded ()
2.! AutoVacuumingActive ()
3. New exclusive configuration variable option to activate the lock?

Masahiko reported that it occurs only when (autovacuum_enabled = off);

regards,
Ranier Vilela

Attachments:

avoid_killing_btree_items_already_dead_v2.patchapplication/octet-stream; name=avoid_killing_btree_items_already_dead_v2.patchDownload

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..e1b1110bf1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1767,6 +1767,15 @@ _bt_killitems(IndexScanDesc scan)
 		}
 	}
 
+    if (XLogHintBitIsNeeded() && !AutoVacuumingActive())
+    {
+		/*
+		 * If wal_log_hints is enabled, we only want to do this
+		 * with an exclusive lock.
+		 */
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(so->currPos.buf, BT_WRITE);
+	}
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -1776,6 +1785,7 @@ _bt_killitems(IndexScanDesc scan)
 		int			itemIndex = so->killedItems[i];
 		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
+	    OffsetNumber current_offnum PG_USED_FOR_ASSERTS_ONLY;
 
 		Assert(itemIndex >= so->currPos.firstItem &&
 			   itemIndex <= so->currPos.lastItem);
@@ -1784,9 +1794,19 @@ _bt_killitems(IndexScanDesc scan)
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-			bool		killtuple = false;
+			IndexTuple	ituple;
+			bool        killtuple;
+
+            /* Save current offnum and advance to the next, avoiding call twice OffsetNumberNext */
+            current_offnum = offnum;
+		    offnum = OffsetNumberNext(offnum);
+			
+            /* Item already as dead, goes to next */
+            if (ItemIdIsDead(iid))
+				continue;
 
+			killtuple = false;
+			ituple = (IndexTuple) PageGetItem(page, iid);
 			if (BTreeTupleIsPosting(ituple))
 			{
 				int			pi = i + 1;
@@ -1819,7 +1839,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !droppedpin);
+					Assert(kitem->indexOffset == current_offnum || !droppedpin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -1862,14 +1882,13 @@ _bt_killitems(IndexScanDesc scan)
 			 * set to WAL (if wal_log_hints or data checksums are enabled),
 			 * which is undesirable.
 			 */
-			if (killtuple && !ItemIdIsDead(iid))
+			if (killtuple)
 			{
 				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
 			}
-			offnum = OffsetNumberNext(offnum);
 		}
 	}