Issues with hash and GiST LP_DEAD setting for kill_prior_tuple
Both hash and GiST indexes set LP_DEAD bits for kill_prior_tuple,
using an approach based on that of nbtree. hash gained this ability in
commit 6977b8b7f4, while GiST gained it in commit 013ebc0a7b.
gistkillitems() and _hash_kill_items() both have similar bugs:
* gistkillitems() correctly checks if the page's LSN has changed in
the period between when we initially read the leaf page and the period
when/after we accessed the heap. But (unlike nbtree), it fails to
account for unlogged relations, where the LSN won't ever change, even
when there was unsafe concurrent heap TID recycling by VACUUM at all.
This bug is likely related to the way that GiST and SP-GiST never hold
on to buffer pins on leaf pages. There are other known bugs [1]https://commitfest.postgresql.org/patch/5721/[2]https://commitfest.postgresql.org/patch/5542/
where index-only scans are broken for the same reason -- we also need
to hold on to a pin to make those safe. (Using an MVCC snapshot is
necessary but not sufficient to make both of these things safe. This
isn't acknowledged by the amgettuple sgml docs, but is nevertheless
true.)
* _hash_kill_items makes roughly the opposite mistake: while hash
might hang on to "leaf" page buffer pins (which makes it okay to set
LP_DEAD bits even when the page has been modified), there appears to
be cases where we don't hang on to a pin. When we don't, we just
acquire the pin within _hash_kill_items for ourselves. This is wrong:
it neglects to compare a stashed page LSN against the leaf page's
now-current LSN in those cases where no pin was held throughout, which
is what nbtree does (that's the only way we can safely drop a pin
eagerly like this). There is no stashed LSN to compare in
HashScanPosData.
FWIW _hash_readpage has a comment about a stashed LSN, so it seems as
if this was barely missed by the work on hash indexes around 2017:
/*
* Remember next and previous block numbers for scrollable
* cursors to know the start position and return false
* indicating that no more matching tuples were found. Also,
* don't reset currPage or lsn, because we expect
* _hash_kill_items to be called for the old page after this
* function returns.
*/
so->currPos.prevPage = prev_blkno;
so->currPos.nextPage = InvalidBlockNumber;
so->currPos.buf = buf;
return false;
* Minor issue: aren't the calls to _hash_kill_items that happen in
_hash_readpage() useless/impossible to reach? I'm referring to this:
/*
* Could not find any matching tuples in the current page, move to
* the next page. Before leaving the current page, deal with any
* killed items.
*/
if (so->numKilled > 0)
_hash_kill_items(scan);
If we couldn't find any matching tuples, then there won't have been
any visibility checks of any tuple/TID. So how can we possibly have
any LP_DEAD items to set in passing?
Background info: I looked into this kill_prior_tuple infrastructure as
part of work that will move most of the logic shared by index AMs that
implement amgettuple out of individual index AMs, and into indexam.c
[1]: https://commitfest.postgresql.org/patch/5721/
make these kinds of problems impossible.
[1]: https://commitfest.postgresql.org/patch/5721/
[2]: https://commitfest.postgresql.org/patch/5542/
[3]: /messages/by-id/CAH2-WzmVvU2KmKyq8sUeYXrc_roA5LfOgDE1-vdtorpk_M3DfA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Tue, Jul 15, 2025 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
* gistkillitems() correctly checks if the page's LSN has changed in
the period between when we initially read the leaf page and the period
when/after we accessed the heap. But (unlike nbtree), it fails to
account for unlogged relations, where the LSN won't ever change, even
when there was unsafe concurrent heap TID recycling by VACUUM at all.
On second thought, I think that GiST is okay here: the gistGetFakeLSN
stuff (added by commit 2edc5cd493 and improved in commit 62401db45c)
is sufficient to make the approach taken by gistkillitems() correct.
All modifications to GiST pages that are logged with a logged relation
will get LSNs that work for gistkillitems' purposes when it deals with
an unlogged relation.
Aside: Obviously, we don't use gistGetFakeLSN in nbtree. That's why we
currently cannot set "BTScanOpaqueData.dropPin = true" with an
unlogged relation. I wonder if it would make sense to make nbtree
follow gist here. We could generalize gistGetFakeLSN, and use it in
nbtree, too. That way, we could remove the
"RelationNeedsWAL(scan->indexRelation)" special case when setting
BTScanOpaqueData.dropPin at the start of each scan. This approach is
arguably simpler, and comes with a performance benefit: nbtree
wouldn't block progress by VACUUM when scanning unlogged relations by
holding on to its pin across amgettuple calls, exactly like with
logged relations. In general, I see no reason why amgettuple index AMs
shouldn't all take exactly the same approach to holding on to/dropping
leaf page buffer pins.
* _hash_kill_items makes roughly the opposite mistake: while hash
might hang on to "leaf" page buffer pins (which makes it okay to set
LP_DEAD bits even when the page has been modified), there appears to
be cases where we don't hang on to a pin. When we don't, we just
acquire the pin within _hash_kill_items for ourselves. This is wrong:
it neglects to compare a stashed page LSN against the leaf page's
now-current LSN in those cases where no pin was held throughout, which
is what nbtree does (that's the only way we can safely drop a pin
eagerly like this). There is no stashed LSN to compare in
HashScanPosData.
Still no reason to doubt that _hash_kill_items is faulty, though
--
Peter Geoghegan
Hello, Peter!
FWIW _hash_readpage has a comment about a stashed LSN, so it seems as
if this was barely missed by the work on hash indexes around 2017:
I think commit 22c5e735 [0]https://github.com/postgres/postgres/commit/22c5e73562c53437979efec4c26cd9fff408777c (Remove lsn from HashScanPosData) is the
thing you are looking for in relation to hash.
Best regards,
Mikhail.
[0]: https://github.com/postgres/postgres/commit/22c5e73562c53437979efec4c26cd9fff408777c
On Thu, Jul 17, 2025 at 7:27 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
FWIW _hash_readpage has a comment about a stashed LSN, so it seems as
if this was barely missed by the work on hash indexes around 2017:I think commit 22c5e735 [0] (Remove lsn from HashScanPosData) is the
thing you are looking for in relation to hash.
Oh yeah, good catch.
I wonder if this is already correct. It definitely *looks* wrong,
because _hash_kill_items() is prepared to re-pin and then LP_DEAD-mark
a page, without the LSN check that would make that safe. But it seems
possible that that never actually happens -- maybe the
HashScanPosIsPinned() check will invariably find that the page is
already pinned (in which case commit 22c5e735 merely neglected to
remove some dead code from _hash_kill_items, but was otherwise
correct).
Do you happen to recall anything about this, Robert and Amit?
FWIW, I'm asking about this now because I want to switch index AMs
that use amgettuple over to a new amgetbatch interface that enables
index prefetching, with prefetch windows that can span multiple leaf
pages. This involves moving management of buffer pins out of affected
index AMs, and into a new layer that will decide when to release each
index page buffer pin (and call _hash_kill_items-like routines) based
on its own criteria.
--
Peter Geoghegan
On Mon, Jul 21, 2025 at 9:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Jul 17, 2025 at 7:27 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:FWIW _hash_readpage has a comment about a stashed LSN, so it seems as
if this was barely missed by the work on hash indexes around 2017:I think commit 22c5e735 [0] (Remove lsn from HashScanPosData) is the
thing you are looking for in relation to hash.Oh yeah, good catch.
I wonder if this is already correct. It definitely *looks* wrong,
because _hash_kill_items() is prepared to re-pin and then LP_DEAD-mark
a page, without the LSN check that would make that safe. But it seems
possible that that never actually happens -- maybe the
HashScanPosIsPinned() check will invariably find that the page is
already pinned (in which case commit 22c5e735 merely neglected to
remove some dead code from _hash_kill_items, but was otherwise
correct).
As per my understanding, during scan, we retain the pin on the primary
bucket page and release the lock whereas for overflow pages we neither
retain pin nor lock. However, IIUC, this is still safe because of the
interlocking provided by vacuum scan. See comments atop
_hash_kill_items() [There are never any scans active in a bucket at
the time VACUUM begins ...]. I think we only need to worry about
vacuum processing the page after we have collected the killed_items
which shouldn't happen as explained in comments atop
_hash_kill_items().
--
With Regards,
Amit Kapila.