Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
The code in gistvacuum.c is closely based on similar code in nbtree.c,
except that it only acquires an exclusive lock -- not a
super-exclusive lock. I suspect that that's because it seemed
unnecessary; nbtree plain index scans have their own special reasons
for this, that don't apply to GiST. Namely: nbtree plain index scans
that don't use an MVCC snapshot clearly need some other interlock to
protect against concurrent recycling of pointed-to-by-leaf-page TIDs.
And so as a general rule nbtree VACUUM needs a full
super-exclusive/cleanup lock, just in case there is a plain index scan
that uses some other kind of snapshot (logical replication, say).
To say the same thing another way: nbtree follows "the third rule"
described by "62.4. Index Locking Considerations" in the docs [1]https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan, but
GiST does not. The idea that GiST's behavior is okay here does seem
consistent with what the same docs go on to say about it: "When using
an MVCC-compliant snapshot, there is no problem because the new
occupant of the slot is certain to be too new to pass the snapshot
test".
But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.
The (admittedly undocumented) reason why we can never drop the leaf
page pin for an index-only scan in nbtree (but can do so for plain
index scans) also relates to heap interlocking -- but with a twist.
Here's the twist: the second heap pass by VACUUM can set visibility
map bits independently of the first (once LP_DEAD items from the first
pass over the heap are set to LP_UNUSED, which renders the page
all-visible) -- this all happens at the end of
lazy_vacuum_heap_page(). That's why index-only scans can't just assume
that VACUUM won't have deleted the TID from the leaf page they're
scanning immediately after they're done reading it. VACUUM could even
manage to set the visibility map bit for a relevant heap page inside
lazy_vacuum_heap_page(), before the index-only scan can read the
visibility map. If that is allowed to happen, the index-only would
give wrong answers if one of the TID references held in local memory
by the index-only scan happens to be marked LP_UNUSED inside
lazy_vacuum_heap_page(). IOW, it looks like we run the risk of a
concurrently recycled dead-to-everybody TID becoming visible during
GiST index-only scans, just because we have no interlock.
In summary:
UUIC this is only safe for nbtree because 1.) It acquires a
super-exclusive lock when vacuuming leaf pages, and 2.) Index-only
scans never drop their pin on the leaf page when accessing the
visibility map "in-sync" with the scan (of course we hope not to
access the heap proper at all for index-only scans). These precautions
are both necessary to make the race condition I describe impossible,
because they ensure that VACUUM cannot reach lazy_vacuum_heap_page()
until after our index-only scan reads the visibility map (and then has
to read the heap for at least that one dead-to-all TID, discovering
that the TID is dead to its snapshot). Why wouldn't GiST need to take
the same precautions, though?
[1]: https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan
--
Peter Geoghegan
<div> </div><div> </div><div>04.11.2021, 04:33, "Peter Geoghegan" <pg@bowt.ie>:</div><blockquote><p>But what about index-only scans, which GiST also supports? I think<br />that the rules are different there, even though index-only scans use<br />an MVCC snapshot.</p></blockquote><p>Let's enumerate steps how things can go wrong.</p><p>Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1</p><p>Backend2: Remove index_tuple1 and heap_tuple1</p><div><div>Backend3: Mark page1 all-visible</div><div>Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visible</div><div> </div><div>To avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do LockBufferForCleanup(index_page1). Do I get things right?</div><div> </div><div>Best regards, Andrey Borodin.</div></div><div> </div>
On Thu, Nov 4, 2021 at 8:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Let's enumerate steps how things can go wrong.
Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1
Backend2: Remove index_tuple1 and heap_tuple1
Backend3: Mark page1 all-visible
Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visibleTo avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do LockBufferForCleanup(index_page1). Do I get things right?
Almost. Backend3 is actually Backend2 here (there is no 3) -- it runs
VACUUM throughout.
Note that it's not particularly likely that Backend2/VACUUM will "win"
this race, because it typically has to do much more work than
Backend1. It has to actually remove the index tuples from the leaf
page, then any other index work (for this and other indexes). Then it
has to arrive back in vacuumlazy.c to set the VM bit in
lazy_vacuum_heap_page(). That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?
--
Peter Geoghegan
4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?
I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.
Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer too?
Best regards, Andrey Borodin.
On Fri, Nov 5, 2021 at 3:26 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.
I think that that's probably all correct -- I would certainly make the
same guess. It's very unlikely to happen, and when it does happen it
happens only once.
Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer too?
No, because there is no heap vacuuming involved (because that doesn't
happen outside lazyvacuum.c). The work that VACUUM does inside
lazy_vacuum_heap_rel() is part of the problem here -- we need an
interlock between that work, and index-only scans. Making LP_DEAD
items in heap pages LP_UNUSED is only ever possible during a VACUUM
operation (I'm sure you know why). AFAICT there would be no bug at all
without that detail.
I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated. But
AFAICT it's always some kind of "interlock with heapam VACUUM" issue,
with TID recycling, with no protection from our MVCC snapshot. I would
say that that's the "real problem" here, when I get to first principles.
Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.
We can talk about this some more in NYC. See you soon!
--
Peter Geoghegan
Attachments:
v1-0001-nbtree-README-Improve-VACUUM-interlock-section.patchapplication/x-patch; name=v1-0001-nbtree-README-Improve-VACUUM-interlock-section.patchDownload
From ea6612300e010f1f2b02119b5a0de95e46d1157d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 3 Nov 2021 14:38:15 -0700
Subject: [PATCH v1] nbtree README: Improve VACUUM interlock section.
Also document a related issue for index-only scans in vacuumlazy.c.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com
---
src/backend/access/heap/vacuumlazy.c | 10 ++
src/backend/access/nbtree/README | 145 ++++++++++++---------------
2 files changed, 75 insertions(+), 80 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 282b44f87..8bfe196bf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2384,6 +2384,16 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD items on the page that were determined to be LP_DEAD items back
* when the same page was visited by lazy_scan_prune() (i.e. those whose TID
* was recorded in the dead_items array at the time).
+ *
+ * We can opportunistically set the visibility map bit for the page here,
+ * which is valuable when lazy_scan_prune couldn't earlier on, owing only to
+ * the fact that there were LP_DEAD items that we'll now mark as unused. This
+ * is why index AMs that support index-only scans have to hold a pin on an
+ * index page as an interlock against VACUUM while accessing the visibility
+ * map (which is really just a dense summary of visibility information in the
+ * heap). If they didn't do this then there would be rare race conditions
+ * where a heap TID that is actually dead appears alive to an unlucky
+ * index-only scan.
*/
static int
lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2a7332d07..c6f04d856 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -89,25 +89,28 @@ Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
-not holding any page lock within the index. We do continue to hold a pin
-on the leaf page in some circumstances, to protect against concurrent
-deletions (see below). In this state the scan is effectively stopped
-"between" pages, either before or after the page it has pinned. This is
-safe in the presence of concurrent insertions and even page splits, because
-items are never moved across pre-existing page boundaries --- so the scan
-cannot miss any items it should have seen, nor accidentally return the same
-item twice. The scan must remember the page's right-link at the time it
-was scanned, since that is the page to move right to; if we move right to
-the current right-link then we'd re-scan any items moved by a page split.
-We don't similarly remember the left-link, since it's best to use the most
-up-to-date left-link when trying to move left (see detailed move-left
-algorithm below).
+not holding any page lock within the index. Plain indexscans can opt to
+hold a pin on the leaf page, to protect against concurrent heap TID
+recycling by VACUUM, but that has nothing to do with the B-Tree physical
+data structure itself. See "VACUUM's superexclusive lock" section below
+for more information.
-In most cases we release our lock and pin on a page before attempting
-to acquire pin and lock on the page we are moving to. In a few places
-it is necessary to lock the next page before releasing the current one.
-This is safe when moving right or up, but not when moving left or down
-(else we'd create the possibility of deadlocks).
+When an index scan finishes processing a leaf page, it has effectively
+stopped "between" pages. This is safe in the presence of concurrent
+insertions and even page splits, because items are never moved across
+pre-existing page boundaries --- so the scan cannot miss any items it
+should have seen, nor accidentally return the same item twice. The scan
+must remember the page's right-link at the time it is read for all this
+to work, since that is the page to visit next if the scan needs to
+continue. It's more complicated with backwards scans, though -- see
+section below.
+
+In most cases we release our lock on a page before attempting to acquire
+a lock on the sibling page we are moving to. In a few places (reached
+only during inserts or VACUUM) it might be necessary to lock the next
+page before releasing the lock on the current page. This is safe when
+moving right or up, but not when moving left or down (else we'd create
+the possibility of deadlocks).
Lehman and Yao fail to discuss what must happen when the root page
becomes full and must be split. Our implementation is to split the
@@ -163,56 +166,44 @@ pages (though suffix truncation is also considered). Note we must include
the incoming item in this calculation, otherwise it is possible to find
that the incoming item doesn't fit on the split page where it needs to go!
-Deleting index tuples during VACUUM
------------------------------------
+VACUUM's superexclusive lock, unsafe concurrently heap TID recycling
+--------------------------------------------------------------------
-Before deleting a leaf item, we get a super-exclusive lock on the target
-page, so that no other backend has a pin on the page when the deletion
-starts. This is not necessary for correctness in terms of the btree index
-operations themselves; as explained above, index scans logically stop
-"between" pages and so can't lose their place. The reason we do it is to
-provide an interlock between VACUUM and indexscans. Since VACUUM deletes
-index entries before reclaiming heap tuple line pointers, the
-super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
-line pointer that an indexscanning process might be about to visit. This
-guarantee works only for simple indexscans that visit the heap in sync
-with the index scan, not for bitmap scans. We only need the guarantee
-when using non-MVCC snapshot rules; when using an MVCC snapshot, it
-doesn't matter if the heap tuple is replaced with an unrelated tuple at
-the same TID, because the new tuple won't be visible to our scan anyway.
-Therefore, a scan using an MVCC snapshot which has no other confounding
-factors will not hold the pin after the page contents are read. The
-current reasons for exceptions, where a pin is still needed, are if the
-index is not WAL-logged or if the scan is an index-only scan. If later
-work allows the pin to be dropped for all cases we will be able to
-simplify the vacuum code, since the concept of a super-exclusive lock
-for btree indexes will no longer be needed.
+Before deleting items from leaf pages, VACUUM gets a super-exclusive
+lock on the target page, so that no other backend has a pin on the page
+when the deletion starts. This is not necessary for correctness in
+terms of the btree index operations themselves; as explained above,
+index scans logically stop "between" pages and so can't lose their
+place. It's necessary to avoid unsafe concurrent recycling of heap
+TIDs.
-Because a pin is not always held, and a page can be split even while
-someone does hold a pin on it, it is possible that an indexscan will
-return items that are no longer stored on the page it has a pin on, but
-rather somewhere to the right of that page. To ensure that VACUUM can't
-prematurely remove such heap tuples, we require btbulkdelete to obtain a
-super-exclusive lock on every leaf page in the index, even pages that
-don't contain any deletable tuples. Any scan which could yield incorrect
-results if the tuple at a TID matching the scan's range and filter
-conditions were replaced by a different tuple while the scan is in
-progress must hold the pin on each index page until all index entries read
-from the page have been processed. This guarantees that the btbulkdelete
-call cannot return while any indexscan is still holding a copy of a
-deleted index tuple if the scan could be confused by that. Note that this
-requirement does not say that btbulkdelete must visit the pages in any
-particular order. (See also simple deletion and bottom-up deletion,
-below.)
+Requiring superexclusive locks in nbtree VACUUM enables interlocking
+between heap vacuuming (where VACUUM recycles heap TIDs) and index scans
+that visit the heap "in sync". Since VACUUM (but not simple deletion or
+bottom-up deletion) always removes index tuples before recycling heap
+line pointers (after btbulkdelete returns, during the second pass over
+the heap), the super-exclusive lock guarantees that VACUUM can't
+concurrently recycle a heap TID that a plain index scan might still need
+to visit. Note that VACUUM must do this for every leaf page, not just
+those that are known to have index tuples that must be removed
+(optimizing away the super-exclusive lock would be wrong, since we're
+not concerned about the physical leaf page itself; the scan has already
+finished reading the leaf page at the point that it begins visiting its
+heap TIDs).
-There is no such interlocking for deletion of items in internal pages,
-since backends keep no lock nor pin on a page they have descended past.
-Hence, when a backend is ascending the tree using its stack, it must
-be prepared for the possibility that the item it wants is to the left of
-the recorded position (but it can't have moved left out of the recorded
-page). Since we hold a lock on the lower page (per L&Y) until we have
-re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.
+The interlock only applies when an index scan opts-in by holding on to a
+pin on each just-read leaf page (until the scan is done visiting TIDs
+found on the leaf page in the heap). Most index scans just drop the pin
+instead, which is generally preferable because it avoids unnecessarily
+blocking index vacuuming. Only index scans using non-MVCC snapshots
+really need the interlock, because they cannot just depend on MVCC rules
+to avoiding returning unrelated heap tuples that happened to reuse the
+original heap line pointer. (Actually, certain implementation
+restrictions that affect the kill_prior_tuple/LP_DEAD optimization also
+affect whether or not we hold a pin here, even with an MVCC snapshot;
+see simple deletion section below. This doesn't change the fact that
+holding on to a pin is fundamentally optional for index scans that use
+an MVCC snapshot.)
VACUUM's linear scan, concurrent page splits
--------------------------------------------
@@ -518,21 +509,15 @@ that's required for the deletion process to perform granular removal of
groups of dead TIDs from posting list tuples (without the situation ever
being allowed to get out of hand).
-It's sufficient to have an exclusive lock on the index page, not a
-super-exclusive lock, to do deletion of LP_DEAD items. It might seem
-that this breaks the interlock between VACUUM and indexscans, but that is
-not so: as long as an indexscanning process has a pin on the page where
-the index item used to be, VACUUM cannot complete its btbulkdelete scan
-and so cannot remove the heap tuple. This is another reason why
-btbulkdelete has to get a super-exclusive lock on every leaf page, not only
-the ones where it actually sees items to delete.
-
-LP_DEAD setting by index scans cannot be sure that a TID whose index tuple
-it had planned on LP_DEAD-setting has not been recycled by VACUUM if it
-drops its pin in the meantime. It must conservatively also remember the
-LSN of the page, and only act to set LP_DEAD bits when the LSN has not
-changed at all. (Avoiding dropping the pin entirely also makes it safe, of
-course.)
+LP_DEAD setting by index scans (via the kill_prior_tuple optimization)
+cannot be sure that a TID whose index tuple it had planned on
+LP_DEAD-setting has not been recycled by VACUUM if it drops its pin in
+the meantime. It must conservatively also remember the LSN of the page,
+and only act to set LP_DEAD bits when the LSN has not changed at all.
+Avoiding dropping the pin entirely makes it safe even when the LSN has
+changed (see related discussion about VACUUM's superexclusive lock
+above), but in practice most index scans opt to not hold onto a pin, to
+avoid blocking VACUUM.
Bottom-Up deletion
------------------
--
2.30.2
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated.
Minor correction: we actually also have to worry about plain index
scans that don't use an MVCC snapshot, which is possible within
nbtree. It's quite likely when using logical replication, actually.
See the patch for more.
Like with the index-only scan case, a non-MVCC snapshot + plain nbtree
index scan cannot rely on heap access within the index scan node -- it
won't reliably notice that any newer heap tuples (that are really the
result of concurrent TID recycling) are not actually visible to its
MVCC snapshot -- because there isn't an MVCC snapshot. The only
difference in the index-only scan scenario is that we use the
visibility map (not the heap) -- which is racey in a way that makes
our MVCC snapshot (IOSs always have an MVCC snapshot) an ineffective
protection.
In summary, to be safe against confusion from concurrent TID recycling
during index/index-only scans, we can do either of the following
things:
1. Hold a pin of our leaf page while accessing the heap -- that'll
definitely conflict with the cleanup lock that nbtree VACUUM will
inevitably try to acquire on our leaf page.
OR:
2. Hold an MVCC snapshot, AND do an actual heap page access during the
plain index scan -- do both together.
With approach 2, our plain index scan must determine visibility using
real XIDs (against something like a dirty snapshot), rather than using
a visibility map bit. That is also necessary because the VM might
become invalid or ambiguous, in a way that's clearly not possible when
looking at full heap tuple headers with XIDs -- concurrent recycling
becomes safe if we know that we'll reliably notice it and not give
wrong answers.
Does that make sense?
--
Peter Geoghegan
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.
I pushed a commit that described how this works for nbtree, in the README file.
I think that there might be an even more subtle race condition in
nbtree itself, though, during recovery. We no longer do a "pin scan"
during recovery these days (see commits 9f83468b, 3e4b7d87, and
687f2cd7 for full information). I think that it might be necessary to
do that, just for the benefit of index-only scans -- if it's necessary
during original execution, then why not during recovery?
The work to remove "pin scans" was justified by pointing out that we
no longer use various kinds of snapshots during recovery, but it said
nothing about index-only scans, which need the TID recycling interlock
(i.e. need to hold onto a leaf page while accessing the heap in sync)
even with an MVCC snapshot. It's easy to imagine how it might have
been missed: nobody ever documented the general issue with index-only
scans, until now. Commit 2ed5b87f recognized they were unsafe for the
optimization that it added (to avoid blocking VACUUM), but never
explained why they were unsafe.
Going back to doing pin scans during recovery seems deeply
unappealing, especially to fix a totally narrow race condition.
--
Peter Geoghegan
On Wed, Nov 3, 2021 at 7:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.
(Reviving this old thread after 3 years)
I was reminded of this old thread during today's discussion of a
tangentially related TID-recycle-safety bug that affects bitmap index
scans that use the visibility map as an optimization [1]/messages/by-id/873c33c5-ef9e-41f6-80b2-2f5e11869f1c@garret.ru. Turns out I
was right to be concerned.
This GiST bug is causally unrelated to that other bug, so I thought it
would be helpful to move discussion of the GiST bug to this old
thread.
Attached is a refined version of a test case I posted earlier on [2]/messages/by-id/CAH2-Wzm6gBqc99iEKO6540ynwpjOqWESt5yjg-bHbt0hc8DPsw@mail.gmail.com -- Peter Geoghegan,
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.
Offhand, I think that the only way to fix this is to bring GiST in
line with nbtree: we ought to teach GiST VACUUM to start acquiring
cleanup locks (previously known as super-exclusive locks), and then
teach GiST index-only scans to hold onto a leaf page buffer pin as the
visibility map (or the heap proper) is accessed for the TIDs to be
returned from the leaf page. Arguably, GiST isn't obeying the current
contract for amgettuple index AMs at all right now (whether or not it
violates the contract as written is open to interpretation, I suppose,
but either way the current behavior is wrong).
We probably shouldn't hold onto a buffer pin during plain GiST
index-only scans, though -- plain GiST index scans *aren't* broken,
and so we should change as little as possible there. More concretely,
we should probably copy more nbtree scan related code into GiST to
deal with all this: we could copy nbtree's _bt_drop_lock_and_maybe_pin
into GiST to fix this bug, while avoiding changing the performance
characteristics of GiST plain index scans. This will also entail
adding a new buffer field to GiST's GISTScanOpaqueData struct --
something similar to nbtree's BTScanOpaqueData.currPos.buf field
(it'll go next to the current GISTScanOpaqueData.curBlkno field, just
like the nbtree equivalent goes next to its own currPage blkno field).
Long term, code like nbtree's _bt_drop_lock_and_maybe_pin should be
generalized and removed from every individual index AM, nbtree
included -- I think that the concepts generalize to every index AM
that supports amgettuple (the race condition in question has
essentially nothing to do with individual index AM requirements). I've
discussed this kind of approach with Tomas Vondra (CC'd) recently.
That's not going to be possible within the scope of a backpatchable
fix, though.
[1]: /messages/by-id/873c33c5-ef9e-41f6-80b2-2f5e11869f1c@garret.ru
[2]: /messages/by-id/CAH2-Wzm6gBqc99iEKO6540ynwpjOqWESt5yjg-bHbt0hc8DPsw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v2-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v2-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload
From 0cb759784cbdfe34c6285df55079439e4004d454 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 2 Dec 2024 15:50:04 -0500
Subject: [PATCH v2] isolationtester showing broken index-only scans with GiST
---
contrib/btree_gist/.gitignore | 2 +
contrib/btree_gist/Makefile | 3 +
contrib/btree_gist/expected/btree_gist.out | 28 ++++++++
contrib/btree_gist/meson.build | 6 ++
contrib/btree_gist/specs/btree_gist.spec | 80 ++++++++++++++++++++++
5 files changed, 119 insertions(+)
create mode 100644 contrib/btree_gist/expected/btree_gist.out
create mode 100644 contrib/btree_gist/specs/btree_gist.spec
diff --git a/contrib/btree_gist/.gitignore b/contrib/btree_gist/.gitignore
index 5dcb3ff97..b4903eba6 100644
--- a/contrib/btree_gist/.gitignore
+++ b/contrib/btree_gist/.gitignore
@@ -1,4 +1,6 @@
# Generated subdirectories
/log/
/results/
+/output_iso/
/tmp_check/
+/tmp_check_iso/
diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 7ac2df26c..e8cdef227 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -42,6 +42,9 @@ REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
bytea bit varbit numeric uuid not_equal enum bool partitions \
stratnum without_overlaps
+ISOLATION = btree_gist
+ISOLATION_OPTS = --load-extension=btree_gist
+
SHLIB_LINK += $(filter -lm, $(LIBS))
ifdef USE_PGXS
diff --git a/contrib/btree_gist/expected/btree_gist.out b/contrib/btree_gist/expected/btree_gist.out
new file mode 100644
index 000000000..0668d318c
--- /dev/null
+++ b/contrib/btree_gist/expected/btree_gist.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/contrib/btree_gist/meson.build b/contrib/btree_gist/meson.build
index 73b1bbf52..51adf635e 100644
--- a/contrib/btree_gist/meson.build
+++ b/contrib/btree_gist/meson.build
@@ -94,4 +94,10 @@ tests += {
'without_overlaps',
],
},
+ 'isolation': {
+ 'specs': [
+ 'btree_gist',
+ ],
+ 'regress_args': ['--load-extension=btree_gist'],
+ },
}
diff --git a/contrib/btree_gist/specs/btree_gist.spec b/contrib/btree_gist/specs/btree_gist.spec
new file mode 100644
index 000000000..e48381e84
--- /dev/null
+++ b/contrib/btree_gist/specs/btree_gist.spec
@@ -0,0 +1,80 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_gist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.
I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.
I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.
--
Peter Geoghegan
On Tue, 3 Dec 2024 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.
I think I have a fix for GiST which can be found attached in patch 0002.
As to how this works: the patch tracks (for IOS) the pages for which
there are some entries yet to be returned by gistgettuple(), and keeps
a pin on those pages using a new AMscan tracking mechanism that
utilizes buffer refcounts. Even if it might not a very elegant
solution and IMV still has rough edges, it works, and fixes the issue
with incorrect results from the GiST index.
One side effect of this change to keep pins in GiST-IOS, is that that
this could realistically keep pins on a huge portion of the index,
thus exhausting shared buffers and increasing prevalence of "no
unpinned buffers"-related errors.
I haven't looked much at SP-GiST yet, so I don't have anything for the
VACUUM+IOS bug there.
0001 is a slight modification of your v2-0001, a version which now
(critically) doesn't expect VACUUM to run to completion before
s1_fetch_all starts; this is important for 0002 as that causes vacuum
to block and wait for the cursor to return more tuples, which the
isolation tester doesn't (can't?) detect. With only 0001, the new test
fails with incorrect results, with 0002 applied the test succeeds.
I'm looking forward to any feedback.
Kind regards,
Matthias van de Meent
Attachments:
v3-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v3-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload
From afa0310803bf72bdaccff265afbdecf26da88435 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 2 Dec 2024 15:50:04 -0500
Subject: [PATCH v3 1/2] isolationtester showing broken index-only scans with
GiST v3
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
---
contrib/btree_gist/.gitignore | 2 +
contrib/btree_gist/Makefile | 3 +
contrib/btree_gist/expected/btree_gist.out | 69 ++++++++++++
contrib/btree_gist/meson.build | 6 ++
contrib/btree_gist/specs/btree_gist.spec | 117 +++++++++++++++++++++
5 files changed, 197 insertions(+)
create mode 100644 contrib/btree_gist/expected/btree_gist.out
create mode 100644 contrib/btree_gist/specs/btree_gist.spec
diff --git a/contrib/btree_gist/.gitignore b/contrib/btree_gist/.gitignore
index 5dcb3ff9723..b4903eba657 100644
--- a/contrib/btree_gist/.gitignore
+++ b/contrib/btree_gist/.gitignore
@@ -1,4 +1,6 @@
# Generated subdirectories
/log/
/results/
+/output_iso/
/tmp_check/
+/tmp_check_iso/
diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 7ac2df26c10..e8cdef2277d 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -42,6 +42,9 @@ REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
bytea bit varbit numeric uuid not_equal enum bool partitions \
stratnum without_overlaps
+ISOLATION = btree_gist
+ISOLATION_OPTS = --load-extension=btree_gist
+
SHLIB_LINK += $(filter -lm, $(LIBS))
ifdef USE_PGXS
diff --git a/contrib/btree_gist/expected/btree_gist.out b/contrib/btree_gist/expected/btree_gist.out
new file mode 100644
index 00000000000..88dad12a415
--- /dev/null
+++ b/contrib/btree_gist/expected/btree_gist.out
@@ -0,0 +1,69 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0 ORDER BY a <-> 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/contrib/btree_gist/meson.build b/contrib/btree_gist/meson.build
index 73b1bbf52a6..51adf635eb9 100644
--- a/contrib/btree_gist/meson.build
+++ b/contrib/btree_gist/meson.build
@@ -94,4 +94,10 @@ tests += {
'without_overlaps',
],
},
+ 'isolation': {
+ 'specs': [
+ 'btree_gist',
+ ],
+ 'regress_args': ['--load-extension=btree_gist'],
+ },
}
diff --git a/contrib/btree_gist/specs/btree_gist.spec b/contrib/btree_gist/specs/btree_gist.spec
new file mode 100644
index 00000000000..18ad7b4dbd5
--- /dev/null
+++ b/contrib/btree_gist/specs/btree_gist.spec
@@ -0,0 +1,117 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_gist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0 ORDER BY a <-> 0;
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v3-0002-RFC-Extend-buffer-pin-lifetime-for-GIST-IOS.patchapplication/octet-stream; name=v3-0002-RFC-Extend-buffer-pin-lifetime-for-GIST-IOS.patchDownload
From 86ca31af76acd71fce3fd36ddf25f63d6d699b77 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 4 Jan 2025 01:02:15 +0100
Subject: [PATCH v3 2/2] RFC: Extend buffer pin lifetime for GIST IOS
This should fix issues with incorrect results when IOS encounters concurrent vacuum.
---
src/include/access/gist_private.h | 16 ++++
src/backend/access/gist/gistget.c | 116 ++++++++++++++++++++++++++-
src/backend/access/gist/gistscan.c | 60 ++++++++++++++
src/backend/access/gist/gistvacuum.c | 4 +-
4 files changed, 192 insertions(+), 4 deletions(-)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..cf5fd4336c7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,6 +17,7 @@
#include "access/amapi.h"
#include "access/gist.h"
#include "access/itup.h"
+#include "lib/ilist.h"
#include "lib/pairingheap.h"
#include "storage/bufmgr.h"
#include "storage/buffile.h"
@@ -124,6 +125,7 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ uint32 pagepinOffset; /* Pinned page's offset into GISTScanOpaqueData.pinned */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -148,6 +150,13 @@ typedef struct GISTSearchItem
(offsetof(GISTSearchItem, distances) + \
sizeof(IndexOrderByDistance) * (n_distances))
+typedef struct GISTPagePin
+{
+ slist_node nextFree;
+ Buffer buffer; /* pinned buffer */
+ int count; /* number of results not yet returned */
+} GISTPagePin;
+
/*
* GISTScanOpaqueData: private state for a scan of a GiST index
*/
@@ -176,6 +185,12 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+
+ GISTPagePin *pinned; /* page pins, used in index-only scans.
+ * otherwise NULL */
+ slist_head nextFreePin; /* next free page pin, if available */
+ BlockNumber pincapacity; /* current max pin count in pinned */
+ uint32 releasePinOffset; /* reduce pin count on this buffer next */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
@@ -463,6 +478,7 @@ extern XLogRecPtr gistXLogAssignLSN(void);
extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern bool gistcanreturn(Relation index, int attno);
+extern uint32 gistgetpin(GISTScanOpaque opaque);
/* gistvalidate.c */
extern bool gistvalidate(Oid opclassoid);
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..779478a44b2 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -337,6 +337,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
OffsetNumber maxoff;
OffsetNumber i;
MemoryContext oldcxt;
+ GISTPagePin *pin = NULL;
+ uint32 pagepinoff;
Assert(!GISTSearchItemIsHeap(*pageItem));
@@ -471,6 +473,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
+
+ if (pin == NULL)
+ {
+ pagepinoff = gistgetpin(so);
+ pin = &so->pinned[pagepinoff];
+ pin->buffer = buffer;
+ pin->count = 0;
+ }
+ so->pageData[so->nPageData].pagepinOffset = pagepinoff;
+ pin->count++;
}
so->nPageData++;
}
@@ -501,7 +513,19 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+
+ if (!PointerIsValid(pin))
+ {
+ pagepinoff = gistgetpin(so);
+ pin = &so->pinned[pagepinoff];
+ pin->buffer = buffer;
+ pin->count = 0;
+ }
+ pin->count++;
+ item->data.heap.pagepinOffset = pagepinoff;
+ }
}
else
{
@@ -526,7 +550,67 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
}
- UnlockReleaseBuffer(buffer);
+ if (scan->xs_want_itup && GistPageIsLeaf(page) && PointerIsValid(pin))
+ {
+ Assert(pin->count > 0);
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ /* Pin to be released when scan pin->count reaches 0 */
+ }
+ else
+ UnlockReleaseBuffer(buffer);
+}
+
+uint32
+gistgetpin(GISTScanOpaque opaque)
+{
+ slist_node *node;
+ GISTPagePin *nextFree;
+
+ if (opaque->pincapacity == 0)
+ {
+ opaque->pincapacity = 1;
+ opaque->pinned = palloc0(sizeof(GISTPagePin));
+
+ return 0;
+ }
+
+ if (slist_is_empty(&opaque->nextFreePin))
+ {
+ BlockNumber firstEntry = opaque->pincapacity;
+
+ /*
+ * We don't have any entries in the linked list, so repalloc is used
+ * safely here. (If there were entries left, they'd be stale
+ * references after this.)
+ *
+ * Note that before overflowing uint32 we'll always first fail the
+ * repalloc(), which has a limit of ~1GiB - much less than the 16GiB
+ * that we'd have to consume before we'd overflow pincapacity.
+ */
+ opaque->pincapacity *= 2;
+
+ opaque->pinned = repalloc(opaque->pinned,
+ mul_size(sizeof(GISTPagePin),
+ opaque->pincapacity));
+
+ /*
+ * Fill nextFreePin list back-to-front, to push higher indexes into
+ * the back of the slist.
+ */
+ for (BlockNumber i = opaque->pincapacity - 1; i >= firstEntry; i--)
+ {
+ slist_push_head(&opaque->nextFreePin,
+ &opaque->pinned[i].nextFree);
+ opaque->pinned[i].buffer = InvalidBuffer;
+ opaque->pinned[i].count = 0;
+ }
+ }
+
+ node = slist_pop_head_node(&opaque->nextFreePin);
+ nextFree = slist_container(GISTPagePin, nextFree, node);
+ nextFree->count = 0;
+ nextFree->buffer = 0;
+ return (uint32) (nextFree - opaque->pinned);
}
/*
@@ -588,7 +672,10 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = item->data.heap.recontup;
+ so->releasePinOffset = item->data.heap.pagepinOffset;
+ }
res = true;
}
else
@@ -637,6 +724,28 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
gistScanPage(scan, &fakeItem, NULL, NULL, NULL);
}
+ if (so->releasePinOffset != UINT32_MAX)
+ {
+ GISTPagePin *pin;
+
+ Assert(so->pincapacity > so->releasePinOffset);
+
+ pin = &so->pinned[so->releasePinOffset];
+
+ Assert(pin->count > 0);
+ Assert(BufferIsValid(pin->buffer));
+
+ pin->count--;
+
+ if (pin->count == 0)
+ {
+ ReleaseBuffer(pin->buffer);
+ pin->buffer = InvalidBuffer;
+ pin->count = 0;
+ slist_push_head(&so->nextFreePin, &pin->nextFree);
+ }
+ }
+
if (scan->numberOfOrderBys > 0)
{
/* Must fetch tuples in strict distance order */
@@ -651,7 +760,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
{
if (scan->kill_prior_tuple && so->curPageData > 0)
{
-
if (so->killedItems == NULL)
{
MemoryContext oldCxt =
@@ -673,7 +781,11 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
/* in an index-only scan, also return the reconstructed tuple */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = so->pageData[so->curPageData].recontup;
+ so->releasePinOffset =
+ so->pageData[so->curPageData].pagepinOffset;
+ }
so->curPageData++;
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index de472e16373..738fd0d4ace 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -111,6 +111,11 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ so->pincapacity = 0;
+ so->pinned = NULL;
+ slist_init(&so->nextFreePin);
+ so->releasePinOffset = UINT32_MAX;
+
scan->opaque = so;
/*
@@ -209,6 +214,34 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
ALLOCSET_DEFAULT_SIZES);
}
+ if (PointerIsValid(so->pinned))
+ {
+ Assert(scan->xs_want_itup);
+
+ slist_init(&so->nextFreePin);
+
+ /*
+ * Fill nextFreePin list back-to-front, to push higher indexes into
+ * the back of the slist.
+ */
+ for (int p = so->pincapacity - 1; p >= 0; p--)
+ {
+ GISTPagePin *pin = &so->pinned[p];
+
+ if (BufferIsValid(pin->buffer))
+ {
+ Assert(pin->count > 0);
+ ReleaseBuffer(pin->buffer);
+ pin->buffer = InvalidBuffer;
+ pin->count = 0;
+ }
+
+ slist_push_head(&so->nextFreePin, &pin->nextFree);
+ }
+
+ so->releasePinOffset = UINT32_MAX;
+ }
+
/* create new, empty pairing heap for search queue */
oldCxt = MemoryContextSwitchTo(so->queueCxt);
so->queue = pairingheap_allocate(pairingheap_GISTSearchItem_cmp, scan);
@@ -348,6 +381,33 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (PointerIsValid(so->pinned))
+ {
+ Assert(scan->xs_want_itup);
+ slist_init(&so->nextFreePin);
+
+ /*
+ * Fill nextFreePin list back-to-front, to push higher indexes into
+ * the back of the slist.
+ */
+ for (int p = so->pincapacity - 1; p >= 0; p--)
+ {
+ GISTPagePin *pin = &so->pinned[p];
+
+ if (BufferIsValid(pin->buffer))
+ {
+ Assert(pin->count > 0);
+ ReleaseBuffer(pin->buffer);
+ pin->buffer = InvalidBuffer;
+ pin->count = 0;
+ }
+
+ slist_push_head(&so->nextFreePin, &pin->nextFree);
+ }
+
+ so->releasePinOffset = UINT32_MAX;
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..b0f0401a8ae 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -290,9 +290,9 @@ restart:
/*
* We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * exclusive lock for cleanup.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
--
2.45.2
On Sat, 4 Jan 2025 at 02:00, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Tue, 3 Dec 2024 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.I think I have a fix for GiST which can be found attached in patch 0002.
As to how this works: the patch tracks (for IOS) the pages for which
there are some entries yet to be returned by gistgettuple(), and keeps
a pin on those pages using a new AMscan tracking mechanism that
utilizes buffer refcounts. Even if it might not a very elegant
solution and IMV still has rough edges, it works, and fixes the issue
with incorrect results from the GiST index.
In the attached v4 of the fix I've opted to replace the bespoke pin
tracker of my previous fix with the default buffer pin tracking
mechanism, which I realised has been designed for the same items and
doesn't require additional memory management on the AM side.
It massively simplifies the code, reduces allocation overhead, and
allowed me to port the fix to SP-GiST much quicker.
I haven't looked much at SP-GiST yet, so I don't have anything for the
VACUUM+IOS bug there.
I've attached a fix that uses the same approach as GiST in v4-0003. I
couldn't find any spgist extensions in contrib to copy-paste the tests
to, but manual testing did show vacuum now does wait for the index
scan to finish page processing.
I'm looking forward to any feedback.
Like patch 0001, the status of that has not changed.
Kind regards,
Matthias van de Meent
Attachments:
v4-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v4-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload
From afa0310803bf72bdaccff265afbdecf26da88435 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 2 Dec 2024 15:50:04 -0500
Subject: [PATCH v4 1/3] isolationtester showing broken index-only scans with
GiST v3
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
---
contrib/btree_gist/.gitignore | 2 +
contrib/btree_gist/Makefile | 3 +
contrib/btree_gist/expected/btree_gist.out | 69 ++++++++++++
contrib/btree_gist/meson.build | 6 ++
contrib/btree_gist/specs/btree_gist.spec | 117 +++++++++++++++++++++
5 files changed, 197 insertions(+)
create mode 100644 contrib/btree_gist/expected/btree_gist.out
create mode 100644 contrib/btree_gist/specs/btree_gist.spec
diff --git a/contrib/btree_gist/.gitignore b/contrib/btree_gist/.gitignore
index 5dcb3ff9723..b4903eba657 100644
--- a/contrib/btree_gist/.gitignore
+++ b/contrib/btree_gist/.gitignore
@@ -1,4 +1,6 @@
# Generated subdirectories
/log/
/results/
+/output_iso/
/tmp_check/
+/tmp_check_iso/
diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 7ac2df26c10..e8cdef2277d 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -42,6 +42,9 @@ REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
bytea bit varbit numeric uuid not_equal enum bool partitions \
stratnum without_overlaps
+ISOLATION = btree_gist
+ISOLATION_OPTS = --load-extension=btree_gist
+
SHLIB_LINK += $(filter -lm, $(LIBS))
ifdef USE_PGXS
diff --git a/contrib/btree_gist/expected/btree_gist.out b/contrib/btree_gist/expected/btree_gist.out
new file mode 100644
index 00000000000..88dad12a415
--- /dev/null
+++ b/contrib/btree_gist/expected/btree_gist.out
@@ -0,0 +1,69 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0 ORDER BY a <-> 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/contrib/btree_gist/meson.build b/contrib/btree_gist/meson.build
index 73b1bbf52a6..51adf635eb9 100644
--- a/contrib/btree_gist/meson.build
+++ b/contrib/btree_gist/meson.build
@@ -94,4 +94,10 @@ tests += {
'without_overlaps',
],
},
+ 'isolation': {
+ 'specs': [
+ 'btree_gist',
+ ],
+ 'regress_args': ['--load-extension=btree_gist'],
+ },
}
diff --git a/contrib/btree_gist/specs/btree_gist.spec b/contrib/btree_gist/specs/btree_gist.spec
new file mode 100644
index 00000000000..18ad7b4dbd5
--- /dev/null
+++ b/contrib/btree_gist/specs/btree_gist.spec
@@ -0,0 +1,117 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_gist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0 ORDER BY a <-> 0;
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE a > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v4-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchapplication/octet-stream; name=v4-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchDownload
From 961251fd3fe5e462c4eccda56ad54842ea300956 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 7 Jan 2025 00:00:32 +0100
Subject: [PATCH v4 3/3] RFC: Extend buffer pinning for SP-GIST IOS
This should fix issues with incorrect results when a SP-GIST
IOS encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/include/access/spgist_private.h | 3 +
src/backend/access/spgist/spgscan.c | 112 ++++++++++++++++++++++----
src/backend/access/spgist/spgvacuum.c | 2 +-
3 files changed, 100 insertions(+), 17 deletions(-)
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index e7cbe10a89b..c29d1d58c47 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -175,6 +175,8 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
+ Buffer buffer; /* buffer pinned for this leaf tuple
+ * (IOS-only) */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
@@ -226,6 +228,7 @@ typedef struct SpGistScanOpaqueData
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
+ Buffer pagePin; /* output tuple's pinned buffer, if IOS */
ItemPointerData heapPtrs[MaxIndexTuplesPerPage]; /* TIDs from cur page */
bool recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 3017861859f..ea344740f49 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ Buffer pin);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -300,6 +301,38 @@ spgPrepareScanKeys(IndexScanDesc scan)
}
}
+/*
+ * Note: This removes all items from the pairingheap.
+ */
+static void
+spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so)
+{
+ /* Guaranteed no pinned pages */
+ if (so->scanQueue == NULL || !scan->xs_want_itup)
+ return;
+
+ if (so->nPtrs > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ while (!pairingheap_is_empty(so->scanQueue))
+ {
+ pairingheap_node *node;
+ SpGistSearchItem *item;
+
+ node = pairingheap_remove_first(so->scanQueue);
+ item = pairingheap_container(SpGistSearchItem, phNode, node);
+ if (!item->isLeaf)
+ continue;
+
+ Assert(BufferIsValid(item->buffer));
+ ReleaseBuffer(item->buffer);
+ }
+}
+
IndexScanDesc
spgbeginscan(Relation rel, int keysz, int orderbysz)
{
@@ -416,6 +449,9 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* preprocess scankeys, set up the representation in *so */
spgPrepareScanKeys(scan);
+ /* release any pinned buffers from earlier rescans */
+ spgScanEndDropAllPagePins(scan, so);
+
/* set up starting queue entries */
resetSpGistScanOpaque(so);
@@ -428,6 +464,12 @@ spgendscan(IndexScanDesc scan)
{
SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ /*
+ * release any pinned buffers from earlier rescans, before we drop their
+ * data by dropping the memory contexts.
+ */
+ spgScanEndDropAllPagePins(scan, so);
+
MemoryContextDelete(so->tempCxt);
MemoryContextDelete(so->traversalCxt);
@@ -460,7 +502,7 @@ spgendscan(IndexScanDesc scan)
static SpGistSearchItem *
spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
Datum leafValue, bool recheck, bool recheckDistances,
- bool isnull, double *distances)
+ bool isnull, double *distances, Buffer addPin)
{
SpGistSearchItem *item = spgAllocSearchItem(so, isnull, distances);
@@ -479,6 +521,10 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
datumCopy(leafValue, so->state.attType.attbyval,
so->state.attType.attlen);
+ Assert(BufferIsValid(addPin));
+ IncrBufferRefCount(addPin);
+ item->buffer = addPin;
+
/*
* If we're going to need to reconstruct INCLUDE attributes, store the
* whole leaf tuple so we can get the INCLUDE attributes out of it.
@@ -495,6 +541,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
{
item->value = (Datum) 0;
item->leafTuple = NULL;
+ item->buffer = InvalidBuffer;
}
item->traversalValue = NULL;
item->isLeaf = true;
@@ -513,7 +560,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
static bool
spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
SpGistLeafTuple leafTuple, bool isnull,
- bool *reportedSome, storeRes_func storeRes)
+ bool *reportedSome, storeRes_func storeRes, Buffer buffer)
{
Datum leafValue;
double *distances;
@@ -580,7 +627,8 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
recheck,
recheckDistances,
isnull,
- distances);
+ distances,
+ buffer);
spgAddSearchItemToQueue(so, heapItem);
@@ -591,7 +639,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, InvalidBuffer);
*reportedSome = true;
}
}
@@ -760,7 +808,7 @@ enum SpGistSpecialOffsetNumbers
static OffsetNumber
spgTestLeafTuple(SpGistScanOpaque so,
SpGistSearchItem *item,
- Page page, OffsetNumber offset,
+ Page page, OffsetNumber offset, Buffer buffer,
bool isnull, bool isroot,
bool *reportedSome,
storeRes_func storeRes)
@@ -799,7 +847,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
Assert(ItemPointerIsValid(&leafTuple->heapPtr));
- spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes);
+ spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes,
+ buffer);
return SGLT_GET_NEXTOFFSET(leafTuple);
}
@@ -835,7 +884,8 @@ redirect:
Assert(so->numberOfNonNullOrderBys > 0);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->buffer);
reportedSome = true;
}
else
@@ -873,7 +923,7 @@ redirect:
/* When root is a leaf, examine all its tuples */
for (offset = FirstOffsetNumber; offset <= max; offset++)
(void) spgTestLeafTuple(so, item, page, offset,
- isnull, true,
+ buffer, isnull, true,
&reportedSome, storeRes);
}
else
@@ -883,10 +933,24 @@ redirect:
{
Assert(offset >= FirstOffsetNumber && offset <= max);
offset = spgTestLeafTuple(so, item, page, offset,
- isnull, false,
+ buffer, isnull, false,
&reportedSome, storeRes);
if (offset == SpGistRedirectOffsetNumber)
+ {
+ Assert(so->nPtrs == 0);
goto redirect;
+ }
+ }
+
+ /*
+ * IOS: Make sure we have one additional pin on the buffer,
+ * so that vacuum won't remove any deleted TIDs and mark
+ * their pages ALL_VISIBLE while we still have a copy.
+ */
+ if (so->want_itup && reportedSome)
+ {
+ IncrBufferRefCount(buffer);
+ so->pagePin = buffer;
}
}
}
@@ -929,9 +993,10 @@ static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ Buffer pin)
{
- Assert(!recheckDistances && !distances);
+ Assert(!recheckDistances && !distances && !BufferIsValid(pin));
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
so->ntids++;
}
@@ -954,10 +1019,9 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
/* storeRes subroutine for gettuple case */
static void
-storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
- Datum leafValue, bool isnull,
- SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr, Datum leafValue,
+ bool isnull, SpGistLeafTuple leafTuple, bool recheck,
+ bool recheckDistances, double *nonNullDistances, Buffer pin)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
@@ -1016,6 +1080,10 @@ storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
so->reconTups[so->nPtrs] = heap_form_tuple(so->reconTupDesc,
leafDatums,
leafIsnulls);
+
+ /* move the buffer pin, if required */
+ if (BufferIsValid(pin))
+ so->pagePin = pin;
}
so->nPtrs++;
}
@@ -1065,7 +1133,19 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
for (i = 0; i < so->nPtrs; i++)
pfree(so->reconTups[i]);
+
+ if (so->nPtrs > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
}
+ else
+ {
+ Assert(!BufferIsValid(so->pagePin));
+ }
+
so->iPtr = so->nPtrs = 0;
spgWalk(scan->indexRelation, so, false, storeGettuple);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0da069fd4d7..d0680a5073e 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
--
2.45.2
v4-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchapplication/octet-stream; name=v4-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchDownload
From e70c19b523cc010662f0c29324e416d889c919fa Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 6 Jan 2025 20:54:08 +0100
Subject: [PATCH v4 2/3] RFC: Extend buffer pinning for GIST IOS
This should fix issues with incorrect results when a GIST IOS
encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/include/access/gist_private.h | 2 +
src/backend/access/gist/README | 16 ++++
src/backend/access/gist/gistget.c | 34 +++++++-
src/backend/access/gist/gistscan.c | 115 ++++++++++++++++++++++++---
src/backend/access/gist/gistvacuum.c | 6 +-
5 files changed, 159 insertions(+), 14 deletions(-)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..ef2b6cab915 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -124,6 +124,7 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ Buffer buffer; /* buffer to unpin, when IOS */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -176,6 +177,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ Buffer pagePin; /* buffer of page, if pinned */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 8015ff19f05..c7c2afad088 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -287,6 +287,22 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
+Index-only scans and VACUUM
+---------------------------
+
+Index-only scans require that any tuple returned by the index scan has not
+been removed from the index by a call to ambulkdelete through VACUUM.
+To ensure this invariant, bulkdelete now requires a buffer cleanup lock, and
+every Index-only scan (IOS) will keep a pin on each page that it is returning
+tuples from. For ordered scans, we keep one pin for each matching leaf tuple,
+for unordered scans we just keep an additional pin while we're still working
+on the page's tuples. This ensures that pages seen by the scan won't be
+cleaned up until after the tuples have been returned.
+
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
+
Buffering build algorithm
-------------------------
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..99fb8d2d4fa 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -395,6 +395,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
so->nPageData = so->curPageData = 0;
+ Assert(so->pagePin == InvalidBuffer);
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -460,6 +461,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].heapPtr = it->t_tid;
so->pageData[so->nPageData].recheck = recheck;
so->pageData[so->nPageData].offnum = i;
+ so->pageData[so->nPageData].buffer = InvalidBuffer;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -471,7 +473,18 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
+
+ /*
+ * Only maintain a single additional buffer pin for unordered
+ * IOS scans; as we have all data already in one place.
+ */
+ if (so->nPageData == 0)
+ {
+ so->pagePin = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
+
so->nPageData++;
}
else
@@ -501,7 +514,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ item->data.heap.buffer = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
else
{
@@ -567,6 +584,10 @@ getNextNearest(IndexScanDesc scan)
/* free previously returned tuple */
pfree(scan->xs_hitup);
scan->xs_hitup = NULL;
+
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
}
do
@@ -588,7 +609,11 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
scan->xs_hitup = item->data.heap.recontup;
+ so->pagePin = item->data.heap.buffer;
+ }
res = true;
}
else
@@ -688,7 +713,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
&& so->curPageData > 0
&& so->curPageData == so->nPageData)
{
-
if (so->killedItems == NULL)
{
MemoryContext oldCxt =
@@ -704,6 +728,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
}
+
+ if (scan->xs_want_itup && so->nPageData > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
/* find and process the next index page */
do
{
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index de472e16373..091bdd08c38 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -110,6 +110,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->numKilled = 0;
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ so->pagePin = InvalidBuffer;
scan->opaque = so;
@@ -151,18 +152,73 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
Assert(so->queueCxt == so->giststate->scanCxt);
first_time = true;
}
- else if (so->queueCxt == so->giststate->scanCxt)
- {
- /* second time through */
- so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST queue context",
- ALLOCSET_DEFAULT_SIZES);
- first_time = false;
- }
else
{
- /* third or later time through */
- MemoryContextReset(so->queueCxt);
+ /*
+ * In the first scan of a query we allocate IOS items in the scan
+ * context, which is never reset. To not leak this memory, we
+ * manually free the queue entries.
+ */
+ const bool freequeue = so->queueCxt == so->giststate->scanCxt;
+ /*
+ * Index-only scans require that vacuum can't clean up entries that
+ * we're still planning to return, so we hold a pin on the buffer until
+ * we're past the returned item (1 pin count for every index tuple).
+ * When rescan is called, however, we need to clean up the pins that
+ * we still hold, lest we leak them and lose a buffer entry to that
+ * page.
+ */
+ const bool unpinqueue = scan->xs_want_itup;
+
+ if (freequeue || unpinqueue)
+ {
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ /*
+ * If we need to unpin a buffer for IOS' heap items, do so
+ * now.
+ */
+ if (unpinqueue && item->blkno != InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+
+ /*
+ * item->data.heap.recontup is stored in the separate memory
+ * context so->pageDataCxt, which is always reset; so we don't
+ * need to free that.
+ * "item" itself is allocated into the queue context, which is
+ * generally reset in rescan.
+ * However, only in the first scan, we allocate these items
+ * into the main scan context, which isn't reset; so we must
+ * free these items, or else we'd leak the memory for the
+ * duration of the query.
+ */
+ if (freequeue)
+ pfree(item);
+ }
+ }
+
+ if (so->queueCxt == so->giststate->scanCxt)
+ {
+ /* second time through */
+ so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST queue context",
+ ALLOCSET_DEFAULT_SIZES);
+ }
+ else
+ {
+ /* third or later time through */
+ MemoryContextReset(so->queueCxt);
+ }
+
first_time = false;
}
@@ -341,6 +397,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
/* any previous xs_hitup will have been pfree'd in context resets above */
scan->xs_hitup = NULL;
+
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+ }
}
void
@@ -348,6 +413,36 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ /* unpin any leftover buffers */
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ /*
+ * Note: unlike gistrescan, there is no need to actually free the
+ * items here, as that's handled by memory context reset in the
+ * call to freeGISTstate() below.
+ */
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ if (item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+ }
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..e0da6e37dca 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
--
2.45.2
Hello.
One thing I think we could add to the patches is to adapt the 10-years-old
comment below with notice about IOS:
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
* the pin for MVCC scans, which allows vacuum to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
Also, I think it is a good idea to add "Assert(!scan->xs_want_itup);"
to gistkillitems.
Best regards,
Mikhail.
On Thu, 9 Jan 2025 at 22:00, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hello.
One thing I think we could add to the patches is to adapt the 10-years-old comment below with notice about IOS:
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
* the pin for MVCC scans, which allows vacuum to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
I don't quite read it as covering IOS. To me, the comment is more
along the lines of (extensively extended):
"""
We'd like to use kill_prior_tuple, but that requires us to apply
changes to the page when we're already done with it for all intents
and purposes (because we scan the page once and buffer results). We
can either keep a pin on the buffer, or re-acquire that page after
finishing producing the tuples from this page.
Pinning the page blocks vacuum [^1], so instead we drop the pin, then
collect LP_DEAD marks, and then later we re-acquire the page to mark
the tuples dead. However, in the meantime the page may have changed;
by keeping a tab on changes in LSN we have a cheap method of detecting
changes in the page iself. [^2]
"""
... and that doesn't seem to cover much of IOS. MVCC index scans
aren't that special, practically every user query has MVCC. I think
this "MVCC scan" even means non-IOS scan, as I can't think of a reason
why dropping a pin would otherwise be a valid behaviour (see the this
thread's main issue).
[^1] Well, it should. In practice, the code in HEAD doesn't, but this
patchset fixes that disagreement.
[^2] If the page changed, i.e. the LSN changed, GIST accepts that it
can't use the collected LP_DEAD marks. We may be able to improve on
that (or not) by matching LP_DEAD offsets and TIDs with the on-page
TIDs, but that's far outside the scope of this patch; we'd first have
to build an understanding about why it's correct to assume vacuum
hasn't finished reaping old tuples and other sessions also finished
inserting new ones with the same TIDs in the meantime.
Also, I think it is a good idea to add "Assert(!scan->xs_want_itup);" to gistkillitems.
Why would it be incorrect or invalid to kill items in an index-only scan?
If we hit the heap (due to ! VM_ALL_VISIBLE) and detected the heap
tuple was dead, why couldn't we mark it as dead in the index? IOS
assumes a high rate of all-visible pages, but it's hardly unheard of
to access pages with dead tuples.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello!
Sorry, I should have expressed my thoughts in more detail, as they don't
matter as much as the time you took to answer.
I don't quite read it as covering IOS. To me, the comment is more
along the lines of (extensively extended):
My idea was just to add a few more details about the locking rule, such as:
* safe to apply LP_DEAD hints to the page later. This allows us to drop
* the pin for MVCC scans (except in cases of index-only scans due to XXX),
which allows vacuum to avoid blocking.
I think this "MVCC scan" even means non-IOS scan
Maybe, but I think it’s better to clarify that, since IOS scans still use
the MVCC snapshot. For me, a non-MVCC scan is something like SnapshotSelf
or SnapshotDirty.
Why would it be incorrect or invalid to kill items in an index-only scan?
Oh, I was comparing the logic to that of btree and somehow made a logical
error in my conclusions. But at least I hope I got some useful thoughts out
of it - since we have a pin during gistkillitems in the case of IOS, we can
ignore the "if (BufferGetLSNAtomic(buffer) != so->curPageLSN)" check in
that case because vacuum is locked.
It is not a compensation for a performance penalty caused by buffer pin
during IOS, but at least something.
I hope this time my conclusions are correct :)
Thanks,
Mikhail.
Hello, Matthias!
Updated patches attached.
Changes:
* cleanup test logic a little bit
* resolve issue with rescan in GIST (item->blkno == InvalidBlockNumber)
* move test to the main isolation suite
* add test for SpGist
* update comment I mentioned before
* allow GIST to set LP_DEAD in cases it is a safe even if LSN is updated
Also, seems like SP-GIST version is broken, it fails like this:
TRAP: failed Assert("BufferIsValid(so->pagePin)"), File:
"../src/backend/access/spgist/spgscan.c", Line: 1139, PID: 612214
FETCH(ExceptionalCondition+0xbe)[0x644a0b9dfdbc]
FETCH(spggettuple+0x289)[0x644a0b3743c6]
FETCH(index_getnext_tid+0x166)[0x644a0b3382f7]
FETCH(+0x3b392b)[0x644a0b56d92b]
FETCH(+0x3887df)[0x644a0b5427df]
FETCH(ExecScan+0x77)[0x644a0b542858]
FETCH(+0x3b3b9b)[0x644a0b56db9b]
FETCH(+0x376d8b)[0x644a0b530d8b]
FETCH(+0x379bd9)[0x644a0b533bd9]
FETCH(standard_ExecutorRun+0x19f)[0x644a0b531457]
FETCH(ExecutorRun+0x5a)[0x644a0b5312b5]
FETCH(+0x6276dc)[0x644a0b7e16dc]
FETCH(+0x628936)[0x644a0b7e2936]
FETCH(PortalRunFetch+0x1a0)[0x644a0b7e229c]
FETCH(PerformPortalFetch+0x13b)[0x644a0b49d7e5]
FETCH(standard_ProcessUtility+0x5f0)[0x644a0b7e3aab]
FETCH(ProcessUtility+0x140)[0x644a0b7e34b4]
FETCH(+0x627ceb)[0x644a0b7e1ceb]
FETCH(+0x627a28)[0x644a0b7e1a28]
FETCH(PortalRun+0x273)[0x644a0b7e12bb]
FETCH(+0x61fae1)[0x644a0b7d9ae1]
FETCH(PostgresMain+0x9eb)[0x644a0b7df170]
FETCH(+0x61b3e2)[0x644a0b7d53e2]
FETCH(postmaster_child_launch+0x137)[0x644a0b6e6e2d]
FETCH(+0x53384b)[0x644a0b6ed84b]
FETCH(+0x530f31)[0x644a0b6eaf31]
FETCH(PostmasterMain+0x161f)[0x644a0b6ea812]
FETCH(main+0x3a1)[0x644a0b5c29cf]
Best regards,
Mikhail.
Show quoted text
Attachments:
v5-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchtext/x-patch; charset=US-ASCII; name=v5-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchDownload
From 59b7746c96cca144c1d1d0362a96d8aa019e2a23 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Fri, 10 Jan 2025 17:55:30 +0100
Subject: [PATCH v5 2/3] RFC: Extend buffer pinning for GIST IOS
This should fix issues with incorrect results when a GIST IOS encounters tuples removed from pages by a concurrent vacuum operation.
Also, add ability to set LP_DEAD bits in more cases of IOS scans overs GIST.
---
src/backend/access/gist/README | 16 ++++
src/backend/access/gist/gistget.c | 46 +++++++++--
src/backend/access/gist/gistscan.c | 115 ++++++++++++++++++++++++---
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 2 +
5 files changed, 166 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 8015ff19f05..c7c2afad088 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -287,6 +287,22 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
+Index-only scans and VACUUM
+---------------------------
+
+Index-only scans require that any tuple returned by the index scan has not
+been removed from the index by a call to ambulkdelete through VACUUM.
+To ensure this invariant, bulkdelete now requires a buffer cleanup lock, and
+every Index-only scan (IOS) will keep a pin on each page that it is returning
+tuples from. For ordered scans, we keep one pin for each matching leaf tuple,
+for unordered scans we just keep an additional pin while we're still working
+on the page's tuples. This ensures that pages seen by the scan won't be
+cleaned up until after the tuples have been returned.
+
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
+
Buffering build algorithm
-------------------------
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index cc40e928e0a..adf86fed67b 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -35,7 +35,7 @@
* away and the TID was re-used by a completely different heap tuple.
*/
static void
-gistkillitems(IndexScanDesc scan)
+gistkillitems(IndexScanDesc scan, bool pagePinned)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
Buffer buffer;
@@ -60,9 +60,10 @@ gistkillitems(IndexScanDesc scan)
/*
* If page LSN differs it means that the page was modified since the last
* read. killedItems could be not valid so LP_DEAD hints applying is not
- * safe.
+ * safe. But in case then page was pinned - it is safe, because VACUUM is
+ * unable to remove tuples due to locking protocol.
*/
- if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+ if (!pagePinned && BufferGetLSNAtomic(buffer) != so->curPageLSN)
{
UnlockReleaseBuffer(buffer);
so->numKilled = 0; /* reset counter */
@@ -395,6 +396,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
so->nPageData = so->curPageData = 0;
+ Assert(so->pagePin == InvalidBuffer);
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -402,7 +404,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ * the pin for MVCC scans (except index-only scans), which allows vacuum
+ * to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
@@ -460,6 +463,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].heapPtr = it->t_tid;
so->pageData[so->nPageData].recheck = recheck;
so->pageData[so->nPageData].offnum = i;
+ so->pageData[so->nPageData].buffer = InvalidBuffer;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -471,7 +475,18 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
+
+ /*
+ * Only maintain a single additional buffer pin for unordered
+ * IOS scans; as we have all data already in one place.
+ */
+ if (so->nPageData == 0)
+ {
+ so->pagePin = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
+
so->nPageData++;
}
else
@@ -501,7 +516,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ item->data.heap.buffer = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
else
{
@@ -567,6 +586,10 @@ getNextNearest(IndexScanDesc scan)
/* free previously returned tuple */
pfree(scan->xs_hitup);
scan->xs_hitup = NULL;
+
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
}
do
@@ -588,7 +611,11 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
scan->xs_hitup = item->data.heap.recontup;
+ so->pagePin = item->data.heap.buffer;
+ }
res = true;
}
else
@@ -688,7 +715,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
&& so->curPageData > 0
&& so->curPageData == so->nPageData)
{
-
if (so->killedItems == NULL)
{
MemoryContext oldCxt =
@@ -704,13 +730,21 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
}
+
+ if (scan->xs_want_itup && so->nPageData > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
/* find and process the next index page */
do
{
GISTSearchItem *item;
if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
- gistkillitems(scan);
+ gistkillitems(scan, BufferIsValid(so->pagePin));
item = getNextGISTSearchItem(so);
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..932c2271510 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -110,6 +110,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->numKilled = 0;
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ so->pagePin = InvalidBuffer;
scan->opaque = so;
@@ -151,18 +152,73 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
Assert(so->queueCxt == so->giststate->scanCxt);
first_time = true;
}
- else if (so->queueCxt == so->giststate->scanCxt)
- {
- /* second time through */
- so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST queue context",
- ALLOCSET_DEFAULT_SIZES);
- first_time = false;
- }
else
{
- /* third or later time through */
- MemoryContextReset(so->queueCxt);
+ /*
+ * In the first scan of a query we allocate IOS items in the scan
+ * context, which is never reset. To not leak this memory, we
+ * manually free the queue entries.
+ */
+ const bool freequeue = so->queueCxt == so->giststate->scanCxt;
+ /*
+ * Index-only scans require that vacuum can't clean up entries that
+ * we're still planning to return, so we hold a pin on the buffer until
+ * we're past the returned item (1 pin count for every index tuple).
+ * When rescan is called, however, we need to clean up the pins that
+ * we still hold, lest we leak them and lose a buffer entry to that
+ * page.
+ */
+ const bool unpinqueue = scan->xs_want_itup;
+
+ if (freequeue || unpinqueue)
+ {
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ /*
+ * If we need to unpin a buffer for IOS' heap items, do so
+ * now.
+ */
+ if (unpinqueue && item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+
+ /*
+ * item->data.heap.recontup is stored in the separate memory
+ * context so->pageDataCxt, which is always reset; so we don't
+ * need to free that.
+ * "item" itself is allocated into the queue context, which is
+ * generally reset in rescan.
+ * However, only in the first scan, we allocate these items
+ * into the main scan context, which isn't reset; so we must
+ * free these items, or else we'd leak the memory for the
+ * duration of the query.
+ */
+ if (freequeue)
+ pfree(item);
+ }
+ }
+
+ if (so->queueCxt == so->giststate->scanCxt)
+ {
+ /* second time through */
+ so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST queue context",
+ ALLOCSET_DEFAULT_SIZES);
+ }
+ else
+ {
+ /* third or later time through */
+ MemoryContextReset(so->queueCxt);
+ }
+
first_time = false;
}
@@ -341,6 +397,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
/* any previous xs_hitup will have been pfree'd in context resets above */
scan->xs_hitup = NULL;
+
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+ }
}
void
@@ -348,6 +413,36 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ /* unpin any leftover buffers */
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ /*
+ * Note: unlike gistrescan, there is no need to actually free the
+ * items here, as that's handled by memory context reset in the
+ * call to freeGISTstate() below.
+ */
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ if (item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+ }
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fe0bfb781ca..840b3d586ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..e559117e7d7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -124,6 +124,7 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ Buffer buffer; /* buffer to unpin, when IOS */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -176,6 +177,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ Buffer pagePin; /* buffer of page, if pinned */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.43.0
v5-0001-isolation-tester-showing-broken-index-only-scans-.patchtext/x-patch; charset=US-ASCII; name=v5-0001-isolation-tester-showing-broken-index-only-scans-.patchDownload
From 5423affdba594ca0c1575f4f35c6ec479f82b216 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Fri, 10 Jan 2025 16:22:29 +0100
Subject: [PATCH v5 1/3] isolation tester showing broken index-only scans with
GiST and SP-GiST
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>, Michail Nikolaev <michail.nikolaev@gmail.com>
---
.../expected/index-only-scan-gist-vacuum.out | 67 +++++++++++
.../index-only-scan-spgist-vacuum.out | 67 +++++++++++
src/test/isolation/isolation_schedule | 2 +
.../specs/index-only-scan-gist-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 113 ++++++++++++++++++
5 files changed, 362 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..9720c9a2dc8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,8 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..b1688d44fa7
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..b414c5d1695
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.43.0
v5-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchtext/x-patch; charset=US-ASCII; name=v5-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchDownload
From bf5e033d291a9dcfdd95ca95d576f5b40a8d34c2 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Fri, 10 Jan 2025 18:00:49 +0100
Subject: [PATCH v5 3/3] RFC: Extend buffer pinning for SP-GIST IOS
This should fix issues with incorrect results when a SP-GIST
IOS encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/backend/access/spgist/spgscan.c | 112 ++++++++++++++++++++++----
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 3 +
3 files changed, 100 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 986362a777f..32e6a0a8a03 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ Buffer pin);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -300,6 +301,38 @@ spgPrepareScanKeys(IndexScanDesc scan)
}
}
+/*
+ * Note: This removes all items from the pairingheap.
+ */
+static void
+spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so)
+{
+ /* Guaranteed no pinned pages */
+ if (so->scanQueue == NULL || !scan->xs_want_itup)
+ return;
+
+ if (so->nPtrs > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ while (!pairingheap_is_empty(so->scanQueue))
+ {
+ pairingheap_node *node;
+ SpGistSearchItem *item;
+
+ node = pairingheap_remove_first(so->scanQueue);
+ item = pairingheap_container(SpGistSearchItem, phNode, node);
+ if (!item->isLeaf)
+ continue;
+
+ Assert(BufferIsValid(item->buffer));
+ ReleaseBuffer(item->buffer);
+ }
+}
+
IndexScanDesc
spgbeginscan(Relation rel, int keysz, int orderbysz)
{
@@ -416,6 +449,9 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* preprocess scankeys, set up the representation in *so */
spgPrepareScanKeys(scan);
+ /* release any pinned buffers from earlier rescans */
+ spgScanEndDropAllPagePins(scan, so);
+
/* set up starting queue entries */
resetSpGistScanOpaque(so);
@@ -428,6 +464,12 @@ spgendscan(IndexScanDesc scan)
{
SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ /*
+ * release any pinned buffers from earlier rescans, before we drop their
+ * data by dropping the memory contexts.
+ */
+ spgScanEndDropAllPagePins(scan, so);
+
MemoryContextDelete(so->tempCxt);
MemoryContextDelete(so->traversalCxt);
@@ -460,7 +502,7 @@ spgendscan(IndexScanDesc scan)
static SpGistSearchItem *
spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
Datum leafValue, bool recheck, bool recheckDistances,
- bool isnull, double *distances)
+ bool isnull, double *distances, Buffer addPin)
{
SpGistSearchItem *item = spgAllocSearchItem(so, isnull, distances);
@@ -479,6 +521,10 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
datumCopy(leafValue, so->state.attType.attbyval,
so->state.attType.attlen);
+ Assert(BufferIsValid(addPin));
+ IncrBufferRefCount(addPin);
+ item->buffer = addPin;
+
/*
* If we're going to need to reconstruct INCLUDE attributes, store the
* whole leaf tuple so we can get the INCLUDE attributes out of it.
@@ -495,6 +541,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
{
item->value = (Datum) 0;
item->leafTuple = NULL;
+ item->buffer = InvalidBuffer;
}
item->traversalValue = NULL;
item->isLeaf = true;
@@ -513,7 +560,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
static bool
spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
SpGistLeafTuple leafTuple, bool isnull,
- bool *reportedSome, storeRes_func storeRes)
+ bool *reportedSome, storeRes_func storeRes, Buffer buffer)
{
Datum leafValue;
double *distances;
@@ -580,7 +627,8 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
recheck,
recheckDistances,
isnull,
- distances);
+ distances,
+ buffer);
spgAddSearchItemToQueue(so, heapItem);
@@ -591,7 +639,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, InvalidBuffer);
*reportedSome = true;
}
}
@@ -760,7 +808,7 @@ enum SpGistSpecialOffsetNumbers
static OffsetNumber
spgTestLeafTuple(SpGistScanOpaque so,
SpGistSearchItem *item,
- Page page, OffsetNumber offset,
+ Page page, OffsetNumber offset, Buffer buffer,
bool isnull, bool isroot,
bool *reportedSome,
storeRes_func storeRes)
@@ -799,7 +847,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
Assert(ItemPointerIsValid(&leafTuple->heapPtr));
- spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes);
+ spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes,
+ buffer);
return SGLT_GET_NEXTOFFSET(leafTuple);
}
@@ -835,7 +884,8 @@ redirect:
Assert(so->numberOfNonNullOrderBys > 0);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->buffer);
reportedSome = true;
}
else
@@ -873,7 +923,7 @@ redirect:
/* When root is a leaf, examine all its tuples */
for (offset = FirstOffsetNumber; offset <= max; offset++)
(void) spgTestLeafTuple(so, item, page, offset,
- isnull, true,
+ buffer, isnull, true,
&reportedSome, storeRes);
}
else
@@ -883,10 +933,24 @@ redirect:
{
Assert(offset >= FirstOffsetNumber && offset <= max);
offset = spgTestLeafTuple(so, item, page, offset,
- isnull, false,
+ buffer, isnull, false,
&reportedSome, storeRes);
if (offset == SpGistRedirectOffsetNumber)
+ {
+ Assert(so->nPtrs == 0);
goto redirect;
+ }
+ }
+
+ /*
+ * IOS: Make sure we have one additional pin on the buffer,
+ * so that vacuum won't remove any deleted TIDs and mark
+ * their pages ALL_VISIBLE while we still have a copy.
+ */
+ if (so->want_itup && reportedSome)
+ {
+ IncrBufferRefCount(buffer);
+ so->pagePin = buffer;
}
}
}
@@ -929,9 +993,10 @@ static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ Buffer pin)
{
- Assert(!recheckDistances && !distances);
+ Assert(!recheckDistances && !distances && !BufferIsValid(pin));
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
so->ntids++;
}
@@ -954,10 +1019,9 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
/* storeRes subroutine for gettuple case */
static void
-storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
- Datum leafValue, bool isnull,
- SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr, Datum leafValue,
+ bool isnull, SpGistLeafTuple leafTuple, bool recheck,
+ bool recheckDistances, double *nonNullDistances, Buffer pin)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
@@ -1016,6 +1080,10 @@ storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
so->reconTups[so->nPtrs] = heap_form_tuple(so->reconTupDesc,
leafDatums,
leafIsnulls);
+
+ /* move the buffer pin, if required */
+ if (BufferIsValid(pin))
+ so->pagePin = pin;
}
so->nPtrs++;
}
@@ -1065,7 +1133,19 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
for (i = 0; i < so->nPtrs; i++)
pfree(so->reconTups[i]);
+
+ if (so->nPtrs > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
}
+ else
+ {
+ Assert(!BufferIsValid(so->pagePin));
+ }
+
so->iPtr = so->nPtrs = 0;
spgWalk(scan->indexRelation, so, false, storeGettuple);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..f02c270c5cc 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..1948e53e2ff 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -175,6 +175,8 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
+ Buffer buffer; /* buffer pinned for this leaf tuple
+ * (IOS-only) */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
@@ -226,6 +228,7 @@ typedef struct SpGistScanOpaqueData
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
+ Buffer pagePin; /* output tuple's pinned buffer, if IOS */
ItemPointerData heapPtrs[MaxIndexTuplesPerPage]; /* TIDs from cur page */
bool recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
--
2.43.0
Hello everyone, and Mathias!
I have fixed sp-gist related crash and a few issues in implementation.
Now it passes tests and (in my opinion) feels simpler.
I'll register that thread in commitfest to honor the bureaucracy.
Best regards,
Mikhail.
Attachments:
v6-0003-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v6-0003-This-should-fix-issues-with-incorrect-results-whe.patchDownload
From 0c451d19d33a2d640acad3160f8794cfaa8763d2 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Wed, 5 Feb 2025 10:59:37 +0100
Subject: [PATCH v6 3/3] This should fix issues with incorrect results when a
SP-GIST IOS encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/backend/access/spgist/spgscan.c | 108 ++++++++++++++++++++++----
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 5 ++
3 files changed, 99 insertions(+), 16 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 53f910e9d89..72ea14dfe2c 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ Buffer pin);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -95,6 +96,11 @@ spgFreeSearchItem(SpGistScanOpaque so, SpGistSearchItem *item)
if (item->traversalValue)
pfree(item->traversalValue);
+ if (so->want_itup && item->isLeaf)
+ {
+ Assert(BufferIsValid(item->buffer));
+ ReleaseBuffer(item->buffer);
+ }
pfree(item);
}
@@ -142,6 +148,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->buffer = InvalidBuffer;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -416,6 +423,9 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* preprocess scankeys, set up the representation in *so */
spgPrepareScanKeys(scan);
+ /* release any pinned buffers from earlier rescans */
+ spgScanEndDropAllPagePins(scan, so);
+
/* set up starting queue entries */
resetSpGistScanOpaque(so);
@@ -428,6 +438,12 @@ spgendscan(IndexScanDesc scan)
{
SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ /*
+ * release any pinned buffers from earlier rescans, before we drop their
+ * data by dropping the memory contexts.
+ */
+ spgScanEndDropAllPagePins(scan, so);
+
MemoryContextDelete(so->tempCxt);
MemoryContextDelete(so->traversalCxt);
@@ -460,7 +476,7 @@ spgendscan(IndexScanDesc scan)
static SpGistSearchItem *
spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
Datum leafValue, bool recheck, bool recheckDistances,
- bool isnull, double *distances)
+ bool isnull, double *distances, Buffer addPin)
{
SpGistSearchItem *item = spgAllocSearchItem(so, isnull, distances);
@@ -479,6 +495,10 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
datumCopy(leafValue, so->state.attType.attbyval,
so->state.attType.attlen);
+ Assert(BufferIsValid(addPin));
+ IncrBufferRefCount(addPin);
+ item->buffer = addPin;
+
/*
* If we're going to need to reconstruct INCLUDE attributes, store the
* whole leaf tuple so we can get the INCLUDE attributes out of it.
@@ -495,6 +515,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
{
item->value = (Datum) 0;
item->leafTuple = NULL;
+ item->buffer = InvalidBuffer;
}
item->traversalValue = NULL;
item->isLeaf = true;
@@ -513,7 +534,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
static bool
spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
SpGistLeafTuple leafTuple, bool isnull,
- bool *reportedSome, storeRes_func storeRes)
+ bool *reportedSome, storeRes_func storeRes, Buffer buffer)
{
Datum leafValue;
double *distances;
@@ -580,7 +601,8 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
recheck,
recheckDistances,
isnull,
- distances);
+ distances,
+ buffer);
spgAddSearchItemToQueue(so, heapItem);
@@ -591,7 +613,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, buffer);
*reportedSome = true;
}
}
@@ -750,6 +772,35 @@ spgGetNextQueueItem(SpGistScanOpaque so)
return (SpGistSearchItem *) pairingheap_remove_first(so->scanQueue);
}
+/*
+ * Note: This removes all items from the pairingheap.
+ */
+void
+spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so)
+{
+ /* Guaranteed no pinned pages */
+ if (so->scanQueue == NULL || !scan->xs_want_itup)
+ return;
+
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ while (!pairingheap_is_empty(so->scanQueue))
+ {
+ SpGistSearchItem *item;
+
+ item = spgGetNextQueueItem(so);
+ if (!item->isLeaf)
+ continue;
+
+ Assert(BufferIsValid(item->buffer));
+ spgFreeSearchItem(so, item);
+ }
+}
+
enum SpGistSpecialOffsetNumbers
{
SpGistBreakOffsetNumber = InvalidOffsetNumber,
@@ -761,6 +812,7 @@ static OffsetNumber
spgTestLeafTuple(SpGistScanOpaque so,
SpGistSearchItem *item,
Page page, OffsetNumber offset,
+ Buffer buffer,
bool isnull, bool isroot,
bool *reportedSome,
storeRes_func storeRes)
@@ -799,7 +851,7 @@ spgTestLeafTuple(SpGistScanOpaque so,
Assert(ItemPointerIsValid(&leafTuple->heapPtr));
- spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes);
+ spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes, buffer);
return SGLT_GET_NEXTOFFSET(leafTuple);
}
@@ -835,7 +887,8 @@ redirect:
Assert(so->numberOfNonNullOrderBys > 0);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->buffer);
reportedSome = true;
}
else
@@ -873,7 +926,7 @@ redirect:
/* When root is a leaf, examine all its tuples */
for (offset = FirstOffsetNumber; offset <= max; offset++)
(void) spgTestLeafTuple(so, item, page, offset,
- isnull, true,
+ buffer, isnull, true,
&reportedSome, storeRes);
}
else
@@ -883,7 +936,7 @@ redirect:
{
Assert(offset >= FirstOffsetNumber && offset <= max);
offset = spgTestLeafTuple(so, item, page, offset,
- isnull, false,
+ buffer, isnull, false,
&reportedSome, storeRes);
if (offset == SpGistRedirectOffsetNumber)
goto redirect;
@@ -929,9 +982,9 @@ static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ Buffer pin)
{
- Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
so->ntids++;
}
@@ -954,10 +1007,9 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
/* storeRes subroutine for gettuple case */
static void
-storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
- Datum leafValue, bool isnull,
- SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr, Datum leafValue,
+ bool isnull, SpGistLeafTuple leafTuple, bool recheck,
+ bool recheckDistances, double *nonNullDistances, Buffer pin)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
@@ -1016,6 +1068,25 @@ storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
so->reconTups[so->nPtrs] = heap_form_tuple(so->reconTupDesc,
leafDatums,
leafIsnulls);
+
+ /*
+ * IOS: Make sure we have one additional pin on the buffer
+ * from the tuple we are going to return.
+ *
+ * In the case buffer is changing - unpin previous buffer.
+ *
+ * We may switch buffers almost randomly in case of ordered
+ * scan - but in such case each item in queue holding its own
+ * pin.
+ */
+ if (so->pagePin != pin)
+ {
+ if (BufferIsValid(so->pagePin))
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = pin;
+ if (BufferIsValid(so->pagePin))
+ IncrBufferRefCount(so->pagePin);
+ }
}
so->nPtrs++;
}
@@ -1065,6 +1136,13 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
for (i = 0; i < so->nPtrs; i++)
pfree(so->reconTups[i]);
+
+ /* Unpin page of last returned tuple if any */
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
}
so->iPtr = so->nPtrs = 0;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..f02c270c5cc 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..c886e51e996 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -175,6 +175,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
+ Buffer buffer; /* buffer pinned for this leaf tuple (IOS-only) */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
@@ -226,6 +227,7 @@ typedef struct SpGistScanOpaqueData
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
+ Buffer pagePin; /* output tuple's pinned buffer, if IOS */
ItemPointerData heapPtrs[MaxIndexTuplesPerPage]; /* TIDs from cur page */
bool recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
@@ -488,6 +490,9 @@ typedef SpGistDeadTupleData *SpGistDeadTuple;
#define GBUF_REQ_LEAF(flags) (((flags) & GBUF_PARITY_MASK) == GBUF_LEAF)
#define GBUF_REQ_NULLS(flags) ((flags) & GBUF_NULLS)
+/* spgscan.c */
+void spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so);
+
/* spgutils.c */
/* reloption parameters */
--
2.43.0
v6-0002-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchapplication/octet-stream; name=v6-0002-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchDownload
From ff3b71dd141d37c5548ab70f0f5f6ab749876f99 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 4 Feb 2025 21:42:20 +0100
Subject: [PATCH v6 2/3] Also, add ability to set LP_DEAD bits in more cases of
IOS scans overs GIST.
---
src/backend/access/gist/gistget.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 3788a855c50..60c2d7a2531 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -35,7 +35,7 @@
* away and the TID was re-used by a completely different heap tuple.
*/
static void
-gistkillitems(IndexScanDesc scan)
+gistkillitems(IndexScanDesc scan, bool pagePinned)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
Buffer buffer;
@@ -60,9 +60,10 @@ gistkillitems(IndexScanDesc scan)
/*
* If page LSN differs it means that the page was modified since the last
* read. killedItems could be not valid so LP_DEAD hints applying is not
- * safe.
+ * safe. But in case then page was pinned - it is safe, because VACUUM is
+ * unable to remove tuples due to locking protocol.
*/
- if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+ if (!pagePinned && BufferGetLSNAtomic(buffer) != so->curPageLSN)
{
UnlockReleaseBuffer(buffer);
so->numKilled = 0; /* reset counter */
@@ -403,7 +404,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ * the pin for MVCC scans (except index-only scans), which allows vacuum
+ * to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
@@ -732,6 +734,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
if (scan->xs_want_itup && so->nPageData > 0)
{
Assert(BufferIsValid(so->pagePin));
+ /*
+ * kill items while page still pinned.
+ * so->numKilled is set to 0 after the call, so, call above
+ * for the same page is guaranteed to be skipped.
+ */
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan, BufferIsValid(so->pagePin));
ReleaseBuffer(so->pagePin);
so->pagePin = InvalidBuffer;
}
@@ -742,7 +751,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
GISTSearchItem *item;
if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
- gistkillitems(scan);
+ gistkillitems(scan, BufferIsValid(so->pagePin));
item = getNextGISTSearchItem(so);
--
2.43.0
v6-0001-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v6-0001-This-should-fix-issues-with-incorrect-results-whe.patchDownload
From 8811fc2123a25aa9497ccd2028a97003d3a89008 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 4 Feb 2025 21:34:46 +0100
Subject: [PATCH v6 1/3] This should fix issues with incorrect results when a
GIST IOS encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/backend/access/gist/README | 16 ++++
src/backend/access/gist/gistget.c | 32 ++++++++
src/backend/access/gist/gistscan.c | 115 ++++++++++++++++++++++++---
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 2 +
5 files changed, 158 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 8015ff19f05..c7c2afad088 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -287,6 +287,22 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
+Index-only scans and VACUUM
+---------------------------
+
+Index-only scans require that any tuple returned by the index scan has not
+been removed from the index by a call to ambulkdelete through VACUUM.
+To ensure this invariant, bulkdelete now requires a buffer cleanup lock, and
+every Index-only scan (IOS) will keep a pin on each page that it is returning
+tuples from. For ordered scans, we keep one pin for each matching leaf tuple,
+for unordered scans we just keep an additional pin while we're still working
+on the page's tuples. This ensures that pages seen by the scan won't be
+cleaned up until after the tuples have been returned.
+
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
+
Buffering build algorithm
-------------------------
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index cc40e928e0a..3788a855c50 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -395,6 +395,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
so->nPageData = so->curPageData = 0;
+ Assert(so->pagePin == InvalidBuffer);
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -460,6 +461,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].heapPtr = it->t_tid;
so->pageData[so->nPageData].recheck = recheck;
so->pageData[so->nPageData].offnum = i;
+ so->pageData[so->nPageData].buffer = InvalidBuffer;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -471,6 +473,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
+
+ /*
+ * Only maintain a single additional buffer pin for unordered
+ * IOS scans; as we have all data already in one place.
+ */
+ if (so->nPageData == 0)
+ {
+ so->pagePin = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
so->nPageData++;
}
@@ -501,7 +513,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ item->data.heap.buffer = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
else
{
@@ -567,6 +583,10 @@ getNextNearest(IndexScanDesc scan)
/* free previously returned tuple */
pfree(scan->xs_hitup);
scan->xs_hitup = NULL;
+
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
}
do
@@ -588,7 +608,11 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
scan->xs_hitup = item->data.heap.recontup;
+ so->pagePin = item->data.heap.buffer;
+ }
res = true;
}
else
@@ -704,6 +728,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
}
+
+ if (scan->xs_want_itup && so->nPageData > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
/* find and process the next index page */
do
{
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..932c2271510 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -110,6 +110,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->numKilled = 0;
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ so->pagePin = InvalidBuffer;
scan->opaque = so;
@@ -151,18 +152,73 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
Assert(so->queueCxt == so->giststate->scanCxt);
first_time = true;
}
- else if (so->queueCxt == so->giststate->scanCxt)
- {
- /* second time through */
- so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST queue context",
- ALLOCSET_DEFAULT_SIZES);
- first_time = false;
- }
else
{
- /* third or later time through */
- MemoryContextReset(so->queueCxt);
+ /*
+ * In the first scan of a query we allocate IOS items in the scan
+ * context, which is never reset. To not leak this memory, we
+ * manually free the queue entries.
+ */
+ const bool freequeue = so->queueCxt == so->giststate->scanCxt;
+ /*
+ * Index-only scans require that vacuum can't clean up entries that
+ * we're still planning to return, so we hold a pin on the buffer until
+ * we're past the returned item (1 pin count for every index tuple).
+ * When rescan is called, however, we need to clean up the pins that
+ * we still hold, lest we leak them and lose a buffer entry to that
+ * page.
+ */
+ const bool unpinqueue = scan->xs_want_itup;
+
+ if (freequeue || unpinqueue)
+ {
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ /*
+ * If we need to unpin a buffer for IOS' heap items, do so
+ * now.
+ */
+ if (unpinqueue && item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+
+ /*
+ * item->data.heap.recontup is stored in the separate memory
+ * context so->pageDataCxt, which is always reset; so we don't
+ * need to free that.
+ * "item" itself is allocated into the queue context, which is
+ * generally reset in rescan.
+ * However, only in the first scan, we allocate these items
+ * into the main scan context, which isn't reset; so we must
+ * free these items, or else we'd leak the memory for the
+ * duration of the query.
+ */
+ if (freequeue)
+ pfree(item);
+ }
+ }
+
+ if (so->queueCxt == so->giststate->scanCxt)
+ {
+ /* second time through */
+ so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST queue context",
+ ALLOCSET_DEFAULT_SIZES);
+ }
+ else
+ {
+ /* third or later time through */
+ MemoryContextReset(so->queueCxt);
+ }
+
first_time = false;
}
@@ -341,6 +397,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
/* any previous xs_hitup will have been pfree'd in context resets above */
scan->xs_hitup = NULL;
+
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+ }
}
void
@@ -348,6 +413,36 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ /* unpin any leftover buffers */
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ /*
+ * Note: unlike gistrescan, there is no need to actually free the
+ * items here, as that's handled by memory context reset in the
+ * call to freeGISTstate() below.
+ */
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ if (item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+ }
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fe0bfb781ca..840b3d586ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..e559117e7d7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -124,6 +124,7 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ Buffer buffer; /* buffer to unpin, when IOS */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -176,6 +177,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ Buffer pagePin; /* buffer of page, if pinned */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.43.0
Ooops, missed one commit - fixed (logic related to LP_DEAD in GIST
extracted to separate commit).
Also, commitfest entry is here - https://commitfest.postgresql.org/52/5542/
Show quoted text
Attachments:
v7-0004-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v7-0004-This-should-fix-issues-with-incorrect-results-whe.patchDownload
From 0c451d19d33a2d640acad3160f8794cfaa8763d2 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Wed, 5 Feb 2025 10:59:37 +0100
Subject: [PATCH v7 4/4] This should fix issues with incorrect results when a
SP-GIST IOS encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/backend/access/spgist/spgscan.c | 108 ++++++++++++++++++++++----
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 5 ++
3 files changed, 99 insertions(+), 16 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 53f910e9d89..72ea14dfe2c 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ Buffer pin);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -95,6 +96,11 @@ spgFreeSearchItem(SpGistScanOpaque so, SpGistSearchItem *item)
if (item->traversalValue)
pfree(item->traversalValue);
+ if (so->want_itup && item->isLeaf)
+ {
+ Assert(BufferIsValid(item->buffer));
+ ReleaseBuffer(item->buffer);
+ }
pfree(item);
}
@@ -142,6 +148,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->buffer = InvalidBuffer;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -416,6 +423,9 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* preprocess scankeys, set up the representation in *so */
spgPrepareScanKeys(scan);
+ /* release any pinned buffers from earlier rescans */
+ spgScanEndDropAllPagePins(scan, so);
+
/* set up starting queue entries */
resetSpGistScanOpaque(so);
@@ -428,6 +438,12 @@ spgendscan(IndexScanDesc scan)
{
SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ /*
+ * release any pinned buffers from earlier rescans, before we drop their
+ * data by dropping the memory contexts.
+ */
+ spgScanEndDropAllPagePins(scan, so);
+
MemoryContextDelete(so->tempCxt);
MemoryContextDelete(so->traversalCxt);
@@ -460,7 +476,7 @@ spgendscan(IndexScanDesc scan)
static SpGistSearchItem *
spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
Datum leafValue, bool recheck, bool recheckDistances,
- bool isnull, double *distances)
+ bool isnull, double *distances, Buffer addPin)
{
SpGistSearchItem *item = spgAllocSearchItem(so, isnull, distances);
@@ -479,6 +495,10 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
datumCopy(leafValue, so->state.attType.attbyval,
so->state.attType.attlen);
+ Assert(BufferIsValid(addPin));
+ IncrBufferRefCount(addPin);
+ item->buffer = addPin;
+
/*
* If we're going to need to reconstruct INCLUDE attributes, store the
* whole leaf tuple so we can get the INCLUDE attributes out of it.
@@ -495,6 +515,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
{
item->value = (Datum) 0;
item->leafTuple = NULL;
+ item->buffer = InvalidBuffer;
}
item->traversalValue = NULL;
item->isLeaf = true;
@@ -513,7 +534,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
static bool
spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
SpGistLeafTuple leafTuple, bool isnull,
- bool *reportedSome, storeRes_func storeRes)
+ bool *reportedSome, storeRes_func storeRes, Buffer buffer)
{
Datum leafValue;
double *distances;
@@ -580,7 +601,8 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
recheck,
recheckDistances,
isnull,
- distances);
+ distances,
+ buffer);
spgAddSearchItemToQueue(so, heapItem);
@@ -591,7 +613,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, buffer);
*reportedSome = true;
}
}
@@ -750,6 +772,35 @@ spgGetNextQueueItem(SpGistScanOpaque so)
return (SpGistSearchItem *) pairingheap_remove_first(so->scanQueue);
}
+/*
+ * Note: This removes all items from the pairingheap.
+ */
+void
+spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so)
+{
+ /* Guaranteed no pinned pages */
+ if (so->scanQueue == NULL || !scan->xs_want_itup)
+ return;
+
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ while (!pairingheap_is_empty(so->scanQueue))
+ {
+ SpGistSearchItem *item;
+
+ item = spgGetNextQueueItem(so);
+ if (!item->isLeaf)
+ continue;
+
+ Assert(BufferIsValid(item->buffer));
+ spgFreeSearchItem(so, item);
+ }
+}
+
enum SpGistSpecialOffsetNumbers
{
SpGistBreakOffsetNumber = InvalidOffsetNumber,
@@ -761,6 +812,7 @@ static OffsetNumber
spgTestLeafTuple(SpGistScanOpaque so,
SpGistSearchItem *item,
Page page, OffsetNumber offset,
+ Buffer buffer,
bool isnull, bool isroot,
bool *reportedSome,
storeRes_func storeRes)
@@ -799,7 +851,7 @@ spgTestLeafTuple(SpGistScanOpaque so,
Assert(ItemPointerIsValid(&leafTuple->heapPtr));
- spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes);
+ spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes, buffer);
return SGLT_GET_NEXTOFFSET(leafTuple);
}
@@ -835,7 +887,8 @@ redirect:
Assert(so->numberOfNonNullOrderBys > 0);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->buffer);
reportedSome = true;
}
else
@@ -873,7 +926,7 @@ redirect:
/* When root is a leaf, examine all its tuples */
for (offset = FirstOffsetNumber; offset <= max; offset++)
(void) spgTestLeafTuple(so, item, page, offset,
- isnull, true,
+ buffer, isnull, true,
&reportedSome, storeRes);
}
else
@@ -883,7 +936,7 @@ redirect:
{
Assert(offset >= FirstOffsetNumber && offset <= max);
offset = spgTestLeafTuple(so, item, page, offset,
- isnull, false,
+ buffer, isnull, false,
&reportedSome, storeRes);
if (offset == SpGistRedirectOffsetNumber)
goto redirect;
@@ -929,9 +982,9 @@ static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ Buffer pin)
{
- Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
so->ntids++;
}
@@ -954,10 +1007,9 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
/* storeRes subroutine for gettuple case */
static void
-storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
- Datum leafValue, bool isnull,
- SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr, Datum leafValue,
+ bool isnull, SpGistLeafTuple leafTuple, bool recheck,
+ bool recheckDistances, double *nonNullDistances, Buffer pin)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
@@ -1016,6 +1068,25 @@ storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
so->reconTups[so->nPtrs] = heap_form_tuple(so->reconTupDesc,
leafDatums,
leafIsnulls);
+
+ /*
+ * IOS: Make sure we have one additional pin on the buffer
+ * from the tuple we are going to return.
+ *
+ * In the case buffer is changing - unpin previous buffer.
+ *
+ * We may switch buffers almost randomly in case of ordered
+ * scan - but in such case each item in queue holding its own
+ * pin.
+ */
+ if (so->pagePin != pin)
+ {
+ if (BufferIsValid(so->pagePin))
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = pin;
+ if (BufferIsValid(so->pagePin))
+ IncrBufferRefCount(so->pagePin);
+ }
}
so->nPtrs++;
}
@@ -1065,6 +1136,13 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
for (i = 0; i < so->nPtrs; i++)
pfree(so->reconTups[i]);
+
+ /* Unpin page of last returned tuple if any */
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
}
so->iPtr = so->nPtrs = 0;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..f02c270c5cc 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..c886e51e996 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -175,6 +175,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
+ Buffer buffer; /* buffer pinned for this leaf tuple (IOS-only) */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
@@ -226,6 +227,7 @@ typedef struct SpGistScanOpaqueData
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
+ Buffer pagePin; /* output tuple's pinned buffer, if IOS */
ItemPointerData heapPtrs[MaxIndexTuplesPerPage]; /* TIDs from cur page */
bool recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
@@ -488,6 +490,9 @@ typedef SpGistDeadTupleData *SpGistDeadTuple;
#define GBUF_REQ_LEAF(flags) (((flags) & GBUF_PARITY_MASK) == GBUF_LEAF)
#define GBUF_REQ_NULLS(flags) ((flags) & GBUF_NULLS)
+/* spgscan.c */
+void spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so);
+
/* spgutils.c */
/* reloption parameters */
--
2.43.0
v7-0002-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v7-0002-This-should-fix-issues-with-incorrect-results-whe.patchDownload
From 8811fc2123a25aa9497ccd2028a97003d3a89008 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 4 Feb 2025 21:34:46 +0100
Subject: [PATCH v7 2/4] This should fix issues with incorrect results when a
GIST IOS encounters tuples removed from pages by a concurrent vacuum
operation.
---
src/backend/access/gist/README | 16 ++++
src/backend/access/gist/gistget.c | 32 ++++++++
src/backend/access/gist/gistscan.c | 115 ++++++++++++++++++++++++---
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 2 +
5 files changed, 158 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 8015ff19f05..c7c2afad088 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -287,6 +287,22 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
+Index-only scans and VACUUM
+---------------------------
+
+Index-only scans require that any tuple returned by the index scan has not
+been removed from the index by a call to ambulkdelete through VACUUM.
+To ensure this invariant, bulkdelete now requires a buffer cleanup lock, and
+every Index-only scan (IOS) will keep a pin on each page that it is returning
+tuples from. For ordered scans, we keep one pin for each matching leaf tuple,
+for unordered scans we just keep an additional pin while we're still working
+on the page's tuples. This ensures that pages seen by the scan won't be
+cleaned up until after the tuples have been returned.
+
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
+
Buffering build algorithm
-------------------------
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index cc40e928e0a..3788a855c50 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -395,6 +395,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
so->nPageData = so->curPageData = 0;
+ Assert(so->pagePin == InvalidBuffer);
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -460,6 +461,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].heapPtr = it->t_tid;
so->pageData[so->nPageData].recheck = recheck;
so->pageData[so->nPageData].offnum = i;
+ so->pageData[so->nPageData].buffer = InvalidBuffer;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -471,6 +473,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
+
+ /*
+ * Only maintain a single additional buffer pin for unordered
+ * IOS scans; as we have all data already in one place.
+ */
+ if (so->nPageData == 0)
+ {
+ so->pagePin = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
so->nPageData++;
}
@@ -501,7 +513,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ item->data.heap.buffer = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
else
{
@@ -567,6 +583,10 @@ getNextNearest(IndexScanDesc scan)
/* free previously returned tuple */
pfree(scan->xs_hitup);
scan->xs_hitup = NULL;
+
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
}
do
@@ -588,7 +608,11 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
scan->xs_hitup = item->data.heap.recontup;
+ so->pagePin = item->data.heap.buffer;
+ }
res = true;
}
else
@@ -704,6 +728,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
}
+
+ if (scan->xs_want_itup && so->nPageData > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
/* find and process the next index page */
do
{
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..932c2271510 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -110,6 +110,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->numKilled = 0;
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ so->pagePin = InvalidBuffer;
scan->opaque = so;
@@ -151,18 +152,73 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
Assert(so->queueCxt == so->giststate->scanCxt);
first_time = true;
}
- else if (so->queueCxt == so->giststate->scanCxt)
- {
- /* second time through */
- so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST queue context",
- ALLOCSET_DEFAULT_SIZES);
- first_time = false;
- }
else
{
- /* third or later time through */
- MemoryContextReset(so->queueCxt);
+ /*
+ * In the first scan of a query we allocate IOS items in the scan
+ * context, which is never reset. To not leak this memory, we
+ * manually free the queue entries.
+ */
+ const bool freequeue = so->queueCxt == so->giststate->scanCxt;
+ /*
+ * Index-only scans require that vacuum can't clean up entries that
+ * we're still planning to return, so we hold a pin on the buffer until
+ * we're past the returned item (1 pin count for every index tuple).
+ * When rescan is called, however, we need to clean up the pins that
+ * we still hold, lest we leak them and lose a buffer entry to that
+ * page.
+ */
+ const bool unpinqueue = scan->xs_want_itup;
+
+ if (freequeue || unpinqueue)
+ {
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ /*
+ * If we need to unpin a buffer for IOS' heap items, do so
+ * now.
+ */
+ if (unpinqueue && item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+
+ /*
+ * item->data.heap.recontup is stored in the separate memory
+ * context so->pageDataCxt, which is always reset; so we don't
+ * need to free that.
+ * "item" itself is allocated into the queue context, which is
+ * generally reset in rescan.
+ * However, only in the first scan, we allocate these items
+ * into the main scan context, which isn't reset; so we must
+ * free these items, or else we'd leak the memory for the
+ * duration of the query.
+ */
+ if (freequeue)
+ pfree(item);
+ }
+ }
+
+ if (so->queueCxt == so->giststate->scanCxt)
+ {
+ /* second time through */
+ so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST queue context",
+ ALLOCSET_DEFAULT_SIZES);
+ }
+ else
+ {
+ /* third or later time through */
+ MemoryContextReset(so->queueCxt);
+ }
+
first_time = false;
}
@@ -341,6 +397,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
/* any previous xs_hitup will have been pfree'd in context resets above */
scan->xs_hitup = NULL;
+
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+ }
}
void
@@ -348,6 +413,36 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ /* unpin any leftover buffers */
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ /*
+ * Note: unlike gistrescan, there is no need to actually free the
+ * items here, as that's handled by memory context reset in the
+ * call to freeGISTstate() below.
+ */
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ if (item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+ }
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fe0bfb781ca..840b3d586ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..e559117e7d7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -124,6 +124,7 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ Buffer buffer; /* buffer to unpin, when IOS */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -176,6 +177,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ Buffer pagePin; /* buffer of page, if pinned */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.43.0
v7-0003-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchapplication/octet-stream; name=v7-0003-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchDownload
From ff3b71dd141d37c5548ab70f0f5f6ab749876f99 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 4 Feb 2025 21:42:20 +0100
Subject: [PATCH v7 3/4] Also, add ability to set LP_DEAD bits in more cases of
IOS scans overs GIST.
---
src/backend/access/gist/gistget.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 3788a855c50..60c2d7a2531 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -35,7 +35,7 @@
* away and the TID was re-used by a completely different heap tuple.
*/
static void
-gistkillitems(IndexScanDesc scan)
+gistkillitems(IndexScanDesc scan, bool pagePinned)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
Buffer buffer;
@@ -60,9 +60,10 @@ gistkillitems(IndexScanDesc scan)
/*
* If page LSN differs it means that the page was modified since the last
* read. killedItems could be not valid so LP_DEAD hints applying is not
- * safe.
+ * safe. But in case then page was pinned - it is safe, because VACUUM is
+ * unable to remove tuples due to locking protocol.
*/
- if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+ if (!pagePinned && BufferGetLSNAtomic(buffer) != so->curPageLSN)
{
UnlockReleaseBuffer(buffer);
so->numKilled = 0; /* reset counter */
@@ -403,7 +404,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ * the pin for MVCC scans (except index-only scans), which allows vacuum
+ * to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
@@ -732,6 +734,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
if (scan->xs_want_itup && so->nPageData > 0)
{
Assert(BufferIsValid(so->pagePin));
+ /*
+ * kill items while page still pinned.
+ * so->numKilled is set to 0 after the call, so, call above
+ * for the same page is guaranteed to be skipped.
+ */
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan, BufferIsValid(so->pagePin));
ReleaseBuffer(so->pagePin);
so->pagePin = InvalidBuffer;
}
@@ -742,7 +751,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
GISTSearchItem *item;
if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
- gistkillitems(scan);
+ gistkillitems(scan, BufferIsValid(so->pagePin));
item = getNextGISTSearchItem(so);
--
2.43.0
v7-0001-isolation-tester-showing-broken-index-only-scans-.patchapplication/octet-stream; name=v7-0001-isolation-tester-showing-broken-index-only-scans-.patchDownload
From 36fbbeb6047283a7f4d78cd717aa671ca279a85a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 3 Feb 2025 21:17:06 +0100
Subject: [PATCH v7 1/4] isolation tester showing broken index-only scans with
GiST and SP-GiST
---
.../expected/index-only-scan-gist-vacuum.out | 67 +++++++++++
.../index-only-scan-spgist-vacuum.out | 67 +++++++++++
src/test/isolation/isolation_schedule | 2 +
.../specs/index-only-scan-gist-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 113 ++++++++++++++++++
5 files changed, 362 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..9720c9a2dc8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,8 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..b1688d44fa7
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..b414c5d1695
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.43.0
Hello, everyone!
Just some commit messages + few cleanups.
Best regards,
Mikhail.
Attachments:
v8-0002-Fix-index-only-scan-race-condition-in-GiST-implem.patchapplication/octet-stream; name=v8-0002-Fix-index-only-scan-race-condition-in-GiST-implem.patchDownload
From 6211b0e943ec317c163212f856cf4acaf638cd5d Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 7 Feb 2025 21:41:50 +0100
Subject: [PATCH v8 2/4] Fix index-only scan race condition in GiST
implementation
Prevent incorrect results in index-only scans caused by concurrent VACUUM.
1. Use LockBufferForCleanup instead of GIST_EXCLUSIVE in gistvacuum.c for more consistent naming
2. Add buffer pinning to ensure index pages remain pinned according to correct locking protocol
3. Update documentation in gist/README to explain the new locking scheme
and note potential buffer exhaustion concerns with ordered scans.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>,
Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
src/backend/access/gist/README | 16 ++++
src/backend/access/gist/gistget.c | 32 ++++++++
src/backend/access/gist/gistscan.c | 115 ++++++++++++++++++++++++---
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 2 +
5 files changed, 158 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 8015ff19f05..c7c2afad088 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -287,6 +287,22 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
+Index-only scans and VACUUM
+---------------------------
+
+Index-only scans require that any tuple returned by the index scan has not
+been removed from the index by a call to ambulkdelete through VACUUM.
+To ensure this invariant, bulkdelete now requires a buffer cleanup lock, and
+every Index-only scan (IOS) will keep a pin on each page that it is returning
+tuples from. For ordered scans, we keep one pin for each matching leaf tuple,
+for unordered scans we just keep an additional pin while we're still working
+on the page's tuples. This ensures that pages seen by the scan won't be
+cleaned up until after the tuples have been returned.
+
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
+
Buffering build algorithm
-------------------------
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index cc40e928e0a..3788a855c50 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -395,6 +395,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
so->nPageData = so->curPageData = 0;
+ Assert(so->pagePin == InvalidBuffer);
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -460,6 +461,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].heapPtr = it->t_tid;
so->pageData[so->nPageData].recheck = recheck;
so->pageData[so->nPageData].offnum = i;
+ so->pageData[so->nPageData].buffer = InvalidBuffer;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -471,6 +473,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
so->pageData[so->nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
+
+ /*
+ * Only maintain a single additional buffer pin for unordered
+ * IOS scans; as we have all data already in one place.
+ */
+ if (so->nPageData == 0)
+ {
+ so->pagePin = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
so->nPageData++;
}
@@ -501,7 +513,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ item->data.heap.buffer = buffer;
+ IncrBufferRefCount(buffer);
+ }
}
else
{
@@ -567,6 +583,10 @@ getNextNearest(IndexScanDesc scan)
/* free previously returned tuple */
pfree(scan->xs_hitup);
scan->xs_hitup = NULL;
+
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
}
do
@@ -588,7 +608,11 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
scan->xs_hitup = item->data.heap.recontup;
+ so->pagePin = item->data.heap.buffer;
+ }
res = true;
}
else
@@ -704,6 +728,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
}
+
+ if (scan->xs_want_itup && so->nPageData > 0)
+ {
+ Assert(BufferIsValid(so->pagePin));
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
/* find and process the next index page */
do
{
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..932c2271510 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -110,6 +110,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->numKilled = 0;
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ so->pagePin = InvalidBuffer;
scan->opaque = so;
@@ -151,18 +152,73 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
Assert(so->queueCxt == so->giststate->scanCxt);
first_time = true;
}
- else if (so->queueCxt == so->giststate->scanCxt)
- {
- /* second time through */
- so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST queue context",
- ALLOCSET_DEFAULT_SIZES);
- first_time = false;
- }
else
{
- /* third or later time through */
- MemoryContextReset(so->queueCxt);
+ /*
+ * In the first scan of a query we allocate IOS items in the scan
+ * context, which is never reset. To not leak this memory, we
+ * manually free the queue entries.
+ */
+ const bool freequeue = so->queueCxt == so->giststate->scanCxt;
+ /*
+ * Index-only scans require that vacuum can't clean up entries that
+ * we're still planning to return, so we hold a pin on the buffer until
+ * we're past the returned item (1 pin count for every index tuple).
+ * When rescan is called, however, we need to clean up the pins that
+ * we still hold, lest we leak them and lose a buffer entry to that
+ * page.
+ */
+ const bool unpinqueue = scan->xs_want_itup;
+
+ if (freequeue || unpinqueue)
+ {
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ /*
+ * If we need to unpin a buffer for IOS' heap items, do so
+ * now.
+ */
+ if (unpinqueue && item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+
+ /*
+ * item->data.heap.recontup is stored in the separate memory
+ * context so->pageDataCxt, which is always reset; so we don't
+ * need to free that.
+ * "item" itself is allocated into the queue context, which is
+ * generally reset in rescan.
+ * However, only in the first scan, we allocate these items
+ * into the main scan context, which isn't reset; so we must
+ * free these items, or else we'd leak the memory for the
+ * duration of the query.
+ */
+ if (freequeue)
+ pfree(item);
+ }
+ }
+
+ if (so->queueCxt == so->giststate->scanCxt)
+ {
+ /* second time through */
+ so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST queue context",
+ ALLOCSET_DEFAULT_SIZES);
+ }
+ else
+ {
+ /* third or later time through */
+ MemoryContextReset(so->queueCxt);
+ }
+
first_time = false;
}
@@ -341,6 +397,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
/* any previous xs_hitup will have been pfree'd in context resets above */
scan->xs_hitup = NULL;
+
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+ }
}
void
@@ -348,6 +413,36 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (scan->xs_want_itup)
+ {
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ /* unpin any leftover buffers */
+ while (!pairingheap_is_empty(so->queue))
+ {
+ pairingheap_node *node;
+ GISTSearchItem *item;
+
+ /*
+ * Note: unlike gistrescan, there is no need to actually free the
+ * items here, as that's handled by memory context reset in the
+ * call to freeGISTstate() below.
+ */
+ node = pairingheap_remove_first(so->queue);
+ item = pairingheap_container(GISTSearchItem, phNode, node);
+
+ if (item->blkno == InvalidBlockNumber)
+ {
+ Assert(BufferIsValid(item->data.heap.buffer));
+ ReleaseBuffer(item->data.heap.buffer);
+ }
+ }
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fe0bfb781ca..840b3d586ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..e559117e7d7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -124,6 +124,7 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ Buffer buffer; /* buffer to unpin, when IOS */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -176,6 +177,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ Buffer pagePin; /* buffer of page, if pinned */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.43.0
v8-0003-Improve-buffer-handling-for-killed-items-in-GiST-.patchapplication/octet-stream; name=v8-0003-Improve-buffer-handling-for-killed-items-in-GiST-.patchDownload
From 80c5c821dabb2123a3fe7d6c3ecddae3c31c621d Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 7 Feb 2025 21:43:37 +0100
Subject: [PATCH v8 3/4] Improve buffer handling for killed items in GiST
index-only scans
Modify gistkillitems() to accept a pagePinned parameter, allowing it to
safely apply LP_DEAD hints even when the page LSN has changed, if the
page is still pinned
Author: Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
src/backend/access/gist/gistget.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 3788a855c50..60c2d7a2531 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -35,7 +35,7 @@
* away and the TID was re-used by a completely different heap tuple.
*/
static void
-gistkillitems(IndexScanDesc scan)
+gistkillitems(IndexScanDesc scan, bool pagePinned)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
Buffer buffer;
@@ -60,9 +60,10 @@ gistkillitems(IndexScanDesc scan)
/*
* If page LSN differs it means that the page was modified since the last
* read. killedItems could be not valid so LP_DEAD hints applying is not
- * safe.
+ * safe. But in case then page was pinned - it is safe, because VACUUM is
+ * unable to remove tuples due to locking protocol.
*/
- if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+ if (!pagePinned && BufferGetLSNAtomic(buffer) != so->curPageLSN)
{
UnlockReleaseBuffer(buffer);
so->numKilled = 0; /* reset counter */
@@ -403,7 +404,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ * the pin for MVCC scans (except index-only scans), which allows vacuum
+ * to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
@@ -732,6 +734,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
if (scan->xs_want_itup && so->nPageData > 0)
{
Assert(BufferIsValid(so->pagePin));
+ /*
+ * kill items while page still pinned.
+ * so->numKilled is set to 0 after the call, so, call above
+ * for the same page is guaranteed to be skipped.
+ */
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan, BufferIsValid(so->pagePin));
ReleaseBuffer(so->pagePin);
so->pagePin = InvalidBuffer;
}
@@ -742,7 +751,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
GISTSearchItem *item;
if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
- gistkillitems(scan);
+ gistkillitems(scan, BufferIsValid(so->pagePin));
item = getNextGISTSearchItem(so);
--
2.43.0
v8-0004-Fix-index-only-scan-race-condition-in-SP-GiST-imp.patchapplication/octet-stream; name=v8-0004-Fix-index-only-scan-race-condition-in-SP-GiST-imp.patchDownload
From c4c72f2eac60ace32d0c6084625f1ac3cfd949b8 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 7 Feb 2025 21:46:01 +0100
Subject: [PATCH v8 4/4] Fix index-only scan race condition in SP-GiST
implementation
Apply the buffer management improvements previously made to GiST index-only scans to SP-GiST as well.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
src/backend/access/spgist/spgscan.c | 108 ++++++++++++++++++++++----
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 5 ++
3 files changed, 99 insertions(+), 16 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 53f910e9d89..72ea14dfe2c 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ Buffer pin);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -95,6 +96,11 @@ spgFreeSearchItem(SpGistScanOpaque so, SpGistSearchItem *item)
if (item->traversalValue)
pfree(item->traversalValue);
+ if (so->want_itup && item->isLeaf)
+ {
+ Assert(BufferIsValid(item->buffer));
+ ReleaseBuffer(item->buffer);
+ }
pfree(item);
}
@@ -142,6 +148,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->buffer = InvalidBuffer;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -416,6 +423,9 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* preprocess scankeys, set up the representation in *so */
spgPrepareScanKeys(scan);
+ /* release any pinned buffers from earlier rescans */
+ spgScanEndDropAllPagePins(scan, so);
+
/* set up starting queue entries */
resetSpGistScanOpaque(so);
@@ -428,6 +438,12 @@ spgendscan(IndexScanDesc scan)
{
SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ /*
+ * release any pinned buffers from earlier rescans, before we drop their
+ * data by dropping the memory contexts.
+ */
+ spgScanEndDropAllPagePins(scan, so);
+
MemoryContextDelete(so->tempCxt);
MemoryContextDelete(so->traversalCxt);
@@ -460,7 +476,7 @@ spgendscan(IndexScanDesc scan)
static SpGistSearchItem *
spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
Datum leafValue, bool recheck, bool recheckDistances,
- bool isnull, double *distances)
+ bool isnull, double *distances, Buffer addPin)
{
SpGistSearchItem *item = spgAllocSearchItem(so, isnull, distances);
@@ -479,6 +495,10 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
datumCopy(leafValue, so->state.attType.attbyval,
so->state.attType.attlen);
+ Assert(BufferIsValid(addPin));
+ IncrBufferRefCount(addPin);
+ item->buffer = addPin;
+
/*
* If we're going to need to reconstruct INCLUDE attributes, store the
* whole leaf tuple so we can get the INCLUDE attributes out of it.
@@ -495,6 +515,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
{
item->value = (Datum) 0;
item->leafTuple = NULL;
+ item->buffer = InvalidBuffer;
}
item->traversalValue = NULL;
item->isLeaf = true;
@@ -513,7 +534,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
static bool
spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
SpGistLeafTuple leafTuple, bool isnull,
- bool *reportedSome, storeRes_func storeRes)
+ bool *reportedSome, storeRes_func storeRes, Buffer buffer)
{
Datum leafValue;
double *distances;
@@ -580,7 +601,8 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
recheck,
recheckDistances,
isnull,
- distances);
+ distances,
+ buffer);
spgAddSearchItemToQueue(so, heapItem);
@@ -591,7 +613,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, buffer);
*reportedSome = true;
}
}
@@ -750,6 +772,35 @@ spgGetNextQueueItem(SpGistScanOpaque so)
return (SpGistSearchItem *) pairingheap_remove_first(so->scanQueue);
}
+/*
+ * Note: This removes all items from the pairingheap.
+ */
+void
+spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so)
+{
+ /* Guaranteed no pinned pages */
+ if (so->scanQueue == NULL || !scan->xs_want_itup)
+ return;
+
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
+
+ while (!pairingheap_is_empty(so->scanQueue))
+ {
+ SpGistSearchItem *item;
+
+ item = spgGetNextQueueItem(so);
+ if (!item->isLeaf)
+ continue;
+
+ Assert(BufferIsValid(item->buffer));
+ spgFreeSearchItem(so, item);
+ }
+}
+
enum SpGistSpecialOffsetNumbers
{
SpGistBreakOffsetNumber = InvalidOffsetNumber,
@@ -761,6 +812,7 @@ static OffsetNumber
spgTestLeafTuple(SpGistScanOpaque so,
SpGistSearchItem *item,
Page page, OffsetNumber offset,
+ Buffer buffer,
bool isnull, bool isroot,
bool *reportedSome,
storeRes_func storeRes)
@@ -799,7 +851,7 @@ spgTestLeafTuple(SpGistScanOpaque so,
Assert(ItemPointerIsValid(&leafTuple->heapPtr));
- spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes);
+ spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes, buffer);
return SGLT_GET_NEXTOFFSET(leafTuple);
}
@@ -835,7 +887,8 @@ redirect:
Assert(so->numberOfNonNullOrderBys > 0);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->buffer);
reportedSome = true;
}
else
@@ -873,7 +926,7 @@ redirect:
/* When root is a leaf, examine all its tuples */
for (offset = FirstOffsetNumber; offset <= max; offset++)
(void) spgTestLeafTuple(so, item, page, offset,
- isnull, true,
+ buffer, isnull, true,
&reportedSome, storeRes);
}
else
@@ -883,7 +936,7 @@ redirect:
{
Assert(offset >= FirstOffsetNumber && offset <= max);
offset = spgTestLeafTuple(so, item, page, offset,
- isnull, false,
+ buffer, isnull, false,
&reportedSome, storeRes);
if (offset == SpGistRedirectOffsetNumber)
goto redirect;
@@ -929,9 +982,9 @@ static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ Buffer pin)
{
- Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
so->ntids++;
}
@@ -954,10 +1007,9 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
/* storeRes subroutine for gettuple case */
static void
-storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
- Datum leafValue, bool isnull,
- SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr, Datum leafValue,
+ bool isnull, SpGistLeafTuple leafTuple, bool recheck,
+ bool recheckDistances, double *nonNullDistances, Buffer pin)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
@@ -1016,6 +1068,25 @@ storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
so->reconTups[so->nPtrs] = heap_form_tuple(so->reconTupDesc,
leafDatums,
leafIsnulls);
+
+ /*
+ * IOS: Make sure we have one additional pin on the buffer
+ * from the tuple we are going to return.
+ *
+ * In the case buffer is changing - unpin previous buffer.
+ *
+ * We may switch buffers almost randomly in case of ordered
+ * scan - but in such case each item in queue holding its own
+ * pin.
+ */
+ if (so->pagePin != pin)
+ {
+ if (BufferIsValid(so->pagePin))
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = pin;
+ if (BufferIsValid(so->pagePin))
+ IncrBufferRefCount(so->pagePin);
+ }
}
so->nPtrs++;
}
@@ -1065,6 +1136,13 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
for (i = 0; i < so->nPtrs; i++)
pfree(so->reconTups[i]);
+
+ /* Unpin page of last returned tuple if any */
+ if (BufferIsValid(so->pagePin))
+ {
+ ReleaseBuffer(so->pagePin);
+ so->pagePin = InvalidBuffer;
+ }
}
so->iPtr = so->nPtrs = 0;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..f02c270c5cc 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..c886e51e996 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -175,6 +175,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
+ Buffer buffer; /* buffer pinned for this leaf tuple (IOS-only) */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
@@ -226,6 +227,7 @@ typedef struct SpGistScanOpaqueData
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
+ Buffer pagePin; /* output tuple's pinned buffer, if IOS */
ItemPointerData heapPtrs[MaxIndexTuplesPerPage]; /* TIDs from cur page */
bool recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
@@ -488,6 +490,9 @@ typedef SpGistDeadTupleData *SpGistDeadTuple;
#define GBUF_REQ_LEAF(flags) (((flags) & GBUF_PARITY_MASK) == GBUF_LEAF)
#define GBUF_REQ_NULLS(flags) ((flags) & GBUF_NULLS)
+/* spgscan.c */
+void spgScanEndDropAllPagePins(IndexScanDesc scan, SpGistScanOpaque so);
+
/* spgutils.c */
/* reloption parameters */
--
2.43.0
v8-0001-Tests-for-index-only-scan-race-condition-with-con.patchapplication/octet-stream; name=v8-0001-Tests-for-index-only-scan-race-condition-with-con.patchDownload
From f9db2dc907d8b25e81406a7b1f814b8f54d73714 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 7 Feb 2025 21:38:51 +0100
Subject: [PATCH v8 1/4] Tests for index-only scan race condition with
concurrent VACUUM in GiST/SP-GiST
Add regression tests that demonstrate wrong results can occur with index-only
scans in GiST and SP-GiST indexes when encountering tuples removed by a
concurrent VACUUM operation. The issue occurs because these index types don't
acquire the proper cleanup lock on index buffers during VACUUM, unlike btree
indexes.
Author: Peter Geoghegan <pg@bowt.ie>, Matthias van de Meent <boekewurm+postgres@gmail.com>, Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
.../expected/index-only-scan-gist-vacuum.out | 67 +++++++++++
.../index-only-scan-spgist-vacuum.out | 67 +++++++++++
src/test/isolation/isolation_schedule | 2 +
.../specs/index-only-scan-gist-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 113 ++++++++++++++++++
5 files changed, 362 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..9720c9a2dc8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,8 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..b1688d44fa7
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..b414c5d1695
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.43.0
On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Just some commit messages + few cleanups.
I'm worried about this:
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.
--
Peter Geoghegan
On 28/02/2025 03:53, Peter Geoghegan wrote:
On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:Just some commit messages + few cleanups.
I'm worried about this:
+These longer pin lifetimes can cause buffer exhaustion with messages like "no +unpinned buffers available" when the index has many pages that have similar +ordering; but future work can figure out how to best work that out.I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.
+1. With no limit, it seems pretty easy to hold thousands of buffer pins
with this.
The index can set IndexScanDesc->xs_recheck to indicate that the quals
must be rechecked. Perhaps we should have a similar flag to indicate
that the visibility must be rechecked.
Matthias's earlier patch
(/messages/by-id/CAEze2Wg1kbpo_Q1=9X68JRsgfkyPCk4T0QN+qKz10+FVzCAoGA@mail.gmail.com)
had a more complicated mechanism to track the pinned buffers. Later
patch got rid of that, which simplified things a lot. I wonder if we
need something like that, after all.
Here's a completely different line of attack: Instead of holding buffer
pins for longer, what if we checked the visibility map earlier? We could
check the visibility map already when we construct the
GISTSearchHeapItem, and set a flag in IndexScanDesc to tell
IndexOnlyNext() that we have already done that. IndexOnlyNext() would
have three cases:
1. The index AM has not checked the visibility map. Check it in
IndexOnlyNext(), and fetch the tuple if it's not set. This is what it
always does today.
2. The index AM has checked the visibility map, and the VM bit was set.
IndexOnlyNext() can skip the VM check and use the tuple directly.
3. The index AM has checked the visibility map, and the VM bit was not
set. IndexOnlyNext() will fetch the tuple to check its visibility.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Wed, 5 Mar 2025 at 10:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 28/02/2025 03:53, Peter Geoghegan wrote:
On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:Just some commit messages + few cleanups.
I'm worried about this:
+These longer pin lifetimes can cause buffer exhaustion with messages like "no +unpinned buffers available" when the index has many pages that have similar +ordering; but future work can figure out how to best work that out.I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.+1. With no limit, it seems pretty easy to hold thousands of buffer pins
with this.The index can set IndexScanDesc->xs_recheck to indicate that the quals
must be rechecked. Perhaps we should have a similar flag to indicate
that the visibility must be rechecked.Matthias's earlier patch
(/messages/by-id/CAEze2Wg1kbpo_Q1=9X68JRsgfkyPCk4T0QN+qKz10+FVzCAoGA@mail.gmail.com)
had a more complicated mechanism to track the pinned buffers. Later
patch got rid of that, which simplified things a lot. I wonder if we
need something like that, after all.
I dropped that because it effectively duplicates the current
per-backend pin tracking system. Adding it back in will probably
complicate matters by a lot again.
Here's a completely different line of attack: Instead of holding buffer
pins for longer, what if we checked the visibility map earlier? We could
check the visibility map already when we construct the
GISTSearchHeapItem, and set a flag in IndexScanDesc to tell
IndexOnlyNext() that we have already done that. IndexOnlyNext() would
have three cases:
I don't like integrating a heap-specific thing like VM_ALL_VISIBLE()
to indexes, but given that IOS code already uses that exact code my
dislike is not to the point of a -1. I'd like it better if we had a
TableAM API for higher-level visibility checks (e.g.
table_tids_could_be_invisible?()) which gives us those responses
instead; dropping the requirement to maintain VM in pg's preferred
format to support efficient IOS.
I am a bit worried about even more random IO happening before we've
returned even a single tuple, but that's probably much less of an
issue than "unlimited pins".
With VM-checking in the index, we would potentially have another
benefit: By checking all tids on the page at once, we can deduplicate
and reduce the VM lookups. The gains might not be all that impressive,
but could be significant in certain hot cases.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, 5 Mar 2025 at 19:19, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Mar 2025 at 10:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 28/02/2025 03:53, Peter Geoghegan wrote:
On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:Just some commit messages + few cleanups.
I'm worried about this:
+These longer pin lifetimes can cause buffer exhaustion with messages like "no +unpinned buffers available" when the index has many pages that have similar +ordering; but future work can figure out how to best work that out.I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.+1. With no limit, it seems pretty easy to hold thousands of buffer pins
with this.The index can set IndexScanDesc->xs_recheck to indicate that the quals
must be rechecked. Perhaps we should have a similar flag to indicate
that the visibility must be rechecked.
Added as xs_visrecheck in 0001.
Here's a completely different line of attack: Instead of holding buffer
pins for longer, what if we checked the visibility map earlier? We could
check the visibility map already when we construct the
GISTSearchHeapItem, and set a flag in IndexScanDesc to tell
IndexOnlyNext() that we have already done that. IndexOnlyNext() would
have three cases:I don't like integrating a heap-specific thing like VM_ALL_VISIBLE()
to indexes, but given that IOS code already uses that exact code my
dislike is not to the point of a -1. I'd like it better if we had a
TableAM API for higher-level visibility checks (e.g.
table_tids_could_be_invisible?()) which gives us those responses
instead; dropping the requirement to maintain VM in pg's preferred
format to support efficient IOS.
Here's a patchset that uses that approach. Naming of functions, types,
fields and arguments TBD. The patch works and passes the new
VACUUM-conflict tests, though I suspect the SP-GIST tests to have
bugs, as an intermediate version of my 0003 patch didn't trigger the
tests to fail, even though it did not hold a pin on (all) sorted
items' data when it was being checked for visibility and/or returned
from the scan.
Patch 0001 details the important changes, while 0002/0003 use this new
API to make GIST and SP-GIST's IOS work correctly when concurrent
VACUUM is/was running.
0004 is the existing patch with tests (v8-0001).
With VM-checking in the index, we would potentially have another
benefit: By checking all tids on the page at once, we can deduplicate
and reduce the VM lookups. The gains might not be all that impressive,
but could be significant in certain hot cases.
That is also included in this, but any performance impact hasn't been
tested nor validated.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v9-0004-Tests-for-index-only-scan-race-condition-with-con.patchapplication/x-patch; name=v9-0004-Tests-for-index-only-scan-race-condition-with-con.patchDownload
From e6c6d48556c57ba9c89a53e41872957859a3ead9 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 7 Feb 2025 21:38:51 +0100
Subject: [PATCH v9 4/4] Tests for index-only scan race condition with
concurrent VACUUM in GiST/SP-GiST
Add regression tests that demonstrate wrong results can occur with index-only
scans in GiST and SP-GiST indexes when encountering tuples removed by a
concurrent VACUUM operation. The issue occurs because these index types don't
acquire the proper cleanup lock on index buffers during VACUUM, unlike btree
indexes.
Author: Peter Geoghegan <pg@bowt.ie>, Matthias van de Meent <boekewurm+postgres@gmail.com>, Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
.../expected/index-only-scan-gist-vacuum.out | 67 +++++++++++
.../index-only-scan-spgist-vacuum.out | 67 +++++++++++
src/test/isolation/isolation_schedule | 2 +
.../specs/index-only-scan-gist-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 113 ++++++++++++++++++
5 files changed, 362 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..19117402f52
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,67 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+x
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock; <waiting ...>
+step s1_fetch_all:
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+
+pg_sleep_for
+------------
+
+(1 row)
+
+a
+-
+(0 rows)
+
+step s2_vacuum: <... completed>
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..9720c9a2dc8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,8 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..b1688d44fa7
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..b414c5d1695
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ SELECT pg_sleep_for(INTERVAL '50ms');
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum (*)
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v9-0001-IOS-TableAM-Support-AM-supplied-fast-visibility-c.patchapplication/x-patch; name=v9-0001-IOS-TableAM-Support-AM-supplied-fast-visibility-c.patchDownload
From e1d0580fa7680dcdb7e03665daa0a02d657e9257 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 7 Mar 2025 17:39:23 +0100
Subject: [PATCH v9 1/4] IOS/TableAM: Support AM-supplied fast visibility
checking
Previously, we assumed VM_ALL_VISIBLE is universal across all AMs. This
is probably not the case, so we introduce a new table method called
"table_index_vischeck_tuples" which allows anyone to ask the AM whether
a tuple is definitely visible to everyone or might be invisible to
someone.
The API is intended to replace direct calls to VM_ALL_VISIBLE and as such
doesn't include "definitely dead to everyone", which would be too
expensive for the Heap AM (and would require additional work in indexes
to manage).
A future commit will use this inside GIST and SP-GIST to fix a race
condition between IOS and VACUUM, which causes a bug with tuple
visibility.
---
src/include/access/heapam.h | 2 +
src/include/access/relscan.h | 5 ++
src/include/access/tableam.h | 57 ++++++++++++++++++
src/backend/access/heap/heapam.c | 63 +++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 1 +
src/backend/access/index/indexam.c | 6 ++
src/backend/access/table/tableamapi.c | 1 +
src/backend/executor/nodeIndexonlyscan.c | 77 +++++++++++++++---------
src/backend/utils/adt/selfuncs.c | 76 ++++++++++++++---------
9 files changed, 231 insertions(+), 57 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..a820f150509 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -378,6 +378,8 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
extern TransactionId heap_index_delete_tuples(Relation rel,
TM_IndexDeleteOp *delstate);
+extern void heap_index_vischeck_tuples(Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
/* in heap/pruneheap.c */
struct GlobalVisState;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..759c9dd164e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -26,6 +26,9 @@
struct ParallelTableScanDescData;
+enum TMVC_Result;
+
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -168,6 +171,8 @@ typedef struct IndexScanDescData
bool xs_recheck; /* T means scan keys must be rechecked */
+ int xs_visrecheck; /* TM_VisCheckResult from tableam.h */
+
/*
* When fetching with an ordering operator, the values of the ORDER BY
* expressions of the last returned tuple, according to the index. If
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..8570b9589a6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -255,6 +255,33 @@ typedef struct TM_IndexDeleteOp
TM_IndexStatus *status;
} TM_IndexDeleteOp;
+/*
+ * Index-only scans require results to be known
+ */
+typedef enum TMVC_Result
+{
+ TMVC_Unchecked,
+ TMVC_MaybeInvisible,
+ TMVC_AllVisible,
+} TMVC_Result;
+
+typedef struct TM_VisCheck
+{
+ ItemPointerData tid;
+ OffsetNumber idxoffnum;
+ TMVC_Result vischeckresult;
+} TM_VisCheck;
+
+/*
+ *
+ */
+typedef struct TM_IndexVisibilityCheckOp
+{
+ int nchecktids;
+ Buffer *vmbuf;
+ TM_VisCheck *checktids;
+} TM_IndexVisibilityCheckOp;
+
/* "options" flag bits for table_tuple_insert */
/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
@@ -501,6 +528,10 @@ typedef struct TableAmRoutine
TransactionId (*index_delete_tuples) (Relation rel,
TM_IndexDeleteOp *delstate);
+ /* see table_index_delete_tuples() */
+ void (*index_vischeck_tuples) (Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
+
/* ------------------------------------------------------------------------
* Manipulations of physical tuples.
@@ -1364,6 +1395,32 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
return rel->rd_tableam->index_delete_tuples(rel, delstate);
}
+static inline void
+table_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ return rel->rd_tableam->index_vischeck_tuples(rel, checkop);
+}
+
+static inline TMVC_Result
+table_index_vischeck_tuple(Relation rel, Buffer *vmbuffer, ItemPointer tid)
+{
+ TM_IndexVisibilityCheckOp checkOp;
+ TM_VisCheck op;
+
+ op.idxoffnum = 0;
+ op.tid = *tid;
+ op.vischeckresult = TMVC_Unchecked;
+ checkOp.checktids = &op;
+ checkOp.nchecktids = 1;
+ checkOp.vmbuf = vmbuffer;
+
+ rel->rd_tableam->index_vischeck_tuples(rel, &checkOp);
+
+ Assert(op.vischeckresult != TMVC_Unchecked);
+
+ return op.vischeckresult;
+}
+
/* ----------------------------------------------------------------------------
* Functions for manipulations of physical tuples.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed3..cd6264544bd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -101,6 +101,7 @@ static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status
uint16 infomask, Relation rel, int *remaining);
static void index_delete_sort(TM_IndexDeleteOp *delstate);
static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
+static int heap_cmp_index_vischeck(const void *a, const void *b);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
@@ -8692,6 +8693,68 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
+/*
+ * heapam implementation of tableam's index_vischeck_tuples interface.
+ *
+ * This helper function is called by index AMs during index-only scans,
+ * to do VM-based visibility checks on individual tuples, so that the AM
+ * can hold the tuple in memory for e.g. reordering for extended periods of
+ * time while without holding thousands of pins to conflict with VACUUM.
+ *
+ * It's possible for this to generate a fair amount of I/O, since we may be
+ * checking hundreds of tuples from a single index block, but that is
+ * preferred over holding thousands of pins.
+ */
+void
+heap_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ BlockNumber prevBlk = InvalidBlockNumber;
+ TMVC_Result lastResult = TMVC_Unchecked;
+ Buffer *vmbuf = checkop->vmbuf;
+ TM_VisCheck *checkTids = checkop->checktids;
+
+ if (checkop->nchecktids > 1)
+ qsort(checkTids, checkop->nchecktids, sizeof(TM_VisCheck),
+ heap_cmp_index_vischeck);
+ /*
+ * XXX: In the future we should probably reorder these operations so
+ * we can apply the checks in block order, rather than index order.
+ */
+ for (int i = 0; i < checkop->nchecktids; i++)
+ {
+ TM_VisCheck *check = &checkop->checktids[i];
+ ItemPointer tid = &check->tid;
+ BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+ Assert(BlockNumberIsValid(blkno));
+ Assert(check->vischeckresult == TMVC_Unchecked);
+
+ if (blkno != prevBlk)
+ {
+ if (VM_ALL_VISIBLE(rel, blkno, vmbuf))
+ lastResult = TMVC_AllVisible;
+ else
+ lastResult = TMVC_MaybeInvisible;
+
+ prevBlk = blkno;
+ }
+
+ check->vischeckresult = lastResult;
+ }
+}
+
+/*
+ * Compare TM_VisChecks for an efficient ordering.
+ */
+static int
+heap_cmp_index_vischeck(const void *a, const void *b)
+{
+ const TM_VisCheck *visa = (const TM_VisCheck *) a;
+ const TM_VisCheck *visb = (const TM_VisCheck *) b;
+ return ItemPointerCompare(unconstify(ItemPointerData *, &visa->tid),
+ unconstify(ItemPointerData *, &visb->tid));
+}
+
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..26e3da04eb1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2667,6 +2667,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_tid_valid = heapam_tuple_tid_valid,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.index_delete_tuples = heap_index_delete_tuples,
+ .index_vischeck_tuples = heap_index_vischeck_tuples,
.relation_set_new_filelocator = heapam_relation_set_new_filelocator,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 8b1f555435b..8b4fadc0743 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -583,6 +583,12 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/* XXX: we should assert that a snapshot is pushed or registered */
Assert(TransactionIdIsValid(RecentXmin));
+ /*
+ * Reset xs_visrecheck, so we don't confuse the next tuple's visibility
+ * state with that of the previous.
+ */
+ scan->xs_visrecheck = TMVC_Unchecked;
+
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_heaptid. It should also set
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 760a36fd2a1..c14d01b3bb1 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -61,6 +61,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_get_latest_tid != NULL);
Assert(routine->tuple_satisfies_snapshot != NULL);
Assert(routine->index_delete_tuples != NULL);
+ Assert(routine->index_vischeck_tuples != NULL);
Assert(routine->tuple_insert != NULL);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index e6635233155..d4dfc4f2456 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -120,6 +120,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
+ TMVC_Result vischeck = scandesc->xs_visrecheck;
CHECK_FOR_INTERRUPTS();
@@ -157,36 +158,56 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!VM_ALL_VISIBLE(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
- {
- /*
- * Rats, we have to visit the heap to check visibility.
- */
- InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
- continue; /* no visible tuple, try next index entry */
-
- ExecClearTuple(node->ioss_TableSlot);
-
- /*
- * Only MVCC snapshots are supported here, so there should be no
- * need to keep following the HOT chain once a visible entry has
- * been found. If we did want to allow that, we'd need to keep
- * more state to remember not to call index_getnext_tid next time.
- */
- if (scandesc->xs_heap_continue)
- elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+ if (vischeck == TMVC_Unchecked)
+ vischeck = table_index_vischeck_tuple(scandesc->heapRelation,
+ &node->ioss_VMBuffer,
+ tid);
- /*
- * Note: at this point we are holding a pin on the heap page, as
- * recorded in scandesc->xs_cbuf. We could release that pin now,
- * but it's not clear whether it's a win to do so. The next index
- * entry might require a visit to the same heap page.
- */
+ Assert(vischeck != TMVC_Unchecked);
- tuple_from_heap = true;
+ switch (vischeck)
+ {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ /*
+ * In case of compilers that don't undertand that elog(ERROR)
+ * doens't exit, and which have -Wimplicit-fallthrough:
+ */
+ /* fallthrough */
+ case TMVC_MaybeInvisible:
+ {
+ /*
+ * Rats, we have to visit the heap to check visibility.
+ */
+ InstrCountTuples2(node, 1);
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+ continue; /* no visible tuple, try next index entry */
+
+ ExecClearTuple(node->ioss_TableSlot);
+
+ /*
+ * Only MVCC snapshots are supported here, so there should be
+ * no need to keep following the HOT chain once a visible
+ * entry has been found. If we did want to allow that, we'd
+ * need to keep more state to remember not to call
+ * index_getnext_tid next time.
+ */
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+ /*
+ * Note: at this point we are holding a pin on the heap page,
+ * as recorded in scandesc->xs_cbuf. We could release that
+ * pin now, but it's not clear whether it's a win to do so.
+ * The next index entry might require a visit to the same heap
+ * page.
+ */
+
+ tuple_from_heap = true;
+ break;
+ }
+ case TMVC_AllVisible:
+ break;
}
/*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c2918c9c831..29c71762cf8 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6386,44 +6386,62 @@ get_actual_variable_endpoint(Relation heapRel,
while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
{
BlockNumber block = ItemPointerGetBlockNumber(tid);
+ TMVC_Result visres = index_scan->xs_visrecheck;
- if (!VM_ALL_VISIBLE(heapRel,
- block,
- &vmbuffer))
+ if (visres == TMVC_Unchecked)
+ visres = table_index_vischeck_tuple(heapRel, &vmbuffer, tid);
+
+ Assert(visres != TMVC_Unchecked);
+
+ switch (visres)
{
- /* Rats, we have to visit the heap to check visibility */
- if (!index_fetch_heap(index_scan, tableslot))
- {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
/*
- * No visible tuple for this index entry, so we need to
- * advance to the next entry. Before doing so, count heap
- * page fetches and give up if we've done too many.
- *
- * We don't charge a page fetch if this is the same heap page
- * as the previous tuple. This is on the conservative side,
- * since other recently-accessed pages are probably still in
- * buffers too; but it's good enough for this heuristic.
+ * In case of compilers that don't undertand that elog(ERROR)
+ * doens't exit, and which have -Wimplicit-fallthrough:
*/
+ /* fallthrough */
+ case TMVC_MaybeInvisible:
+ {
+ /* Rats, we have to visit the heap to check visibility */
+ if (!index_fetch_heap(index_scan, tableslot))
+ {
+ /*
+ * No visible tuple for this index entry, so we need to
+ * advance to the next entry. Before doing so, count heap
+ * page fetches and give up if we've done too many.
+ *
+ * We don't charge a page fetch if this is the same heap
+ * page as the previous tuple. This is on the
+ * conservative side, since other recently-accessed pages
+ * are probably still in buffers too; but it's good enough
+ * for this heuristic.
+ */
#define VISITED_PAGES_LIMIT 100
- if (block != last_heap_block)
- {
- last_heap_block = block;
- n_visited_heap_pages++;
- if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
- break;
- }
+ if (block != last_heap_block)
+ {
+ last_heap_block = block;
+ n_visited_heap_pages++;
+ if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
+ break;
+ }
- continue; /* no visible tuple, try next index entry */
- }
+ continue; /* no visible tuple, try next index entry */
+ }
- /* We don't actually need the heap tuple for anything */
- ExecClearTuple(tableslot);
+ /* We don't actually need the heap tuple for anything */
+ ExecClearTuple(tableslot);
- /*
- * We don't care whether there's more than one visible tuple in
- * the HOT chain; if any are visible, that's good enough.
- */
+ /*
+ * We don't care whether there's more than one visible tuple in
+ * the HOT chain; if any are visible, that's good enough.
+ */
+ break;
+ }
+ case TMVC_AllVisible:
+ break;
}
/*
--
2.45.2
v9-0003-SP-GIST-Fix-visibility-issues-in-IOS.patchapplication/x-patch; name=v9-0003-SP-GIST-Fix-visibility-issues-in-IOS.patchDownload
From ba72cc0cbb44dc6c4da64e49f2fd98b6f8a00528 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 8 Mar 2025 01:15:08 +0100
Subject: [PATCH v9 3/4] SP-GIST: Fix visibility issues in IOS
Previously, SP-GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with SP-GIST vacuum, and we now do
preliminary visibility checks to be used by IOS so that the IOS
infrastructure knows to recheck the heap page even if that page is now
ALL_VISIBLE.
Note: For PG17 and below, this needs some adaptations to use e.g.
VM_ALL_VISIBLE, and pack its fields in places that work fine on 32-bit
systems, too.
Idea from Heikki Linnakangas
Backpatch: 17-
---
src/include/access/spgist_private.h | 9 +-
src/backend/access/spgist/spgscan.c | 175 ++++++++++++++++++++++++--
src/backend/access/spgist/spgvacuum.c | 2 +-
3 files changed, 172 insertions(+), 14 deletions(-)
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..63e970468c7 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "utils/geo_decls.h"
#include "utils/relcache.h"
+#include "tableam.h"
typedef struct SpGistOptions
@@ -175,7 +176,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
-
+ uint8 visrecheck; /* IOS: TMVC_Result of contained heap tuple */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
} SpGistSearchItem;
@@ -223,6 +224,7 @@ typedef struct SpGistScanOpaqueData
/* These fields are only used in amgettuple scans: */
bool want_itup; /* are we reconstructing tuples? */
+ Buffer vmbuf; /* IOS: used for table_index_vischeck_tuples */
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
@@ -235,6 +237,11 @@ typedef struct SpGistScanOpaqueData
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
+ /* support for IOS */
+ int nReorderThisPage;
+ uint8 *visrecheck; /* IOS vis check results, counted by nPtrs */
+ SpGistSearchItem **items; /* counted by nReorderThisPage */
+
/*
* Note: using MaxIndexTuplesPerPage above is a bit hokey since
* SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 53f910e9d89..3a7c0c308e6 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ TMVC_Result visrecheck);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -142,6 +143,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->visrecheck = TMVC_Unchecked;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -386,6 +388,19 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
if (scankey && scan->numberOfKeys > 0)
memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+ /* prepare index-only scan requirements */
+ so->nReorderThisPage = 0;
+ if (scan->xs_want_itup)
+ {
+ if (so->visrecheck == NULL)
+ so->visrecheck = palloc(MaxIndexTuplesPerPage);
+
+ if (scan->numberOfOrderBys > 0 && so->items == NULL)
+ {
+ so->items = palloc_array(SpGistSearchItem *, MaxIndexTuplesPerPage);
+ }
+ }
+
/* initialize order-by data if needed */
if (orderbys && scan->numberOfOrderBys > 0)
{
@@ -451,6 +466,9 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ if (BufferIsValid(so->vmbuf))
+ ReleaseBuffer(so->vmbuf);
+
pfree(so);
}
@@ -500,6 +518,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
item->isLeaf = true;
item->recheck = recheck;
item->recheckDistances = recheckDistances;
+ item->visrecheck = TMVC_Unchecked;
return item;
}
@@ -582,6 +601,14 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ if (so->want_itup)
+ {
+ Assert(PointerIsValid(so->items));
+
+ so->items[so->nReorderThisPage] = heapItem;
+ so->nReorderThisPage++;
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -591,7 +618,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, TMVC_Unchecked);
*reportedSome = true;
}
}
@@ -804,6 +831,84 @@ spgTestLeafTuple(SpGistScanOpaque so,
return SGLT_GET_NEXTOFFSET(leafTuple);
}
+/* pupulate so->visrecheck based on current cached tuples */
+static void
+spgPopulateUnorderedVischecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+ Assert(so->nPtrs > 0);
+ Assert(scan->numberOfOrderBys == 0);
+
+ op.nchecktids = so->nPtrs;
+ op.checktids = palloc_array(TM_VisCheck, so->nPtrs);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ op.checktids[i].idxoffnum = i;
+ op.checktids[i].vischeckresult = TMVC_Unchecked;
+ op.checktids[i].tid = so->heapPtrs[i];
+
+ Assert(ItemPointerIsValid(&op.checktids[i].tid));
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(ItemPointerEquals(&so->heapPtrs[check->idxoffnum],
+ &check->tid));
+ Assert(check->idxoffnum < op.nchecktids);
+
+ so->visrecheck[check->idxoffnum] = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+}
+
+/* pupulate so->visrecheck based on current cached tuples */
+static void
+spgPopulateOrderedVisChecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ Assert(so->nReorderThisPage > 0);
+ Assert(scan->numberOfOrderBys > 0);
+ Assert(PointerIsValid(so->items));
+
+ op.nchecktids = so->nReorderThisPage;
+ op.checktids = palloc_array(TM_VisCheck, so->nReorderThisPage);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ op.checktids[i].idxoffnum = i;
+ op.checktids[i].vischeckresult = TMVC_Unchecked;
+ op.checktids[i].tid = so->items[i]->heapPtr;
+
+ Assert(ItemPointerIsValid(&so->items[i]->heapPtr));
+ Assert(so->items[i]->isLeaf);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->idxoffnum < op.nchecktids);
+ Assert(ItemPointerEquals(&check->tid,
+ &so->items[check->idxoffnum]->heapPtr));
+
+ so->items[check->idxoffnum]->visrecheck = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+ so->nReorderThisPage = 0;
+}
+
/*
* Walk the tree and report all tuples passing the scan quals to the storeRes
* subroutine.
@@ -812,8 +917,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
* next page boundary once we have reported at least one tuple.
*/
static void
-spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
- storeRes_func storeRes)
+spgWalk(IndexScanDesc scan, Relation index, SpGistScanOpaque so,
+ bool scanWholeIndex, storeRes_func storeRes)
{
Buffer buffer = InvalidBuffer;
bool reportedSome = false;
@@ -833,9 +938,22 @@ redirect:
{
/* We store heap items in the queue only in case of ordered search */
Assert(so->numberOfNonNullOrderBys > 0);
+ /*
+ * If an item we found on a page is retrieved immediately after
+ * processing that page, we won't yet have released the page pin,
+ * and thus won't yet have processed the visibility data of the
+ * page's (now) ordered tuples.
+ * Do that now, so that the tuple we're about to store does have
+ * accurate data.
+ */
+ if (so->want_itup && so->nReorderThisPage)
+ spgPopulateOrderedVisChecks(scan, so);
+
+ Assert(!so->want_itup || item->visrecheck != TMVC_Unchecked);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->visrecheck);
reportedSome = true;
}
else
@@ -852,7 +970,12 @@ redirect:
}
else if (blkno != BufferGetBlockNumber(buffer))
{
- UnlockReleaseBuffer(buffer);
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ if (so->nReorderThisPage > 0)
+ spgPopulateOrderedVisChecks(scan, so);
+
+ ReleaseBuffer(buffer);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
}
@@ -920,16 +1043,37 @@ redirect:
}
if (buffer != InvalidBuffer)
- UnlockReleaseBuffer(buffer);
-}
+ {
+ /*
+ * If we're in an index-only scan, pre-check visibility of the tuples,
+ * so we can drop the pin quickly.
+ */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ if (so->want_itup)
+ {
+ if (scan->numberOfOrderBys > 0 && so->nReorderThisPage > 0)
+ {
+ spgPopulateOrderedVisChecks(scan, so);
+ }
+ if (scan->numberOfOrderBys == 0 && so->nPtrs > 0)
+ {
+ spgPopulateUnorderedVischecks(scan, so);
+ }
+ }
+
+ ReleaseBuffer(buffer);
+ }
+}
/* storeRes subroutine for getbitmap case */
static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ TMVC_Result visres)
{
Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
@@ -947,7 +1091,7 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
so->tbm = tbm;
so->ntids = 0;
- spgWalk(scan->indexRelation, so, true, storeBitmap);
+ spgWalk(scan, scan->indexRelation, so, true, storeBitmap);
return so->ntids;
}
@@ -957,12 +1101,15 @@ static void
storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+ bool recheckDistances, double *nonNullDistances,
+ TMVC_Result visres)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
so->recheck[so->nPtrs] = recheck;
so->recheckDistances[so->nPtrs] = recheckDistances;
+ if (so->want_itup)
+ so->visrecheck[so->nPtrs] = visres;
if (so->numberOfOrderBys > 0)
{
@@ -1039,6 +1186,10 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_heaptid = so->heapPtrs[so->iPtr];
scan->xs_recheck = so->recheck[so->iPtr];
scan->xs_hitup = so->reconTups[so->iPtr];
+ if (so->want_itup)
+ scan->xs_visrecheck = so->visrecheck[so->iPtr];
+
+ Assert(!scan->xs_want_itup || scan->xs_visrecheck != TMVC_Unchecked);
if (so->numberOfOrderBys > 0)
index_store_float8_orderby_distances(scan, so->orderByTypes,
@@ -1068,7 +1219,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
}
so->iPtr = so->nPtrs = 0;
- spgWalk(scan->indexRelation, so, false, storeGettuple);
+ spgWalk(scan, scan->indexRelation, so, false, storeGettuple);
if (so->nPtrs == 0)
break; /* must have completed scan */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index eeddacd0d52..993d4a5b662 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
--
2.45.2
v9-0002-GIST-Fix-visibility-issues-in-IOS.patchapplication/x-patch; name=v9-0002-GIST-Fix-visibility-issues-in-IOS.patchDownload
From 6eb8314fb3e07595e4016ade40fe8dade7fb9c37 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 7 Mar 2025 22:55:24 +0100
Subject: [PATCH v9 2/4] GIST: Fix visibility issues in IOS
Previously, GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with GIST vacuum, and we now do
preliminary visibility checks to be used by IOS so that the IOS
infrastructure knows to recheck the heap page even if that page is now
ALL_VISIBLE.
Note: For PG17 and below, this needs some adaptations to use e.g.
VM_ALL_VISIBLE, and pack its fields in places that work fine on 32-bit
systems, too.
Idea from Heikki Linnakangas
Backpatch: 17-
---
src/include/access/gist_private.h | 27 +++++--
src/backend/access/gist/gistget.c | 104 ++++++++++++++++++++++++++-
src/backend/access/gist/gistscan.c | 5 ++
src/backend/access/gist/gistvacuum.c | 6 +-
4 files changed, 131 insertions(+), 11 deletions(-)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..4261565b5ad 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/buffile.h"
#include "utils/hsearch.h"
#include "access/genam.h"
+#include "tableam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -124,6 +125,8 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ uint8 visrecheck; /* Cached visibility check result for this
+ * heap pointer. */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -170,12 +173,24 @@ typedef struct GISTScanOpaqueData
BlockNumber curBlkno; /* current number of block */
GistNSN curPageLSN; /* pos in the WAL stream when page was read */
- /* In a non-ordered search, returnable heap items are stored here: */
- GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
- OffsetNumber nPageData; /* number of valid items in array */
- OffsetNumber curPageData; /* next item to return */
- MemoryContext pageDataCxt; /* context holding the fetched tuples, for
- * index-only scans */
+ /* info used by Index-Only Scans */
+ Buffer vmbuf; /* reusable buffer for IOS' vm lookups */
+
+ union {
+ struct {
+ /* In a non-ordered search, returnable heap items are stored here: */
+ GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nPageData; /* number of valid items in array */
+ OffsetNumber curPageData; /* next item to return */
+ MemoryContext pageDataCxt; /* context holding the fetched tuples,
+ * for index-only scans */
+ };
+ struct {
+ /* In an ordered search, we use this as scratch space */
+ GISTSearchHeapItem *sortData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nsortData; /* number of items in sortData */
+ };
+ };
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index cc40e928e0a..bb4caa6c310 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -24,6 +24,7 @@
#include "utils/float.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "access/tableam.h"
/*
* gistkillitems() -- set LP_DEAD state for items an indexscan caller has
@@ -394,7 +395,15 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
return;
}
- so->nPageData = so->curPageData = 0;
+ if (scan->numberOfOrderBys)
+ {
+ so->nsortData = 0;
+ }
+ else
+ {
+ so->nPageData = so->curPageData = 0;
+ }
+
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -501,7 +510,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ so->sortData[so->nsortData] = &item->data.heap;
+ so->nsortData += 1;
+ }
}
else
{
@@ -526,7 +539,88 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
}
- UnlockReleaseBuffer(buffer);
+ /* Allow writes to the buffer */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ if (scan->xs_want_itup)
+ {
+ TM_IndexVisibilityCheckOp op;
+ op.vmbuf = &so->vmbuf;
+
+ if (scan->numberOfOrderBys > 0)
+ {
+ op.nchecktids = so->nsortData;
+
+ if (op.nchecktids > 0)
+ {
+ op.checktids = palloc(op.nchecktids * sizeof(TM_VisCheck));
+
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ op.checktids[off].vischeckresult = TMVC_Unchecked;
+ op.checktids[off].tid = so->sortData[off]->heapPtr;
+ op.checktids[off].idxoffnum = off;
+ Assert(ItemPointerIsValid(&op.checktids[off].tid));
+ }
+ }
+ }
+ else
+ {
+ op.nchecktids = so->nPageData;
+
+ if (op.nchecktids > 0)
+ {
+ op.checktids = palloc_array(TM_VisCheck, op.nchecktids);
+
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ op.checktids[off].vischeckresult = TMVC_Unchecked;
+ op.checktids[off].tid = so->pageData[off].heapPtr;
+ op.checktids[off].idxoffnum = off;
+ Assert(ItemPointerIsValid(&op.checktids[off].tid));
+ }
+ }
+ }
+
+ if (op.nchecktids > 0)
+ {
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = so->sortData[check->idxoffnum];
+
+ Assert(check->idxoffnum < op.nchecktids);
+ Assert(ItemPointerEquals(&item->heapPtr, &check->tid));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ /* reset state */
+ so->nsortData = 0;
+ }
+ else
+ {
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = &so->pageData[check->idxoffnum];
+
+ Assert(check->idxoffnum < op.nchecktids);
+ Assert(ItemPointerEquals(&item->heapPtr, &check->tid));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ }
+
+ pfree(op.checktids);
+ }
+ }
+
+ /* Allow VACUUM to process the buffer again */
+ ReleaseBuffer(buffer);
}
/*
@@ -588,7 +682,10 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = item->data.heap.recontup;
+ scan->xs_visrecheck = item->data.heap.visrecheck;
+ }
res = true;
}
else
@@ -673,7 +770,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
/* in an index-only scan, also return the reconstructed tuple */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = so->pageData[so->curPageData].recontup;
+ scan->xs_visrecheck = so->pageData[so->curPageData].visrecheck;
+ }
so->curPageData++;
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..52f5f144ccd 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -347,6 +347,11 @@ void
gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index dd0d9d5006c..5a95b93236e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
--
2.45.2
Hello, Mathias!
though I suspect the SP-GIST tests to have
bugs, as an intermediate version of my 0003 patch didn't trigger the
tests to fail
It all fails on master - could you please detail what is "intermediate" in
that case? Also, I think it is a good idea to add the same type of test to
btree.
* XXX: In the future we should probably reorder these operations so
* we can apply the checks in block order, rather than index order.
I think it is already done in your patch, no?
Should we when use that mechanics for btree as well? It seems to be
straight forward and non-invasive. In such case, "Unchecked" goes away, and
it is each AM responsibility to call the check while holding the pin.
Best regards,
Mikhail.
On Sat, 8 Mar 2025 at 08:06, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Here's a patchset that uses that approach. Naming of functions, types,
fields and arguments TBD. The patch works and passes the new
VACUUM-conflict tests, though I suspect the SP-GIST tests to have
bugs, as an intermediate version of my 0003 patch didn't trigger the
tests to fail, even though it did not hold a pin on (all) sorted
items' data when it was being checked for visibility and/or returned
from the scan.Patch 0001 details the important changes, while 0002/0003 use this new
API to make GIST and SP-GIST's IOS work correctly when concurrent
VACUUM is/was running.
0004 is the existing patch with tests (v8-0001).
I noticed that Mikhail's feedback from [1]/messages/by-id/CANtu0ojz0apXnVia0reTL28eL2=__ev8aLsiH=1XfD_Z3dnkTw@mail.gmail.com is not yet addressed. I
have changed the status of the commitfest entry to Waiting on Author,
kindly address them and update the status to Needs review.
[1]: /messages/by-id/CANtu0ojz0apXnVia0reTL28eL2=__ev8aLsiH=1XfD_Z3dnkTw@mail.gmail.com
Regards,
Vignesh
On Sun, 16 Mar 2025 at 13:58, vignesh C <vignesh21@gmail.com> wrote:
On Sat, 8 Mar 2025 at 08:06, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Here's a patchset that uses that approach. Naming of functions, types,
fields and arguments TBD. The patch works and passes the new
VACUUM-conflict tests, though I suspect the SP-GIST tests to have
bugs, as an intermediate version of my 0003 patch didn't trigger the
tests to fail, even though it did not hold a pin on (all) sorted
items' data when it was being checked for visibility and/or returned
from the scan.Patch 0001 details the important changes, while 0002/0003 use this new
API to make GIST and SP-GIST's IOS work correctly when concurrent
VACUUM is/was running.
0004 is the existing patch with tests (v8-0001).I noticed that Mikhail's feedback from [1] is not yet addressed. I
have changed the status of the commitfest entry to Waiting on Author,
kindly address them and update the status to Needs review.
[1] - /messages/by-id/CANtu0ojz0apXnVia0reTL28eL2=__ev8aLsiH=1XfD_Z3dnkTw@mail.gmail.com
While there has indeed been some feedback, so far I've been looking
for architectural feedback about how the bug would be solved, not per
se the names of variables, or the exact details of the comments on the
new code: I usually rather wait with polishing my patches until after
we've made sure it doesn't need a full rewrite due to architectural
issues (like what happened in the previous two iterations).
Attached is v10, which polishes the previous patches, and adds a patch
for nbtree to use the new visibility checking strategy so that it too
can release its index pages much earlier, and adds a similar
visibility check test to nbtree.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v10-0004-NBTree-Reduce-Index-Only-Scan-pinning-requiremen.patchapplication/octet-stream; name=v10-0004-NBTree-Reduce-Index-Only-Scan-pinning-requiremen.patchDownload
From 0474c4028e644fa99fa4dc1cf2479220782e7a30 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 20 Mar 2025 23:12:25 +0100
Subject: [PATCH v10 4/5] NBTree: Reduce Index-Only Scan pinning requirements
Previously, we would keep a pin on every leaf page while we were returning
tuples to the scan. With this patch, we utilize the newly introduced
table_index_vischeck_tuples API to pre-check visibility of all TIDs, and
thus unpin the page well ahead of when we'd usually be ready with returning
and processing all index tuple results. This reduces the time VACUUM may
have to wait for a pin, and can increase performance with reduced redundant
VM checks.
---
src/include/access/nbtree.h | 4 ++
src/backend/access/nbtree/nbtree.c | 5 ++
src/backend/access/nbtree/nbtsearch.c | 100 ++++++++++++++++++++++++--
3 files changed, 103 insertions(+), 6 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 0c43767f8c3..2423ddf7bfd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -951,6 +951,7 @@ typedef struct BTScanPosItem /* what we remember about each match */
ItemPointerData heapTid; /* TID of referenced heap item */
OffsetNumber indexOffset; /* index item's location within page */
LocationIndex tupleOffset; /* IndexTuple's offset in workspace, if any */
+ uint8 visrecheck; /* visibility recheck status, if any */
} BTScanPosItem;
typedef struct BTScanPosData
@@ -1053,6 +1054,9 @@ typedef struct BTScanOpaqueData
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ /* buffer used for index-only scan visibility checks */
+ Buffer vmbuf;
+
/*
* If we are doing an index-only scan, these are the tuple storage
* workspaces for the currPos and markPos respectively. Each is of size
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c0a8833e068..e7b7f7cfa4c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -345,6 +345,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ so->vmbuf = InvalidBuffer;
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -436,6 +438,9 @@ btendscan(IndexScanDesc scan)
so->markItemIndex = -1;
BTScanPosUnpinIfPinned(so->markPos);
+ if (BufferIsValid(so->vmbuf))
+ ReleaseBuffer(so->vmbuf);
+
/* No need to invalidate positions, the RAM is about to be freed. */
/* Release storage */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 22b27d01d00..7445ee9daeb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,7 @@
#include "utils/rel.h"
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp, Buffer *vmbuf);
static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access);
@@ -64,13 +64,88 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* See nbtree/README section on making concurrent TID recycling safe.
*/
static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp, Buffer *vmbuf)
{
_bt_unlockbuf(scan->indexRelation, sp->buf);
+ /*
+ * Do some visibility checks if this is an index-only scan; allowing us to
+ * drop the pin on this page before we have returned all tuples from this
+ * IOS to the executor.
+ */
+ if (scan->xs_want_itup)
+ {
+ TM_IndexVisibilityCheckOp visCheck;
+ int offset = sp->firstItem;
+
+ visCheck.nchecktids = 1 + sp->lastItem - offset;
+ visCheck.checktids = palloc_array(TM_VisCheck,
+ visCheck.nchecktids);
+ visCheck.vmbuf = vmbuf;
+
+ for (int i = 0; i < visCheck.nchecktids; i++)
+ {
+ int itemidx = offset + i;
+
+ Assert(sp->items[itemidx].visrecheck == TMVC_Unchecked);
+ Assert(ItemPointerIsValid(&sp->items[itemidx].heapTid));
+
+ visCheck.checktids[i].tid = sp->items[itemidx].heapTid;
+ visCheck.checktids[i].idxoffnum = itemidx;
+ visCheck.checktids[i].vischeckresult = TMVC_Unchecked;
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &visCheck);
+
+ for (int i = 0; i < visCheck.nchecktids; i++)
+ {
+ TM_VisCheck *check = &visCheck.checktids[i];
+ BTScanPosItem *item = &sp->items[check->idxoffnum];
+
+ /* We must have a valid visibility check result */
+ Assert(check->vischeckresult != TMVC_Unchecked);
+ /* The offset number should still indicate the right item */
+ Assert(ItemPointerEquals(&check->tid, &item->heapTid));
+
+ /* Store the visibility check result */
+ item->visrecheck = check->vischeckresult;
+ }
+
+ /* release temporary resources */
+ pfree(visCheck.checktids);
+ }
+
+ /*
+ * We may need to hold a pin on the page for one of several reasons:
+ *
+ * 1.) To safely apply kill_prior_tuple, we need to know that the tuples
+ * were not removed from the page (and subsequently re-inserted).
+ * A page's LSN can also allow us to detect modifications on the page,
+ * which then allows us to bail out of setting the hint bits, but that
+ * requires the index to be WAL-logged; so unless the index is WAL-logged
+ * we must hold a pin on the page to apply the kill_prior_tuple
+ * optimization.
+ *
+ * 2.) Non-MVCC scans need pin coupling to make sure the scan covers
+ * exactly the whole index keyspace.
+ *
+ * 3.) For Index-Only Scans, the scan needs to check the visibility of the
+ * table tuple while the relevant index tuple is guaranteed to still be
+ * contained in the index (so that vacuum hasn't yet marked any pages that
+ * could contain the value as ALL_VISIBLE after reclaiming a dead tuple
+ * that might be buffered in the scan). A pin must therefore be held
+ * at least while the basic visibility of the page's tuples is being
+ * checked.
+ *
+ * For cases 1 and 2, we must hold the pin after we've finished processing
+ * the index page.
+ *
+ * For case 3, we can release the pin if we first do the visibility checks
+ * of to-be-returned tuples using table_index_vischeck_tuples, which we've
+ * done just above.
+ */
if (IsMVCCSnapshot(scan->xs_snapshot) &&
- RelationNeedsWAL(scan->indexRelation) &&
- !scan->xs_want_itup)
+ RelationNeedsWAL(scan->indexRelation))
{
ReleaseBuffer(sp->buf);
sp->buf = InvalidBuffer;
@@ -1904,6 +1979,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
Size itupsz = IndexTupleSize(itup);
@@ -1934,6 +2011,8 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
/* Save base IndexTuple (truncate posting list) */
@@ -1970,6 +2049,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
/*
* Have index-only scans return the same base IndexTuple for every TID
@@ -1995,6 +2075,14 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
/* Return next item, per amgettuple contract */
scan->xs_heaptid = currItem->heapTid;
+
+ if (scan->xs_want_itup)
+ {
+ scan->xs_visrecheck = currItem->visrecheck;
+ Assert(currItem->visrecheck != TMVC_Unchecked ||
+ BufferIsValid(so->currPos.buf));
+ }
+
if (so->currTuples)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
}
@@ -2153,7 +2241,7 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
* so->currPos.buf in preparation for btgettuple returning tuples.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos, &so->vmbuf);
return true;
}
@@ -2310,7 +2398,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
*/
Assert(so->currPos.currPage == blkno);
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos, &so->vmbuf);
return true;
}
--
2.45.2
v10-0005-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchapplication/octet-stream; name=v10-0005-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchDownload
From 92bcf6190a81969a116f7663ad96f8adea935394 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 21 Mar 2025 16:41:31 +0100
Subject: [PATCH v10 5/5] Test for IOS/Vacuum race conditions in index AMs
Add regression tests that demonstrate wrong results can occur with index-only
scans in GiST and SP-GiST indexes when encountering tuples being removed by a
concurrent VACUUM operation.
With these tests the index AMs are also expected to not block VACUUM even when
they're used inside a cursor.
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Co-authored-by: Peter Geoghegan <pg@bowt.ie>
Co-authored-by: Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
.../expected/index-only-scan-btree-vacuum.out | 59 +++++++++
.../expected/index-only-scan-gist-vacuum.out | 53 ++++++++
.../index-only-scan-spgist-vacuum.out | 53 ++++++++
src/test/isolation/isolation_schedule | 3 +
.../specs/index-only-scan-btree-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-gist-vacuum.spec | 112 +++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 112 +++++++++++++++++
7 files changed, 505 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-btree-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-btree-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-btree-vacuum.out b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
new file mode 100644
index 00000000000..9a9d94c86f6
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_asc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_asc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_desc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_desc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..cb1668a40ff 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,9 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-scan-btree-vacuum
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-btree-vacuum.spec b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
new file mode 100644
index 00000000000..9a00804c2c5
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing correct results with btree even with concurrent
+# vacuum
+
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_btree_a ON ios_needs_cleanup_lock USING btree (a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted_asc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+}
+step s1_prepare_sorted_desc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1, nor 10, so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum
+{
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+}
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_asc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_desc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # If the index scan doesn't correctly interlock its visibility tests with
+ # concurrent VACUUM cleanup then VACUUM will mark pages as all-visible that
+ # the scan in the next steps may then consider all-visible, despite some of
+ # those rows having been removed.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..9d241b25920
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..cd621d4f7f2
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v10-0001-IOS-TableAM-Support-AM-specific-fast-visibility-.patchapplication/octet-stream; name=v10-0001-IOS-TableAM-Support-AM-specific-fast-visibility-.patchDownload
From c113d0c4f7d4bd7209c06fa73f2153140d8afb9c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 7 Mar 2025 17:39:23 +0100
Subject: [PATCH v10 1/5] IOS/TableAM: Support AM-specific fast visibility
tests
Previously, we assumed VM_ALL_VISIBLE is universal across all AMs. This
is probably not the case, so we introduce a new table method called
"table_index_vischeck_tuples" which allows anyone to ask the AM whether
a tuple is definitely visible to everyone or might be invisible to
someone.
The API is intended to replace direct calls to VM_ALL_VISIBLE and as such
doesn't include "definitely dead to everyone", as the Heap AM's VM doesn't
support *definitely dead* as output for its lookups; and thus it would be
too expensive for the Heap AM to produce such results.
A future commit will use this inside GIST and SP-GIST to fix a race
condition between IOS and VACUUM, which causes a bug with tuple
visibility, and a further patch will add support for this to nbtree.
---
src/include/access/heapam.h | 2 +
src/include/access/relscan.h | 5 ++
src/include/access/tableam.h | 73 +++++++++++++++++++++
src/backend/access/heap/heapam.c | 64 ++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 1 +
src/backend/access/index/indexam.c | 6 ++
src/backend/access/table/tableamapi.c | 1 +
src/backend/executor/nodeIndexonlyscan.c | 83 ++++++++++++++++--------
src/backend/utils/adt/selfuncs.c | 76 +++++++++++++---------
9 files changed, 254 insertions(+), 57 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..a820f150509 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -378,6 +378,8 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
extern TransactionId heap_index_delete_tuples(Relation rel,
TM_IndexDeleteOp *delstate);
+extern void heap_index_vischeck_tuples(Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
/* in heap/pruneheap.c */
struct GlobalVisState;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..93a6f65ab0e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -26,6 +26,9 @@
struct ParallelTableScanDescData;
+enum TMVC_Result;
+
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -176,6 +179,8 @@ typedef struct IndexScanDescData
bool xs_recheck; /* T means scan keys must be rechecked */
+ int xs_visrecheck; /* TM_VisCheckResult from tableam.h */
+
/*
* When fetching with an ordering operator, the values of the ORDER BY
* expressions of the last returned tuple, according to the index. If
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..2dbdb9287f1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -255,6 +255,49 @@ typedef struct TM_IndexDeleteOp
TM_IndexStatus *status;
} TM_IndexDeleteOp;
+/*
+ * State used when calling table_index_delete_tuples()
+ *
+ * Index-only scans need to know the visibility of the associated table tuples
+ * before they can return the index tuple. If the index tuple is known to be
+ * visible with a cheap check, we can return it directly without requesting
+ * the visibility info from the table AM directly.
+ *
+ * This AM API exposes a cheap visibility checking API to indexes, allowing
+ * these indexes to check multiple tuples worth of visibility info at once,
+ * and allowing the AM to store these checks, improving the pinning ergonomics
+ * of index AMs by allowing a scan to cache index tuples in memory without
+ * holding pins on index tuples' pages until the index tuples were returned.
+ *
+ * The AM is called with a list of TIDs, and its output will indicate the
+ * visibility state of each tuple: Unchecked, Dead, MaybeVisible, or Visible.
+ *
+ * HeapAM's implementation of visibility maps only allows for cheap checks of
+ * *definitely visible*; all other results are *maybe visible*. A result for
+ * *definitely not visible* aka dead is currently not accounted for by lack of
+ * Table AMs which support such visibility lookups cheaply.
+ */
+typedef enum TMVC_Result
+{
+ TMVC_Unchecked,
+ TMVC_MaybeVisible,
+ TMVC_Visible,
+} TMVC_Result;
+
+typedef struct TM_VisCheck
+{
+ ItemPointerData tid; /* table TID from index tuple */
+ OffsetNumber idxoffnum; /* identifier for the TID in this call */
+ TMVC_Result vischeckresult; /* output of the visibilitycheck */
+} TM_VisCheck;
+
+typedef struct TM_IndexVisibilityCheckOp
+{
+ int nchecktids; /* number of TIDs to check */
+ Buffer *vmbuf; /* pointer to VM buffer to reuse across calls */
+ TM_VisCheck *checktids; /* the checks to execute */
+} TM_IndexVisibilityCheckOp;
+
/* "options" flag bits for table_tuple_insert */
/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
@@ -501,6 +544,10 @@ typedef struct TableAmRoutine
TransactionId (*index_delete_tuples) (Relation rel,
TM_IndexDeleteOp *delstate);
+ /* see table_index_delete_tuples() */
+ void (*index_vischeck_tuples) (Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
+
/* ------------------------------------------------------------------------
* Manipulations of physical tuples.
@@ -1328,6 +1375,32 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
return rel->rd_tableam->index_delete_tuples(rel, delstate);
}
+static inline void
+table_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ return rel->rd_tableam->index_vischeck_tuples(rel, checkop);
+}
+
+static inline TMVC_Result
+table_index_vischeck_tuple(Relation rel, Buffer *vmbuffer, ItemPointer tid)
+{
+ TM_IndexVisibilityCheckOp checkOp;
+ TM_VisCheck op;
+
+ op.idxoffnum = 0;
+ op.tid = *tid;
+ op.vischeckresult = TMVC_Unchecked;
+ checkOp.checktids = &op;
+ checkOp.nchecktids = 1;
+ checkOp.vmbuf = vmbuffer;
+
+ rel->rd_tableam->index_vischeck_tuples(rel, &checkOp);
+
+ Assert(op.vischeckresult != TMVC_Unchecked);
+
+ return op.vischeckresult;
+}
+
/* ----------------------------------------------------------------------------
* Functions for manipulations of physical tuples.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b12b583c4d9..a58f19761aa 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -102,6 +102,7 @@ static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status
bool logLockFailure);
static void index_delete_sort(TM_IndexDeleteOp *delstate);
static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
+static int heap_cmp_index_vischeck(const void *a, const void *b);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
@@ -8775,6 +8776,69 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
+/*
+ * heapam implementation of tableam's index_vischeck_tuples interface.
+ *
+ * This helper function is called by index AMs during index-only scans,
+ * to do VM-based visibility checks on individual tuples, so that the AM
+ * can hold the tuple in memory for e.g. reordering for extended periods of
+ * time while without holding thousands of pins to conflict with VACUUM.
+ *
+ * It's possible for this to generate a fair amount of I/O, since we may be
+ * checking hundreds of tuples from a single index block, but that is
+ * preferred over holding thousands of pins.
+ */
+void
+heap_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ BlockNumber prevBlk = InvalidBlockNumber;
+ TMVC_Result lastResult = TMVC_Unchecked;
+ Buffer *vmbuf = checkop->vmbuf;
+ TM_VisCheck *checkTids = checkop->checktids;
+
+ /*
+ * Order the TIDs to heap order, so that we will only need to visit every
+ * VM page at most once.
+ */
+ if (checkop->nchecktids > 1)
+ qsort(checkTids, checkop->nchecktids, sizeof(TM_VisCheck),
+ heap_cmp_index_vischeck);
+
+ for (int i = 0; i < checkop->nchecktids; i++)
+ {
+ TM_VisCheck *check = &checkop->checktids[i];
+ ItemPointer tid = &check->tid;
+ BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+ /* Visibility should be checked just once per tuple. */
+ Assert(check->vischeckresult == TMVC_Unchecked);
+
+ if (blkno != prevBlk)
+ {
+ if (VM_ALL_VISIBLE(rel, blkno, vmbuf))
+ lastResult = TMVC_Visible;
+ else
+ lastResult = TMVC_MaybeVisible;
+
+ prevBlk = blkno;
+ }
+
+ check->vischeckresult = lastResult;
+ }
+}
+
+/*
+ * Compare TM_VisChecks for an efficient ordering.
+ */
+static int
+heap_cmp_index_vischeck(const void *a, const void *b)
+{
+ const TM_VisCheck *visa = (const TM_VisCheck *) a;
+ const TM_VisCheck *visb = (const TM_VisCheck *) b;
+ return ItemPointerCompare(unconstify(ItemPointerData *, &visa->tid),
+ unconstify(ItemPointerData *, &visb->tid));
+}
+
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4da4dc84580..65c23ff6658 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2676,6 +2676,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_tid_valid = heapam_tuple_tid_valid,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.index_delete_tuples = heap_index_delete_tuples,
+ .index_vischeck_tuples = heap_index_vischeck_tuples,
.relation_set_new_filelocator = heapam_relation_set_new_filelocator,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 55ec4c10352..370e442e24e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -627,6 +627,12 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/* XXX: we should assert that a snapshot is pushed or registered */
Assert(TransactionIdIsValid(RecentXmin));
+ /*
+ * Reset xs_visrecheck, so we don't confuse the next tuple's visibility
+ * state with that of the previous.
+ */
+ scan->xs_visrecheck = TMVC_Unchecked;
+
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_heaptid. It should also set
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..b3ce90ceaea 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -61,6 +61,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_get_latest_tid != NULL);
Assert(routine->tuple_satisfies_snapshot != NULL);
Assert(routine->index_delete_tuples != NULL);
+ Assert(routine->index_vischeck_tuples != NULL);
Assert(routine->tuple_insert != NULL);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca9507..e02fc1652ff 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -121,6 +121,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
+ TMVC_Result vischeck = scandesc->xs_visrecheck;
CHECK_FOR_INTERRUPTS();
@@ -128,6 +129,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We can skip the heap fetch if the TID references a heap page on
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
+ * The index may have already pre-checked the visibility of the tuple
+ * for us, and stored the result in xs_visrecheck, in which case we
+ * can skip the call.
*
* Note on Memory Ordering Effects: visibilitymap_get_status does not
* lock the visibility map buffer, and therefore the result we read
@@ -157,37 +161,60 @@ IndexOnlyNext(IndexOnlyScanState *node)
*
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
+ *
+ * The index doing these checks for us doesn't materially change these
+ * considerations.
*/
- if (!VM_ALL_VISIBLE(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
- {
- /*
- * Rats, we have to visit the heap to check visibility.
- */
- InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
- continue; /* no visible tuple, try next index entry */
+ if (vischeck == TMVC_Unchecked)
+ vischeck = table_index_vischeck_tuple(scandesc->heapRelation,
+ &node->ioss_VMBuffer,
+ tid);
- ExecClearTuple(node->ioss_TableSlot);
-
- /*
- * Only MVCC snapshots are supported here, so there should be no
- * need to keep following the HOT chain once a visible entry has
- * been found. If we did want to allow that, we'd need to keep
- * more state to remember not to call index_getnext_tid next time.
- */
- if (scandesc->xs_heap_continue)
- elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+ Assert(vischeck != TMVC_Unchecked);
- /*
- * Note: at this point we are holding a pin on the heap page, as
- * recorded in scandesc->xs_cbuf. We could release that pin now,
- * but it's not clear whether it's a win to do so. The next index
- * entry might require a visit to the same heap page.
- */
-
- tuple_from_heap = true;
+ switch (vischeck)
+ {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ /*
+ * In case of compilers that don't undertand that elog(ERROR)
+ * doens't exit, and which have -Wimplicit-fallthrough:
+ */
+ /* fallthrough */
+ case TMVC_MaybeVisible:
+ {
+ /*
+ * Rats, we have to visit the heap to check visibility.
+ */
+ InstrCountTuples2(node, 1);
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+ continue; /* no visible tuple, try next index entry */
+
+ ExecClearTuple(node->ioss_TableSlot);
+
+ /*
+ * Only MVCC snapshots are supported here, so there should be
+ * no need to keep following the HOT chain once a visible
+ * entry has been found. If we did want to allow that, we'd
+ * need to keep more state to remember not to call
+ * index_getnext_tid next time.
+ */
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+ /*
+ * Note: at this point we are holding a pin on the heap page,
+ * as recorded in scandesc->xs_cbuf. We could release that
+ * pin now, but it's not clear whether it's a win to do so.
+ * The next index entry might require a visit to the same heap
+ * page.
+ */
+
+ tuple_from_heap = true;
+ break;
+ }
+ case TMVC_Visible:
+ break;
}
/*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5b35debc8ff..7e9ca616a67 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6561,44 +6561,62 @@ get_actual_variable_endpoint(Relation heapRel,
while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
{
BlockNumber block = ItemPointerGetBlockNumber(tid);
+ TMVC_Result visres = index_scan->xs_visrecheck;
- if (!VM_ALL_VISIBLE(heapRel,
- block,
- &vmbuffer))
+ if (visres == TMVC_Unchecked)
+ visres = table_index_vischeck_tuple(heapRel, &vmbuffer, tid);
+
+ Assert(visres != TMVC_Unchecked);
+
+ switch (visres)
{
- /* Rats, we have to visit the heap to check visibility */
- if (!index_fetch_heap(index_scan, tableslot))
- {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
/*
- * No visible tuple for this index entry, so we need to
- * advance to the next entry. Before doing so, count heap
- * page fetches and give up if we've done too many.
- *
- * We don't charge a page fetch if this is the same heap page
- * as the previous tuple. This is on the conservative side,
- * since other recently-accessed pages are probably still in
- * buffers too; but it's good enough for this heuristic.
+ * In case of compilers that don't undertand that elog(ERROR)
+ * doens't exit, and which have -Wimplicit-fallthrough:
*/
+ /* fallthrough */
+ case TMVC_MaybeVisible:
+ {
+ /* Rats, we have to visit the heap to check visibility */
+ if (!index_fetch_heap(index_scan, tableslot))
+ {
+ /*
+ * No visible tuple for this index entry, so we need to
+ * advance to the next entry. Before doing so, count heap
+ * page fetches and give up if we've done too many.
+ *
+ * We don't charge a page fetch if this is the same heap
+ * page as the previous tuple. This is on the
+ * conservative side, since other recently-accessed pages
+ * are probably still in buffers too; but it's good enough
+ * for this heuristic.
+ */
#define VISITED_PAGES_LIMIT 100
- if (block != last_heap_block)
- {
- last_heap_block = block;
- n_visited_heap_pages++;
- if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
- break;
- }
+ if (block != last_heap_block)
+ {
+ last_heap_block = block;
+ n_visited_heap_pages++;
+ if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
+ break;
+ }
- continue; /* no visible tuple, try next index entry */
- }
+ continue; /* no visible tuple, try next index entry */
+ }
- /* We don't actually need the heap tuple for anything */
- ExecClearTuple(tableslot);
+ /* We don't actually need the heap tuple for anything */
+ ExecClearTuple(tableslot);
- /*
- * We don't care whether there's more than one visible tuple in
- * the HOT chain; if any are visible, that's good enough.
- */
+ /*
+ * We don't care whether there's more than one visible tuple in
+ * the HOT chain; if any are visible, that's good enough.
+ */
+ break;
+ }
+ case TMVC_Visible:
+ break;
}
/*
--
2.45.2
v10-0002-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v10-0002-GIST-Fix-visibility-issues-in-IOS.patchDownload
From 02d9d600bb384ee0d83f7165e6336f4dfd2ae038 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 7 Mar 2025 22:55:24 +0100
Subject: [PATCH v10 2/5] GIST: Fix visibility issues in IOS
Previously, GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with GIST vacuum, and we now do
preliminary visibility checks to be used by IOS so that the IOS
infrastructure knows to recheck the heap page even if that page is now
ALL_VISIBLE.
Note: For PG17 and below, this needs some adaptations to use e.g.
VM_ALL_VISIBLE, and pack its fields in places that don't cause ABI
issues on 32-bit systems.
Idea from Heikki Linnakangas
Backpatch: 17-
---
src/include/access/gist_private.h | 27 ++++-
src/backend/access/gist/gistget.c | 159 ++++++++++++++++++++++-----
src/backend/access/gist/gistscan.c | 11 +-
src/backend/access/gist/gistvacuum.c | 6 +-
4 files changed, 164 insertions(+), 39 deletions(-)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..a4bc344381c 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/buffile.h"
#include "utils/hsearch.h"
#include "access/genam.h"
+#include "tableam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -124,6 +125,8 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ uint8 visrecheck; /* Cached visibility check result for this
+ * heap pointer. */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -170,12 +173,24 @@ typedef struct GISTScanOpaqueData
BlockNumber curBlkno; /* current number of block */
GistNSN curPageLSN; /* pos in the WAL stream when page was read */
- /* In a non-ordered search, returnable heap items are stored here: */
- GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
- OffsetNumber nPageData; /* number of valid items in array */
- OffsetNumber curPageData; /* next item to return */
- MemoryContext pageDataCxt; /* context holding the fetched tuples, for
- * index-only scans */
+ /* info used by Index-Only Scans */
+ Buffer vmbuf; /* reusable buffer for IOS' vm lookups */
+
+ union {
+ struct {
+ /* In a non-ordered search, returnable heap items are stored here: */
+ GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nPageData; /* number of valid items in array */
+ OffsetNumber curPageData; /* next item to return */
+ MemoryContext pageDataCxt; /* context holding the fetched tuples,
+ * for index-only scans */
+ } nos;
+ struct {
+ /* In an ordered search, we use this as scratch space */
+ GISTSearchHeapItem *sortData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nsortData; /* number of items in sortData */
+ } os;
+ };
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 387d9972345..05a6eb0c300 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -17,6 +17,7 @@
#include "access/genam.h"
#include "access/gist_private.h"
#include "access/relscan.h"
+#include "access/tableam.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -394,10 +395,14 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
return;
}
- so->nPageData = so->curPageData = 0;
+ if (scan->numberOfOrderBys)
+ so->os.nsortData = 0;
+ else
+ so->nos.nPageData = so->nos.curPageData = 0;
+
scan->xs_hitup = NULL; /* might point into pageDataCxt */
- if (so->pageDataCxt)
- MemoryContextReset(so->pageDataCxt);
+ if (so->nos.pageDataCxt)
+ MemoryContextReset(so->nos.pageDataCxt);
/*
* We save the LSN of the page as we read it, so that we know whether it
@@ -457,9 +462,9 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
/*
* Non-ordered scan, so report tuples in so->pageData[]
*/
- so->pageData[so->nPageData].heapPtr = it->t_tid;
- so->pageData[so->nPageData].recheck = recheck;
- so->pageData[so->nPageData].offnum = i;
+ so->nos.pageData[so->nos.nPageData].heapPtr = it->t_tid;
+ so->nos.pageData[so->nos.nPageData].recheck = recheck;
+ so->nos.pageData[so->nos.nPageData].offnum = i;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -467,12 +472,12 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
*/
if (scan->xs_want_itup)
{
- oldcxt = MemoryContextSwitchTo(so->pageDataCxt);
- so->pageData[so->nPageData].recontup =
+ oldcxt = MemoryContextSwitchTo(so->nos.pageDataCxt);
+ so->nos.pageData[so->nos.nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
}
- so->nPageData++;
+ so->nos.nPageData++;
}
else
{
@@ -501,7 +506,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ so->os.sortData[so->os.nsortData] = &item->data.heap;
+ so->os.nsortData += 1;
+ }
}
else
{
@@ -526,7 +535,97 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
}
- UnlockReleaseBuffer(buffer);
+ /* Allow writes to the buffer, but don't yet allow VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * If we're in an index-only scan, we need to do visibility checks before
+ * we release the pin, so that VACUUM can't clean up dead tuples from this
+ * index page and mark the page ALL_VISIBLE before the tuple was returned.
+ *
+ * See also docs section "Index Locking Considerations".
+ */
+ if (scan->xs_want_itup)
+ {
+ TM_IndexVisibilityCheckOp op;
+ op.vmbuf = &so->vmbuf;
+
+ if (scan->numberOfOrderBys > 0)
+ {
+ op.nchecktids = so->os.nsortData;
+
+ if (op.nchecktids > 0)
+ {
+ op.checktids = palloc(op.nchecktids * sizeof(TM_VisCheck));
+
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ op.checktids[off].vischeckresult = TMVC_Unchecked;
+ op.checktids[off].tid = so->os.sortData[off]->heapPtr;
+ op.checktids[off].idxoffnum = off;
+ Assert(ItemPointerIsValid(&op.checktids[off].tid));
+ }
+ }
+ }
+ else
+ {
+ op.nchecktids = so->nos.nPageData;
+
+ if (op.nchecktids > 0)
+ {
+ op.checktids = palloc_array(TM_VisCheck, op.nchecktids);
+
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ op.checktids[off].vischeckresult = TMVC_Unchecked;
+ op.checktids[off].tid = so->nos.pageData[off].heapPtr;
+ op.checktids[off].idxoffnum = off;
+ Assert(ItemPointerIsValid(&op.checktids[off].tid));
+ }
+ }
+ }
+
+ if (op.nchecktids > 0)
+ {
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = so->os.sortData[check->idxoffnum];
+
+ /* sanity checks */
+ Assert(check->idxoffnum < op.nchecktids);
+ Assert(ItemPointerEquals(&item->heapPtr, &check->tid));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ /* reset state */
+ so->os.nsortData = 0;
+ }
+ else
+ {
+ for (int off = 0; off < op.nchecktids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = &so->nos.pageData[check->idxoffnum];
+
+ Assert(check->idxoffnum < op.nchecktids);
+ Assert(ItemPointerEquals(&item->heapPtr, &check->tid));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ }
+
+ /* clean up the used resources */
+ pfree(op.checktids);
+ }
+ }
+
+ /* Allow VACUUM to process the buffer again */
+ ReleaseBuffer(buffer);
}
/*
@@ -588,7 +687,10 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = item->data.heap.recontup;
+ scan->xs_visrecheck = item->data.heap.visrecheck;
+ }
res = true;
}
else
@@ -629,10 +731,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
scan->instrument->nsearches++;
so->firstCall = false;
- so->curPageData = so->nPageData = 0;
+ so->nos.curPageData = so->nos.nPageData = 0;
scan->xs_hitup = NULL;
- if (so->pageDataCxt)
- MemoryContextReset(so->pageDataCxt);
+ if (so->nos.pageDataCxt)
+ MemoryContextReset(so->nos.pageDataCxt);
fakeItem.blkno = GIST_ROOT_BLKNO;
memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
@@ -649,9 +751,9 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
/* Fetch tuples index-page-at-a-time */
for (;;)
{
- if (so->curPageData < so->nPageData)
+ if (so->nos.curPageData < so->nos.nPageData)
{
- if (scan->kill_prior_tuple && so->curPageData > 0)
+ if (scan->kill_prior_tuple && so->nos.curPageData > 0)
{
if (so->killedItems == NULL)
@@ -667,17 +769,20 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
}
if (so->numKilled < MaxIndexTuplesPerPage)
so->killedItems[so->numKilled++] =
- so->pageData[so->curPageData - 1].offnum;
+ so->nos.pageData[so->nos.curPageData - 1].offnum;
}
/* continuing to return tuples from a leaf page */
- scan->xs_heaptid = so->pageData[so->curPageData].heapPtr;
- scan->xs_recheck = so->pageData[so->curPageData].recheck;
+ scan->xs_heaptid = so->nos.pageData[so->nos.curPageData].heapPtr;
+ scan->xs_recheck = so->nos.pageData[so->nos.curPageData].recheck;
/* in an index-only scan, also return the reconstructed tuple */
if (scan->xs_want_itup)
- scan->xs_hitup = so->pageData[so->curPageData].recontup;
+ {
+ scan->xs_hitup = so->nos.pageData[so->nos.curPageData].recontup;
+ scan->xs_visrecheck = so->nos.pageData[so->nos.curPageData].visrecheck;
+ }
- so->curPageData++;
+ so->nos.curPageData++;
return true;
}
@@ -687,8 +792,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
* necessary
*/
if (scan->kill_prior_tuple
- && so->curPageData > 0
- && so->curPageData == so->nPageData)
+ && so->nos.curPageData > 0
+ && so->nos.curPageData == so->nos.nPageData)
{
if (so->killedItems == NULL)
@@ -704,7 +809,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
}
if (so->numKilled < MaxIndexTuplesPerPage)
so->killedItems[so->numKilled++] =
- so->pageData[so->curPageData - 1].offnum;
+ so->nos.pageData[so->nos.curPageData - 1].offnum;
}
/* find and process the next index page */
do
@@ -733,7 +838,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
gistScanPage(scan, item, item->distances, NULL, NULL);
pfree(item);
- } while (so->nPageData == 0);
+ } while (so->nos.nPageData == 0);
}
}
}
@@ -756,10 +861,10 @@ gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
scan->instrument->nsearches++;
/* Begin the scan by processing the root page */
- so->curPageData = so->nPageData = 0;
+ so->nos.curPageData = so->nos.nPageData = 0;
scan->xs_hitup = NULL;
- if (so->pageDataCxt)
- MemoryContextReset(so->pageDataCxt);
+ if (so->nos.pageDataCxt)
+ MemoryContextReset(so->nos.pageDataCxt);
fakeItem.blkno = GIST_ROOT_BLKNO;
memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..a59672ce979 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -204,9 +204,9 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
scan->xs_hitupdesc = so->giststate->fetchTupdesc;
/* Also create a memory context that will hold the returned tuples */
- so->pageDataCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST page data context",
- ALLOCSET_DEFAULT_SIZES);
+ so->nos.pageDataCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST page data context",
+ ALLOCSET_DEFAULT_SIZES);
}
/* create new, empty pairing heap for search queue */
@@ -347,6 +347,11 @@ void
gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index dd0d9d5006c..5a95b93236e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -289,10 +289,10 @@ restart:
info->strategy);
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
--
2.45.2
v10-0003-SP-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v10-0003-SP-GIST-Fix-visibility-issues-in-IOS.patchDownload
From 4190f1a502d6d9c5798eba18c82cfd34d5fa95ee Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 8 Mar 2025 01:15:08 +0100
Subject: [PATCH v10 3/5] SP-GIST: Fix visibility issues in IOS
Previously, SP-GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with SP-GIST vacuum, and we now do
preliminary visibility checks to be used by IOS so that the IOS
infrastructure knows to recheck the heap page even if that page is now
ALL_VISIBLE.
Note: For PG17 and below, this needs some adaptations to use e.g.
VM_ALL_VISIBLE, and pack its fields in places that work fine on 32-bit
systems, too.
Idea from Heikki Linnakangas
Backpatch: 17-
---
src/include/access/spgist_private.h | 9 +-
src/backend/access/spgist/spgscan.c | 187 ++++++++++++++++++++++++--
src/backend/access/spgist/spgvacuum.c | 2 +-
3 files changed, 184 insertions(+), 14 deletions(-)
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..63e970468c7 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "utils/geo_decls.h"
#include "utils/relcache.h"
+#include "tableam.h"
typedef struct SpGistOptions
@@ -175,7 +176,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
-
+ uint8 visrecheck; /* IOS: TMVC_Result of contained heap tuple */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
} SpGistSearchItem;
@@ -223,6 +224,7 @@ typedef struct SpGistScanOpaqueData
/* These fields are only used in amgettuple scans: */
bool want_itup; /* are we reconstructing tuples? */
+ Buffer vmbuf; /* IOS: used for table_index_vischeck_tuples */
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
@@ -235,6 +237,11 @@ typedef struct SpGistScanOpaqueData
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
+ /* support for IOS */
+ int nReorderThisPage;
+ uint8 *visrecheck; /* IOS vis check results, counted by nPtrs */
+ SpGistSearchItem **items; /* counted by nReorderThisPage */
+
/*
* Note: using MaxIndexTuplesPerPage above is a bit hokey since
* SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 25893050c58..06dc255c32d 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ TMVC_Result visrecheck);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -142,6 +143,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->visrecheck = TMVC_Unchecked;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -386,6 +388,19 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
if (scankey && scan->numberOfKeys > 0)
memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+ /* prepare index-only scan requirements */
+ so->nReorderThisPage = 0;
+ if (scan->xs_want_itup)
+ {
+ if (so->visrecheck == NULL)
+ so->visrecheck = palloc(MaxIndexTuplesPerPage);
+
+ if (scan->numberOfOrderBys > 0 && so->items == NULL)
+ {
+ so->items = palloc_array(SpGistSearchItem *, MaxIndexTuplesPerPage);
+ }
+ }
+
/* initialize order-by data if needed */
if (orderbys && scan->numberOfOrderBys > 0)
{
@@ -453,6 +468,9 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ if (BufferIsValid(so->vmbuf))
+ ReleaseBuffer(so->vmbuf);
+
pfree(so);
}
@@ -502,6 +520,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
item->isLeaf = true;
item->recheck = recheck;
item->recheckDistances = recheckDistances;
+ item->visrecheck = TMVC_Unchecked;
return item;
}
@@ -584,6 +603,14 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ if (so->want_itup)
+ {
+ Assert(PointerIsValid(so->items));
+
+ so->items[so->nReorderThisPage] = heapItem;
+ so->nReorderThisPage++;
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -593,7 +620,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, TMVC_Unchecked);
*reportedSome = true;
}
}
@@ -806,6 +833,93 @@ spgTestLeafTuple(SpGistScanOpaque so,
return SGLT_GET_NEXTOFFSET(leafTuple);
}
+/*
+ * Pupulate so->visrecheck based on tuples which are cached for a currently
+ * pinned page.
+ */
+static void
+spgPopulateUnorderedVischecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ Assert(scan->numberOfOrderBys == 0);
+
+ if (so->nPtrs == 0)
+ return;
+
+ op.nchecktids = so->nPtrs;
+ op.checktids = palloc_array(TM_VisCheck, so->nPtrs);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ op.checktids[i].idxoffnum = i;
+ op.checktids[i].vischeckresult = TMVC_Unchecked;
+ op.checktids[i].tid = so->heapPtrs[i];
+
+ Assert(ItemPointerIsValid(&op.checktids[i].tid));
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(ItemPointerEquals(&so->heapPtrs[check->idxoffnum],
+ &check->tid));
+ Assert(check->idxoffnum < op.nchecktids);
+
+ so->visrecheck[check->idxoffnum] = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+}
+
+/* pupulate so->visrecheck based on current cached tuples */
+static void
+spgPopulateOrderedVisChecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ if (so->nReorderThisPage == 0)
+ return;
+
+ Assert(so->nReorderThisPage > 0);
+ Assert(scan->numberOfOrderBys > 0);
+ Assert(PointerIsValid(so->items));
+
+ op.nchecktids = so->nReorderThisPage;
+ op.checktids = palloc_array(TM_VisCheck, so->nReorderThisPage);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ op.checktids[i].idxoffnum = i;
+ op.checktids[i].vischeckresult = TMVC_Unchecked;
+ op.checktids[i].tid = so->items[i]->heapPtr;
+
+ Assert(ItemPointerIsValid(&so->items[i]->heapPtr));
+ Assert(so->items[i]->isLeaf);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.nchecktids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->idxoffnum < op.nchecktids);
+ Assert(ItemPointerEquals(&check->tid,
+ &so->items[check->idxoffnum]->heapPtr));
+
+ so->items[check->idxoffnum]->visrecheck = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+ so->nReorderThisPage = 0;
+}
+
/*
* Walk the tree and report all tuples passing the scan quals to the storeRes
* subroutine.
@@ -814,8 +928,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
* next page boundary once we have reported at least one tuple.
*/
static void
-spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
- storeRes_func storeRes)
+spgWalk(IndexScanDesc scan, Relation index, SpGistScanOpaque so,
+ bool scanWholeIndex, storeRes_func storeRes)
{
Buffer buffer = InvalidBuffer;
bool reportedSome = false;
@@ -835,9 +949,23 @@ redirect:
{
/* We store heap items in the queue only in case of ordered search */
Assert(so->numberOfNonNullOrderBys > 0);
+
+ /*
+ * If an item we found on a page is retrieved immediately after
+ * processing that page, we won't yet have released the page pin,
+ * and thus won't yet have processed the visibility data of the
+ * page's (now) ordered tuples.
+ * Do that now, so that all tuples on the page we're about to
+ * unpin were checked for visibility before we returned any.
+ */
+ if (so->want_itup && so->nReorderThisPage)
+ spgPopulateOrderedVisChecks(scan, so);
+
+ Assert(!so->want_itup || item->visrecheck != TMVC_Unchecked);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->visrecheck);
reportedSome = true;
}
else
@@ -854,7 +982,15 @@ redirect:
}
else if (blkno != BufferGetBlockNumber(buffer))
{
- UnlockReleaseBuffer(buffer);
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ Assert(so->numberOfOrderBys >= 0);
+ if (so->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+
+ ReleaseBuffer(buffer);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
}
@@ -922,16 +1058,36 @@ redirect:
}
if (buffer != InvalidBuffer)
- UnlockReleaseBuffer(buffer);
-}
+ {
+ /* Unlock the buffer for concurrent accesses except VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * If we're in an index-only scan, pre-check visibility of the tuples,
+ * so we can drop the pin without causing visibility bugs.
+ */
+ if (so->want_itup)
+ {
+ Assert(scan->numberOfOrderBys >= 0);
+ if (scan->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+ }
+
+ /* Release the page */
+ ReleaseBuffer(buffer);
+ }
+}
/* storeRes subroutine for getbitmap case */
static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ TMVC_Result visres)
{
Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
@@ -949,7 +1105,7 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
so->tbm = tbm;
so->ntids = 0;
- spgWalk(scan->indexRelation, so, true, storeBitmap);
+ spgWalk(scan, scan->indexRelation, so, true, storeBitmap);
return so->ntids;
}
@@ -959,12 +1115,15 @@ static void
storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+ bool recheckDistances, double *nonNullDistances,
+ TMVC_Result visres)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
so->recheck[so->nPtrs] = recheck;
so->recheckDistances[so->nPtrs] = recheckDistances;
+ if (so->want_itup)
+ so->visrecheck[so->nPtrs] = visres;
if (so->numberOfOrderBys > 0)
{
@@ -1041,6 +1200,10 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_heaptid = so->heapPtrs[so->iPtr];
scan->xs_recheck = so->recheck[so->iPtr];
scan->xs_hitup = so->reconTups[so->iPtr];
+ if (so->want_itup)
+ scan->xs_visrecheck = so->visrecheck[so->iPtr];
+
+ Assert(!scan->xs_want_itup || scan->xs_visrecheck != TMVC_Unchecked);
if (so->numberOfOrderBys > 0)
index_store_float8_orderby_distances(scan, so->orderByTypes,
@@ -1070,7 +1233,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
}
so->iPtr = so->nPtrs = 0;
- spgWalk(scan->indexRelation, so, false, storeGettuple);
+ spgWalk(scan, scan->indexRelation, so, false, storeGettuple);
if (so->nPtrs == 0)
break; /* must have completed scan */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index eeddacd0d52..993d4a5b662 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -629,7 +629,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
--
2.45.2
On Fri, 21 Mar 2025 at 17:14, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Attached is v10, which polishes the previous patches, and adds a patch
for nbtree to use the new visibility checking strategy so that it too
can release its index pages much earlier, and adds a similar
visibility check test to nbtree.
And here's v12. v11 (skipped) would've been a rebase, but after
finishing the rebase I noticed a severe regression in btree's IOS with
the new code, so v12 here applies some optimizations which reduce the
overhead of the new code.
Given its TableAM api changes it'd be nice to have a review on 0001,
though the additions could be rewritten to not (yet) add
TableAMRoutine.
I think patches 1, 2 and 3 are relevant to PG18 (as long as we don't
have a beta, and this is only a bit more than a bugfix). Patch 4 is
for PG19 to get btree to implement the new API, too, and patch 5
contains tests similar to the bitmap scan tests, validating that IOS
doesn't block VACUUM but still returns correct results.
I'll try to figure out a patch that's backpatchable, as alternative to
patches 2 and 3, or at least for back-patching into PG17-. That will
arrive separately, though.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v12-0002-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v12-0002-GIST-Fix-visibility-issues-in-IOS.patchDownload
From d0f1f588b42522ab5c177ad55be368f3d281a565 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 7 Mar 2025 22:55:24 +0100
Subject: [PATCH v12 2/5] GIST: Fix visibility issues in IOS
Previously, GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with GIST vacuum, and we now do
preliminary visibility checks to be used by IOS so that the IOS
infrastructure knows to recheck the heap page even if that page is now
ALL_VISIBLE.
Idea from Heikki Linnakangas
---
src/backend/access/gist/gistget.c | 163 ++++++++++++++++++++++-----
src/backend/access/gist/gistscan.c | 11 +-
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 27 ++++-
4 files changed, 168 insertions(+), 39 deletions(-)
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 387d9972345..16fda28d4ad 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -17,6 +17,7 @@
#include "access/genam.h"
#include "access/gist_private.h"
#include "access/relscan.h"
+#include "access/tableam.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -394,10 +395,14 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
return;
}
- so->nPageData = so->curPageData = 0;
+ if (scan->numberOfOrderBys)
+ so->os.nsortData = 0;
+ else
+ so->nos.nPageData = so->nos.curPageData = 0;
+
scan->xs_hitup = NULL; /* might point into pageDataCxt */
- if (so->pageDataCxt)
- MemoryContextReset(so->pageDataCxt);
+ if (so->nos.pageDataCxt)
+ MemoryContextReset(so->nos.pageDataCxt);
/*
* We save the LSN of the page as we read it, so that we know whether it
@@ -457,9 +462,9 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
/*
* Non-ordered scan, so report tuples in so->pageData[]
*/
- so->pageData[so->nPageData].heapPtr = it->t_tid;
- so->pageData[so->nPageData].recheck = recheck;
- so->pageData[so->nPageData].offnum = i;
+ so->nos.pageData[so->nos.nPageData].heapPtr = it->t_tid;
+ so->nos.pageData[so->nos.nPageData].recheck = recheck;
+ so->nos.pageData[so->nos.nPageData].offnum = i;
/*
* In an index-only scan, also fetch the data from the tuple. The
@@ -467,12 +472,12 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
*/
if (scan->xs_want_itup)
{
- oldcxt = MemoryContextSwitchTo(so->pageDataCxt);
- so->pageData[so->nPageData].recontup =
+ oldcxt = MemoryContextSwitchTo(so->nos.pageDataCxt);
+ so->nos.pageData[so->nos.nPageData].recontup =
gistFetchTuple(giststate, r, it);
MemoryContextSwitchTo(oldcxt);
}
- so->nPageData++;
+ so->nos.nPageData++;
}
else
{
@@ -501,7 +506,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
* In an index-only scan, also fetch the data from the tuple.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ so->os.sortData[so->os.nsortData] = &item->data.heap;
+ so->os.nsortData += 1;
+ }
}
else
{
@@ -526,7 +535,101 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
}
- UnlockReleaseBuffer(buffer);
+ /* Allow writes to the buffer, but don't yet allow VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * If we're in an index-only scan, we need to do visibility checks before
+ * we release the pin, so that VACUUM can't clean up dead tuples from this
+ * index page and mark the page ALL_VISIBLE before the tuple was returned.
+ *
+ * See also docs section "Index Locking Considerations".
+ */
+ if (scan->xs_want_itup)
+ {
+ TM_IndexVisibilityCheckOp op;
+ op.vmbuf = &so->vmbuf;
+
+ if (scan->numberOfOrderBys > 0)
+ {
+ op.checkntids = so->os.nsortData;
+
+ if (op.checkntids > 0)
+ {
+ op.checktids = palloc(op.checkntids * sizeof(TM_VisCheck));
+
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ Assert(ItemPointerIsValid(&so->os.sortData[off]->heapPtr));
+
+ PopulateTMVischeck(&op.checktids[off],
+ &so->os.sortData[off]->heapPtr,
+ off);
+ }
+ }
+ }
+ else
+ {
+ op.checkntids = so->nos.nPageData;
+
+ if (op.checkntids > 0)
+ {
+ op.checktids = palloc_array(TM_VisCheck, op.checkntids);
+
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ Assert(ItemPointerIsValid(&so->nos.pageData[off].heapPtr));
+
+ PopulateTMVischeck(&op.checktids[off],
+ &so->nos.pageData[off].heapPtr,
+ off);
+ }
+ }
+ }
+
+ if (op.checkntids > 0)
+ {
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = so->os.sortData[check->idxoffnum];
+
+ /* sanity checks */
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapPtr));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ /* reset state */
+ so->os.nsortData = 0;
+ }
+ else
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = &so->nos.pageData[check->idxoffnum];
+
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapPtr));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ }
+
+ /* clean up the used resources */
+ pfree(op.checktids);
+ }
+ }
+
+ /* Allow VACUUM to process the buffer again */
+ ReleaseBuffer(buffer);
}
/*
@@ -588,7 +691,10 @@ getNextNearest(IndexScanDesc scan)
/* in an index-only scan, also return the reconstructed tuple. */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = item->data.heap.recontup;
+ scan->xs_visrecheck = item->data.heap.visrecheck;
+ }
res = true;
}
else
@@ -629,10 +735,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
scan->instrument->nsearches++;
so->firstCall = false;
- so->curPageData = so->nPageData = 0;
+ so->nos.curPageData = so->nos.nPageData = 0;
scan->xs_hitup = NULL;
- if (so->pageDataCxt)
- MemoryContextReset(so->pageDataCxt);
+ if (so->nos.pageDataCxt)
+ MemoryContextReset(so->nos.pageDataCxt);
fakeItem.blkno = GIST_ROOT_BLKNO;
memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
@@ -649,9 +755,9 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
/* Fetch tuples index-page-at-a-time */
for (;;)
{
- if (so->curPageData < so->nPageData)
+ if (so->nos.curPageData < so->nos.nPageData)
{
- if (scan->kill_prior_tuple && so->curPageData > 0)
+ if (scan->kill_prior_tuple && so->nos.curPageData > 0)
{
if (so->killedItems == NULL)
@@ -667,17 +773,20 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
}
if (so->numKilled < MaxIndexTuplesPerPage)
so->killedItems[so->numKilled++] =
- so->pageData[so->curPageData - 1].offnum;
+ so->nos.pageData[so->nos.curPageData - 1].offnum;
}
/* continuing to return tuples from a leaf page */
- scan->xs_heaptid = so->pageData[so->curPageData].heapPtr;
- scan->xs_recheck = so->pageData[so->curPageData].recheck;
+ scan->xs_heaptid = so->nos.pageData[so->nos.curPageData].heapPtr;
+ scan->xs_recheck = so->nos.pageData[so->nos.curPageData].recheck;
/* in an index-only scan, also return the reconstructed tuple */
if (scan->xs_want_itup)
- scan->xs_hitup = so->pageData[so->curPageData].recontup;
+ {
+ scan->xs_hitup = so->nos.pageData[so->nos.curPageData].recontup;
+ scan->xs_visrecheck = so->nos.pageData[so->nos.curPageData].visrecheck;
+ }
- so->curPageData++;
+ so->nos.curPageData++;
return true;
}
@@ -687,8 +796,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
* necessary
*/
if (scan->kill_prior_tuple
- && so->curPageData > 0
- && so->curPageData == so->nPageData)
+ && so->nos.curPageData > 0
+ && so->nos.curPageData == so->nos.nPageData)
{
if (so->killedItems == NULL)
@@ -704,7 +813,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
}
if (so->numKilled < MaxIndexTuplesPerPage)
so->killedItems[so->numKilled++] =
- so->pageData[so->curPageData - 1].offnum;
+ so->nos.pageData[so->nos.curPageData - 1].offnum;
}
/* find and process the next index page */
do
@@ -733,7 +842,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
gistScanPage(scan, item, item->distances, NULL, NULL);
pfree(item);
- } while (so->nPageData == 0);
+ } while (so->nos.nPageData == 0);
}
}
}
@@ -756,10 +865,10 @@ gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
scan->instrument->nsearches++;
/* Begin the scan by processing the root page */
- so->curPageData = so->nPageData = 0;
+ so->nos.curPageData = so->nos.nPageData = 0;
scan->xs_hitup = NULL;
- if (so->pageDataCxt)
- MemoryContextReset(so->pageDataCxt);
+ if (so->nos.pageDataCxt)
+ MemoryContextReset(so->nos.pageDataCxt);
fakeItem.blkno = GIST_ROOT_BLKNO;
memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..a59672ce979 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -204,9 +204,9 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
scan->xs_hitupdesc = so->giststate->fetchTupdesc;
/* Also create a memory context that will hold the returned tuples */
- so->pageDataCxt = AllocSetContextCreate(so->giststate->scanCxt,
- "GiST page data context",
- ALLOCSET_DEFAULT_SIZES);
+ so->nos.pageDataCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST page data context",
+ ALLOCSET_DEFAULT_SIZES);
}
/* create new, empty pairing heap for search queue */
@@ -347,6 +347,11 @@ void
gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 6a359c98c60..d0b8afc252f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -325,10 +325,10 @@ restart:
recurse_to = InvalidBlockNumber;
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..a4bc344381c 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/buffile.h"
#include "utils/hsearch.h"
#include "access/genam.h"
+#include "tableam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -124,6 +125,8 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ uint8 visrecheck; /* Cached visibility check result for this
+ * heap pointer. */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -170,12 +173,24 @@ typedef struct GISTScanOpaqueData
BlockNumber curBlkno; /* current number of block */
GistNSN curPageLSN; /* pos in the WAL stream when page was read */
- /* In a non-ordered search, returnable heap items are stored here: */
- GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
- OffsetNumber nPageData; /* number of valid items in array */
- OffsetNumber curPageData; /* next item to return */
- MemoryContext pageDataCxt; /* context holding the fetched tuples, for
- * index-only scans */
+ /* info used by Index-Only Scans */
+ Buffer vmbuf; /* reusable buffer for IOS' vm lookups */
+
+ union {
+ struct {
+ /* In a non-ordered search, returnable heap items are stored here: */
+ GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nPageData; /* number of valid items in array */
+ OffsetNumber curPageData; /* next item to return */
+ MemoryContext pageDataCxt; /* context holding the fetched tuples,
+ * for index-only scans */
+ } nos;
+ struct {
+ /* In an ordered search, we use this as scratch space */
+ GISTSearchHeapItem *sortData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nsortData; /* number of items in sortData */
+ } os;
+ };
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.48.1
v12-0005-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchapplication/octet-stream; name=v12-0005-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchDownload
From e3a3f426b63c0ef20ddfd5d5bfb9f7edacc0641c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 21 Mar 2025 16:41:31 +0100
Subject: [PATCH v12 5/5] Test for IOS/Vacuum race conditions in index AMs
Add regression tests that demonstrate wrong results can occur with index-only
scans in GiST and SP-GiST indexes when encountering tuples being removed by a
concurrent VACUUM operation.
With these tests the index AMs are also expected to not block VACUUM even when
they're used inside a cursor.
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Co-authored-by: Peter Geoghegan <pg@bowt.ie>
Co-authored-by: Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
.../expected/index-only-scan-btree-vacuum.out | 59 +++++++++
.../expected/index-only-scan-gist-vacuum.out | 53 ++++++++
.../index-only-scan-spgist-vacuum.out | 53 ++++++++
src/test/isolation/isolation_schedule | 3 +
.../specs/index-only-scan-btree-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-gist-vacuum.spec | 112 +++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 112 +++++++++++++++++
7 files changed, 505 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-btree-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-btree-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-btree-vacuum.out b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
new file mode 100644
index 00000000000..9a9d94c86f6
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_asc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_asc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_desc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_desc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index e3c669a29c7..69909d4d911 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -18,6 +18,9 @@ test: two-ids
test: multiple-row-versions
test: index-only-scan
test: index-only-bitmapscan
+test: index-only-scan-btree-vacuum
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-btree-vacuum.spec b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
new file mode 100644
index 00000000000..9a00804c2c5
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing correct results with btree even with concurrent
+# vacuum
+
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_btree_a ON ios_needs_cleanup_lock USING btree (a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted_asc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+}
+step s1_prepare_sorted_desc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1, nor 10, so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum
+{
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+}
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_asc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_desc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # If the index scan doesn't correctly interlock its visibility tests with
+ # concurrent VACUUM cleanup then VACUUM will mark pages as all-visible that
+ # the scan in the next steps may then consider all-visible, despite some of
+ # those rows having been removed.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..9d241b25920
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..cd621d4f7f2
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.48.1
v12-0001-IOS-TableAM-Support-AM-specific-fast-visibility-.patchapplication/octet-stream; name=v12-0001-IOS-TableAM-Support-AM-specific-fast-visibility-.patchDownload
From 6bf950901f00bba930738d77ca16831ffdec8dd3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 7 Mar 2025 17:39:23 +0100
Subject: [PATCH v12 1/5] IOS/TableAM: Support AM-specific fast visibility
tests
Previously, we assumed VM_ALL_VISIBLE is universal across all AMs. This
is probably not the case, so we introduce a new table method called
"table_index_vischeck_tuples" which allows anyone to ask the AM whether
a tuple is definitely visible to everyone or might be invisible to
someone.
The API is intended to replace direct calls to VM_ALL_VISIBLE and as such
doesn't include "definitely dead to everyone", as the Heap AM's VM doesn't
support *definitely dead* as output for its lookups; and thus it would be
too expensive for the Heap AM to produce such results.
A future commit will use this inside GIST and SP-GIST to fix a race
condition between IOS and VACUUM, which causes a bug with tuple
visibility, and a further patch will add support for this to nbtree.
---
src/backend/access/heap/heapam.c | 177 +++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 1 +
src/backend/access/heap/visibilitymap.c | 39 ++---
src/backend/access/index/indexam.c | 6 +
src/backend/access/table/tableamapi.c | 1 +
src/backend/executor/nodeIndexonlyscan.c | 83 +++++++----
src/backend/utils/adt/selfuncs.c | 76 ++++++----
src/include/access/heapam.h | 2 +
src/include/access/relscan.h | 5 +
src/include/access/tableam.h | 103 +++++++++++++
src/include/access/visibilitymapdefs.h | 19 +++
11 files changed, 430 insertions(+), 82 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c1a4de14a59..34acd2c06c0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -101,11 +101,37 @@ static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status
uint16 infomask, Relation rel, int *remaining,
bool logLockFailure);
static void index_delete_sort(TM_IndexDeleteOp *delstate);
+static inline int heap_ivc_process_block(Relation rel, Buffer *vmbuf,
+ TM_VisCheck *checks, int nchecks);
+static void heap_ivc_process_all(Relation rel, Buffer *vmbuf,
+ TM_VisCheck *checks, int nchecks);
static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+/* sort template definitions for index */
+#define ST_SORT heap_ivc_sortby_tidheapblk
+#define ST_ELEMENT_TYPE TM_VisCheck
+#define ST_DECLARE
+#define ST_DEFINE
+#define ST_SCOPE static inline
+#define ST_COMPARE(a, b) ( \
+ a->tidblkno < b->tidblkno ? -1 : ( \
+ a->tidblkno > b->tidblkno ? 1 : 0 \
+ ) \
+)
+
+#include "lib/sort_template.h"
+
+#define ST_SORT heap_ivc_sortby_idx
+#define ST_ELEMENT_TYPE TM_VisCheck
+#define ST_DECLARE
+#define ST_DEFINE
+#define ST_SCOPE static inline
+#define ST_COMPARE(a, b) (((int) a->idxoffnum) - ((int) b->idxoffnum))
+#include "lib/sort_template.h"
+
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -8750,6 +8776,157 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
+/*
+ * heapam implementation of tableam's index_vischeck_tuples interface.
+ *
+ * This helper function is called by index AMs during index-only scans,
+ * to do VM-based visibility checks on individual tuples, so that the AM
+ * can hold the tuple in memory for e.g. reordering for extended periods of
+ * time while without holding thousands of pins to conflict with VACUUM.
+ *
+ * It's possible for this to generate a fair amount of I/O, since we may be
+ * checking hundreds of tuples from a single index block, but that is
+ * preferred over holding thousands of pins.
+ *
+ * We use heuristics to balance the costs of sorting TIDs with VM page
+ * lookups.
+ */
+void
+heap_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ Buffer vmbuf = *checkop->vmbuf;
+ Buffer storvmbuf = vmbuf;
+ TM_VisCheck *checks = checkop->checktids;
+ int checkntids = checkop->checkntids;
+ int upcomingvmbufchanges = 0;
+
+ /*
+ * The first index scan will have to pin the VM buffer, and that first
+ * change in the vm buffer shouldn't put us into the expensive VM page &
+ * sort path; so we special-case this operation.
+ */
+ if (!BufferIsValid(vmbuf))
+ {
+ int processed;
+ processed = heap_ivc_process_block(rel, &vmbuf, checks,checkntids);
+ checkntids -= processed;
+ checks += processed;
+ storvmbuf = vmbuf;
+ Assert(processed > 0);
+ }
+
+ while (vmbuf == storvmbuf && checkntids > 0)
+ {
+ int processed;
+
+ processed = heap_ivc_process_block(rel, &vmbuf, checks,checkntids);
+
+ Assert(processed <= checkntids);
+
+ checkntids -= processed;
+ checks += processed;
+ }
+
+ *checkop->vmbuf = vmbuf;
+
+ if (checkntids == 0)
+ {
+ return;
+ }
+
+ upcomingvmbufchanges = 0;
+
+ for (int i = 1; i < checkntids; i++)
+ {
+ /*
+ * Instead of storing the previous iteration's result, we only match
+ * the block numbers
+ */
+ BlockNumber lastblkno = checks[i - 1].tidblkno;
+ BlockNumber newblkno = checks[i].tidblkno;
+ /*
+ * divide-by-constant can be faster than BufferGetBlockNumber()
+ */
+ BlockNumber lastvmblkno = HEAPBLK_TO_VMBLOCK(lastblkno);
+ BlockNumber newvmblkno = HEAPBLK_TO_VMBLOCK(newblkno);
+
+ if (lastvmblkno != newvmblkno)
+ upcomingvmbufchanges++;
+ }
+
+ if (upcomingvmbufchanges <= pg_ceil_log2_32(checkntids))
+ {
+ /*
+ * No big amount of VM buf changes, so do all visibility checks
+ * without sorting.
+ */
+ heap_ivc_process_all(rel, checkop->vmbuf, checks, checkntids);
+
+ return;
+ }
+
+ /*
+ * Order the TIDs to heap order, so that we will only need to visit every
+ * VM page at most once.
+ */
+ heap_ivc_sortby_tidheapblk(checks, checkntids);
+
+ /* do all visibility checks */
+ heap_ivc_process_all(rel, checkop->vmbuf, checks, checkntids);
+
+ /* put the checks back in index order */
+ heap_ivc_sortby_idx(checks, checkntids);
+}
+
+
+static inline int
+heap_ivc_process_block(Relation rel, Buffer *vmbuf, TM_VisCheck *checks,
+ int nchecks)
+{
+ BlockNumber blkno;
+ BlockNumber prevblkno = blkno = checks->tidblkno;
+ TMVC_Result result;
+ int processed = 0;
+
+ if (VM_ALL_VISIBLE(rel, blkno, vmbuf))
+ result = TMVC_Visible;
+ else
+ result = TMVC_MaybeVisible;
+
+ do
+ {
+ checks->vischeckresult = result;
+
+ nchecks--;
+ processed++;
+ checks++;
+
+ if (nchecks <= 0)
+ return processed;
+
+ blkno = checks->tidblkno;
+ } while (blkno == prevblkno);
+
+ return processed;
+}
+
+static void
+heap_ivc_process_all(Relation rel, Buffer *vmbuf,
+ TM_VisCheck *checks, int nchecks)
+{
+ while (nchecks > 0)
+ {
+ int processed;
+
+ processed = heap_ivc_process_block(rel, vmbuf, checks, nchecks);
+
+ Assert(processed <= nchecks);
+
+ nchecks -= processed;
+ checks += processed;
+ }
+}
+
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..fe4b0b39da7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2648,6 +2648,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_tid_valid = heapam_tuple_tid_valid,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.index_delete_tuples = heap_index_delete_tuples,
+ .index_vischeck_tuples = heap_index_vischeck_tuples,
.relation_set_new_filelocator = heapam_relation_set_new_filelocator,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 745a04ef26e..ae71c0a6d6e 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -107,17 +107,6 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of heap blocks we can represent in one byte */
-#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
-
-/* Number of heap blocks we can represent in one visibility map page. */
-#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
-
-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
-
/* Masks for counting subsets of bits in the visibility map. */
#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
@@ -137,9 +126,9 @@ static Buffer vm_extend(Relation rel, BlockNumber vm_nblocks);
bool
visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- int mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+ BlockNumber mapBlock = HEAPBLK_TO_VMBLOCK(heapBlk);
+ int mapByte = HEAPBLK_TO_VMBYTE(heapBlk);
+ int mapOffset = HEAPBLK_TO_VMOFFSET(heapBlk);
uint8 mask = flags << mapOffset;
char *map;
bool cleared = false;
@@ -190,7 +179,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags
void
visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ BlockNumber mapBlock = HEAPBLK_TO_VMBLOCK(heapBlk);
/* Reuse the old pinned buffer if possible */
if (BufferIsValid(*vmbuf))
@@ -214,7 +203,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
bool
visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ BlockNumber mapBlock = HEAPBLK_TO_VMBLOCK(heapBlk);
return BufferIsValid(vmbuf) && BufferGetBlockNumber(vmbuf) == mapBlock;
}
@@ -247,9 +236,9 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
uint8 flags)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+ BlockNumber mapBlock = HEAPBLK_TO_VMBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_VMBYTE(heapBlk);
+ uint8 mapOffset = HEAPBLK_TO_VMOFFSET(heapBlk);
Page page;
uint8 *map;
uint8 status;
@@ -340,9 +329,9 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
uint8
visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+ BlockNumber mapBlock = HEAPBLK_TO_VMBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_VMBYTE(heapBlk);
+ uint8 mapOffset = HEAPBLK_TO_VMOFFSET(heapBlk);
char *map;
uint8 result;
@@ -445,9 +434,9 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
BlockNumber newnblocks;
/* last remaining block, byte, and bit */
- BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
- uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
- uint8 truncOffset = HEAPBLK_TO_OFFSET(nheapblocks);
+ BlockNumber truncBlock = HEAPBLK_TO_VMBLOCK(nheapblocks);
+ uint32 truncByte = HEAPBLK_TO_VMBYTE(nheapblocks);
+ uint8 truncOffset = HEAPBLK_TO_VMOFFSET(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..61d1f08220d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -628,6 +628,12 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/* XXX: we should assert that a snapshot is pushed or registered */
Assert(TransactionIdIsValid(RecentXmin));
+ /*
+ * Reset xs_visrecheck, so we don't confuse the next tuple's visibility
+ * state with that of the previous.
+ */
+ scan->xs_visrecheck = TMVC_Unchecked;
+
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_heaptid. It should also set
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..b3ce90ceaea 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -61,6 +61,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_get_latest_tid != NULL);
Assert(routine->tuple_satisfies_snapshot != NULL);
Assert(routine->index_delete_tuples != NULL);
+ Assert(routine->index_vischeck_tuples != NULL);
Assert(routine->tuple_insert != NULL);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca9507..e02fc1652ff 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -121,6 +121,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
+ TMVC_Result vischeck = scandesc->xs_visrecheck;
CHECK_FOR_INTERRUPTS();
@@ -128,6 +129,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We can skip the heap fetch if the TID references a heap page on
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
+ * The index may have already pre-checked the visibility of the tuple
+ * for us, and stored the result in xs_visrecheck, in which case we
+ * can skip the call.
*
* Note on Memory Ordering Effects: visibilitymap_get_status does not
* lock the visibility map buffer, and therefore the result we read
@@ -157,37 +161,60 @@ IndexOnlyNext(IndexOnlyScanState *node)
*
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
+ *
+ * The index doing these checks for us doesn't materially change these
+ * considerations.
*/
- if (!VM_ALL_VISIBLE(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
- {
- /*
- * Rats, we have to visit the heap to check visibility.
- */
- InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
- continue; /* no visible tuple, try next index entry */
+ if (vischeck == TMVC_Unchecked)
+ vischeck = table_index_vischeck_tuple(scandesc->heapRelation,
+ &node->ioss_VMBuffer,
+ tid);
- ExecClearTuple(node->ioss_TableSlot);
-
- /*
- * Only MVCC snapshots are supported here, so there should be no
- * need to keep following the HOT chain once a visible entry has
- * been found. If we did want to allow that, we'd need to keep
- * more state to remember not to call index_getnext_tid next time.
- */
- if (scandesc->xs_heap_continue)
- elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+ Assert(vischeck != TMVC_Unchecked);
- /*
- * Note: at this point we are holding a pin on the heap page, as
- * recorded in scandesc->xs_cbuf. We could release that pin now,
- * but it's not clear whether it's a win to do so. The next index
- * entry might require a visit to the same heap page.
- */
-
- tuple_from_heap = true;
+ switch (vischeck)
+ {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ /*
+ * In case of compilers that don't undertand that elog(ERROR)
+ * doens't exit, and which have -Wimplicit-fallthrough:
+ */
+ /* fallthrough */
+ case TMVC_MaybeVisible:
+ {
+ /*
+ * Rats, we have to visit the heap to check visibility.
+ */
+ InstrCountTuples2(node, 1);
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+ continue; /* no visible tuple, try next index entry */
+
+ ExecClearTuple(node->ioss_TableSlot);
+
+ /*
+ * Only MVCC snapshots are supported here, so there should be
+ * no need to keep following the HOT chain once a visible
+ * entry has been found. If we did want to allow that, we'd
+ * need to keep more state to remember not to call
+ * index_getnext_tid next time.
+ */
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+ /*
+ * Note: at this point we are holding a pin on the heap page,
+ * as recorded in scandesc->xs_cbuf. We could release that
+ * pin now, but it's not clear whether it's a win to do so.
+ * The next index entry might require a visit to the same heap
+ * page.
+ */
+
+ tuple_from_heap = true;
+ break;
+ }
+ case TMVC_Visible:
+ break;
}
/*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index a96b1b9c0bc..035bd7a82be 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6730,44 +6730,62 @@ get_actual_variable_endpoint(Relation heapRel,
while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
{
BlockNumber block = ItemPointerGetBlockNumber(tid);
+ TMVC_Result visres = index_scan->xs_visrecheck;
- if (!VM_ALL_VISIBLE(heapRel,
- block,
- &vmbuffer))
+ if (visres == TMVC_Unchecked)
+ visres = table_index_vischeck_tuple(heapRel, &vmbuffer, tid);
+
+ Assert(visres != TMVC_Unchecked);
+
+ switch (visres)
{
- /* Rats, we have to visit the heap to check visibility */
- if (!index_fetch_heap(index_scan, tableslot))
- {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
/*
- * No visible tuple for this index entry, so we need to
- * advance to the next entry. Before doing so, count heap
- * page fetches and give up if we've done too many.
- *
- * We don't charge a page fetch if this is the same heap page
- * as the previous tuple. This is on the conservative side,
- * since other recently-accessed pages are probably still in
- * buffers too; but it's good enough for this heuristic.
+ * In case of compilers that don't undertand that elog(ERROR)
+ * doens't exit, and which have -Wimplicit-fallthrough:
*/
+ /* fallthrough */
+ case TMVC_MaybeVisible:
+ {
+ /* Rats, we have to visit the heap to check visibility */
+ if (!index_fetch_heap(index_scan, tableslot))
+ {
+ /*
+ * No visible tuple for this index entry, so we need to
+ * advance to the next entry. Before doing so, count heap
+ * page fetches and give up if we've done too many.
+ *
+ * We don't charge a page fetch if this is the same heap
+ * page as the previous tuple. This is on the
+ * conservative side, since other recently-accessed pages
+ * are probably still in buffers too; but it's good enough
+ * for this heuristic.
+ */
#define VISITED_PAGES_LIMIT 100
- if (block != last_heap_block)
- {
- last_heap_block = block;
- n_visited_heap_pages++;
- if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
- break;
- }
+ if (block != last_heap_block)
+ {
+ last_heap_block = block;
+ n_visited_heap_pages++;
+ if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
+ break;
+ }
- continue; /* no visible tuple, try next index entry */
- }
+ continue; /* no visible tuple, try next index entry */
+ }
- /* We don't actually need the heap tuple for anything */
- ExecClearTuple(tableslot);
+ /* We don't actually need the heap tuple for anything */
+ ExecClearTuple(tableslot);
- /*
- * We don't care whether there's more than one visible tuple in
- * the HOT chain; if any are visible, that's good enough.
- */
+ /*
+ * We don't care whether there's more than one visible tuple in
+ * the HOT chain; if any are visible, that's good enough.
+ */
+ break;
+ }
+ case TMVC_Visible:
+ break;
}
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..1b66aa0bacc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -368,6 +368,8 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
extern TransactionId heap_index_delete_tuples(Relation rel,
TM_IndexDeleteOp *delstate);
+extern void heap_index_vischeck_tuples(Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
/* in heap/pruneheap.c */
struct GlobalVisState;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..93a6f65ab0e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -26,6 +26,9 @@
struct ParallelTableScanDescData;
+enum TMVC_Result;
+
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -176,6 +179,8 @@ typedef struct IndexScanDescData
bool xs_recheck; /* T means scan keys must be rechecked */
+ int xs_visrecheck; /* TM_VisCheckResult from tableam.h */
+
/*
* When fetching with an ordering operator, the values of the ORDER BY
* expressions of the last returned tuple, according to the index. If
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..47666cf96ea 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -248,6 +248,63 @@ typedef struct TM_IndexDeleteOp
TM_IndexStatus *status;
} TM_IndexDeleteOp;
+/*
+ * State used when calling table_index_delete_tuples()
+ *
+ * Index-only scans need to know the visibility of the associated table tuples
+ * before they can return the index tuple. If the index tuple is known to be
+ * visible with a cheap check, we can return it directly without requesting
+ * the visibility info from the table AM directly.
+ *
+ * This AM API exposes a cheap visibility checking API to indexes, allowing
+ * these indexes to check multiple tuples worth of visibility info at once,
+ * and allowing the AM to store these checks, improving the pinning ergonomics
+ * of index AMs by allowing a scan to cache index tuples in memory without
+ * holding pins on index tuples' pages until the index tuples were returned.
+ *
+ * The AM is called with a list of TIDs, and its output will indicate the
+ * visibility state of each tuple: Unchecked, Dead, MaybeVisible, or Visible.
+ *
+ * HeapAM's implementation of visibility maps only allows for cheap checks of
+ * *definitely visible*; all other results are *maybe visible*. A result for
+ * *definitely not visible* aka dead is currently not accounted for by lack of
+ * Table AMs which support such visibility lookups cheaply.
+ */
+typedef enum TMVC_Result
+{
+ TMVC_Unchecked,
+ TMVC_Visible,
+ TMVC_MaybeVisible,
+} TMVC_Result;
+
+typedef struct TM_VisCheck
+{
+ /* table TID from index tuple */
+ BlockNumber tidblkno;
+ uint16 tidoffset;
+ /* identifier for the TID in this visibility check operation context */
+ OffsetNumber idxoffnum;
+ /* the result of the visibility check operation */
+ TMVC_Result vischeckresult;
+} TM_VisCheck;
+
+static inline void
+PopulateTMVischeck(TM_VisCheck *check, ItemPointer tid, OffsetNumber idxoff)
+{
+ Assert(ItemPointerIsValid(tid));
+ check->tidblkno = ItemPointerGetBlockNumberNoCheck(tid);
+ check->tidoffset = ItemPointerGetOffsetNumberNoCheck(tid);
+ check->idxoffnum = idxoff;
+ check->vischeckresult = TMVC_Unchecked;
+}
+
+typedef struct TM_IndexVisibilityCheckOp
+{
+ int checkntids; /* number of TIDs to check */
+ Buffer *vmbuf; /* pointer to VM buffer to reuse across calls */
+ TM_VisCheck *checktids; /* the checks to execute */
+} TM_IndexVisibilityCheckOp;
+
/* "options" flag bits for table_tuple_insert */
/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
@@ -494,6 +551,10 @@ typedef struct TableAmRoutine
TransactionId (*index_delete_tuples) (Relation rel,
TM_IndexDeleteOp *delstate);
+ /* see table_index_vischeck_tuples() */
+ void (*index_vischeck_tuples) (Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
+
/* ------------------------------------------------------------------------
* Manipulations of physical tuples.
@@ -1318,6 +1379,48 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
return rel->rd_tableam->index_delete_tuples(rel, delstate);
}
+/*
+ * Determine rough visibility information of index tuples based on each TID.
+ *
+ * Determines which entries from index AM caller's TM_IndexVisibilityCheckOp
+ * state point to TMVC_VISIBLE or TMVC_MAYBE_VISIBLE table tuples, at low IO
+ * overhead. For the heap AM, the implementation is effectively a wrapper
+ * around VM_ALL_FROZEN.
+ *
+ * On return, all TM_VisChecks indicated by checkop->checktids will have been
+ * updated with the correct visibility status.
+ *
+ * Note that there is no value for "definitely dead" tuples, as the Heap AM
+ * doesn't have an efficient method to determine that a tuple is dead to all
+ * users, as it would have to go into the heap. If and when AMs are built
+ * that would support VM checks with an equivalent to VM_ALL_DEAD this
+ * decision can be reconsidered.
+ */
+static inline void
+table_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ return rel->rd_tableam->index_vischeck_tuples(rel, checkop);
+}
+
+static inline TMVC_Result
+table_index_vischeck_tuple(Relation rel, Buffer *vmbuffer, ItemPointer tid)
+{
+ TM_IndexVisibilityCheckOp checkOp;
+ TM_VisCheck op;
+
+ PopulateTMVischeck(&op, tid, 0);
+
+ checkOp.checktids = &op;
+ checkOp.checkntids = 1;
+ checkOp.vmbuf = vmbuffer;
+
+ rel->rd_tableam->index_vischeck_tuples(rel, &checkOp);
+
+ Assert(op.vischeckresult != TMVC_Unchecked);
+
+ return op.vischeckresult;
+}
+
/* ----------------------------------------------------------------------------
* Functions for manipulations of physical tuples.
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 5ad5c020877..c75303f63fd 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -12,6 +12,7 @@
*/
#ifndef VISIBILITYMAPDEFS_H
#define VISIBILITYMAPDEFS_H
+#include "storage/bufpage.h"
/* Number of bits for one heap page */
#define BITS_PER_HEAPBLOCK 2
@@ -31,4 +32,22 @@
#define VISIBILITYMAP_XLOG_CATALOG_REL 0x04
#define VISIBILITYMAP_XLOG_VALID_BITS (VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)
+/*
+ * Size of the bitmap on each visibility map page, in bytes. There's no
+ * extra headers, so the whole page minus the standard page header is
+ * used for the bitmap.
+ */
+#define VM_MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+
+/* Number of heap blocks we can represent in one byte */
+#define VM_HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
+
+/* Number of heap blocks we can represent in one visibility map page. */
+#define VM_HEAPBLOCKS_PER_PAGE (VM_MAPSIZE * VM_HEAPBLOCKS_PER_BYTE)
+
+/* Mapping from heap block number to the right bit in the visibility map */
+#define HEAPBLK_TO_VMBLOCK(x) ((x) / VM_HEAPBLOCKS_PER_PAGE)
+#define HEAPBLK_TO_VMBYTE(x) (((x) % VM_HEAPBLOCKS_PER_PAGE) / VM_HEAPBLOCKS_PER_BYTE)
+#define HEAPBLK_TO_VMOFFSET(x) (((x) % VM_HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
#endif /* VISIBILITYMAPDEFS_H */
--
2.48.1
v12-0004-NBTree-Reduce-Index-Only-Scan-pin-duration.patchapplication/octet-stream; name=v12-0004-NBTree-Reduce-Index-Only-Scan-pin-duration.patchDownload
From 045942933860b18a26e4c8239b3df350a23caab9 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 20 Mar 2025 23:12:25 +0100
Subject: [PATCH v12 4/5] NBTree: Reduce Index-Only Scan pin duration
Previously, we would keep a pin on every leaf page while we were returning
tuples to the scan. With this patch, we utilize the newly introduced
table_index_vischeck_tuples API to pre-check visibility of all TIDs, and
thus unpin the page well ahead of when we'd usually be ready with returning
and processing all index tuple results. This reduces the time VACUUM may
have to wait for a pin, and can increase performance with reduced redundant
VM checks.
---
src/backend/access/nbtree/nbtree.c | 21 +++++
src/backend/access/nbtree/nbtsearch.c | 125 ++++++++++++++++++++++++--
src/include/access/nbtree.h | 6 ++
3 files changed, 147 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index accc7fe8bbe..1f44d17e3b4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -360,6 +360,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ so->vmbuf = InvalidBuffer;
+ so->vischeckcap = 0;
+ so->vischecksbuf = NULL;
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -400,6 +404,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/*
* Allocate tuple workspace arrays, if needed for an index-only scan and
* not already done in a previous rescan call. To save on palloc
@@ -451,6 +461,17 @@ btendscan(IndexScanDesc scan)
so->markItemIndex = -1;
BTScanPosUnpinIfPinned(so->markPos);
+ if (so->vischecksbuf)
+ pfree(so->vischecksbuf);
+ so->vischecksbuf = NULL;
+ so->vischeckcap = 0;
+
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/* No need to invalidate positions, the RAM is about to be freed. */
/* Release storage */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 77264ddeecb..ed173bf7246 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,7 @@
#include "utils/rel.h"
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp, BTScanOpaque so);
static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access);
@@ -54,6 +54,12 @@ static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+/*
+ * Execute vischecks at the index level?
+ * Enabled by default.
+ */
+#define DEBUG_IOS_VISCHECKS_ENABLED true
+
/*
* _bt_drop_lock_and_maybe_pin()
*
@@ -64,13 +70,109 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* See nbtree/README section on making concurrent TID recycling safe.
*/
static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp, BTScanOpaque so)
{
_bt_unlockbuf(scan->indexRelation, sp->buf);
+ /*
+ * Do some visibility checks if this is an index-only scan; allowing us to
+ * drop the pin on this page before we have returned all tuples from this
+ * IOS to the executor.
+ */
+ if (scan->xs_want_itup && DEBUG_IOS_VISCHECKS_ENABLED)
+ {
+ int initOffset = sp->firstItem;
+ int ntids = 1 + sp->lastItem - initOffset;
+
+ if (ntids > 0)
+ {
+ TM_IndexVisibilityCheckOp visCheck;
+ Relation heaprel = scan->heapRelation;
+ TM_VisCheck *check;
+ BTScanPosItem *item;
+
+ visCheck.checkntids = ntids;
+
+ if (so->vischeckcap == 0)
+ {
+ so->vischecksbuf = palloc_array(TM_VisCheck, ntids);
+ so->vischeckcap = ntids;
+ }
+ else if (so->vischeckcap < visCheck.checkntids)
+ {
+ so->vischecksbuf = repalloc_array(so->vischecksbuf,
+ TM_VisCheck, ntids);
+ so->vischeckcap = ntids;
+ }
+
+ visCheck.checktids = so->vischecksbuf;
+ visCheck.vmbuf = &so->vmbuf;
+
+ check = so->vischecksbuf;
+ item = &so->currPos.items[initOffset];
+
+ for (int i = 0; i < visCheck.checkntids; i++)
+ {
+ Assert(item->visrecheck == TMVC_Unchecked);
+ Assert(ItemPointerIsValid(&item->heapTid));
+
+ PopulateTMVischeck(check, &item->heapTid, initOffset + i);
+
+ item++;
+ check++;
+ }
+
+ table_index_vischeck_tuples(heaprel, &visCheck);
+ check = so->vischecksbuf;
+
+ for (int i = 0; i < visCheck.checkntids; i++)
+ {
+ item = &so->currPos.items[check->idxoffnum];
+ /* We must have a valid visibility check result */
+ Assert(check->vischeckresult != TMVC_Unchecked);
+ /* The offset number should still indicate the right item */
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapTid));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapTid));
+
+ /* Store the visibility check result */
+ item->visrecheck = check->vischeckresult;
+ check++;
+ }
+ }
+ }
+
+ /*
+ * We may need to hold a pin on the page for one of several reasons:
+ *
+ * 1.) To safely apply kill_prior_tuple, we need to know that the tuples
+ * were not removed from the page (and subsequently re-inserted).
+ * A page's LSN can also allow us to detect modifications on the page,
+ * which then allows us to bail out of setting the hint bits, but that
+ * requires the index to be WAL-logged; so unless the index is WAL-logged
+ * we must hold a pin on the page to apply the kill_prior_tuple
+ * optimization.
+ *
+ * 2.) Non-MVCC scans need pin coupling to make sure the scan covers
+ * exactly the whole index keyspace.
+ *
+ * 3.) For Index-Only Scans, the scan needs to check the visibility of the
+ * table tuple while the relevant index tuple is guaranteed to still be
+ * contained in the index (so that vacuum hasn't yet marked any pages that
+ * could contain the value as ALL_VISIBLE after reclaiming a dead tuple
+ * that might be buffered in the scan). A pin must therefore be held
+ * at least while the basic visibility of the page's tuples is being
+ * checked.
+ *
+ * For cases 1 and 2, we must hold the pin after we've finished processing
+ * the index page.
+ *
+ * For case 3, we can release the pin if we first do the visibility checks
+ * of to-be-returned tuples using table_index_vischeck_tuples, which we've
+ * done just above.
+ */
if (IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
- !scan->xs_want_itup)
+ (!scan->xs_want_itup || DEBUG_IOS_VISCHECKS_ENABLED))
{
ReleaseBuffer(sp->buf);
sp->buf = InvalidBuffer;
@@ -2001,6 +2103,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
Size itupsz = IndexTupleSize(itup);
@@ -2031,6 +2135,8 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
/* Save base IndexTuple (truncate posting list) */
@@ -2067,6 +2173,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
/*
* Have index-only scans return the same base IndexTuple for every TID
@@ -2092,6 +2199,14 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
/* Return next item, per amgettuple contract */
scan->xs_heaptid = currItem->heapTid;
+
+ if (scan->xs_want_itup)
+ {
+ scan->xs_visrecheck = currItem->visrecheck;
+ Assert(currItem->visrecheck != TMVC_Unchecked ||
+ BufferIsValid(so->currPos.buf));
+ }
+
if (so->currTuples)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
}
@@ -2250,7 +2365,7 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
* so->currPos.buf in preparation for btgettuple returning tuples.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos, so);
return true;
}
@@ -2407,7 +2522,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
*/
Assert(so->currPos.currPage == blkno);
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos, so);
return true;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..f9ced4a1f0b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -957,6 +957,7 @@ typedef struct BTScanPosItem /* what we remember about each match */
ItemPointerData heapTid; /* TID of referenced heap item */
OffsetNumber indexOffset; /* index item's location within page */
LocationIndex tupleOffset; /* IndexTuple's offset in workspace, if any */
+ uint8 visrecheck; /* visibility recheck status, if any */
} BTScanPosItem;
typedef struct BTScanPosData
@@ -1071,6 +1072,11 @@ typedef struct BTScanOpaqueData
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ /* used for index-only scan visibility prechecks */
+ Buffer vmbuf; /* vm buffer */
+ int vischeckcap; /* capacity of vischeckbuf */
+ TM_VisCheck *vischecksbuf; /* single allocation to save on alloc overhead */
+
/*
* If we are doing an index-only scan, these are the tuple storage
* workspaces for the currPos and markPos respectively. Each is of size
--
2.48.1
v12-0003-SP-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v12-0003-SP-GIST-Fix-visibility-issues-in-IOS.patchDownload
From 00bb5187dba75601c8d3d37db9ce8b56e377a742 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 8 Mar 2025 01:15:08 +0100
Subject: [PATCH v12 3/5] SP-GIST: Fix visibility issues in IOS
Previously, SP-GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with SP-GIST vacuum, and we now do
preliminary visibility checks to be used by IOS so that the IOS
infrastructure knows to recheck the heap page even if that page is now
ALL_VISIBLE.
Idea from Heikki Linnakangas
---
src/backend/access/spgist/spgscan.c | 182 ++++++++++++++++++++++++--
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 9 +-
3 files changed, 179 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 25893050c58..42784fdc4ca 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ TMVC_Result visrecheck);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -142,6 +143,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->visrecheck = TMVC_Unchecked;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -386,6 +388,19 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
if (scankey && scan->numberOfKeys > 0)
memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+ /* prepare index-only scan requirements */
+ so->nReorderThisPage = 0;
+ if (scan->xs_want_itup)
+ {
+ if (so->visrecheck == NULL)
+ so->visrecheck = palloc(MaxIndexTuplesPerPage);
+
+ if (scan->numberOfOrderBys > 0 && so->items == NULL)
+ {
+ so->items = palloc_array(SpGistSearchItem *, MaxIndexTuplesPerPage);
+ }
+ }
+
/* initialize order-by data if needed */
if (orderbys && scan->numberOfOrderBys > 0)
{
@@ -453,6 +468,9 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ if (BufferIsValid(so->vmbuf))
+ ReleaseBuffer(so->vmbuf);
+
pfree(so);
}
@@ -502,6 +520,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
item->isLeaf = true;
item->recheck = recheck;
item->recheckDistances = recheckDistances;
+ item->visrecheck = TMVC_Unchecked;
return item;
}
@@ -584,6 +603,14 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ if (so->want_itup)
+ {
+ Assert(PointerIsValid(so->items));
+
+ so->items[so->nReorderThisPage] = heapItem;
+ so->nReorderThisPage++;
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -593,7 +620,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, TMVC_Unchecked);
*reportedSome = true;
}
}
@@ -806,6 +833,88 @@ spgTestLeafTuple(SpGistScanOpaque so,
return SGLT_GET_NEXTOFFSET(leafTuple);
}
+/*
+ * Pupulate so->visrecheck based on tuples which are cached for a currently
+ * pinned page.
+ */
+static void
+spgPopulateUnorderedVischecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ Assert(scan->numberOfOrderBys == 0);
+
+ if (so->nPtrs == 0)
+ return;
+
+ op.checkntids = so->nPtrs;
+ op.checktids = palloc_array(TM_VisCheck, so->nPtrs);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ Assert(ItemPointerIsValid(&so->heapPtrs[i]));
+
+ PopulateTMVischeck(&op.checktids[i], &so->heapPtrs[i], i);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&so->heapPtrs[check->idxoffnum]));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&so->heapPtrs[check->idxoffnum]));
+ Assert(check->idxoffnum < op.checkntids);
+
+ so->visrecheck[check->idxoffnum] = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+}
+
+/* pupulate so->visrecheck based on current cached tuples */
+static void
+spgPopulateOrderedVisChecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ if (so->nReorderThisPage == 0)
+ return;
+
+ Assert(so->nReorderThisPage > 0);
+ Assert(scan->numberOfOrderBys > 0);
+ Assert(PointerIsValid(so->items));
+
+ op.checkntids = so->nReorderThisPage;
+ op.checktids = palloc_array(TM_VisCheck, so->nReorderThisPage);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ PopulateTMVischeck(&op.checktids[i], &so->items[i]->heapPtr, i);
+ Assert(ItemPointerIsValid(&so->items[i]->heapPtr));
+ Assert(so->items[i]->isLeaf);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&so->items[check->idxoffnum]->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&so->items[check->idxoffnum]->heapPtr));
+
+ so->items[check->idxoffnum]->visrecheck = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+ so->nReorderThisPage = 0;
+}
+
/*
* Walk the tree and report all tuples passing the scan quals to the storeRes
* subroutine.
@@ -814,8 +923,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
* next page boundary once we have reported at least one tuple.
*/
static void
-spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
- storeRes_func storeRes)
+spgWalk(IndexScanDesc scan, Relation index, SpGistScanOpaque so,
+ bool scanWholeIndex, storeRes_func storeRes)
{
Buffer buffer = InvalidBuffer;
bool reportedSome = false;
@@ -835,9 +944,23 @@ redirect:
{
/* We store heap items in the queue only in case of ordered search */
Assert(so->numberOfNonNullOrderBys > 0);
+
+ /*
+ * If an item we found on a page is retrieved immediately after
+ * processing that page, we won't yet have released the page pin,
+ * and thus won't yet have processed the visibility data of the
+ * page's (now) ordered tuples.
+ * Do that now, so that all tuples on the page we're about to
+ * unpin were checked for visibility before we returned any.
+ */
+ if (so->want_itup && so->nReorderThisPage)
+ spgPopulateOrderedVisChecks(scan, so);
+
+ Assert(!so->want_itup || item->visrecheck != TMVC_Unchecked);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->visrecheck);
reportedSome = true;
}
else
@@ -854,7 +977,15 @@ redirect:
}
else if (blkno != BufferGetBlockNumber(buffer))
{
- UnlockReleaseBuffer(buffer);
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ Assert(so->numberOfOrderBys >= 0);
+ if (so->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+
+ ReleaseBuffer(buffer);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
}
@@ -922,16 +1053,36 @@ redirect:
}
if (buffer != InvalidBuffer)
- UnlockReleaseBuffer(buffer);
-}
+ {
+ /* Unlock the buffer for concurrent accesses except VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ /*
+ * If we're in an index-only scan, pre-check visibility of the tuples,
+ * so we can drop the pin without causing visibility bugs.
+ */
+ if (so->want_itup)
+ {
+ Assert(scan->numberOfOrderBys >= 0);
+
+ if (scan->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+ }
+
+ /* Release the page */
+ ReleaseBuffer(buffer);
+ }
+}
/* storeRes subroutine for getbitmap case */
static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ TMVC_Result visres)
{
Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
@@ -949,7 +1100,7 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
so->tbm = tbm;
so->ntids = 0;
- spgWalk(scan->indexRelation, so, true, storeBitmap);
+ spgWalk(scan, scan->indexRelation, so, true, storeBitmap);
return so->ntids;
}
@@ -959,12 +1110,15 @@ static void
storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+ bool recheckDistances, double *nonNullDistances,
+ TMVC_Result visres)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
so->recheck[so->nPtrs] = recheck;
so->recheckDistances[so->nPtrs] = recheckDistances;
+ if (so->want_itup)
+ so->visrecheck[so->nPtrs] = visres;
if (so->numberOfOrderBys > 0)
{
@@ -1041,6 +1195,10 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_heaptid = so->heapPtrs[so->iPtr];
scan->xs_recheck = so->recheck[so->iPtr];
scan->xs_hitup = so->reconTups[so->iPtr];
+ if (so->want_itup)
+ scan->xs_visrecheck = so->visrecheck[so->iPtr];
+
+ Assert(!scan->xs_want_itup || scan->xs_visrecheck != TMVC_Unchecked);
if (so->numberOfOrderBys > 0)
index_store_float8_orderby_distances(scan, so->orderByTypes,
@@ -1070,7 +1228,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
}
so->iPtr = so->nPtrs = 0;
- spgWalk(scan->indexRelation, so, false, storeGettuple);
+ spgWalk(scan, scan->indexRelation, so, false, storeGettuple);
if (so->nPtrs == 0)
break; /* must have completed scan */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 81171f35451..04836a29304 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -625,7 +625,7 @@ spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
BlockNumber blkno = BufferGetBlockNumber(buffer);
Page page;
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..63e970468c7 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "utils/geo_decls.h"
#include "utils/relcache.h"
+#include "tableam.h"
typedef struct SpGistOptions
@@ -175,7 +176,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
-
+ uint8 visrecheck; /* IOS: TMVC_Result of contained heap tuple */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
} SpGistSearchItem;
@@ -223,6 +224,7 @@ typedef struct SpGistScanOpaqueData
/* These fields are only used in amgettuple scans: */
bool want_itup; /* are we reconstructing tuples? */
+ Buffer vmbuf; /* IOS: used for table_index_vischeck_tuples */
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
@@ -235,6 +237,11 @@ typedef struct SpGistScanOpaqueData
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
+ /* support for IOS */
+ int nReorderThisPage;
+ uint8 *visrecheck; /* IOS vis check results, counted by nPtrs */
+ SpGistSearchItem **items; /* counted by nReorderThisPage */
+
/*
* Note: using MaxIndexTuplesPerPage above is a bit hokey since
* SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
--
2.48.1
On Thu, 24 Apr 2025 at 22:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Fri, 21 Mar 2025 at 17:14, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Attached is v10, which polishes the previous patches, and adds a patch
for nbtree to use the new visibility checking strategy so that it too
can release its index pages much earlier, and adds a similar
visibility check test to nbtree.And here's v12. v11 (skipped) would've been a rebase, but after
finishing the rebase I noticed a severe regression in btree's IOS with
the new code, so v12 here applies some optimizations which reduce the
overhead of the new code.
Here's v13, which moves the changes around a bit:
v12's 0001 is split into 3 patches (1, 3, and 4), whilst v12's 2-5
were correspondingly renumbered 5-8. Patch 0002 is an otherwise
unrelated change in pg_visibility that updates it to use the new
vectorized API, reducing overhead. So, summary of the patches:
0001: Replaces visibilitymap_get_status with a vectorized variant that
touches each VM page at most once per call; reducing buffer churn and
enabling later patches
0002: update pg_visibility to use this newly vectorized API (instead
of the current model that checks each page at a time)
0003: Adds the table_index_vischeck_tuples API, requiring a table AM
to expose VM checks through an API.
0004: Adjusts Index-Only scan infrastructure to make use of 0004,
instead of using VM_ALL_VISIBLE(). It also adds the relevant
infrastructure for enabling index AMs to provide the pre-checked
visibility status (from table_index_vischeck_tuples) to an index-only
scan.
0005: Implement VM-checks in GIST's IOS code
0006: Same, but for SP-GIST
0007: Same, but for NBTREE
0008: Add tests which validate that we still get correct results from
our queries, even when we use cursors to block results from getting
returned, and cleaning up tuples from those index pages.
A big benefit with this patch is that indexes now have no direct
reason anymore to hold back a VACUUM scan -- the visibility of tuples
can be checked at page scan time, and any shared resources can be
released before returning tuples to higher nodes.
------------------------
Summary of the problem that I'm solving here:
An index that holds dead tuples could return those dead tuples in an
Index-Only Scan (IOS) to the scan node, as the index AM itself doesn't
have any information about the visibility of the tuples that it
contains. The IOS infrastructure prevents these tuples from being
exposed by doing visibility checks against the Visibility Map (VM) and
-if necessary- the underlying heap.
This, however, depends on an invariant: VACUUM MUST NOT remove a TID
that's being returned by an index scan, at least not before before
that tuple has been checked for visibility in the VM; otherwise VACUUM
may get to clean up the dead tuple's page and mark it all-visible
before the visibility check occurs, incorrectly returning an
'all-visible' result for that dead tuple.
Btree indexes interlock with vacuum using a buffer cleanup lock and a
pin on pages it's yet to return results for; and that solution works
quite fine [^1].
This same solution sadly doesn't work for GiST and SP-GiST, as in a
worst case scenario they may have to sort the whole index in memory
before they can return the first index tuple, and holding pins on
those pages would be extremely punishing and might even cause system
crashes due to a lack of available un-pinned shared buffers.
To solve this, we implement a mechanism to allow indexes to do rough
visibility checks on TIDs; the results of which can then be passed
through the IndexScanDesc to indicate what the VM state was when the
tuple was still in the index. This enables them to make the VM check
happen just after they've scanned a page, but before they release
their pin on the page, adding the interlock with VACUUM without
requiring unreasonably large amounts of page pins.
This new mechanism is safe in MVCC snapshots, where we know that
tuples which are all-visible can't be removed while the scan is
ongoing, and where possibly-dead possibly-replaced TIDs are known to
be visibility-checked using transaction IDs, and where any new TID
would have to be inserted in a different transaction and therefore is
definitely invisible to our current transaction.
------------------------
Open items: review that this doesn't have any further issues, and
commits once this has been considered good enough.
Note:
This patch changes TableAMRoutine and renames/changes exposed
functions, and as a result can't be backpatched as-is. I have a
separate thread over at [0]/messages/by-id/CAEze2WgH13m=MDST58KLo-NkZpbwBEt4xNWcgtghWBwRj3J0+A@mail.gmail.com where I'm keeping track of a patchset that
is derived from this one and is focused on backpatching. That patchset
will contain patches 0004/0005/0006 and a reduced version of 0001+0003
to make it work in older branches without breaking external ABI
compatibility. I intend for the exposed table_index_vischeck_tuples()
API to remain consistent across the two patchsets.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
[0]: /messages/by-id/CAEze2WgH13m=MDST58KLo-NkZpbwBEt4xNWcgtghWBwRj3J0+A@mail.gmail.com
[^1]: Mostly fine, because it still holds VACUUM back when an
index-only scan holds a page pin and VACUUM needs to process that
page. If the index scan doesn't progress, then VACUUM can't progress
either, and that can cause vacuum to get stuck. That issue is solved
(for normal index scans) with patch 0007.
Attachments:
v13-0001-Add-vectorized-API-for-visibility-map-lookup.patchapplication/octet-stream; name=v13-0001-Add-vectorized-API-for-visibility-map-lookup.patchDownload
From 4867bf5e92d6118452e0875d3e642a1327ff1c48 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 19 Dec 2025 21:57:54 +0100
Subject: [PATCH v13 1/8] Add vectorized API for visibility map lookup
This allows for faster VM lookups when you have a batch of
pages to check, and will be used by future visibility checks
in the Heap table access method, all to support more
efficient Index-Only scans.
---
src/backend/access/heap/visibilitymap.c | 149 ++++++++++++++++++++----
src/include/access/visibilitymap.h | 14 ++-
2 files changed, 142 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d14588e92ae..40cff906eeb 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,7 +16,7 @@
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set bit(s) in a previously pinned page and log
* visibilitymap_set_vmbits - set bit(s) in a pinned page
- * visibilitymap_get_status - get status of bits
+ * visibilitymap_get_status_v - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_prepare_truncate -
* prepare for truncation of the visibility map
@@ -119,6 +119,9 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+/* map VM blocks back to the first heap block on that page */
+#define MAPBLOCK_TO_HEAPBLK(x) ((x) * HEAPBLOCKS_PER_PAGE)
+
/* Masks for counting subsets of bits in the visibility map. */
#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
@@ -391,7 +394,32 @@ visibilitymap_set_vmbits(BlockNumber heapBlk,
}
/*
- * visibilitymap_get_status - get status of bits
+ * Do a binary search over the provided array of BlockNumber, returning the
+ * index that the provided key could be inserted whilst maintaining order.
+ */
+static int
+find_index_in_block_array(BlockNumber key, const BlockNumber *blknos, int nblocks)
+{
+ int low = 0,
+ high = nblocks;
+
+ /* standard binary search */
+ while (low != high)
+ {
+ int mid = low + (high - low + 1) / 2;
+ BlockNumber midpoint = blknos[mid];
+
+ if (midpoint > key)
+ high = mid - 1;
+ else
+ low = mid;
+ }
+
+ return low;
+}
+
+/*
+ * visibilitymap_get_status_v - get status of bits
*
* Are all tuples on heapBlk visible to all or are marked frozen, according
* to the visibility map?
@@ -402,6 +430,9 @@ visibilitymap_set_vmbits(BlockNumber heapBlk,
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
* releasing *vmbuf after it's done testing and setting bits.
*
+ * The caller is responsible for providing a sorted array of unique heap
+ * blocks, and providing sufficient space in *status.
+ *
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
* since we don't lock the visibility map page either, it's even possible that
@@ -409,45 +440,123 @@ visibilitymap_set_vmbits(BlockNumber heapBlk,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-uint8
-visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
+void
+visibilitymap_get_statusv(Relation rel, const BlockNumber *heapBlks, uint8 *status,
+ int nblocks, Buffer *vmbuf)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
- char *map;
- uint8 result;
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlks[0]);
+ int startOff = 0;
+ int currblk;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_statusv %s %d", RelationGetRelationName(rel), heapBlks[0]);
#endif
/* Reuse the old pinned buffer if possible */
if (BufferIsValid(*vmbuf))
{
- if (BufferGetBlockNumber(*vmbuf) != mapBlock)
+ BlockNumber curMapBlock = BufferGetBlockNumber(*vmbuf);
+
+ /*
+ * If we have more than one block, but the head of the array isn't
+ * on the current VM page, it's still possible that the current VM
+ * page contains some other requested pages' visibility status. To
+ * figure out if we must swap the buffer now, we search the array to
+ * find the location where such a BlockNumber should be located.
+ *
+ * The index that's returned references the first BlockNumber >=
+ * firstHeapBlock, so it may reference a different VM page entirely.
+ * That's fine, we do have a later check which verifies whether that
+ * block belongs to the current VM buffer, and if not, we bail out.
+ */
+ if (nblocks > 1 && curMapBlock != mapBlock)
+ {
+ BlockNumber firstHeapBlk = MAPBLOCK_TO_HEAPBLK(curMapBlock);
+ startOff = find_index_in_block_array(firstHeapBlk, heapBlks, nblocks);
+ }
+
+ /*
+ * Bail if we still don't have pages for this VM buffer.
+ */
+ if (curMapBlock != HEAPBLK_TO_MAPBLOCK(heapBlks[startOff]))
{
+ startOff = 0;
ReleaseBuffer(*vmbuf);
*vmbuf = InvalidBuffer;
}
}
- if (!BufferIsValid(*vmbuf))
+ /* We can return here when we started processing the array only halfway through */
+restart:
+ currblk = startOff;
+ mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlks[currblk]);
+
+ /* grab the VM buffer for our mapBlock, if we didn't have it already */
+ if (*vmbuf == InvalidBuffer)
{
*vmbuf = vm_readbuf(rel, mapBlock, false);
- if (!BufferIsValid(*vmbuf))
- return (uint8) 0;
+
+ if (*vmbuf == InvalidBuffer)
+ goto endOfVisMap;
}
- map = PageGetContents(BufferGetPage(*vmbuf));
+ /* main loop */
+ while (1)
+ {
+ char *map = PageGetContents(BufferGetPage(*vmbuf));
+ int64 firstNext = MAPBLOCK_TO_HEAPBLK((int64) mapBlock) + (int64) HEAPBLOCKS_PER_PAGE;
+
+ /* Check the visibility status of all heap blocks on the current VM page */
+ for (;currblk < nblocks && ((int64) (heapBlks[currblk])) < firstNext; currblk++)
+ {
+ uint32 mapByte;
+ uint8 mapOffset;
+
+ mapByte = HEAPBLK_TO_MAPBYTE(heapBlks[currblk]);
+ mapOffset = HEAPBLK_TO_OFFSET(heapBlks[currblk]);
+
+ status[currblk] = (map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS;
+ }
+
+ /* end of the scan */
+ if (currblk >= nblocks)
+ break;
+
+ /* prepare the vm buffer for the next vm block */
+ ReleaseBuffer(*vmbuf);
+ mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlks[currblk]);
+ *vmbuf = vm_readbuf(rel, mapBlock, false);
+
+ if (*vmbuf == InvalidBuffer)
+ goto endOfVisMap;
+ }
+
+endOfVisMap:
+ /* set visibility map result to 0 for blocks past the end of the VM */
+ while (currblk < nblocks)
+ status[currblk++] = 0;
/*
- * A single byte read is atomic. There could be memory-ordering effects
- * here, but for performance reasons we make it the caller's job to worry
- * about that.
+ * If we started processing in the middle of the array to reduce buffer
+ * churn, we loop back to restart here
*/
- result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
- return result;
+ if (startOff > 0)
+ {
+ nblocks = startOff;
+ startOff = 0;
+
+ /*
+ * The next loop around will work on a different page, so we should
+ * release this buffer.
+ */
+ if (BufferIsValid(*vmbuf))
+ {
+ ReleaseBuffer(*vmbuf);
+ *vmbuf = InvalidBuffer;
+ }
+
+ goto restart;
+ }
}
/*
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index c6fa37be968..0353c41d14e 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -41,9 +41,21 @@ extern uint8 visibilitymap_set(Relation rel,
extern uint8 visibilitymap_set_vmbits(BlockNumber heapBlk,
Buffer vmBuf, uint8 flags,
const RelFileLocator rlocator);
-extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_get_statusv(Relation rel, const BlockNumber *heapBlks,
+ uint8 *statusv, int nblocks,
+ Buffer *vmbuf);
extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
BlockNumber nheapblocks);
+inline uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
+{
+ uint8 status;
+
+ visibilitymap_get_statusv(rel, &heapBlk, &status, 1, vmbuf);
+
+ return status;
+}
+
#endif /* VISIBILITYMAP_H */
--
2.50.1 (Apple Git-155)
v13-0008-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchapplication/octet-stream; name=v13-0008-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchDownload
From 2a8bc6256051e38989aad703cc7fc6ff875b04cb Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 21 Mar 2025 16:41:31 +0100
Subject: [PATCH v13 8/8] Test for IOS/Vacuum race conditions in index AMs
Add regression tests that demonstrate wrong results can occur with index-only
scans in GiST and SP-GiST indexes when encountering tuples being removed by a
concurrent VACUUM operation.
With these tests the index AMs are also expected to not block VACUUM even when
they're used inside a cursor.
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Co-authored-by: Peter Geoghegan <pg@bowt.ie>
Co-authored-by: Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
.../expected/index-only-scan-btree-vacuum.out | 59 +++++++++
.../expected/index-only-scan-gist-vacuum.out | 53 ++++++++
.../index-only-scan-spgist-vacuum.out | 53 ++++++++
src/test/isolation/isolation_schedule | 3 +
.../specs/index-only-scan-btree-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-gist-vacuum.spec | 112 +++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 112 +++++++++++++++++
7 files changed, 505 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-btree-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-btree-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-btree-vacuum.out b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
new file mode 100644
index 00000000000..9a9d94c86f6
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_asc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_asc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_desc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_desc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index f2e067b1fbc..6366ad23c0d 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -18,6 +18,9 @@ test: two-ids
test: multiple-row-versions
test: index-only-scan
test: index-only-bitmapscan
+test: index-only-scan-btree-vacuum
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-btree-vacuum.spec b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
new file mode 100644
index 00000000000..9a00804c2c5
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing correct results with btree even with concurrent
+# vacuum
+
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_btree_a ON ios_needs_cleanup_lock USING btree (a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted_asc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+}
+step s1_prepare_sorted_desc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1, nor 10, so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum
+{
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+}
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_asc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_desc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # If the index scan doesn't correctly interlock its visibility tests with
+ # concurrent VACUUM cleanup then VACUUM will mark pages as all-visible that
+ # the scan in the next steps may then consider all-visible, despite some of
+ # those rows having been removed.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..9d241b25920
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..cd621d4f7f2
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.50.1 (Apple Git-155)
v13-0005-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v13-0005-GIST-Fix-visibility-issues-in-IOS.patchDownload
From 6f81f8742b6eb384453dd4f187bbff50f775ba31 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 20 Dec 2025 02:37:08 +0100
Subject: [PATCH v13 5/8] GIST: Fix visibility issues in IOS
Previously, GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with GIST vacuum, and we now do
preliminary visibility checks for IOS whilst holding the pin. This
allows us to return tuples to the scan after releasing the pin,
without breaking visibility rules.
Idea from Heikki Linnakangas
---
src/backend/access/gist/gistget.c | 125 ++++++++++++++++++++++++++-
src/backend/access/gist/gistscan.c | 6 ++
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 27 ++++--
4 files changed, 151 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 9ba45acfff3..cc193280f74 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -17,6 +17,7 @@
#include "access/genam.h"
#include "access/gist_private.h"
#include "access/relscan.h"
+#include "access/tableam.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -394,7 +395,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
return;
}
- so->nPageData = so->curPageData = 0;
+ if (scan->numberOfOrderBys)
+ so->nsortItems = 0;
+ else
+ so->nPageData = so->curPageData = 0;
+
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -498,10 +503,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
item->data.heap.recheckDistances = recheck_distances;
/*
- * In an index-only scan, also fetch the data from the tuple.
+ * In an index-only scan, also fetch the data from the tuple,
+ * and keep a reference to the tuple so we can run visibility
+ * checks on the tuples before we release the buffer.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ so->sortItems[so->nsortItems] = &item->data.heap;
+ so->nsortItems += 1;
+ }
}
else
{
@@ -526,7 +537,104 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
}
- UnlockReleaseBuffer(buffer);
+ /* Allow writes to the buffer, but don't yet allow VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * If we're in an index-only scan, we need to do VM-like visibility checks
+ * before we release the pin. This way, VACUUM can't clean up dead tuples
+ * from this index page and mark the heap page ALL_VISIBLE before the tuple
+ * was returned; or at least not without the index-only scan knowing to
+ * still look at the heap page for the visibility of this tuple.
+ *
+ * See also docs section "Index Locking Considerations".
+ */
+ if (scan->xs_want_itup)
+ {
+ TM_IndexVisibilityCheckOp op;
+ op.vmbuf = &so->vmbuf;
+
+ /* get the number of TIDs we're about to check */
+ if (scan->numberOfOrderBys > 0)
+ op.checkntids = so->nsortItems;
+ else
+ op.checkntids = so->nPageData;
+
+ /* skip the rest of the vischeck code if nothing is to be done. */
+ if (op.checkntids == 0)
+ goto IOSVisChecksDone;
+
+ op.checktids = palloc_array(TM_VisCheck, op.checkntids);
+
+ /* Populate the visibility check items */
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ Assert(ItemPointerIsValid(&so->sortItems[off]->heapPtr));
+
+ PopulateTMVischeck(&op.checktids[off],
+ &so->sortItems[off]->heapPtr,
+ off);
+ }
+ }
+ else
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ Assert(ItemPointerIsValid(&so->pageData[off].heapPtr));
+
+ PopulateTMVischeck(&op.checktids[off],
+ &so->pageData[off].heapPtr,
+ off);
+ }
+ }
+
+ /* ask the table for the visibility status of these tids */
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ /* and copy the visibility status into the GISTSearchItems */
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = so->sortItems[check->idxoffnum];
+
+ /* sanity checks */
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapPtr));
+
+ item->visrecheck = check->vischeckresult;
+ }
+
+ /* reset the temporary state used for tracking IOS items */
+ so->nsortItems = 0;
+ }
+ else
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = &so->pageData[check->idxoffnum];
+
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapPtr));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ }
+
+ /* finally, clean up the used resources */
+ pfree(op.checktids);
+ }
+
+IOSVisChecksDone:
+
+ /* Allow VACUUM to process the buffer again */
+ ReleaseBuffer(buffer);
}
/*
@@ -586,9 +694,15 @@ getNextNearest(IndexScanDesc scan)
item->distances,
item->data.heap.recheckDistances);
- /* in an index-only scan, also return the reconstructed tuple. */
+ /*
+ * In an index-only scan, also return the reconstructed tuple,
+ * and store the visibility check's result.
+ */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = item->data.heap.recontup;
+ scan->xs_visrecheck = item->data.heap.visrecheck;
+ }
res = true;
}
else
@@ -675,7 +789,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
/* in an index-only scan, also return the reconstructed tuple */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = so->pageData[so->curPageData].recontup;
+ scan->xs_visrecheck = so->pageData[so->curPageData].visrecheck;
+ }
so->curPageData++;
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 01b8ff0b6fa..bf6b1a82548 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -348,6 +348,12 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 7591ad4da1a..fc541ff5efa 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -326,10 +326,10 @@ restart:
recurse_to = InvalidBlockNumber;
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..272c18ea17d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/buffile.h"
#include "utils/hsearch.h"
#include "access/genam.h"
+#include "tableam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -124,6 +125,8 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ uint8 visrecheck; /* Cached visibility check result for this
+ * heap pointer. */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -170,12 +173,24 @@ typedef struct GISTScanOpaqueData
BlockNumber curBlkno; /* current number of block */
GistNSN curPageLSN; /* pos in the WAL stream when page was read */
- /* In a non-ordered search, returnable heap items are stored here: */
- GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
- OffsetNumber nPageData; /* number of valid items in array */
- OffsetNumber curPageData; /* next item to return */
- MemoryContext pageDataCxt; /* context holding the fetched tuples, for
- * index-only scans */
+ /* info used by Index-Only Scans */
+ Buffer vmbuf; /* reusable buffer for IOS' vm lookups */
+
+ union {
+ struct {
+ /* In a non-ordered search, returnable heap items are stored here: */
+ GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nPageData; /* number of valid items in array */
+ OffsetNumber curPageData; /* next item to return */
+ MemoryContext pageDataCxt; /* context holding the fetched tuples,
+ * for index-only scans */
+ };
+ struct {
+ /* In an ordered search, we use this as scratch space for IOS */
+ GISTSearchHeapItem *sortItems[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nsortItems; /* number of items in sortData */
+ };
+ };
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.50.1 (Apple Git-155)
v13-0007-nbtree-Reduce-Index-Only-Scan-pin-duration.patchapplication/octet-stream; name=v13-0007-nbtree-Reduce-Index-Only-Scan-pin-duration.patchDownload
From 7330fd56024acdc4fe8b10a18fedcc633b41ddec Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 22 Dec 2025 17:36:14 +0100
Subject: [PATCH v13 7/8] nbtree: Reduce Index-Only Scan pin duration
Previously, we would keep a pin on every leaf page while we were returning
tuples to the scan. With this patch, we utilize the newly introduced
table_index_vischeck_tuples API to pre-check visibility of all TIDs, and
thus unpin the page well ahead of when we'd usually be ready with returning
and processing all index tuple results. This reduces the cases where VACUUM
may have to wait for a pin from a stalled index scan, and can increase
performance by reducing the amount of VM page accesses.
---
src/backend/access/nbtree/nbtreadpage.c | 5 ++
src/backend/access/nbtree/nbtree.c | 27 +++++-
src/backend/access/nbtree/nbtsearch.c | 115 ++++++++++++++++++++++--
src/include/access/nbtree.h | 8 ++
4 files changed, 145 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index b3b8b553411..b9c93ad29e6 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -1038,6 +1038,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
Size itupsz = IndexTupleSize(itup);
@@ -1068,6 +1070,8 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
/* Save base IndexTuple (truncate posting list) */
@@ -1104,6 +1108,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
/*
* Have index-only scans return the same base IndexTuple for every TID
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b4425231935..04056647805 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -364,6 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ so->vmbuf = InvalidBuffer;
+ so->vischeckcap = 0;
+ so->vischecksbuf = NULL;
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -420,10 +424,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
*
* Note: so->dropPin should never change across rescans.
*/
- so->dropPin = (!scan->xs_want_itup &&
- IsMVCCSnapshot(scan->xs_snapshot) &&
+ so->dropPin = (IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
scan->heapRelation != NULL);
+ so->vischeck = scan->xs_want_itup;
+
+ Assert(!scan->xs_want_itup || so->vischeck || !so->dropPin);
so->markItemIndex = -1;
so->needPrimScan = false;
@@ -432,6 +438,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/*
* Allocate tuple workspace arrays, if needed for an index-only scan and
* not already done in a previous rescan call. To save on palloc
@@ -473,6 +485,17 @@ btendscan(IndexScanDesc scan)
so->markItemIndex = -1;
BTScanPosUnpinIfPinned(so->markPos);
+ if (so->vischecksbuf)
+ pfree(so->vischecksbuf);
+ so->vischecksbuf = NULL;
+ so->vischeckcap = 0;
+
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/* No need to invalidate positions, the RAM is about to be freed. */
/* Release storage */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7a416d60cea..bdcc2974c92 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
#include "utils/rel.h"
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, Relation heaprel,
+ BTScanOpaque so);
static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access);
@@ -51,12 +52,95 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
*/
static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
+_bt_drop_lock_and_maybe_pin(Relation rel, Relation heaprel, BTScanOpaque so)
{
+ if (so->dropPin)
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ _bt_unlockbuf(rel, so->currPos.buf);
+
+ /*
+ * Do some visibility checks if this is an index-only scan; allowing us to
+ * drop the pin on this page before we have returned all tuples from this
+ * IOS to the executor.
+ */
+ if (so->vischeck)
+ {
+ TM_IndexVisibilityCheckOp visCheck;
+ BTScanPos sp = &so->currPos;
+ BTScanPosItem *item;
+ int initOffset = sp->firstItem;
+
+ visCheck.checkntids = 1 + sp->lastItem - initOffset;
+
+ Assert(visCheck.checkntids > 0);
+
+ /* populate the visibility checking buffer */
+ if (so->vischeckcap == 0)
+ {
+ Assert(so->vischecksbuf == NULL);
+ so->vischecksbuf = palloc_array(TM_VisCheck,
+ visCheck.checkntids);
+ so->vischeckcap = visCheck.checkntids;
+ }
+ else if (so->vischeckcap < visCheck.checkntids)
+ {
+ so->vischecksbuf = repalloc_array(so->vischecksbuf,
+ TM_VisCheck,
+ visCheck.checkntids);
+ so->vischeckcap = visCheck.checkntids;
+ }
+
+ /* populate the visibility check data */
+ visCheck.checktids = so->vischecksbuf;
+ visCheck.vmbuf = &so->vmbuf;
+
+ item = &so->currPos.items[initOffset];
+
+ for (int i = 0; i < visCheck.checkntids; i++)
+ {
+ TM_VisCheck *check = &visCheck.checktids[i];
+ Assert(item->visrecheck == TMVC_Unchecked);
+ Assert(ItemPointerIsValid(&item->heapTid));
+
+ PopulateTMVischeck(check, &item->heapTid, initOffset);
+
+ item++;
+ initOffset++;
+ }
+
+ /* do the visibility check */
+ table_index_vischeck_tuples(heaprel, &visCheck);
+
+ /* ... and put the results into the BTScanPosItems */
+ for (int i = 0; i < visCheck.checkntids; i++)
+ {
+ TM_VisCheck *check = &visCheck.checktids[i];
+ TMVC_Result result = check->vischeckresult;
+ /* We must have a valid visibility check result */
+ Assert(result != TMVC_Unchecked && result <= TMVC_MAX);
+
+ /* The idxoffnum should be in the expected range */
+ Assert(check->idxoffnum >= so->currPos.firstItem &&
+ check->idxoffnum <= so->currPos.lastItem);
+
+ item = &so->currPos.items[check->idxoffnum];
+
+ /* Ensure we don't visit the same item twice */
+ Assert(item->visrecheck == TMVC_Unchecked);
+
+ /* The offset number should still indicate the right item */
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapTid));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapTid));
+
+ /* Store the visibility check result */
+ item->visrecheck = result;
+ }
+ }
+
if (!so->dropPin)
{
- /* Just drop the lock (not the pin) */
- _bt_unlockbuf(rel, so->currPos.buf);
+ /* Only drop the lock (not the pin) */
return;
}
@@ -67,8 +151,7 @@ _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
* when concurrent heap TID recycling by VACUUM might have taken place.
*/
Assert(RelationNeedsWAL(rel));
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
- _bt_relbuf(rel, so->currPos.buf);
+ ReleaseBuffer(so->currPos.buf);
so->currPos.buf = InvalidBuffer;
}
@@ -1626,9 +1709,25 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
Assert(BTScanPosIsValid(so->currPos));
Assert(so->currPos.itemIndex >= so->currPos.firstItem);
Assert(so->currPos.itemIndex <= so->currPos.lastItem);
+ Assert(!scan->xs_want_itup || so->vischeck || !so->dropPin);
/* Return next item, per amgettuple contract */
scan->xs_heaptid = currItem->heapTid;
+
+ if (scan->xs_want_itup)
+ {
+ /*
+ * If we've already dropped the buffer, we better have already
+ * checked the visibility state of the tuple: Without the
+ * buffer pinned, vacuum may have already cleaned up the tuple
+ * and marked the page as ALL_VISIBLE.
+ */
+ Assert(BufferIsValid(so->currPos.buf) ||
+ currItem->visrecheck != TMVC_Unchecked);
+
+ scan->xs_visrecheck = currItem->visrecheck;
+ }
+
if (so->currTuples)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
}
@@ -1785,7 +1884,7 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
* so->currPos.buf in preparation for btgettuple returning tuples.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(rel, so);
+ _bt_drop_lock_and_maybe_pin(rel, scan->heapRelation, so);
return true;
}
@@ -1945,7 +2044,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
*/
Assert(so->currPos.currPage == blkno);
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(rel, so);
+ _bt_drop_lock_and_maybe_pin(rel, scan->heapRelation, so);
return true;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7a3efd209bc..b6ff85c9e61 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -17,6 +17,7 @@
#include "access/amapi.h"
#include "access/itup.h"
#include "access/sdir.h"
+#include "access/tableam.h"
#include "catalog/pg_am_d.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
@@ -957,6 +958,7 @@ typedef struct BTScanPosItem /* what we remember about each match */
ItemPointerData heapTid; /* TID of referenced heap item */
OffsetNumber indexOffset; /* index item's location within page */
LocationIndex tupleOffset; /* IndexTuple's offset in workspace, if any */
+ uint8 visrecheck; /* visibility recheck status, if any */
} BTScanPosItem;
typedef struct BTScanPosData
@@ -1072,6 +1074,12 @@ typedef struct BTScanOpaqueData
int numKilled; /* number of currently stored items */
bool dropPin; /* drop leaf pin before btgettuple returns? */
+ /* used for index-only scan visibility prechecks */
+ bool vischeck; /* check visibility of scanned tuples */
+ Buffer vmbuf; /* vm buffer */
+ int vischeckcap; /* capacity of vischeckbuf */
+ TM_VisCheck *vischecksbuf; /* single allocation to save on alloc overhead */
+
/*
* If we are doing an index-only scan, these are the tuple storage
* workspaces for the currPos and markPos respectively. Each is of size
--
2.50.1 (Apple Git-155)
v13-0006-SP-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v13-0006-SP-GIST-Fix-visibility-issues-in-IOS.patchDownload
From 478257aefd4f747102d3cdf9e4021a95a5038f84 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 20 Dec 2025 02:49:13 +0100
Subject: [PATCH v13 6/8] SP-GIST: Fix visibility issues in IOS
Previously, SP-GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with SP-GIST vacuum, and we now do
preliminary visibility checks for IOS whilst holding the pin. This
allows us to return tuples to the scan after releasing the pin,
without breaking visibility rules.
Idea from Heikki Linnakangas
---
src/backend/access/spgist/spgscan.c | 182 ++++++++++++++++++++++++--
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 9 +-
3 files changed, 179 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 946772f3957..8537b4d87ce 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ TMVC_Result visrecheck);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -142,6 +143,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->visrecheck = TMVC_Unchecked;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -380,6 +382,19 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
if (scankey && scan->numberOfKeys > 0)
memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+ /* prepare index-only scan requirements */
+ so->nReorderThisPage = 0;
+ if (scan->xs_want_itup)
+ {
+ if (so->visrecheck == NULL)
+ so->visrecheck = palloc(MaxIndexTuplesPerPage);
+
+ if (scan->numberOfOrderBys > 0 && so->items == NULL)
+ {
+ so->items = palloc_array(SpGistSearchItem *, MaxIndexTuplesPerPage);
+ }
+ }
+
/* initialize order-by data if needed */
if (orderbys && scan->numberOfOrderBys > 0)
{
@@ -447,6 +462,9 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ if (BufferIsValid(so->vmbuf))
+ ReleaseBuffer(so->vmbuf);
+
pfree(so);
}
@@ -496,6 +514,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
item->isLeaf = true;
item->recheck = recheck;
item->recheckDistances = recheckDistances;
+ item->visrecheck = TMVC_Unchecked;
return item;
}
@@ -578,6 +597,14 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ if (so->want_itup)
+ {
+ Assert(so->items != NULL);
+
+ so->items[so->nReorderThisPage] = heapItem;
+ so->nReorderThisPage++;
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -587,7 +614,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, TMVC_Unchecked);
*reportedSome = true;
}
}
@@ -800,6 +827,88 @@ spgTestLeafTuple(SpGistScanOpaque so,
return SGLT_GET_NEXTOFFSET(leafTuple);
}
+/*
+ * Pupulate so->visrecheck based on tuples which are cached for a currently
+ * pinned page.
+ */
+static void
+spgPopulateUnorderedVischecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ Assert(scan->numberOfOrderBys == 0);
+
+ if (so->nPtrs == 0)
+ return;
+
+ op.checkntids = so->nPtrs;
+ op.checktids = palloc_array(TM_VisCheck, so->nPtrs);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ Assert(ItemPointerIsValid(&so->heapPtrs[i]));
+
+ PopulateTMVischeck(&op.checktids[i], &so->heapPtrs[i], i);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&so->heapPtrs[check->idxoffnum]));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&so->heapPtrs[check->idxoffnum]));
+ Assert(check->idxoffnum < op.checkntids);
+
+ so->visrecheck[check->idxoffnum] = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+}
+
+/* pupulate so->visrecheck based on current cached tuples */
+static void
+spgPopulateOrderedVisChecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ if (so->nReorderThisPage == 0)
+ return;
+
+ Assert(so->nReorderThisPage > 0);
+ Assert(scan->numberOfOrderBys > 0);
+ Assert(so->items != NULL);
+
+ op.checkntids = so->nReorderThisPage;
+ op.checktids = palloc_array(TM_VisCheck, so->nReorderThisPage);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ PopulateTMVischeck(&op.checktids[i], &so->items[i]->heapPtr, i);
+ Assert(ItemPointerIsValid(&so->items[i]->heapPtr));
+ Assert(so->items[i]->isLeaf);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&so->items[check->idxoffnum]->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&so->items[check->idxoffnum]->heapPtr));
+
+ so->items[check->idxoffnum]->visrecheck = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+ so->nReorderThisPage = 0;
+}
+
/*
* Walk the tree and report all tuples passing the scan quals to the storeRes
* subroutine.
@@ -808,8 +917,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
* next page boundary once we have reported at least one tuple.
*/
static void
-spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
- storeRes_func storeRes)
+spgWalk(IndexScanDesc scan, Relation index, SpGistScanOpaque so,
+ bool scanWholeIndex, storeRes_func storeRes)
{
Buffer buffer = InvalidBuffer;
bool reportedSome = false;
@@ -829,9 +938,23 @@ redirect:
{
/* We store heap items in the queue only in case of ordered search */
Assert(so->numberOfNonNullOrderBys > 0);
+
+ /*
+ * If an item we found on a page is retrieved immediately after
+ * processing that page, we won't yet have released the page pin,
+ * and thus won't yet have processed the visibility data of the
+ * page's (now) ordered tuples.
+ * Do that now, so that all tuples on the page we're about to
+ * unpin were checked for visibility before we returned any.
+ */
+ if (so->want_itup && so->nReorderThisPage)
+ spgPopulateOrderedVisChecks(scan, so);
+
+ Assert(!so->want_itup || item->visrecheck != TMVC_Unchecked);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->visrecheck);
reportedSome = true;
}
else
@@ -848,7 +971,15 @@ redirect:
}
else if (blkno != BufferGetBlockNumber(buffer))
{
- UnlockReleaseBuffer(buffer);
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ Assert(so->numberOfOrderBys >= 0);
+ if (so->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+
+ ReleaseBuffer(buffer);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
}
@@ -916,16 +1047,36 @@ redirect:
}
if (buffer != InvalidBuffer)
- UnlockReleaseBuffer(buffer);
-}
+ {
+ /* Unlock the buffer for concurrent accesses except VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ /*
+ * If we're in an index-only scan, pre-check visibility of the tuples,
+ * so we can drop the pin without causing visibility bugs.
+ */
+ if (so->want_itup)
+ {
+ Assert(scan->numberOfOrderBys >= 0);
+
+ if (scan->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+ }
+
+ /* Release the page */
+ ReleaseBuffer(buffer);
+ }
+}
/* storeRes subroutine for getbitmap case */
static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ TMVC_Result visres)
{
Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
@@ -943,7 +1094,7 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
so->tbm = tbm;
so->ntids = 0;
- spgWalk(scan->indexRelation, so, true, storeBitmap);
+ spgWalk(scan, scan->indexRelation, so, true, storeBitmap);
return so->ntids;
}
@@ -953,12 +1104,15 @@ static void
storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+ bool recheckDistances, double *nonNullDistances,
+ TMVC_Result visres)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
so->recheck[so->nPtrs] = recheck;
so->recheckDistances[so->nPtrs] = recheckDistances;
+ if (so->want_itup)
+ so->visrecheck[so->nPtrs] = visres;
if (so->numberOfOrderBys > 0)
{
@@ -1035,6 +1189,10 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_heaptid = so->heapPtrs[so->iPtr];
scan->xs_recheck = so->recheck[so->iPtr];
scan->xs_hitup = so->reconTups[so->iPtr];
+ if (so->want_itup)
+ scan->xs_visrecheck = so->visrecheck[so->iPtr];
+
+ Assert(!scan->xs_want_itup || scan->xs_visrecheck != TMVC_Unchecked);
if (so->numberOfOrderBys > 0)
index_store_float8_orderby_distances(scan, so->orderByTypes,
@@ -1064,7 +1222,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
}
so->iPtr = so->nPtrs = 0;
- spgWalk(scan->indexRelation, so, false, storeGettuple);
+ spgWalk(scan, scan->indexRelation, so, false, storeGettuple);
if (so->nPtrs == 0)
break; /* must have completed scan */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cb5671c1a4e..ef1dd59049f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -625,7 +625,7 @@ spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
BlockNumber blkno = BufferGetBlockNumber(buffer);
Page page;
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index 083af5962a8..d7590259604 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "utils/geo_decls.h"
#include "utils/relcache.h"
+#include "tableam.h"
typedef struct SpGistOptions
@@ -175,7 +176,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
-
+ uint8 visrecheck; /* IOS: TMVC_Result of contained heap tuple */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
} SpGistSearchItem;
@@ -223,6 +224,7 @@ typedef struct SpGistScanOpaqueData
/* These fields are only used in amgettuple scans: */
bool want_itup; /* are we reconstructing tuples? */
+ Buffer vmbuf; /* IOS: used for table_index_vischeck_tuples */
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
@@ -235,6 +237,11 @@ typedef struct SpGistScanOpaqueData
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
+ /* support for IOS */
+ int nReorderThisPage;
+ uint8 *visrecheck; /* IOS vis check results, counted by nPtrs */
+ SpGistSearchItem **items; /* counted by nReorderThisPage */
+
/*
* Note: using MaxIndexTuplesPerPage above is a bit hokey since
* SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
--
2.50.1 (Apple Git-155)
v13-0004-IOS-Support-tableAM-powered-prechecked-visibilit.patchapplication/octet-stream; name=v13-0004-IOS-Support-tableAM-powered-prechecked-visibilit.patchDownload
From 0eff87b15593fc9d86dba25dad83ebb8a9925bfe Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 20 Dec 2025 00:39:50 +0100
Subject: [PATCH v13 4/8] IOS: Support tableAM-powered prechecked visibility
statuses
Previously, we assumed VM_ALL_VISIBLE(...) is universal across all
AMs. Now, we use the new table_index_vischeck_tuples API, and allow
index AMs to do this check as well. This removes the need for index
AMs to hold on to pinned pages in index-only scans, as long as they
do the visibility checks on the tuples they pull from the pages
whilst they still have a pin on the page.
selfuncs.c's get_actual_variable_endpoint() is similarly updated
to use this new visibility checking infrastructure.
Future commits will implement this new infrastructure in gist,
sp-gist, and nbtree.
---
src/backend/access/index/indexam.c | 6 ++
src/backend/executor/nodeIndexonlyscan.c | 80 +++++++++++++++---------
src/backend/utils/adt/selfuncs.c | 74 +++++++++++++---------
src/include/access/relscan.h | 2 +
4 files changed, 104 insertions(+), 58 deletions(-)
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d23b..c9817e6004c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -638,6 +638,12 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/* XXX: we should assert that a snapshot is pushed or registered */
Assert(TransactionIdIsValid(RecentXmin));
+ /*
+ * Reset xs_visrecheck, so we don't confuse the next tuple's visibility
+ * state with that of the previous.
+ */
+ scan->xs_visrecheck = TMVC_Unchecked;
+
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_heaptid. It should also set
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 6bea42f128f..e8811f9a3c1 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -121,6 +121,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
+ TMVC_Result vischeck = scandesc->xs_visrecheck;
CHECK_FOR_INTERRUPTS();
@@ -128,6 +129,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We can skip the heap fetch if the TID references a heap page on
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
+ * The index may have already pre-checked the visibility of the tuple
+ * for us and stored the result in xs_visrecheck. In those cases, we
+ * can skip checking the visibility status.
*
* Note on Memory Ordering Effects: visibilitymap_get_status does not
* lock the visibility map buffer, and therefore the result we read
@@ -157,37 +161,57 @@ IndexOnlyNext(IndexOnlyScanState *node)
*
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
+ *
+ * The index doing these checks for us doesn't materially change these
+ * considerations.
*/
- if (!VM_ALL_VISIBLE(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
- {
- /*
- * Rats, we have to visit the heap to check visibility.
- */
- InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
- continue; /* no visible tuple, try next index entry */
+ if (vischeck == TMVC_Unchecked)
+ vischeck = table_index_vischeck_tuple(scandesc->heapRelation,
+ &node->ioss_VMBuffer,
+ tid);
- ExecClearTuple(node->ioss_TableSlot);
-
- /*
- * Only MVCC snapshots are supported here, so there should be no
- * need to keep following the HOT chain once a visible entry has
- * been found. If we did want to allow that, we'd need to keep
- * more state to remember not to call index_getnext_tid next time.
- */
- if (scandesc->xs_heap_continue)
- elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+ Assert(vischeck != TMVC_Unchecked);
- /*
- * Note: at this point we are holding a pin on the heap page, as
- * recorded in scandesc->xs_cbuf. We could release that pin now,
- * but it's not clear whether it's a win to do so. The next index
- * entry might require a visit to the same heap page.
- */
-
- tuple_from_heap = true;
+ switch (vischeck)
+ {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ break;
+ case TMVC_Visible:
+ /* no further checks required here */
+ break;
+ case TMVC_MaybeVisible:
+ {
+ /*
+ * Rats, we have to visit the heap to check visibility.
+ */
+ InstrCountTuples2(node, 1);
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+ continue; /* no visible tuple, try next index entry */
+
+ ExecClearTuple(node->ioss_TableSlot);
+
+ /*
+ * Only MVCC snapshots are supported here, so there should be
+ * no need to keep following the HOT chain once a visible
+ * entry has been found. If we did want to allow that, we'd
+ * need to keep more state to remember not to call
+ * index_getnext_tid next time.
+ */
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+ /*
+ * Note: at this point we are holding a pin on the heap page,
+ * as recorded in scandesc->xs_cbuf. We could release that
+ * pin now, but it's not clear whether it's a win to do so.
+ * The next index entry might require a visit to the same heap
+ * page.
+ */
+
+ tuple_from_heap = true;
+ break;
+ }
}
/*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c760b19db55..f01adcd6f71 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7109,44 +7109,58 @@ get_actual_variable_endpoint(Relation heapRel,
while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
{
BlockNumber block = ItemPointerGetBlockNumber(tid);
+ TMVC_Result visres = index_scan->xs_visrecheck;
- if (!VM_ALL_VISIBLE(heapRel,
- block,
- &vmbuffer))
+ if (visres == TMVC_Unchecked)
+ visres = table_index_vischeck_tuple(heapRel, &vmbuffer, tid);
+
+ Assert(visres != TMVC_Unchecked);
+
+ switch (visres)
{
- /* Rats, we have to visit the heap to check visibility */
- if (!index_fetch_heap(index_scan, tableslot))
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ case TMVC_Visible:
+ /* no further checks required here */
+ break;
+ case TMVC_MaybeVisible:
{
- /*
- * No visible tuple for this index entry, so we need to
- * advance to the next entry. Before doing so, count heap
- * page fetches and give up if we've done too many.
- *
- * We don't charge a page fetch if this is the same heap page
- * as the previous tuple. This is on the conservative side,
- * since other recently-accessed pages are probably still in
- * buffers too; but it's good enough for this heuristic.
- */
+ /* Rats, we have to visit the heap to check visibility */
+ if (!index_fetch_heap(index_scan, tableslot))
+ {
+ /*
+ * No visible tuple for this index entry, so we need to
+ * advance to the next entry. Before doing so, count heap
+ * page fetches and give up if we've done too many.
+ *
+ * We don't charge a page fetch if this is the same heap
+ * page as the previous tuple. This is on the
+ * conservative side, since other recently-accessed pages
+ * are probably still in buffers too; but it's good enough
+ * for this heuristic.
+ */
#define VISITED_PAGES_LIMIT 100
- if (block != last_heap_block)
- {
- last_heap_block = block;
- n_visited_heap_pages++;
- if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
- break;
- }
+ if (block != last_heap_block)
+ {
+ last_heap_block = block;
+ n_visited_heap_pages++;
+ if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
+ break;
+ }
- continue; /* no visible tuple, try next index entry */
- }
+ continue; /* no visible tuple, try next index entry */
+ }
- /* We don't actually need the heap tuple for anything */
- ExecClearTuple(tableslot);
+ /* We don't actually need the heap tuple for anything */
+ ExecClearTuple(tableslot);
- /*
- * We don't care whether there's more than one visible tuple in
- * the HOT chain; if any are visible, that's good enough.
- */
+ /*
+ * We don't care whether there's more than one visible tuple in
+ * the HOT chain; if any are visible, that's good enough.
+ */
+ break;
+ }
}
/*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 78989a959d4..76cf8d02795 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -178,6 +178,8 @@ typedef struct IndexScanDescData
bool xs_recheck; /* T means scan keys must be rechecked */
+ int xs_visrecheck; /* TM_VisCheckResult from tableam.h */
+
/*
* When fetching with an ordering operator, the values of the ORDER BY
* expressions of the last returned tuple, according to the index. If
--
2.50.1 (Apple Git-155)
v13-0003-TableAM-Support-AM-specific-fast-visibility-test.patchapplication/octet-stream; name=v13-0003-TableAM-Support-AM-specific-fast-visibility-test.patchDownload
From e7432523481b9b6184d25848a3a722a3cce296cd Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 19 Dec 2025 23:58:40 +0100
Subject: [PATCH v13 3/8] TableAM: Support AM-specific fast visibility tests
Previously, we assumed VM_ALL_VISIBLE(...) is universal across all
AMs. This is probably not the case, so we introduce a new table
method called "table_index_vischeck_tuples" which allows anyone to
ask the AM whether a tuple (or list of tuples) is definitely visible
to us, or might be deleted or otherwise invisible.
We implement that method directly for HeapAM; usage of the facility
will follow in later commits.
---
src/backend/access/heap/heapam.c | 124 ++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 1 +
src/backend/access/table/tableamapi.c | 1 +
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 125 +++++++++++++++++++++++
5 files changed, 253 insertions(+)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6daf4a87dec..d29346a2fee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -106,6 +106,20 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+/* sort template definitions for index visibility checks */
+#define ST_SORT heap_ivc_sortby_tidheapblk
+#define ST_ELEMENT_TYPE TM_VisCheck
+#define ST_DECLARE
+#define ST_DEFINE
+#define ST_SCOPE static inline
+#define ST_COMPARE(a, b) ( \
+ a->tidblkno < b->tidblkno ? -1 : ( \
+ a->tidblkno > b->tidblkno ? 1 : 0 \
+ ) \
+)
+
+#include "lib/sort_template.h"
+
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -8813,6 +8827,116 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
+/*
+ * heapam implementation of tableam's index_vischeck_tuples interface.
+ *
+ * This helper function is called by index AMs during index-only scans,
+ * to do VM-based visibility checks on individual tuples, so that the AM
+ * can hold the tuple in memory for e.g. reordering for extended periods of
+ * time while without holding thousands of pins to conflict with VACUUM.
+ *
+ * It's possible for this to generate a fair amount of I/O, since we may be
+ * checking hundreds of tuples from a single index block, but that is
+ * preferred over holding thousands of pins.
+ *
+ * We use heuristics to balance the costs of sorting TIDs with VM page
+ * lookups.
+ */
+void
+heap_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ TM_VisCheck *checks = checkop->checktids;
+ int checkntids = checkop->checkntids;
+ int nblocks = 1;
+ BlockNumber *blknos;
+ uint8 *status;
+ TMVC_Result res;
+
+ if (checkntids == 0)
+ return;
+
+ /*
+ * Order the TIDs to heap order, so that we will only need to visit every
+ * VM page at most once.
+ */
+ heap_ivc_sortby_tidheapblk(checks, checkntids);
+
+ for (int i = 0; i < checkntids - 1; i++)
+ {
+ if (checks[i].tidblkno != checks[i + 1].tidblkno)
+ {
+ Assert(checks[i].tidblkno < checks[i + 1].tidblkno);
+ nblocks++;
+ }
+ }
+
+ /*
+ * No need to allocate arrays or do other (comparatively expensive)
+ * bookkeeping when we have only one block to check.
+ */
+ if (nblocks == 1)
+ {
+ if (VM_ALL_VISIBLE(rel, checks[0].tidblkno, checkop->vmbuf))
+ res = TMVC_Visible;
+ else
+ res = TMVC_MaybeVisible;
+
+ for (int i = 0; i < checkntids; i++)
+ checks[i].vischeckresult = res;
+
+ return;
+ }
+
+ blknos = palloc_array(BlockNumber, nblocks);
+ status = palloc_array(uint8, nblocks);
+
+ blknos[0] = checks[0].tidblkno;
+
+ /* fill in the rest of the blknos array with unique block numbers */
+ for (int i = 0, j = 0; i < checkntids; i++)
+ {
+ Assert(BlockNumberIsValid(checks[i].tidblkno));
+
+ if (checks[i].tidblkno != blknos[j])
+ blknos[++j] = checks[i].tidblkno;
+ }
+
+ /* do the actual visibility checks */
+ visibilitymap_get_statusv(rel, blknos, status, nblocks, checkop->vmbuf);
+
+ /*
+ * 'res' is the current TMVC value for blknos[j] below. It is updated
+ * inside the loop, but only when j is updated, so we must initialize it
+ * here, or we'll store uninitialized data instead of an TMVC value for
+ * the first block's result.
+ */
+ if (status[0] & VISIBILITYMAP_ALL_VISIBLE)
+ res = TMVC_Visible;
+ else
+ res = TMVC_MaybeVisible;
+
+ /* copy the results of blknos into the TM_VisChecks */
+ for (int i = 0, j = 0; i < checkntids; i++)
+ {
+ if (checks[i].tidblkno != blknos[j])
+ {
+ j += 1;
+ Assert(checks[i].tidblkno == blknos[j]);
+
+ if (status[j] & VISIBILITYMAP_ALL_VISIBLE)
+ res = TMVC_Visible;
+ else
+ res = TMVC_MaybeVisible;
+ }
+
+ checks[i].vischeckresult = res;
+ }
+
+ /* and clean up the resources we'd used */
+ pfree(status);
+ pfree(blknos);
+}
+
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf62f..6189557cbbb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2648,6 +2648,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_tid_valid = heapam_tuple_tid_valid,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.index_delete_tuples = heap_index_delete_tuples,
+ .index_vischeck_tuples = heap_index_vischeck_tuples,
.relation_set_new_filelocator = heapam_relation_set_new_filelocator,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..b3ce90ceaea 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -61,6 +61,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_get_latest_tid != NULL);
Assert(routine->tuple_satisfies_snapshot != NULL);
Assert(routine->index_delete_tuples != NULL);
+ Assert(routine->index_vischeck_tuples != NULL);
Assert(routine->tuple_insert != NULL);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f7e4ae3843c..faf4f3a585a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -407,6 +407,8 @@ extern void simple_heap_update(Relation relation, const ItemPointerData *otid,
extern TransactionId heap_index_delete_tuples(Relation rel,
TM_IndexDeleteOp *delstate);
+extern void heap_index_vischeck_tuples(Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
/* in heap/pruneheap.c */
extern void heap_page_prune_opt(Relation relation, Buffer buffer);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6bf5..52acf8c1985 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -254,6 +254,69 @@ typedef struct TM_IndexDeleteOp
TM_IndexStatus *status;
} TM_IndexDeleteOp;
+/*
+ * State used when calling table_index_delete_tuples()
+ *
+ * Index-only scans need to know the visibility of the associated table tuples
+ * before they can return the index tuple. If the index tuple is known to be
+ * visible with a cheap check, we can return it directly without requesting
+ * the visibility info from the table AM directly.
+ *
+ * This AM API exposes a cheap bulk visibility checking API to indexes,
+ * allowing these indexes to check multiple tuples worth of visibility info at
+ * once, and allows the AM to store these checks. This improves the pinning
+ * ergonomics of index AMs by allowing a scan to cache index tuples in memory
+ * without holding pins on these index tuple pages until the index tuples are
+ * returned.
+ *
+ * The method is called with a list of TIDs, and its output will indicate the
+ * visibility state of each tuple: Unchecked, Dead, MaybeVisible, or Visible.
+ *
+ * HeapAM's implementation of visibility maps only allows for cheap checks of
+ * *definitely visible*; all other results are *maybe visible*. A result for
+ * *definitely not visible* aka dead is currently not accounted for by lack of
+ * Table AMs which support such visibility lookups cheaply. However, if a
+ * Table AM were to implement this, it could be used to quickly skip the
+ * current tuple in index scans, without having to ask the Table AM for that
+ * TID's data.
+ */
+typedef enum TMVC_Result
+{
+ TMVC_Unchecked = 0,
+ TMVC_Visible = 1,
+ TMVC_MaybeVisible = 2,
+
+#define TMVC_MAX TMVC_MaybeVisible
+} TMVC_Result;
+
+typedef struct TM_VisCheck
+{
+ /* TID from index tuple; deformed to not waste time during sort ops */
+ BlockNumber tidblkno;
+ uint16 tidoffset;
+ /* identifier for the TID in this visibility check operation context */
+ OffsetNumber idxoffnum;
+ /* the result of the visibility check operation */
+ TMVC_Result vischeckresult;
+} TM_VisCheck;
+
+static inline void
+PopulateTMVischeck(TM_VisCheck *check, ItemPointer tid, OffsetNumber idxoff)
+{
+ Assert(ItemPointerIsValid(tid));
+ check->tidblkno = ItemPointerGetBlockNumberNoCheck(tid);
+ check->tidoffset = ItemPointerGetOffsetNumberNoCheck(tid);
+ check->idxoffnum = idxoff;
+ check->vischeckresult = TMVC_Unchecked;
+}
+
+typedef struct TM_IndexVisibilityCheckOp
+{
+ int checkntids; /* number of TIDs to check */
+ Buffer *vmbuf; /* pointer to VM buffer to reuse across calls */
+ TM_VisCheck *checktids; /* the checks to execute */
+} TM_IndexVisibilityCheckOp;
+
/* "options" flag bits for table_tuple_insert */
/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
@@ -500,6 +563,10 @@ typedef struct TableAmRoutine
TransactionId (*index_delete_tuples) (Relation rel,
TM_IndexDeleteOp *delstate);
+ /* see table_index_vischeck_tuples() */
+ void (*index_vischeck_tuples) (Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
+
/* ------------------------------------------------------------------------
* Manipulations of physical tuples.
@@ -1333,6 +1400,64 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
return rel->rd_tableam->index_delete_tuples(rel, delstate);
}
+/*
+ * Determine rough visibility information of index tuples based on each TID.
+ *
+ * Determines which entries from index AM caller's TM_IndexVisibilityCheckOp
+ * state point to TMVC_VISIBLE or TMVC_MAYBE_VISIBLE table tuples, at low IO
+ * overhead. For the heap AM, the implementation is effectively a wrapper
+ * around VM_ALL_FROZEN.
+ *
+ * On return, all TM_VisChecks indicated by checkop->checktids will have been
+ * updated with the correct visibility status.
+ *
+ * Note that there is no value for "definitely dead" tuples, as the Heap AM
+ * doesn't have an efficient method to determine that a tuple is dead to all
+ * users, as it would have to go into the heap. If and when AMs are built
+ * that would support VM checks with an equivalent to VM_ALL_DEAD this
+ * decision can be reconsidered.
+ */
+static inline void
+table_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ rel->rd_tableam->index_vischeck_tuples(rel, checkop);
+
+#if USE_ASSERT_CHECKING
+ for (int i = 0; i < checkop->checkntids; i++)
+ {
+ TMVC_Result res = checkop->checktids[i].vischeckresult;
+
+ if (res <= TMVC_Unchecked || res > TMVC_MAX)
+ {
+ elog(PANIC, "Unexpected vischeckresult %d at offset %d/%d, expected value between %d and %d inclusive",
+ checkop->checktids[i].vischeckresult,
+ i, checkop->checkntids,
+ TMVC_Visible,
+ TMVC_MaybeVisible);
+ }
+ }
+#endif
+}
+
+static inline TMVC_Result
+table_index_vischeck_tuple(Relation rel, Buffer *vmbuffer, ItemPointer tid)
+{
+ TM_IndexVisibilityCheckOp checkOp;
+ TM_VisCheck op;
+
+ PopulateTMVischeck(&op, tid, 0);
+
+ checkOp.checktids = &op;
+ checkOp.checkntids = 1;
+ checkOp.vmbuf = vmbuffer;
+
+ rel->rd_tableam->index_vischeck_tuples(rel, &checkOp);
+
+ Assert(op.vischeckresult != TMVC_Unchecked);
+
+ return op.vischeckresult;
+}
+
/* ----------------------------------------------------------------------------
* Functions for manipulations of physical tuples.
--
2.50.1 (Apple Git-155)
v13-0002-pg_visibility-vectorize-collect_visibility_data.patchapplication/octet-stream; name=v13-0002-pg_visibility-vectorize-collect_visibility_data.patchDownload
From 18eea37d9f5d46d2e2c1295528de2154ed23456d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 19 Dec 2025 22:27:04 +0100
Subject: [PATCH v13 2/8] pg_visibility: vectorize collect_visibility_data
---
contrib/pg_visibility/pg_visibility.c | 40 +++++++++++++++++++++------
1 file changed, 31 insertions(+), 9 deletions(-)
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 7046c1b5f8e..8c76b0bf7f6 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -510,6 +510,9 @@ collect_visibility_data(Oid relid, bool include_pd)
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
BlockRangeReadStreamPrivate p;
ReadStream *stream = NULL;
+#define VM_BATCHSIZE 1024
+ BlockNumber *blknos;
+ uint8 *status;
rel = relation_open(relid, AccessShareLock);
@@ -521,6 +524,9 @@ collect_visibility_data(Oid relid, bool include_pd)
info->next = 0;
info->count = nblocks;
+ blknos = palloc0_array(BlockNumber, VM_BATCHSIZE);
+ status = palloc0_array(uint8, VM_BATCHSIZE);
+
/* Create a stream if reading main fork. */
if (include_pd)
{
@@ -541,30 +547,44 @@ collect_visibility_data(Oid relid, bool include_pd)
0);
}
- for (blkno = 0; blkno < nblocks; ++blkno)
+ for (blkno = 0; blkno < nblocks;)
{
- int32 mapbits;
+ int batchsize = 0;
+
+ for (BlockNumber bno = blkno; batchsize < VM_BATCHSIZE && bno < nblocks;)
+ blknos[batchsize++] = bno++;
/* Make sure we are interruptible. */
CHECK_FOR_INTERRUPTS();
- /* Get map info. */
- mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
- if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
- info->bits[blkno] |= (1 << 0);
- if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
- info->bits[blkno] |= (1 << 1);
+ /* Get map info in bulk. */
+ visibilitymap_get_statusv(rel, blknos, status, batchsize, &vmbuffer);
+
+ /* move the status bits */
+ for (int i = 0; i < batchsize; i++)
+ {
+ uint32 mapbits = status[i];
+ BlockNumber bno = blknos[i];
+
+ if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+ info->bits[bno] |= (1 << 0);
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+ info->bits[bno] |= (1 << 1);
+ }
/*
* Page-level data requires reading every block, so only get it if the
* caller needs it. Use a buffer access strategy, too, to prevent
* cache-trashing.
*/
- if (include_pd)
+ for (int i = 0; include_pd && i < batchsize; i++)
{
Buffer buffer;
Page page;
+ /* This subloop should be interruptable, it does IO */
+ CHECK_FOR_INTERRUPTS();
+
buffer = read_stream_next_buffer(stream, NULL);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
@@ -574,6 +594,8 @@ collect_visibility_data(Oid relid, bool include_pd)
UnlockReleaseBuffer(buffer);
}
+
+ blkno += batchsize;
}
if (include_pd)
--
2.50.1 (Apple Git-155)
On Mon, 22 Dec 2025 at 23:23, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Thu, 24 Apr 2025 at 22:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Fri, 21 Mar 2025 at 17:14, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Attached is v10, which polishes the previous patches, and adds a patch
for nbtree to use the new visibility checking strategy so that it too
can release its index pages much earlier, and adds a similar
visibility check test to nbtree.And here's v12. v11 (skipped) would've been a rebase, but after
finishing the rebase I noticed a severe regression in btree's IOS with
the new code, so v12 here applies some optimizations which reduce the
overhead of the new code.Here's v13, which moves the changes around a bit:
CFBot reported failures, which appeared to be due to an oversight in
patch 0001, where visibilitymap_get_status was missing a static
modifier to accompany its inline nature.
Apart from that fix v14 is identical to v13.
Kind regards,
Matthias van de Meent
Attachments:
v14-0004-IOS-Support-tableAM-powered-prechecked-visibilit.patchapplication/octet-stream; name=v14-0004-IOS-Support-tableAM-powered-prechecked-visibilit.patchDownload
From 346eb3fb59ba39dd36d88bc5239ec14d2435412b Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 20 Dec 2025 00:39:50 +0100
Subject: [PATCH v14 4/8] IOS: Support tableAM-powered prechecked visibility
statuses
Previously, we assumed VM_ALL_VISIBLE(...) is universal across all
AMs. Now, we use the new table_index_vischeck_tuples API, and allow
index AMs to do this check as well. This removes the need for index
AMs to hold on to pinned pages in index-only scans, as long as they
do the visibility checks on the tuples they pull from the pages
whilst they still have a pin on the page.
selfuncs.c's get_actual_variable_endpoint() is similarly updated
to use this new visibility checking infrastructure.
Future commits will implement this new infrastructure in gist,
sp-gist, and nbtree.
---
src/backend/access/index/indexam.c | 6 ++
src/backend/executor/nodeIndexonlyscan.c | 80 +++++++++++++++---------
src/backend/utils/adt/selfuncs.c | 74 +++++++++++++---------
src/include/access/relscan.h | 2 +
4 files changed, 104 insertions(+), 58 deletions(-)
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d23b..c9817e6004c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -638,6 +638,12 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/* XXX: we should assert that a snapshot is pushed or registered */
Assert(TransactionIdIsValid(RecentXmin));
+ /*
+ * Reset xs_visrecheck, so we don't confuse the next tuple's visibility
+ * state with that of the previous.
+ */
+ scan->xs_visrecheck = TMVC_Unchecked;
+
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_heaptid. It should also set
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 6bea42f128f..e8811f9a3c1 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -121,6 +121,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
+ TMVC_Result vischeck = scandesc->xs_visrecheck;
CHECK_FOR_INTERRUPTS();
@@ -128,6 +129,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We can skip the heap fetch if the TID references a heap page on
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
+ * The index may have already pre-checked the visibility of the tuple
+ * for us and stored the result in xs_visrecheck. In those cases, we
+ * can skip checking the visibility status.
*
* Note on Memory Ordering Effects: visibilitymap_get_status does not
* lock the visibility map buffer, and therefore the result we read
@@ -157,37 +161,57 @@ IndexOnlyNext(IndexOnlyScanState *node)
*
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
+ *
+ * The index doing these checks for us doesn't materially change these
+ * considerations.
*/
- if (!VM_ALL_VISIBLE(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
- {
- /*
- * Rats, we have to visit the heap to check visibility.
- */
- InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
- continue; /* no visible tuple, try next index entry */
+ if (vischeck == TMVC_Unchecked)
+ vischeck = table_index_vischeck_tuple(scandesc->heapRelation,
+ &node->ioss_VMBuffer,
+ tid);
- ExecClearTuple(node->ioss_TableSlot);
-
- /*
- * Only MVCC snapshots are supported here, so there should be no
- * need to keep following the HOT chain once a visible entry has
- * been found. If we did want to allow that, we'd need to keep
- * more state to remember not to call index_getnext_tid next time.
- */
- if (scandesc->xs_heap_continue)
- elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+ Assert(vischeck != TMVC_Unchecked);
- /*
- * Note: at this point we are holding a pin on the heap page, as
- * recorded in scandesc->xs_cbuf. We could release that pin now,
- * but it's not clear whether it's a win to do so. The next index
- * entry might require a visit to the same heap page.
- */
-
- tuple_from_heap = true;
+ switch (vischeck)
+ {
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ break;
+ case TMVC_Visible:
+ /* no further checks required here */
+ break;
+ case TMVC_MaybeVisible:
+ {
+ /*
+ * Rats, we have to visit the heap to check visibility.
+ */
+ InstrCountTuples2(node, 1);
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+ continue; /* no visible tuple, try next index entry */
+
+ ExecClearTuple(node->ioss_TableSlot);
+
+ /*
+ * Only MVCC snapshots are supported here, so there should be
+ * no need to keep following the HOT chain once a visible
+ * entry has been found. If we did want to allow that, we'd
+ * need to keep more state to remember not to call
+ * index_getnext_tid next time.
+ */
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+ /*
+ * Note: at this point we are holding a pin on the heap page,
+ * as recorded in scandesc->xs_cbuf. We could release that
+ * pin now, but it's not clear whether it's a win to do so.
+ * The next index entry might require a visit to the same heap
+ * page.
+ */
+
+ tuple_from_heap = true;
+ break;
+ }
}
/*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c760b19db55..f01adcd6f71 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7109,44 +7109,58 @@ get_actual_variable_endpoint(Relation heapRel,
while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
{
BlockNumber block = ItemPointerGetBlockNumber(tid);
+ TMVC_Result visres = index_scan->xs_visrecheck;
- if (!VM_ALL_VISIBLE(heapRel,
- block,
- &vmbuffer))
+ if (visres == TMVC_Unchecked)
+ visres = table_index_vischeck_tuple(heapRel, &vmbuffer, tid);
+
+ Assert(visres != TMVC_Unchecked);
+
+ switch (visres)
{
- /* Rats, we have to visit the heap to check visibility */
- if (!index_fetch_heap(index_scan, tableslot))
+ case TMVC_Unchecked:
+ elog(ERROR, "Failed to check visibility for tuple");
+ case TMVC_Visible:
+ /* no further checks required here */
+ break;
+ case TMVC_MaybeVisible:
{
- /*
- * No visible tuple for this index entry, so we need to
- * advance to the next entry. Before doing so, count heap
- * page fetches and give up if we've done too many.
- *
- * We don't charge a page fetch if this is the same heap page
- * as the previous tuple. This is on the conservative side,
- * since other recently-accessed pages are probably still in
- * buffers too; but it's good enough for this heuristic.
- */
+ /* Rats, we have to visit the heap to check visibility */
+ if (!index_fetch_heap(index_scan, tableslot))
+ {
+ /*
+ * No visible tuple for this index entry, so we need to
+ * advance to the next entry. Before doing so, count heap
+ * page fetches and give up if we've done too many.
+ *
+ * We don't charge a page fetch if this is the same heap
+ * page as the previous tuple. This is on the
+ * conservative side, since other recently-accessed pages
+ * are probably still in buffers too; but it's good enough
+ * for this heuristic.
+ */
#define VISITED_PAGES_LIMIT 100
- if (block != last_heap_block)
- {
- last_heap_block = block;
- n_visited_heap_pages++;
- if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
- break;
- }
+ if (block != last_heap_block)
+ {
+ last_heap_block = block;
+ n_visited_heap_pages++;
+ if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
+ break;
+ }
- continue; /* no visible tuple, try next index entry */
- }
+ continue; /* no visible tuple, try next index entry */
+ }
- /* We don't actually need the heap tuple for anything */
- ExecClearTuple(tableslot);
+ /* We don't actually need the heap tuple for anything */
+ ExecClearTuple(tableslot);
- /*
- * We don't care whether there's more than one visible tuple in
- * the HOT chain; if any are visible, that's good enough.
- */
+ /*
+ * We don't care whether there's more than one visible tuple in
+ * the HOT chain; if any are visible, that's good enough.
+ */
+ break;
+ }
}
/*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 78989a959d4..76cf8d02795 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -178,6 +178,8 @@ typedef struct IndexScanDescData
bool xs_recheck; /* T means scan keys must be rechecked */
+ int xs_visrecheck; /* TM_VisCheckResult from tableam.h */
+
/*
* When fetching with an ordering operator, the values of the ORDER BY
* expressions of the last returned tuple, according to the index. If
--
2.50.1 (Apple Git-155)
v14-0005-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v14-0005-GIST-Fix-visibility-issues-in-IOS.patchDownload
From e866d587c6ef9d2fa9aa4a1fdfaedae3ddcfcd55 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 20 Dec 2025 02:37:08 +0100
Subject: [PATCH v14 5/8] GIST: Fix visibility issues in IOS
Previously, GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with GIST vacuum, and we now do
preliminary visibility checks for IOS whilst holding the pin. This
allows us to return tuples to the scan after releasing the pin,
without breaking visibility rules.
Idea from Heikki Linnakangas
---
src/backend/access/gist/gistget.c | 125 ++++++++++++++++++++++++++-
src/backend/access/gist/gistscan.c | 6 ++
src/backend/access/gist/gistvacuum.c | 6 +-
src/include/access/gist_private.h | 27 ++++--
4 files changed, 151 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 9ba45acfff3..cc193280f74 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -17,6 +17,7 @@
#include "access/genam.h"
#include "access/gist_private.h"
#include "access/relscan.h"
+#include "access/tableam.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -394,7 +395,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
return;
}
- so->nPageData = so->curPageData = 0;
+ if (scan->numberOfOrderBys)
+ so->nsortItems = 0;
+ else
+ so->nPageData = so->curPageData = 0;
+
scan->xs_hitup = NULL; /* might point into pageDataCxt */
if (so->pageDataCxt)
MemoryContextReset(so->pageDataCxt);
@@ -498,10 +503,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
item->data.heap.recheckDistances = recheck_distances;
/*
- * In an index-only scan, also fetch the data from the tuple.
+ * In an index-only scan, also fetch the data from the tuple,
+ * and keep a reference to the tuple so we can run visibility
+ * checks on the tuples before we release the buffer.
*/
if (scan->xs_want_itup)
+ {
item->data.heap.recontup = gistFetchTuple(giststate, r, it);
+ so->sortItems[so->nsortItems] = &item->data.heap;
+ so->nsortItems += 1;
+ }
}
else
{
@@ -526,7 +537,104 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
}
- UnlockReleaseBuffer(buffer);
+ /* Allow writes to the buffer, but don't yet allow VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * If we're in an index-only scan, we need to do VM-like visibility checks
+ * before we release the pin. This way, VACUUM can't clean up dead tuples
+ * from this index page and mark the heap page ALL_VISIBLE before the tuple
+ * was returned; or at least not without the index-only scan knowing to
+ * still look at the heap page for the visibility of this tuple.
+ *
+ * See also docs section "Index Locking Considerations".
+ */
+ if (scan->xs_want_itup)
+ {
+ TM_IndexVisibilityCheckOp op;
+ op.vmbuf = &so->vmbuf;
+
+ /* get the number of TIDs we're about to check */
+ if (scan->numberOfOrderBys > 0)
+ op.checkntids = so->nsortItems;
+ else
+ op.checkntids = so->nPageData;
+
+ /* skip the rest of the vischeck code if nothing is to be done. */
+ if (op.checkntids == 0)
+ goto IOSVisChecksDone;
+
+ op.checktids = palloc_array(TM_VisCheck, op.checkntids);
+
+ /* Populate the visibility check items */
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ Assert(ItemPointerIsValid(&so->sortItems[off]->heapPtr));
+
+ PopulateTMVischeck(&op.checktids[off],
+ &so->sortItems[off]->heapPtr,
+ off);
+ }
+ }
+ else
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ Assert(ItemPointerIsValid(&so->pageData[off].heapPtr));
+
+ PopulateTMVischeck(&op.checktids[off],
+ &so->pageData[off].heapPtr,
+ off);
+ }
+ }
+
+ /* ask the table for the visibility status of these tids */
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ /* and copy the visibility status into the GISTSearchItems */
+ if (scan->numberOfOrderBys > 0)
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = so->sortItems[check->idxoffnum];
+
+ /* sanity checks */
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapPtr));
+
+ item->visrecheck = check->vischeckresult;
+ }
+
+ /* reset the temporary state used for tracking IOS items */
+ so->nsortItems = 0;
+ }
+ else
+ {
+ for (int off = 0; off < op.checkntids; off++)
+ {
+ TM_VisCheck *check = &op.checktids[off];
+ GISTSearchHeapItem *item = &so->pageData[check->idxoffnum];
+
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapPtr));
+
+ item->visrecheck = check->vischeckresult;
+ }
+ }
+
+ /* finally, clean up the used resources */
+ pfree(op.checktids);
+ }
+
+IOSVisChecksDone:
+
+ /* Allow VACUUM to process the buffer again */
+ ReleaseBuffer(buffer);
}
/*
@@ -586,9 +694,15 @@ getNextNearest(IndexScanDesc scan)
item->distances,
item->data.heap.recheckDistances);
- /* in an index-only scan, also return the reconstructed tuple. */
+ /*
+ * In an index-only scan, also return the reconstructed tuple,
+ * and store the visibility check's result.
+ */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = item->data.heap.recontup;
+ scan->xs_visrecheck = item->data.heap.visrecheck;
+ }
res = true;
}
else
@@ -675,7 +789,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
/* in an index-only scan, also return the reconstructed tuple */
if (scan->xs_want_itup)
+ {
scan->xs_hitup = so->pageData[so->curPageData].recontup;
+ scan->xs_visrecheck = so->pageData[so->curPageData].visrecheck;
+ }
so->curPageData++;
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 01b8ff0b6fa..bf6b1a82548 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -348,6 +348,12 @@ gistendscan(IndexScanDesc scan)
{
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/*
* freeGISTstate is enough to clean up everything made by gistbeginscan,
* as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 7591ad4da1a..fc541ff5efa 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -326,10 +326,10 @@ restart:
recurse_to = InvalidBlockNumber;
/*
- * We are not going to stay here for a long time, aggressively grab an
- * exclusive lock.
+ * We are not going to stay here for a long time, aggressively grab a
+ * cleanup lock.
*/
- LockBuffer(buffer, GIST_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = BufferGetPage(buffer);
if (gistPageRecyclable(page))
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..272c18ea17d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/buffile.h"
#include "utils/hsearch.h"
#include "access/genam.h"
+#include "tableam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -124,6 +125,8 @@ typedef struct GISTSearchHeapItem
* index-only scans */
OffsetNumber offnum; /* track offset in page to mark tuple as
* LP_DEAD */
+ uint8 visrecheck; /* Cached visibility check result for this
+ * heap pointer. */
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -170,12 +173,24 @@ typedef struct GISTScanOpaqueData
BlockNumber curBlkno; /* current number of block */
GistNSN curPageLSN; /* pos in the WAL stream when page was read */
- /* In a non-ordered search, returnable heap items are stored here: */
- GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
- OffsetNumber nPageData; /* number of valid items in array */
- OffsetNumber curPageData; /* next item to return */
- MemoryContext pageDataCxt; /* context holding the fetched tuples, for
- * index-only scans */
+ /* info used by Index-Only Scans */
+ Buffer vmbuf; /* reusable buffer for IOS' vm lookups */
+
+ union {
+ struct {
+ /* In a non-ordered search, returnable heap items are stored here: */
+ GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nPageData; /* number of valid items in array */
+ OffsetNumber curPageData; /* next item to return */
+ MemoryContext pageDataCxt; /* context holding the fetched tuples,
+ * for index-only scans */
+ };
+ struct {
+ /* In an ordered search, we use this as scratch space for IOS */
+ GISTSearchHeapItem *sortItems[BLCKSZ / sizeof(IndexTupleData)];
+ OffsetNumber nsortItems; /* number of items in sortData */
+ };
+ };
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
--
2.50.1 (Apple Git-155)
v14-0007-nbtree-Reduce-Index-Only-Scan-pin-duration.patchapplication/octet-stream; name=v14-0007-nbtree-Reduce-Index-Only-Scan-pin-duration.patchDownload
From 53a8d936f44be7c767692afe4415351e09ca7803 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 22 Dec 2025 17:36:14 +0100
Subject: [PATCH v14 7/8] nbtree: Reduce Index-Only Scan pin duration
Previously, we would keep a pin on every leaf page while we were returning
tuples to the scan. With this patch, we utilize the newly introduced
table_index_vischeck_tuples API to pre-check visibility of all TIDs, and
thus unpin the page well ahead of when we'd usually be ready with returning
and processing all index tuple results. This reduces the cases where VACUUM
may have to wait for a pin from a stalled index scan, and can increase
performance by reducing the amount of VM page accesses.
---
src/backend/access/nbtree/nbtreadpage.c | 5 ++
src/backend/access/nbtree/nbtree.c | 27 +++++-
src/backend/access/nbtree/nbtsearch.c | 115 ++++++++++++++++++++++--
src/include/access/nbtree.h | 8 ++
4 files changed, 145 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index b3b8b553411..b9c93ad29e6 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -1038,6 +1038,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
Size itupsz = IndexTupleSize(itup);
@@ -1068,6 +1070,8 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
+
if (so->currTuples)
{
/* Save base IndexTuple (truncate posting list) */
@@ -1104,6 +1108,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
currItem->heapTid = *heapTid;
currItem->indexOffset = offnum;
+ currItem->visrecheck = TMVC_Unchecked;
/*
* Have index-only scans return the same base IndexTuple for every TID
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b4425231935..04056647805 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -364,6 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ so->vmbuf = InvalidBuffer;
+ so->vischeckcap = 0;
+ so->vischecksbuf = NULL;
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -420,10 +424,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
*
* Note: so->dropPin should never change across rescans.
*/
- so->dropPin = (!scan->xs_want_itup &&
- IsMVCCSnapshot(scan->xs_snapshot) &&
+ so->dropPin = (IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
scan->heapRelation != NULL);
+ so->vischeck = scan->xs_want_itup;
+
+ Assert(!scan->xs_want_itup || so->vischeck || !so->dropPin);
so->markItemIndex = -1;
so->needPrimScan = false;
@@ -432,6 +438,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/*
* Allocate tuple workspace arrays, if needed for an index-only scan and
* not already done in a previous rescan call. To save on palloc
@@ -473,6 +485,17 @@ btendscan(IndexScanDesc scan)
so->markItemIndex = -1;
BTScanPosUnpinIfPinned(so->markPos);
+ if (so->vischecksbuf)
+ pfree(so->vischecksbuf);
+ so->vischecksbuf = NULL;
+ so->vischeckcap = 0;
+
+ if (BufferIsValid(so->vmbuf))
+ {
+ ReleaseBuffer(so->vmbuf);
+ so->vmbuf = InvalidBuffer;
+ }
+
/* No need to invalidate positions, the RAM is about to be freed. */
/* Release storage */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7a416d60cea..bdcc2974c92 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
#include "utils/rel.h"
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, Relation heaprel,
+ BTScanOpaque so);
static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access);
@@ -51,12 +52,95 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
*/
static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
+_bt_drop_lock_and_maybe_pin(Relation rel, Relation heaprel, BTScanOpaque so)
{
+ if (so->dropPin)
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ _bt_unlockbuf(rel, so->currPos.buf);
+
+ /*
+ * Do some visibility checks if this is an index-only scan; allowing us to
+ * drop the pin on this page before we have returned all tuples from this
+ * IOS to the executor.
+ */
+ if (so->vischeck)
+ {
+ TM_IndexVisibilityCheckOp visCheck;
+ BTScanPos sp = &so->currPos;
+ BTScanPosItem *item;
+ int initOffset = sp->firstItem;
+
+ visCheck.checkntids = 1 + sp->lastItem - initOffset;
+
+ Assert(visCheck.checkntids > 0);
+
+ /* populate the visibility checking buffer */
+ if (so->vischeckcap == 0)
+ {
+ Assert(so->vischecksbuf == NULL);
+ so->vischecksbuf = palloc_array(TM_VisCheck,
+ visCheck.checkntids);
+ so->vischeckcap = visCheck.checkntids;
+ }
+ else if (so->vischeckcap < visCheck.checkntids)
+ {
+ so->vischecksbuf = repalloc_array(so->vischecksbuf,
+ TM_VisCheck,
+ visCheck.checkntids);
+ so->vischeckcap = visCheck.checkntids;
+ }
+
+ /* populate the visibility check data */
+ visCheck.checktids = so->vischecksbuf;
+ visCheck.vmbuf = &so->vmbuf;
+
+ item = &so->currPos.items[initOffset];
+
+ for (int i = 0; i < visCheck.checkntids; i++)
+ {
+ TM_VisCheck *check = &visCheck.checktids[i];
+ Assert(item->visrecheck == TMVC_Unchecked);
+ Assert(ItemPointerIsValid(&item->heapTid));
+
+ PopulateTMVischeck(check, &item->heapTid, initOffset);
+
+ item++;
+ initOffset++;
+ }
+
+ /* do the visibility check */
+ table_index_vischeck_tuples(heaprel, &visCheck);
+
+ /* ... and put the results into the BTScanPosItems */
+ for (int i = 0; i < visCheck.checkntids; i++)
+ {
+ TM_VisCheck *check = &visCheck.checktids[i];
+ TMVC_Result result = check->vischeckresult;
+ /* We must have a valid visibility check result */
+ Assert(result != TMVC_Unchecked && result <= TMVC_MAX);
+
+ /* The idxoffnum should be in the expected range */
+ Assert(check->idxoffnum >= so->currPos.firstItem &&
+ check->idxoffnum <= so->currPos.lastItem);
+
+ item = &so->currPos.items[check->idxoffnum];
+
+ /* Ensure we don't visit the same item twice */
+ Assert(item->visrecheck == TMVC_Unchecked);
+
+ /* The offset number should still indicate the right item */
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&item->heapTid));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&item->heapTid));
+
+ /* Store the visibility check result */
+ item->visrecheck = result;
+ }
+ }
+
if (!so->dropPin)
{
- /* Just drop the lock (not the pin) */
- _bt_unlockbuf(rel, so->currPos.buf);
+ /* Only drop the lock (not the pin) */
return;
}
@@ -67,8 +151,7 @@ _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
* when concurrent heap TID recycling by VACUUM might have taken place.
*/
Assert(RelationNeedsWAL(rel));
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
- _bt_relbuf(rel, so->currPos.buf);
+ ReleaseBuffer(so->currPos.buf);
so->currPos.buf = InvalidBuffer;
}
@@ -1626,9 +1709,25 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
Assert(BTScanPosIsValid(so->currPos));
Assert(so->currPos.itemIndex >= so->currPos.firstItem);
Assert(so->currPos.itemIndex <= so->currPos.lastItem);
+ Assert(!scan->xs_want_itup || so->vischeck || !so->dropPin);
/* Return next item, per amgettuple contract */
scan->xs_heaptid = currItem->heapTid;
+
+ if (scan->xs_want_itup)
+ {
+ /*
+ * If we've already dropped the buffer, we better have already
+ * checked the visibility state of the tuple: Without the
+ * buffer pinned, vacuum may have already cleaned up the tuple
+ * and marked the page as ALL_VISIBLE.
+ */
+ Assert(BufferIsValid(so->currPos.buf) ||
+ currItem->visrecheck != TMVC_Unchecked);
+
+ scan->xs_visrecheck = currItem->visrecheck;
+ }
+
if (so->currTuples)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
}
@@ -1785,7 +1884,7 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
* so->currPos.buf in preparation for btgettuple returning tuples.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(rel, so);
+ _bt_drop_lock_and_maybe_pin(rel, scan->heapRelation, so);
return true;
}
@@ -1945,7 +2044,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
*/
Assert(so->currPos.currPage == blkno);
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(rel, so);
+ _bt_drop_lock_and_maybe_pin(rel, scan->heapRelation, so);
return true;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7a3efd209bc..b6ff85c9e61 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -17,6 +17,7 @@
#include "access/amapi.h"
#include "access/itup.h"
#include "access/sdir.h"
+#include "access/tableam.h"
#include "catalog/pg_am_d.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
@@ -957,6 +958,7 @@ typedef struct BTScanPosItem /* what we remember about each match */
ItemPointerData heapTid; /* TID of referenced heap item */
OffsetNumber indexOffset; /* index item's location within page */
LocationIndex tupleOffset; /* IndexTuple's offset in workspace, if any */
+ uint8 visrecheck; /* visibility recheck status, if any */
} BTScanPosItem;
typedef struct BTScanPosData
@@ -1072,6 +1074,12 @@ typedef struct BTScanOpaqueData
int numKilled; /* number of currently stored items */
bool dropPin; /* drop leaf pin before btgettuple returns? */
+ /* used for index-only scan visibility prechecks */
+ bool vischeck; /* check visibility of scanned tuples */
+ Buffer vmbuf; /* vm buffer */
+ int vischeckcap; /* capacity of vischeckbuf */
+ TM_VisCheck *vischecksbuf; /* single allocation to save on alloc overhead */
+
/*
* If we are doing an index-only scan, these are the tuple storage
* workspaces for the currPos and markPos respectively. Each is of size
--
2.50.1 (Apple Git-155)
v14-0008-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchapplication/octet-stream; name=v14-0008-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patchDownload
From 7ed82faee6f1d4bd69d7df56cf6765fffaf3a071 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 21 Mar 2025 16:41:31 +0100
Subject: [PATCH v14 8/8] Test for IOS/Vacuum race conditions in index AMs
Add regression tests that demonstrate wrong results can occur with index-only
scans in GiST and SP-GiST indexes when encountering tuples being removed by a
concurrent VACUUM operation.
With these tests the index AMs are also expected to not block VACUUM even when
they're used inside a cursor.
Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Co-authored-by: Peter Geoghegan <pg@bowt.ie>
Co-authored-by: Michail Nikolaev <michail.nikolaev@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CANtu0oi0rkR%2BFsgyLXnGZ-uW2950-urApAWLhy-%2BV1WJD%3D_ZXA%40mail.gmail.com
---
.../expected/index-only-scan-btree-vacuum.out | 59 +++++++++
.../expected/index-only-scan-gist-vacuum.out | 53 ++++++++
.../index-only-scan-spgist-vacuum.out | 53 ++++++++
src/test/isolation/isolation_schedule | 3 +
.../specs/index-only-scan-btree-vacuum.spec | 113 ++++++++++++++++++
.../specs/index-only-scan-gist-vacuum.spec | 112 +++++++++++++++++
.../specs/index-only-scan-spgist-vacuum.spec | 112 +++++++++++++++++
7 files changed, 505 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-scan-btree-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-gist-vacuum.out
create mode 100644 src/test/isolation/expected/index-only-scan-spgist-vacuum.out
create mode 100644 src/test/isolation/specs/index-only-scan-btree-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-gist-vacuum.spec
create mode 100644 src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
diff --git a/src/test/isolation/expected/index-only-scan-btree-vacuum.out b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
new file mode 100644
index 00000000000..9a9d94c86f6
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-btree-vacuum.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_asc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_asc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted_desc s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted_desc:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+--
+10
+(1 row)
+
+step s2_vacuum:
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+1
+(1 row)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-gist-vacuum.out b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-gist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/expected/index-only-scan-spgist-vacuum.out b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
new file mode 100644
index 00000000000..b7c02ee9529
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-spgist-vacuum.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_mod s1_begin s1_prepare_sorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_sorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+ x
+------------------
+1.4142135623730951
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+x
+-
+(0 rows)
+
+step s1_commit: COMMIT;
+
+starting permutation: s2_mod s1_begin s1_prepare_unsorted s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_mod:
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+
+step s1_begin: BEGIN;
+step s1_prepare_unsorted:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-----
+(1,1)
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index f2e067b1fbc..6366ad23c0d 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -18,6 +18,9 @@ test: two-ids
test: multiple-row-versions
test: index-only-scan
test: index-only-bitmapscan
+test: index-only-scan-btree-vacuum
+test: index-only-scan-gist-vacuum
+test: index-only-scan-spgist-vacuum
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-scan-btree-vacuum.spec b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
new file mode 100644
index 00000000000..9a00804c2c5
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-btree-vacuum.spec
@@ -0,0 +1,113 @@
+# index-only-scan test showing correct results with btree even with concurrent
+# vacuum
+
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a int NOT NULL, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_btree_a ON ios_needs_cleanup_lock USING btree (a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted_asc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a ASC;
+}
+step s1_prepare_sorted_desc {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a as x FROM ios_needs_cleanup_lock ORDER BY a DESC;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1, nor 10, so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a BETWEEN 2 AND 9;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum
+{
+ VACUUM (TRUNCATE false) ios_needs_cleanup_lock;
+}
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_asc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted_desc
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # If the index scan doesn't correctly interlock its visibility tests with
+ # concurrent VACUUM cleanup then VACUUM will mark pages as all-visible that
+ # the scan in the next steps may then consider all-visible, despite some of
+ # those rows having been removed.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-gist-vacuum.spec b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
new file mode 100644
index 00000000000..9d241b25920
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-gist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING gist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
diff --git a/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
new file mode 100644
index 00000000000..cd621d4f7f2
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-spgist-vacuum.spec
@@ -0,0 +1,112 @@
+# index-only-scan test showing wrong results with SPGiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_needs_cleanup_lock (a point NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_needs_cleanup_lock SELECT point(g.i, g.i), g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_spgist_a ON ios_needs_cleanup_lock USING spgist(a);
+}
+setup
+{
+ VACUUM (ANALYZE) ios_needs_cleanup_lock;
+}
+
+teardown
+{
+ DROP TABLE ios_needs_cleanup_lock;
+}
+
+
+session s1
+
+# Force an index-only scan, where possible:
+setup {
+ SET enable_bitmapscan = false;
+ SET enable_indexonlyscan = true;
+ SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare_sorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a <-> point '(0,0)' as x FROM ios_needs_cleanup_lock ORDER BY a <-> point '(0,0)';
+}
+
+step s1_prepare_unsorted {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_needs_cleanup_lock WHERE box '((-100,-100),(100,100))' @> a;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_needs_cleanup_lock WHERE a != point '(1,1)';
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_needs_cleanup_lock; }
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_sorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
+
+permutation
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare_unsorted
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ # Because this should block on buffer-level locks, this won't ever be
+ # considered "blocked" by isolation tester, and so we only have a single
+ # step we can work with concurrently.
+ s2_vacuum
+
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.50.1 (Apple Git-155)
v14-0006-SP-GIST-Fix-visibility-issues-in-IOS.patchapplication/octet-stream; name=v14-0006-SP-GIST-Fix-visibility-issues-in-IOS.patchDownload
From f6865e3fafc686c7770a6674cbd58b611fb06792 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sat, 20 Dec 2025 02:49:13 +0100
Subject: [PATCH v14 6/8] SP-GIST: Fix visibility issues in IOS
Previously, SP-GIST IOS could buffer tuples from pages while VACUUM came
along and cleaned up an ALL_DEAD tuple, marking the tuple's page
ALL_VISIBLE again and making IOS mistakenly believe the tuple is indeed
visible.
With this patch, pins now conflict with SP-GIST vacuum, and we now do
preliminary visibility checks for IOS whilst holding the pin. This
allows us to return tuples to the scan after releasing the pin,
without breaking visibility rules.
Idea from Heikki Linnakangas
---
src/backend/access/spgist/spgscan.c | 182 ++++++++++++++++++++++++--
src/backend/access/spgist/spgvacuum.c | 2 +-
src/include/access/spgist_private.h | 9 +-
3 files changed, 179 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 946772f3957..8537b4d87ce 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -30,7 +30,8 @@
typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isNull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances);
+ bool recheckDistances, double *distances,
+ TMVC_Result visrecheck);
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
@@ -142,6 +143,7 @@ spgAddStartItem(SpGistScanOpaque so, bool isnull)
startEntry->traversalValue = NULL;
startEntry->recheck = false;
startEntry->recheckDistances = false;
+ startEntry->visrecheck = TMVC_Unchecked;
spgAddSearchItemToQueue(so, startEntry);
}
@@ -380,6 +382,19 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
if (scankey && scan->numberOfKeys > 0)
memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+ /* prepare index-only scan requirements */
+ so->nReorderThisPage = 0;
+ if (scan->xs_want_itup)
+ {
+ if (so->visrecheck == NULL)
+ so->visrecheck = palloc(MaxIndexTuplesPerPage);
+
+ if (scan->numberOfOrderBys > 0 && so->items == NULL)
+ {
+ so->items = palloc_array(SpGistSearchItem *, MaxIndexTuplesPerPage);
+ }
+ }
+
/* initialize order-by data if needed */
if (orderbys && scan->numberOfOrderBys > 0)
{
@@ -447,6 +462,9 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ if (BufferIsValid(so->vmbuf))
+ ReleaseBuffer(so->vmbuf);
+
pfree(so);
}
@@ -496,6 +514,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
item->isLeaf = true;
item->recheck = recheck;
item->recheckDistances = recheckDistances;
+ item->visrecheck = TMVC_Unchecked;
return item;
}
@@ -578,6 +597,14 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ if (so->want_itup)
+ {
+ Assert(so->items != NULL);
+
+ so->items[so->nReorderThisPage] = heapItem;
+ so->nReorderThisPage++;
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -587,7 +614,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
/* non-ordered scan, so report the item right away */
Assert(!recheckDistances);
storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
- leafTuple, recheck, false, NULL);
+ leafTuple, recheck, false, NULL, TMVC_Unchecked);
*reportedSome = true;
}
}
@@ -800,6 +827,88 @@ spgTestLeafTuple(SpGistScanOpaque so,
return SGLT_GET_NEXTOFFSET(leafTuple);
}
+/*
+ * Pupulate so->visrecheck based on tuples which are cached for a currently
+ * pinned page.
+ */
+static void
+spgPopulateUnorderedVischecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ Assert(scan->numberOfOrderBys == 0);
+
+ if (so->nPtrs == 0)
+ return;
+
+ op.checkntids = so->nPtrs;
+ op.checktids = palloc_array(TM_VisCheck, so->nPtrs);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ Assert(ItemPointerIsValid(&so->heapPtrs[i]));
+
+ PopulateTMVischeck(&op.checktids[i], &so->heapPtrs[i], i);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&so->heapPtrs[check->idxoffnum]));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&so->heapPtrs[check->idxoffnum]));
+ Assert(check->idxoffnum < op.checkntids);
+
+ so->visrecheck[check->idxoffnum] = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+}
+
+/* pupulate so->visrecheck based on current cached tuples */
+static void
+spgPopulateOrderedVisChecks(IndexScanDesc scan, SpGistScanOpaqueData *so)
+{
+ TM_IndexVisibilityCheckOp op;
+
+ if (so->nReorderThisPage == 0)
+ return;
+
+ Assert(so->nReorderThisPage > 0);
+ Assert(scan->numberOfOrderBys > 0);
+ Assert(so->items != NULL);
+
+ op.checkntids = so->nReorderThisPage;
+ op.checktids = palloc_array(TM_VisCheck, so->nReorderThisPage);
+ op.vmbuf = &so->vmbuf;
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ PopulateTMVischeck(&op.checktids[i], &so->items[i]->heapPtr, i);
+ Assert(ItemPointerIsValid(&so->items[i]->heapPtr));
+ Assert(so->items[i]->isLeaf);
+ }
+
+ table_index_vischeck_tuples(scan->heapRelation, &op);
+
+ for (int i = 0; i < op.checkntids; i++)
+ {
+ TM_VisCheck *check = &op.checktids[i];
+
+ Assert(check->idxoffnum < op.checkntids);
+ Assert(check->tidblkno == ItemPointerGetBlockNumberNoCheck(&so->items[check->idxoffnum]->heapPtr));
+ Assert(check->tidoffset == ItemPointerGetOffsetNumberNoCheck(&so->items[check->idxoffnum]->heapPtr));
+
+ so->items[check->idxoffnum]->visrecheck = check->vischeckresult;
+ }
+
+ pfree(op.checktids);
+ so->nReorderThisPage = 0;
+}
+
/*
* Walk the tree and report all tuples passing the scan quals to the storeRes
* subroutine.
@@ -808,8 +917,8 @@ spgTestLeafTuple(SpGistScanOpaque so,
* next page boundary once we have reported at least one tuple.
*/
static void
-spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
- storeRes_func storeRes)
+spgWalk(IndexScanDesc scan, Relation index, SpGistScanOpaque so,
+ bool scanWholeIndex, storeRes_func storeRes)
{
Buffer buffer = InvalidBuffer;
bool reportedSome = false;
@@ -829,9 +938,23 @@ redirect:
{
/* We store heap items in the queue only in case of ordered search */
Assert(so->numberOfNonNullOrderBys > 0);
+
+ /*
+ * If an item we found on a page is retrieved immediately after
+ * processing that page, we won't yet have released the page pin,
+ * and thus won't yet have processed the visibility data of the
+ * page's (now) ordered tuples.
+ * Do that now, so that all tuples on the page we're about to
+ * unpin were checked for visibility before we returned any.
+ */
+ if (so->want_itup && so->nReorderThisPage)
+ spgPopulateOrderedVisChecks(scan, so);
+
+ Assert(!so->want_itup || item->visrecheck != TMVC_Unchecked);
storeRes(so, &item->heapPtr, item->value, item->isNull,
item->leafTuple, item->recheck,
- item->recheckDistances, item->distances);
+ item->recheckDistances, item->distances,
+ item->visrecheck);
reportedSome = true;
}
else
@@ -848,7 +971,15 @@ redirect:
}
else if (blkno != BufferGetBlockNumber(buffer))
{
- UnlockReleaseBuffer(buffer);
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ Assert(so->numberOfOrderBys >= 0);
+ if (so->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+
+ ReleaseBuffer(buffer);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
}
@@ -916,16 +1047,36 @@ redirect:
}
if (buffer != InvalidBuffer)
- UnlockReleaseBuffer(buffer);
-}
+ {
+ /* Unlock the buffer for concurrent accesses except VACUUM */
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ /*
+ * If we're in an index-only scan, pre-check visibility of the tuples,
+ * so we can drop the pin without causing visibility bugs.
+ */
+ if (so->want_itup)
+ {
+ Assert(scan->numberOfOrderBys >= 0);
+
+ if (scan->numberOfOrderBys == 0)
+ spgPopulateUnorderedVischecks(scan, so);
+ else
+ spgPopulateOrderedVisChecks(scan, so);
+ }
+
+ /* Release the page */
+ ReleaseBuffer(buffer);
+ }
+}
/* storeRes subroutine for getbitmap case */
static void
storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *distances)
+ bool recheckDistances, double *distances,
+ TMVC_Result visres)
{
Assert(!recheckDistances && !distances);
tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
@@ -943,7 +1094,7 @@ spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
so->tbm = tbm;
so->ntids = 0;
- spgWalk(scan->indexRelation, so, true, storeBitmap);
+ spgWalk(scan, scan->indexRelation, so, true, storeBitmap);
return so->ntids;
}
@@ -953,12 +1104,15 @@ static void
storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
Datum leafValue, bool isnull,
SpGistLeafTuple leafTuple, bool recheck,
- bool recheckDistances, double *nonNullDistances)
+ bool recheckDistances, double *nonNullDistances,
+ TMVC_Result visres)
{
Assert(so->nPtrs < MaxIndexTuplesPerPage);
so->heapPtrs[so->nPtrs] = *heapPtr;
so->recheck[so->nPtrs] = recheck;
so->recheckDistances[so->nPtrs] = recheckDistances;
+ if (so->want_itup)
+ so->visrecheck[so->nPtrs] = visres;
if (so->numberOfOrderBys > 0)
{
@@ -1035,6 +1189,10 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_heaptid = so->heapPtrs[so->iPtr];
scan->xs_recheck = so->recheck[so->iPtr];
scan->xs_hitup = so->reconTups[so->iPtr];
+ if (so->want_itup)
+ scan->xs_visrecheck = so->visrecheck[so->iPtr];
+
+ Assert(!scan->xs_want_itup || scan->xs_visrecheck != TMVC_Unchecked);
if (so->numberOfOrderBys > 0)
index_store_float8_orderby_distances(scan, so->orderByTypes,
@@ -1064,7 +1222,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
}
so->iPtr = so->nPtrs = 0;
- spgWalk(scan->indexRelation, so, false, storeGettuple);
+ spgWalk(scan, scan->indexRelation, so, false, storeGettuple);
if (so->nPtrs == 0)
break; /* must have completed scan */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cb5671c1a4e..ef1dd59049f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -625,7 +625,7 @@ spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
BlockNumber blkno = BufferGetBlockNumber(buffer);
Page page;
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ LockBufferForCleanup(buffer);
page = BufferGetPage(buffer);
if (PageIsNew(page))
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index 083af5962a8..d7590259604 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "utils/geo_decls.h"
#include "utils/relcache.h"
+#include "tableam.h"
typedef struct SpGistOptions
@@ -175,7 +176,7 @@ typedef struct SpGistSearchItem
bool isLeaf; /* SearchItem is heap item */
bool recheck; /* qual recheck is needed */
bool recheckDistances; /* distance recheck is needed */
-
+ uint8 visrecheck; /* IOS: TMVC_Result of contained heap tuple */
/* array with numberOfOrderBys entries */
double distances[FLEXIBLE_ARRAY_MEMBER];
} SpGistSearchItem;
@@ -223,6 +224,7 @@ typedef struct SpGistScanOpaqueData
/* These fields are only used in amgettuple scans: */
bool want_itup; /* are we reconstructing tuples? */
+ Buffer vmbuf; /* IOS: used for table_index_vischeck_tuples */
TupleDesc reconTupDesc; /* if so, descriptor for reconstructed tuples */
int nPtrs; /* number of TIDs found on current page */
int iPtr; /* index for scanning through same */
@@ -235,6 +237,11 @@ typedef struct SpGistScanOpaqueData
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
+ /* support for IOS */
+ int nReorderThisPage;
+ uint8 *visrecheck; /* IOS vis check results, counted by nPtrs */
+ SpGistSearchItem **items; /* counted by nReorderThisPage */
+
/*
* Note: using MaxIndexTuplesPerPage above is a bit hokey since
* SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
--
2.50.1 (Apple Git-155)
v14-0002-pg_visibility-vectorize-collect_visibility_data.patchapplication/octet-stream; name=v14-0002-pg_visibility-vectorize-collect_visibility_data.patchDownload
From 7afeedb39ea1a08c2a9cbaf412aaa32f70b96bd5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 19 Dec 2025 22:27:04 +0100
Subject: [PATCH v14 2/8] pg_visibility: vectorize collect_visibility_data
---
contrib/pg_visibility/pg_visibility.c | 40 +++++++++++++++++++++------
1 file changed, 31 insertions(+), 9 deletions(-)
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 7046c1b5f8e..8c76b0bf7f6 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -510,6 +510,9 @@ collect_visibility_data(Oid relid, bool include_pd)
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
BlockRangeReadStreamPrivate p;
ReadStream *stream = NULL;
+#define VM_BATCHSIZE 1024
+ BlockNumber *blknos;
+ uint8 *status;
rel = relation_open(relid, AccessShareLock);
@@ -521,6 +524,9 @@ collect_visibility_data(Oid relid, bool include_pd)
info->next = 0;
info->count = nblocks;
+ blknos = palloc0_array(BlockNumber, VM_BATCHSIZE);
+ status = palloc0_array(uint8, VM_BATCHSIZE);
+
/* Create a stream if reading main fork. */
if (include_pd)
{
@@ -541,30 +547,44 @@ collect_visibility_data(Oid relid, bool include_pd)
0);
}
- for (blkno = 0; blkno < nblocks; ++blkno)
+ for (blkno = 0; blkno < nblocks;)
{
- int32 mapbits;
+ int batchsize = 0;
+
+ for (BlockNumber bno = blkno; batchsize < VM_BATCHSIZE && bno < nblocks;)
+ blknos[batchsize++] = bno++;
/* Make sure we are interruptible. */
CHECK_FOR_INTERRUPTS();
- /* Get map info. */
- mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
- if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
- info->bits[blkno] |= (1 << 0);
- if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
- info->bits[blkno] |= (1 << 1);
+ /* Get map info in bulk. */
+ visibilitymap_get_statusv(rel, blknos, status, batchsize, &vmbuffer);
+
+ /* move the status bits */
+ for (int i = 0; i < batchsize; i++)
+ {
+ uint32 mapbits = status[i];
+ BlockNumber bno = blknos[i];
+
+ if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+ info->bits[bno] |= (1 << 0);
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+ info->bits[bno] |= (1 << 1);
+ }
/*
* Page-level data requires reading every block, so only get it if the
* caller needs it. Use a buffer access strategy, too, to prevent
* cache-trashing.
*/
- if (include_pd)
+ for (int i = 0; include_pd && i < batchsize; i++)
{
Buffer buffer;
Page page;
+ /* This subloop should be interruptable, it does IO */
+ CHECK_FOR_INTERRUPTS();
+
buffer = read_stream_next_buffer(stream, NULL);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
@@ -574,6 +594,8 @@ collect_visibility_data(Oid relid, bool include_pd)
UnlockReleaseBuffer(buffer);
}
+
+ blkno += batchsize;
}
if (include_pd)
--
2.50.1 (Apple Git-155)
v14-0001-Add-vectorized-API-for-visibility-map-lookup.patchapplication/octet-stream; name=v14-0001-Add-vectorized-API-for-visibility-map-lookup.patchDownload
From bfe23a47e1af52e327247920fe829e690a889497 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 19 Dec 2025 21:57:54 +0100
Subject: [PATCH v14 1/8] Add vectorized API for visibility map lookup
This allows for faster VM lookups when you have a batch of
pages to check, and will be used by future visibility checks
in the Heap table access method, all to support more
efficient Index-Only scans.
---
src/backend/access/heap/visibilitymap.c | 149 ++++++++++++++++++++----
src/include/access/visibilitymap.h | 14 ++-
2 files changed, 142 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d14588e92ae..40cff906eeb 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,7 +16,7 @@
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set bit(s) in a previously pinned page and log
* visibilitymap_set_vmbits - set bit(s) in a pinned page
- * visibilitymap_get_status - get status of bits
+ * visibilitymap_get_status_v - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_prepare_truncate -
* prepare for truncation of the visibility map
@@ -119,6 +119,9 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+/* map VM blocks back to the first heap block on that page */
+#define MAPBLOCK_TO_HEAPBLK(x) ((x) * HEAPBLOCKS_PER_PAGE)
+
/* Masks for counting subsets of bits in the visibility map. */
#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
@@ -391,7 +394,32 @@ visibilitymap_set_vmbits(BlockNumber heapBlk,
}
/*
- * visibilitymap_get_status - get status of bits
+ * Do a binary search over the provided array of BlockNumber, returning the
+ * index that the provided key could be inserted whilst maintaining order.
+ */
+static int
+find_index_in_block_array(BlockNumber key, const BlockNumber *blknos, int nblocks)
+{
+ int low = 0,
+ high = nblocks;
+
+ /* standard binary search */
+ while (low != high)
+ {
+ int mid = low + (high - low + 1) / 2;
+ BlockNumber midpoint = blknos[mid];
+
+ if (midpoint > key)
+ high = mid - 1;
+ else
+ low = mid;
+ }
+
+ return low;
+}
+
+/*
+ * visibilitymap_get_status_v - get status of bits
*
* Are all tuples on heapBlk visible to all or are marked frozen, according
* to the visibility map?
@@ -402,6 +430,9 @@ visibilitymap_set_vmbits(BlockNumber heapBlk,
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
* releasing *vmbuf after it's done testing and setting bits.
*
+ * The caller is responsible for providing a sorted array of unique heap
+ * blocks, and providing sufficient space in *status.
+ *
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
* since we don't lock the visibility map page either, it's even possible that
@@ -409,45 +440,123 @@ visibilitymap_set_vmbits(BlockNumber heapBlk,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-uint8
-visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
+void
+visibilitymap_get_statusv(Relation rel, const BlockNumber *heapBlks, uint8 *status,
+ int nblocks, Buffer *vmbuf)
{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
- char *map;
- uint8 result;
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlks[0]);
+ int startOff = 0;
+ int currblk;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_statusv %s %d", RelationGetRelationName(rel), heapBlks[0]);
#endif
/* Reuse the old pinned buffer if possible */
if (BufferIsValid(*vmbuf))
{
- if (BufferGetBlockNumber(*vmbuf) != mapBlock)
+ BlockNumber curMapBlock = BufferGetBlockNumber(*vmbuf);
+
+ /*
+ * If we have more than one block, but the head of the array isn't
+ * on the current VM page, it's still possible that the current VM
+ * page contains some other requested pages' visibility status. To
+ * figure out if we must swap the buffer now, we search the array to
+ * find the location where such a BlockNumber should be located.
+ *
+ * The index that's returned references the first BlockNumber >=
+ * firstHeapBlock, so it may reference a different VM page entirely.
+ * That's fine, we do have a later check which verifies whether that
+ * block belongs to the current VM buffer, and if not, we bail out.
+ */
+ if (nblocks > 1 && curMapBlock != mapBlock)
+ {
+ BlockNumber firstHeapBlk = MAPBLOCK_TO_HEAPBLK(curMapBlock);
+ startOff = find_index_in_block_array(firstHeapBlk, heapBlks, nblocks);
+ }
+
+ /*
+ * Bail if we still don't have pages for this VM buffer.
+ */
+ if (curMapBlock != HEAPBLK_TO_MAPBLOCK(heapBlks[startOff]))
{
+ startOff = 0;
ReleaseBuffer(*vmbuf);
*vmbuf = InvalidBuffer;
}
}
- if (!BufferIsValid(*vmbuf))
+ /* We can return here when we started processing the array only halfway through */
+restart:
+ currblk = startOff;
+ mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlks[currblk]);
+
+ /* grab the VM buffer for our mapBlock, if we didn't have it already */
+ if (*vmbuf == InvalidBuffer)
{
*vmbuf = vm_readbuf(rel, mapBlock, false);
- if (!BufferIsValid(*vmbuf))
- return (uint8) 0;
+
+ if (*vmbuf == InvalidBuffer)
+ goto endOfVisMap;
}
- map = PageGetContents(BufferGetPage(*vmbuf));
+ /* main loop */
+ while (1)
+ {
+ char *map = PageGetContents(BufferGetPage(*vmbuf));
+ int64 firstNext = MAPBLOCK_TO_HEAPBLK((int64) mapBlock) + (int64) HEAPBLOCKS_PER_PAGE;
+
+ /* Check the visibility status of all heap blocks on the current VM page */
+ for (;currblk < nblocks && ((int64) (heapBlks[currblk])) < firstNext; currblk++)
+ {
+ uint32 mapByte;
+ uint8 mapOffset;
+
+ mapByte = HEAPBLK_TO_MAPBYTE(heapBlks[currblk]);
+ mapOffset = HEAPBLK_TO_OFFSET(heapBlks[currblk]);
+
+ status[currblk] = (map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS;
+ }
+
+ /* end of the scan */
+ if (currblk >= nblocks)
+ break;
+
+ /* prepare the vm buffer for the next vm block */
+ ReleaseBuffer(*vmbuf);
+ mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlks[currblk]);
+ *vmbuf = vm_readbuf(rel, mapBlock, false);
+
+ if (*vmbuf == InvalidBuffer)
+ goto endOfVisMap;
+ }
+
+endOfVisMap:
+ /* set visibility map result to 0 for blocks past the end of the VM */
+ while (currblk < nblocks)
+ status[currblk++] = 0;
/*
- * A single byte read is atomic. There could be memory-ordering effects
- * here, but for performance reasons we make it the caller's job to worry
- * about that.
+ * If we started processing in the middle of the array to reduce buffer
+ * churn, we loop back to restart here
*/
- result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
- return result;
+ if (startOff > 0)
+ {
+ nblocks = startOff;
+ startOff = 0;
+
+ /*
+ * The next loop around will work on a different page, so we should
+ * release this buffer.
+ */
+ if (BufferIsValid(*vmbuf))
+ {
+ ReleaseBuffer(*vmbuf);
+ *vmbuf = InvalidBuffer;
+ }
+
+ goto restart;
+ }
}
/*
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index c6fa37be968..1fce032a48c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -41,9 +41,21 @@ extern uint8 visibilitymap_set(Relation rel,
extern uint8 visibilitymap_set_vmbits(BlockNumber heapBlk,
Buffer vmBuf, uint8 flags,
const RelFileLocator rlocator);
-extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_get_statusv(Relation rel, const BlockNumber *heapBlks,
+ uint8 *statusv, int nblocks,
+ Buffer *vmbuf);
extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
BlockNumber nheapblocks);
+static inline uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
+{
+ uint8 status;
+
+ visibilitymap_get_statusv(rel, &heapBlk, &status, 1, vmbuf);
+
+ return status;
+}
+
#endif /* VISIBILITYMAP_H */
--
2.50.1 (Apple Git-155)
v14-0003-TableAM-Support-AM-specific-fast-visibility-test.patchapplication/octet-stream; name=v14-0003-TableAM-Support-AM-specific-fast-visibility-test.patchDownload
From 57b2f54e3b6aa2a4719bfd036dc350d0f2431fa7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 19 Dec 2025 23:58:40 +0100
Subject: [PATCH v14 3/8] TableAM: Support AM-specific fast visibility tests
Previously, we assumed VM_ALL_VISIBLE(...) is universal across all
AMs. This is probably not the case, so we introduce a new table
method called "table_index_vischeck_tuples" which allows anyone to
ask the AM whether a tuple (or list of tuples) is definitely visible
to us, or might be deleted or otherwise invisible.
We implement that method directly for HeapAM; usage of the facility
will follow in later commits.
---
src/backend/access/heap/heapam.c | 124 ++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 1 +
src/backend/access/table/tableamapi.c | 1 +
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 125 +++++++++++++++++++++++
5 files changed, 253 insertions(+)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6daf4a87dec..d29346a2fee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -106,6 +106,20 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+/* sort template definitions for index visibility checks */
+#define ST_SORT heap_ivc_sortby_tidheapblk
+#define ST_ELEMENT_TYPE TM_VisCheck
+#define ST_DECLARE
+#define ST_DEFINE
+#define ST_SCOPE static inline
+#define ST_COMPARE(a, b) ( \
+ a->tidblkno < b->tidblkno ? -1 : ( \
+ a->tidblkno > b->tidblkno ? 1 : 0 \
+ ) \
+)
+
+#include "lib/sort_template.h"
+
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -8813,6 +8827,116 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
+/*
+ * heapam implementation of tableam's index_vischeck_tuples interface.
+ *
+ * This helper function is called by index AMs during index-only scans,
+ * to do VM-based visibility checks on individual tuples, so that the AM
+ * can hold the tuple in memory for e.g. reordering for extended periods of
+ * time while without holding thousands of pins to conflict with VACUUM.
+ *
+ * It's possible for this to generate a fair amount of I/O, since we may be
+ * checking hundreds of tuples from a single index block, but that is
+ * preferred over holding thousands of pins.
+ *
+ * We use heuristics to balance the costs of sorting TIDs with VM page
+ * lookups.
+ */
+void
+heap_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ TM_VisCheck *checks = checkop->checktids;
+ int checkntids = checkop->checkntids;
+ int nblocks = 1;
+ BlockNumber *blknos;
+ uint8 *status;
+ TMVC_Result res;
+
+ if (checkntids == 0)
+ return;
+
+ /*
+ * Order the TIDs to heap order, so that we will only need to visit every
+ * VM page at most once.
+ */
+ heap_ivc_sortby_tidheapblk(checks, checkntids);
+
+ for (int i = 0; i < checkntids - 1; i++)
+ {
+ if (checks[i].tidblkno != checks[i + 1].tidblkno)
+ {
+ Assert(checks[i].tidblkno < checks[i + 1].tidblkno);
+ nblocks++;
+ }
+ }
+
+ /*
+ * No need to allocate arrays or do other (comparatively expensive)
+ * bookkeeping when we have only one block to check.
+ */
+ if (nblocks == 1)
+ {
+ if (VM_ALL_VISIBLE(rel, checks[0].tidblkno, checkop->vmbuf))
+ res = TMVC_Visible;
+ else
+ res = TMVC_MaybeVisible;
+
+ for (int i = 0; i < checkntids; i++)
+ checks[i].vischeckresult = res;
+
+ return;
+ }
+
+ blknos = palloc_array(BlockNumber, nblocks);
+ status = palloc_array(uint8, nblocks);
+
+ blknos[0] = checks[0].tidblkno;
+
+ /* fill in the rest of the blknos array with unique block numbers */
+ for (int i = 0, j = 0; i < checkntids; i++)
+ {
+ Assert(BlockNumberIsValid(checks[i].tidblkno));
+
+ if (checks[i].tidblkno != blknos[j])
+ blknos[++j] = checks[i].tidblkno;
+ }
+
+ /* do the actual visibility checks */
+ visibilitymap_get_statusv(rel, blknos, status, nblocks, checkop->vmbuf);
+
+ /*
+ * 'res' is the current TMVC value for blknos[j] below. It is updated
+ * inside the loop, but only when j is updated, so we must initialize it
+ * here, or we'll store uninitialized data instead of an TMVC value for
+ * the first block's result.
+ */
+ if (status[0] & VISIBILITYMAP_ALL_VISIBLE)
+ res = TMVC_Visible;
+ else
+ res = TMVC_MaybeVisible;
+
+ /* copy the results of blknos into the TM_VisChecks */
+ for (int i = 0, j = 0; i < checkntids; i++)
+ {
+ if (checks[i].tidblkno != blknos[j])
+ {
+ j += 1;
+ Assert(checks[i].tidblkno == blknos[j]);
+
+ if (status[j] & VISIBILITYMAP_ALL_VISIBLE)
+ res = TMVC_Visible;
+ else
+ res = TMVC_MaybeVisible;
+ }
+
+ checks[i].vischeckresult = res;
+ }
+
+ /* and clean up the resources we'd used */
+ pfree(status);
+ pfree(blknos);
+}
+
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf62f..6189557cbbb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2648,6 +2648,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_tid_valid = heapam_tuple_tid_valid,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.index_delete_tuples = heap_index_delete_tuples,
+ .index_vischeck_tuples = heap_index_vischeck_tuples,
.relation_set_new_filelocator = heapam_relation_set_new_filelocator,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..b3ce90ceaea 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -61,6 +61,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_get_latest_tid != NULL);
Assert(routine->tuple_satisfies_snapshot != NULL);
Assert(routine->index_delete_tuples != NULL);
+ Assert(routine->index_vischeck_tuples != NULL);
Assert(routine->tuple_insert != NULL);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f7e4ae3843c..faf4f3a585a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -407,6 +407,8 @@ extern void simple_heap_update(Relation relation, const ItemPointerData *otid,
extern TransactionId heap_index_delete_tuples(Relation rel,
TM_IndexDeleteOp *delstate);
+extern void heap_index_vischeck_tuples(Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
/* in heap/pruneheap.c */
extern void heap_page_prune_opt(Relation relation, Buffer buffer);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6bf5..52acf8c1985 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -254,6 +254,69 @@ typedef struct TM_IndexDeleteOp
TM_IndexStatus *status;
} TM_IndexDeleteOp;
+/*
+ * State used when calling table_index_delete_tuples()
+ *
+ * Index-only scans need to know the visibility of the associated table tuples
+ * before they can return the index tuple. If the index tuple is known to be
+ * visible with a cheap check, we can return it directly without requesting
+ * the visibility info from the table AM directly.
+ *
+ * This AM API exposes a cheap bulk visibility checking API to indexes,
+ * allowing these indexes to check multiple tuples worth of visibility info at
+ * once, and allows the AM to store these checks. This improves the pinning
+ * ergonomics of index AMs by allowing a scan to cache index tuples in memory
+ * without holding pins on these index tuple pages until the index tuples are
+ * returned.
+ *
+ * The method is called with a list of TIDs, and its output will indicate the
+ * visibility state of each tuple: Unchecked, Dead, MaybeVisible, or Visible.
+ *
+ * HeapAM's implementation of visibility maps only allows for cheap checks of
+ * *definitely visible*; all other results are *maybe visible*. A result for
+ * *definitely not visible* aka dead is currently not accounted for by lack of
+ * Table AMs which support such visibility lookups cheaply. However, if a
+ * Table AM were to implement this, it could be used to quickly skip the
+ * current tuple in index scans, without having to ask the Table AM for that
+ * TID's data.
+ */
+typedef enum TMVC_Result
+{
+ TMVC_Unchecked = 0,
+ TMVC_Visible = 1,
+ TMVC_MaybeVisible = 2,
+
+#define TMVC_MAX TMVC_MaybeVisible
+} TMVC_Result;
+
+typedef struct TM_VisCheck
+{
+ /* TID from index tuple; deformed to not waste time during sort ops */
+ BlockNumber tidblkno;
+ uint16 tidoffset;
+ /* identifier for the TID in this visibility check operation context */
+ OffsetNumber idxoffnum;
+ /* the result of the visibility check operation */
+ TMVC_Result vischeckresult;
+} TM_VisCheck;
+
+static inline void
+PopulateTMVischeck(TM_VisCheck *check, ItemPointer tid, OffsetNumber idxoff)
+{
+ Assert(ItemPointerIsValid(tid));
+ check->tidblkno = ItemPointerGetBlockNumberNoCheck(tid);
+ check->tidoffset = ItemPointerGetOffsetNumberNoCheck(tid);
+ check->idxoffnum = idxoff;
+ check->vischeckresult = TMVC_Unchecked;
+}
+
+typedef struct TM_IndexVisibilityCheckOp
+{
+ int checkntids; /* number of TIDs to check */
+ Buffer *vmbuf; /* pointer to VM buffer to reuse across calls */
+ TM_VisCheck *checktids; /* the checks to execute */
+} TM_IndexVisibilityCheckOp;
+
/* "options" flag bits for table_tuple_insert */
/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
@@ -500,6 +563,10 @@ typedef struct TableAmRoutine
TransactionId (*index_delete_tuples) (Relation rel,
TM_IndexDeleteOp *delstate);
+ /* see table_index_vischeck_tuples() */
+ void (*index_vischeck_tuples) (Relation rel,
+ TM_IndexVisibilityCheckOp *checkop);
+
/* ------------------------------------------------------------------------
* Manipulations of physical tuples.
@@ -1333,6 +1400,64 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
return rel->rd_tableam->index_delete_tuples(rel, delstate);
}
+/*
+ * Determine rough visibility information of index tuples based on each TID.
+ *
+ * Determines which entries from index AM caller's TM_IndexVisibilityCheckOp
+ * state point to TMVC_VISIBLE or TMVC_MAYBE_VISIBLE table tuples, at low IO
+ * overhead. For the heap AM, the implementation is effectively a wrapper
+ * around VM_ALL_FROZEN.
+ *
+ * On return, all TM_VisChecks indicated by checkop->checktids will have been
+ * updated with the correct visibility status.
+ *
+ * Note that there is no value for "definitely dead" tuples, as the Heap AM
+ * doesn't have an efficient method to determine that a tuple is dead to all
+ * users, as it would have to go into the heap. If and when AMs are built
+ * that would support VM checks with an equivalent to VM_ALL_DEAD this
+ * decision can be reconsidered.
+ */
+static inline void
+table_index_vischeck_tuples(Relation rel, TM_IndexVisibilityCheckOp *checkop)
+{
+ rel->rd_tableam->index_vischeck_tuples(rel, checkop);
+
+#if USE_ASSERT_CHECKING
+ for (int i = 0; i < checkop->checkntids; i++)
+ {
+ TMVC_Result res = checkop->checktids[i].vischeckresult;
+
+ if (res <= TMVC_Unchecked || res > TMVC_MAX)
+ {
+ elog(PANIC, "Unexpected vischeckresult %d at offset %d/%d, expected value between %d and %d inclusive",
+ checkop->checktids[i].vischeckresult,
+ i, checkop->checkntids,
+ TMVC_Visible,
+ TMVC_MaybeVisible);
+ }
+ }
+#endif
+}
+
+static inline TMVC_Result
+table_index_vischeck_tuple(Relation rel, Buffer *vmbuffer, ItemPointer tid)
+{
+ TM_IndexVisibilityCheckOp checkOp;
+ TM_VisCheck op;
+
+ PopulateTMVischeck(&op, tid, 0);
+
+ checkOp.checktids = &op;
+ checkOp.checkntids = 1;
+ checkOp.vmbuf = vmbuffer;
+
+ rel->rd_tableam->index_vischeck_tuples(rel, &checkOp);
+
+ Assert(op.vischeckresult != TMVC_Unchecked);
+
+ return op.vischeckresult;
+}
+
/* ----------------------------------------------------------------------------
* Functions for manipulations of physical tuples.
--
2.50.1 (Apple Git-155)