Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
The code in gistvacuum.c is closely based on similar code in nbtree.c,
except that it only acquires an exclusive lock -- not a
super-exclusive lock. I suspect that that's because it seemed
unnecessary; nbtree plain index scans have their own special reasons
for this, that don't apply to GiST. Namely: nbtree plain index scans
that don't use an MVCC snapshot clearly need some other interlock to
protect against concurrent recycling of pointed-to-by-leaf-page TIDs.
And so as a general rule nbtree VACUUM needs a full
super-exclusive/cleanup lock, just in case there is a plain index scan
that uses some other kind of snapshot (logical replication, say).
To say the same thing another way: nbtree follows "the third rule"
described by "62.4. Index Locking Considerations" in the docs [1]https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan, but
GiST does not. The idea that GiST's behavior is okay here does seem
consistent with what the same docs go on to say about it: "When using
an MVCC-compliant snapshot, there is no problem because the new
occupant of the slot is certain to be too new to pass the snapshot
test".
But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.
The (admittedly undocumented) reason why we can never drop the leaf
page pin for an index-only scan in nbtree (but can do so for plain
index scans) also relates to heap interlocking -- but with a twist.
Here's the twist: the second heap pass by VACUUM can set visibility
map bits independently of the first (once LP_DEAD items from the first
pass over the heap are set to LP_UNUSED, which renders the page
all-visible) -- this all happens at the end of
lazy_vacuum_heap_page(). That's why index-only scans can't just assume
that VACUUM won't have deleted the TID from the leaf page they're
scanning immediately after they're done reading it. VACUUM could even
manage to set the visibility map bit for a relevant heap page inside
lazy_vacuum_heap_page(), before the index-only scan can read the
visibility map. If that is allowed to happen, the index-only would
give wrong answers if one of the TID references held in local memory
by the index-only scan happens to be marked LP_UNUSED inside
lazy_vacuum_heap_page(). IOW, it looks like we run the risk of a
concurrently recycled dead-to-everybody TID becoming visible during
GiST index-only scans, just because we have no interlock.
In summary:
UUIC this is only safe for nbtree because 1.) It acquires a
super-exclusive lock when vacuuming leaf pages, and 2.) Index-only
scans never drop their pin on the leaf page when accessing the
visibility map "in-sync" with the scan (of course we hope not to
access the heap proper at all for index-only scans). These precautions
are both necessary to make the race condition I describe impossible,
because they ensure that VACUUM cannot reach lazy_vacuum_heap_page()
until after our index-only scan reads the visibility map (and then has
to read the heap for at least that one dead-to-all TID, discovering
that the TID is dead to its snapshot). Why wouldn't GiST need to take
the same precautions, though?
[1]: https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan
--
Peter Geoghegan
<div> </div><div> </div><div>04.11.2021, 04:33, "Peter Geoghegan" <pg@bowt.ie>:</div><blockquote><p>But what about index-only scans, which GiST also supports? I think<br />that the rules are different there, even though index-only scans use<br />an MVCC snapshot.</p></blockquote><p>Let's enumerate steps how things can go wrong.</p><p>Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1</p><p>Backend2: Remove index_tuple1 and heap_tuple1</p><div><div>Backend3: Mark page1 all-visible</div><div>Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visible</div><div> </div><div>To avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do LockBufferForCleanup(index_page1). Do I get things right?</div><div> </div><div>Best regards, Andrey Borodin.</div></div><div> </div>
On Thu, Nov 4, 2021 at 8:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Let's enumerate steps how things can go wrong.
Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1
Backend2: Remove index_tuple1 and heap_tuple1
Backend3: Mark page1 all-visible
Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visibleTo avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do LockBufferForCleanup(index_page1). Do I get things right?
Almost. Backend3 is actually Backend2 here (there is no 3) -- it runs
VACUUM throughout.
Note that it's not particularly likely that Backend2/VACUUM will "win"
this race, because it typically has to do much more work than
Backend1. It has to actually remove the index tuples from the leaf
page, then any other index work (for this and other indexes). Then it
has to arrive back in vacuumlazy.c to set the VM bit in
lazy_vacuum_heap_page(). That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?
--
Peter Geoghegan
4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?
I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.
Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer too?
Best regards, Andrey Borodin.
On Fri, Nov 5, 2021 at 3:26 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.
I think that that's probably all correct -- I would certainly make the
same guess. It's very unlikely to happen, and when it does happen it
happens only once.
Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer too?
No, because there is no heap vacuuming involved (because that doesn't
happen outside lazyvacuum.c). The work that VACUUM does inside
lazy_vacuum_heap_rel() is part of the problem here -- we need an
interlock between that work, and index-only scans. Making LP_DEAD
items in heap pages LP_UNUSED is only ever possible during a VACUUM
operation (I'm sure you know why). AFAICT there would be no bug at all
without that detail.
I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated. But
AFAICT it's always some kind of "interlock with heapam VACUUM" issue,
with TID recycling, with no protection from our MVCC snapshot. I would
say that that's the "real problem" here, when I get to first principles.
Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.
We can talk about this some more in NYC. See you soon!
--
Peter Geoghegan
Attachments:
v1-0001-nbtree-README-Improve-VACUUM-interlock-section.patchapplication/x-patch; name=v1-0001-nbtree-README-Improve-VACUUM-interlock-section.patchDownload+75-80
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated.
Minor correction: we actually also have to worry about plain index
scans that don't use an MVCC snapshot, which is possible within
nbtree. It's quite likely when using logical replication, actually.
See the patch for more.
Like with the index-only scan case, a non-MVCC snapshot + plain nbtree
index scan cannot rely on heap access within the index scan node -- it
won't reliably notice that any newer heap tuples (that are really the
result of concurrent TID recycling) are not actually visible to its
MVCC snapshot -- because there isn't an MVCC snapshot. The only
difference in the index-only scan scenario is that we use the
visibility map (not the heap) -- which is racey in a way that makes
our MVCC snapshot (IOSs always have an MVCC snapshot) an ineffective
protection.
In summary, to be safe against confusion from concurrent TID recycling
during index/index-only scans, we can do either of the following
things:
1. Hold a pin of our leaf page while accessing the heap -- that'll
definitely conflict with the cleanup lock that nbtree VACUUM will
inevitably try to acquire on our leaf page.
OR:
2. Hold an MVCC snapshot, AND do an actual heap page access during the
plain index scan -- do both together.
With approach 2, our plain index scan must determine visibility using
real XIDs (against something like a dirty snapshot), rather than using
a visibility map bit. That is also necessary because the VM might
become invalid or ambiguous, in a way that's clearly not possible when
looking at full heap tuple headers with XIDs -- concurrent recycling
becomes safe if we know that we'll reliably notice it and not give
wrong answers.
Does that make sense?
--
Peter Geoghegan
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.
I pushed a commit that described how this works for nbtree, in the README file.
I think that there might be an even more subtle race condition in
nbtree itself, though, during recovery. We no longer do a "pin scan"
during recovery these days (see commits 9f83468b, 3e4b7d87, and
687f2cd7 for full information). I think that it might be necessary to
do that, just for the benefit of index-only scans -- if it's necessary
during original execution, then why not during recovery?
The work to remove "pin scans" was justified by pointing out that we
no longer use various kinds of snapshots during recovery, but it said
nothing about index-only scans, which need the TID recycling interlock
(i.e. need to hold onto a leaf page while accessing the heap in sync)
even with an MVCC snapshot. It's easy to imagine how it might have
been missed: nobody ever documented the general issue with index-only
scans, until now. Commit 2ed5b87f recognized they were unsafe for the
optimization that it added (to avoid blocking VACUUM), but never
explained why they were unsafe.
Going back to doing pin scans during recovery seems deeply
unappealing, especially to fix a totally narrow race condition.
--
Peter Geoghegan
On Wed, Nov 3, 2021 at 7:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.
(Reviving this old thread after 3 years)
I was reminded of this old thread during today's discussion of a
tangentially related TID-recycle-safety bug that affects bitmap index
scans that use the visibility map as an optimization [1]/messages/by-id/873c33c5-ef9e-41f6-80b2-2f5e11869f1c@garret.ru. Turns out I
was right to be concerned.
This GiST bug is causally unrelated to that other bug, so I thought it
would be helpful to move discussion of the GiST bug to this old
thread.
Attached is a refined version of a test case I posted earlier on [2]/messages/by-id/CAH2-Wzm6gBqc99iEKO6540ynwpjOqWESt5yjg-bHbt0hc8DPsw@mail.gmail.com -- Peter Geoghegan,
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.
Offhand, I think that the only way to fix this is to bring GiST in
line with nbtree: we ought to teach GiST VACUUM to start acquiring
cleanup locks (previously known as super-exclusive locks), and then
teach GiST index-only scans to hold onto a leaf page buffer pin as the
visibility map (or the heap proper) is accessed for the TIDs to be
returned from the leaf page. Arguably, GiST isn't obeying the current
contract for amgettuple index AMs at all right now (whether or not it
violates the contract as written is open to interpretation, I suppose,
but either way the current behavior is wrong).
We probably shouldn't hold onto a buffer pin during plain GiST
index-only scans, though -- plain GiST index scans *aren't* broken,
and so we should change as little as possible there. More concretely,
we should probably copy more nbtree scan related code into GiST to
deal with all this: we could copy nbtree's _bt_drop_lock_and_maybe_pin
into GiST to fix this bug, while avoiding changing the performance
characteristics of GiST plain index scans. This will also entail
adding a new buffer field to GiST's GISTScanOpaqueData struct --
something similar to nbtree's BTScanOpaqueData.currPos.buf field
(it'll go next to the current GISTScanOpaqueData.curBlkno field, just
like the nbtree equivalent goes next to its own currPage blkno field).
Long term, code like nbtree's _bt_drop_lock_and_maybe_pin should be
generalized and removed from every individual index AM, nbtree
included -- I think that the concepts generalize to every index AM
that supports amgettuple (the race condition in question has
essentially nothing to do with individual index AM requirements). I've
discussed this kind of approach with Tomas Vondra (CC'd) recently.
That's not going to be possible within the scope of a backpatchable
fix, though.
[1]: /messages/by-id/873c33c5-ef9e-41f6-80b2-2f5e11869f1c@garret.ru
[2]: /messages/by-id/CAH2-Wzm6gBqc99iEKO6540ynwpjOqWESt5yjg-bHbt0hc8DPsw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v2-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v2-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload+119-1
On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.
I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.
I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.
--
Peter Geoghegan
On Tue, 3 Dec 2024 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.
I think I have a fix for GiST which can be found attached in patch 0002.
As to how this works: the patch tracks (for IOS) the pages for which
there are some entries yet to be returned by gistgettuple(), and keeps
a pin on those pages using a new AMscan tracking mechanism that
utilizes buffer refcounts. Even if it might not a very elegant
solution and IMV still has rough edges, it works, and fixes the issue
with incorrect results from the GiST index.
One side effect of this change to keep pins in GiST-IOS, is that that
this could realistically keep pins on a huge portion of the index,
thus exhausting shared buffers and increasing prevalence of "no
unpinned buffers"-related errors.
I haven't looked much at SP-GiST yet, so I don't have anything for the
VACUUM+IOS bug there.
0001 is a slight modification of your v2-0001, a version which now
(critically) doesn't expect VACUUM to run to completion before
s1_fetch_all starts; this is important for 0002 as that causes vacuum
to block and wait for the cursor to return more tuples, which the
isolation tester doesn't (can't?) detect. With only 0001, the new test
fails with incorrect results, with 0002 applied the test succeeds.
I'm looking forward to any feedback.
Kind regards,
Matthias van de Meent
Attachments:
v3-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v3-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload+197-1
v3-0002-RFC-Extend-buffer-pin-lifetime-for-GIST-IOS.patchapplication/octet-stream; name=v3-0002-RFC-Extend-buffer-pin-lifetime-for-GIST-IOS.patchDownload+192-5
On Sat, 4 Jan 2025 at 02:00, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Tue, 3 Dec 2024 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.I think I have a fix for GiST which can be found attached in patch 0002.
As to how this works: the patch tracks (for IOS) the pages for which
there are some entries yet to be returned by gistgettuple(), and keeps
a pin on those pages using a new AMscan tracking mechanism that
utilizes buffer refcounts. Even if it might not a very elegant
solution and IMV still has rough edges, it works, and fixes the issue
with incorrect results from the GiST index.
In the attached v4 of the fix I've opted to replace the bespoke pin
tracker of my previous fix with the default buffer pin tracking
mechanism, which I realised has been designed for the same items and
doesn't require additional memory management on the AM side.
It massively simplifies the code, reduces allocation overhead, and
allowed me to port the fix to SP-GiST much quicker.
I haven't looked much at SP-GiST yet, so I don't have anything for the
VACUUM+IOS bug there.
I've attached a fix that uses the same approach as GiST in v4-0003. I
couldn't find any spgist extensions in contrib to copy-paste the tests
to, but manual testing did show vacuum now does wait for the index
scan to finish page processing.
I'm looking forward to any feedback.
Like patch 0001, the status of that has not changed.
Kind regards,
Matthias van de Meent
Attachments:
v4-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v4-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload+197-1
v4-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchapplication/octet-stream; name=v4-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchDownload+100-18
v4-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchapplication/octet-stream; name=v4-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchDownload+159-15
Hello.
One thing I think we could add to the patches is to adapt the 10-years-old
comment below with notice about IOS:
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
* the pin for MVCC scans, which allows vacuum to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
Also, I think it is a good idea to add "Assert(!scan->xs_want_itup);"
to gistkillitems.
Best regards,
Mikhail.
On Thu, 9 Jan 2025 at 22:00, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hello.
One thing I think we could add to the patches is to adapt the 10-years-old comment below with notice about IOS:
/*
* We save the LSN of the page as we read it, so that we know whether it
* safe to apply LP_DEAD hints to the page later. This allows us to drop
* the pin for MVCC scans, which allows vacuum to avoid blocking.
*/
so->curPageLSN = BufferGetLSNAtomic(buffer);
I don't quite read it as covering IOS. To me, the comment is more
along the lines of (extensively extended):
"""
We'd like to use kill_prior_tuple, but that requires us to apply
changes to the page when we're already done with it for all intents
and purposes (because we scan the page once and buffer results). We
can either keep a pin on the buffer, or re-acquire that page after
finishing producing the tuples from this page.
Pinning the page blocks vacuum [^1], so instead we drop the pin, then
collect LP_DEAD marks, and then later we re-acquire the page to mark
the tuples dead. However, in the meantime the page may have changed;
by keeping a tab on changes in LSN we have a cheap method of detecting
changes in the page iself. [^2]
"""
... and that doesn't seem to cover much of IOS. MVCC index scans
aren't that special, practically every user query has MVCC. I think
this "MVCC scan" even means non-IOS scan, as I can't think of a reason
why dropping a pin would otherwise be a valid behaviour (see the this
thread's main issue).
[^1] Well, it should. In practice, the code in HEAD doesn't, but this
patchset fixes that disagreement.
[^2] If the page changed, i.e. the LSN changed, GIST accepts that it
can't use the collected LP_DEAD marks. We may be able to improve on
that (or not) by matching LP_DEAD offsets and TIDs with the on-page
TIDs, but that's far outside the scope of this patch; we'd first have
to build an understanding about why it's correct to assume vacuum
hasn't finished reaping old tuples and other sessions also finished
inserting new ones with the same TIDs in the meantime.
Also, I think it is a good idea to add "Assert(!scan->xs_want_itup);" to gistkillitems.
Why would it be incorrect or invalid to kill items in an index-only scan?
If we hit the heap (due to ! VM_ALL_VISIBLE) and detected the heap
tuple was dead, why couldn't we mark it as dead in the index? IOS
assumes a high rate of all-visible pages, but it's hardly unheard of
to access pages with dead tuples.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello!
Sorry, I should have expressed my thoughts in more detail, as they don't
matter as much as the time you took to answer.
I don't quite read it as covering IOS. To me, the comment is more
along the lines of (extensively extended):
My idea was just to add a few more details about the locking rule, such as:
* safe to apply LP_DEAD hints to the page later. This allows us to drop
* the pin for MVCC scans (except in cases of index-only scans due to XXX),
which allows vacuum to avoid blocking.
I think this "MVCC scan" even means non-IOS scan
Maybe, but I think it’s better to clarify that, since IOS scans still use
the MVCC snapshot. For me, a non-MVCC scan is something like SnapshotSelf
or SnapshotDirty.
Why would it be incorrect or invalid to kill items in an index-only scan?
Oh, I was comparing the logic to that of btree and somehow made a logical
error in my conclusions. But at least I hope I got some useful thoughts out
of it - since we have a pin during gistkillitems in the case of IOS, we can
ignore the "if (BufferGetLSNAtomic(buffer) != so->curPageLSN)" check in
that case because vacuum is locked.
It is not a compensation for a performance penalty caused by buffer pin
during IOS, but at least something.
I hope this time my conclusions are correct :)
Thanks,
Mikhail.
Hello, Matthias!
Updated patches attached.
Changes:
* cleanup test logic a little bit
* resolve issue with rescan in GIST (item->blkno == InvalidBlockNumber)
* move test to the main isolation suite
* add test for SpGist
* update comment I mentioned before
* allow GIST to set LP_DEAD in cases it is a safe even if LSN is updated
Also, seems like SP-GIST version is broken, it fails like this:
TRAP: failed Assert("BufferIsValid(so->pagePin)"), File:
"../src/backend/access/spgist/spgscan.c", Line: 1139, PID: 612214
FETCH(ExceptionalCondition+0xbe)[0x644a0b9dfdbc]
FETCH(spggettuple+0x289)[0x644a0b3743c6]
FETCH(index_getnext_tid+0x166)[0x644a0b3382f7]
FETCH(+0x3b392b)[0x644a0b56d92b]
FETCH(+0x3887df)[0x644a0b5427df]
FETCH(ExecScan+0x77)[0x644a0b542858]
FETCH(+0x3b3b9b)[0x644a0b56db9b]
FETCH(+0x376d8b)[0x644a0b530d8b]
FETCH(+0x379bd9)[0x644a0b533bd9]
FETCH(standard_ExecutorRun+0x19f)[0x644a0b531457]
FETCH(ExecutorRun+0x5a)[0x644a0b5312b5]
FETCH(+0x6276dc)[0x644a0b7e16dc]
FETCH(+0x628936)[0x644a0b7e2936]
FETCH(PortalRunFetch+0x1a0)[0x644a0b7e229c]
FETCH(PerformPortalFetch+0x13b)[0x644a0b49d7e5]
FETCH(standard_ProcessUtility+0x5f0)[0x644a0b7e3aab]
FETCH(ProcessUtility+0x140)[0x644a0b7e34b4]
FETCH(+0x627ceb)[0x644a0b7e1ceb]
FETCH(+0x627a28)[0x644a0b7e1a28]
FETCH(PortalRun+0x273)[0x644a0b7e12bb]
FETCH(+0x61fae1)[0x644a0b7d9ae1]
FETCH(PostgresMain+0x9eb)[0x644a0b7df170]
FETCH(+0x61b3e2)[0x644a0b7d53e2]
FETCH(postmaster_child_launch+0x137)[0x644a0b6e6e2d]
FETCH(+0x53384b)[0x644a0b6ed84b]
FETCH(+0x530f31)[0x644a0b6eaf31]
FETCH(PostmasterMain+0x161f)[0x644a0b6ea812]
FETCH(main+0x3a1)[0x644a0b5c29cf]
Best regards,
Mikhail.
Show quoted text
Attachments:
v5-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchtext/x-patch; charset=US-ASCII; name=v5-0002-RFC-Extend-buffer-pinning-for-GIST-IOS.patchDownload+166-20
v5-0001-isolation-tester-showing-broken-index-only-scans-.patchtext/x-patch; charset=US-ASCII; name=v5-0001-isolation-tester-showing-broken-index-only-scans-.patchDownload+362-1
v5-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchtext/x-patch; charset=US-ASCII; name=v5-0003-RFC-Extend-buffer-pinning-for-SP-GIST-IOS.patchDownload+100-18
Hello everyone, and Mathias!
I have fixed sp-gist related crash and a few issues in implementation.
Now it passes tests and (in my opinion) feels simpler.
I'll register that thread in commitfest to honor the bureaucracy.
Best regards,
Mikhail.
Attachments:
v6-0003-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v6-0003-This-should-fix-issues-with-incorrect-results-whe.patchDownload+99-17
v6-0002-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchapplication/octet-stream; name=v6-0002-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchDownload+14-6
v6-0001-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v6-0001-This-should-fix-issues-with-incorrect-results-whe.patchDownload+158-14
Ooops, missed one commit - fixed (logic related to LP_DEAD in GIST
extracted to separate commit).
Also, commitfest entry is here - https://commitfest.postgresql.org/52/5542/
Show quoted text
Attachments:
v7-0004-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v7-0004-This-should-fix-issues-with-incorrect-results-whe.patchDownload+99-17
v7-0002-This-should-fix-issues-with-incorrect-results-whe.patchapplication/octet-stream; name=v7-0002-This-should-fix-issues-with-incorrect-results-whe.patchDownload+158-14
v7-0003-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchapplication/octet-stream; name=v7-0003-Also-add-ability-to-set-LP_DEAD-bits-in-more-case.patchDownload+14-6
v7-0001-isolation-tester-showing-broken-index-only-scans-.patchapplication/octet-stream; name=v7-0001-isolation-tester-showing-broken-index-only-scans-.patchDownload+362-1
Hello, everyone!
Just some commit messages + few cleanups.
Best regards,
Mikhail.
Attachments:
v8-0002-Fix-index-only-scan-race-condition-in-GiST-implem.patchapplication/octet-stream; name=v8-0002-Fix-index-only-scan-race-condition-in-GiST-implem.patchDownload+158-14
v8-0003-Improve-buffer-handling-for-killed-items-in-GiST-.patchapplication/octet-stream; name=v8-0003-Improve-buffer-handling-for-killed-items-in-GiST-.patchDownload+14-6
v8-0004-Fix-index-only-scan-race-condition-in-SP-GiST-imp.patchapplication/octet-stream; name=v8-0004-Fix-index-only-scan-race-condition-in-SP-GiST-imp.patchDownload+99-17
v8-0001-Tests-for-index-only-scan-race-condition-with-con.patchapplication/octet-stream; name=v8-0001-Tests-for-index-only-scan-race-condition-with-con.patchDownload+362-1
On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Just some commit messages + few cleanups.
I'm worried about this:
+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.
I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.
--
Peter Geoghegan
On 28/02/2025 03:53, Peter Geoghegan wrote:
On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:Just some commit messages + few cleanups.
I'm worried about this:
+These longer pin lifetimes can cause buffer exhaustion with messages like "no +unpinned buffers available" when the index has many pages that have similar +ordering; but future work can figure out how to best work that out.I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.
+1. With no limit, it seems pretty easy to hold thousands of buffer pins
with this.
The index can set IndexScanDesc->xs_recheck to indicate that the quals
must be rechecked. Perhaps we should have a similar flag to indicate
that the visibility must be rechecked.
Matthias's earlier patch
(/messages/by-id/CAEze2Wg1kbpo_Q1=9X68JRsgfkyPCk4T0QN+qKz10+FVzCAoGA@mail.gmail.com)
had a more complicated mechanism to track the pinned buffers. Later
patch got rid of that, which simplified things a lot. I wonder if we
need something like that, after all.
Here's a completely different line of attack: Instead of holding buffer
pins for longer, what if we checked the visibility map earlier? We could
check the visibility map already when we construct the
GISTSearchHeapItem, and set a flag in IndexScanDesc to tell
IndexOnlyNext() that we have already done that. IndexOnlyNext() would
have three cases:
1. The index AM has not checked the visibility map. Check it in
IndexOnlyNext(), and fetch the tuple if it's not set. This is what it
always does today.
2. The index AM has checked the visibility map, and the VM bit was set.
IndexOnlyNext() can skip the VM check and use the tuple directly.
3. The index AM has checked the visibility map, and the VM bit was not
set. IndexOnlyNext() will fetch the tuple to check its visibility.
--
Heikki Linnakangas
Neon (https://neon.tech)