Incorrect result of bitmap heap scan.

Started by Konstantin Knizhnikover 1 year ago46 messages
Jump to latest
#1Konstantin Knizhnik
k.knizhnik@postgrespro.ru

Hi hackers,

Attached script reproduces the problem with incorrect results of `select
count(*)` (it returns larger number of records than really available in
the table).
It is not always reproduced, so you may need to repeat it multiple times
- at my system it failed 3 times from 10.

The problem takes place with pg16/17/18 (other versions I have not checked).

The test is called `test_ios` (index-only-scan), but it is not correct.
Index-only scan is not used in this case.
And this is actually the first question to PG17/18: IOS is not used when
number of records is less than 100k (for this particular table):

postgres=# create table t(pk integer primary key); CREATE TABLE
postgres=# set enable_seqscan = off; SET postgres=# set enable_indexscan
= off; SET postgres=# insert into t values (generate_series(1,1000));
INSERT 0 1000 postgres=# vacuum t; VACUUM postgres=# explain select
count(*) from t; QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=43.02..43.03 rows=1 width=8) -> Bitmap Heap Scan on t
(cost=25.52..40.52 rows=1000 width=0) -> Bitmap Index Scan on t_pkey
(cost=0.00..25.27 rows=1000 width=0) (3 rows) postgres=# set
enable_bitmapscan = off; SET postgres=# explain select count(*) from t;
QUERY PLAN ------------------------------------------------------------
Aggregate (cost=17.50..17.51 rows=1 width=8) -> Seq Scan on t
(cost=0.00..15.00 rows=1000 width=0) Disabled: true (3 rows)

So, as you can see, Postgres prefers to use disabled seqscan, but not
IOS. It is different from pg16 where disabling bitmap scan makes
optimizer to choose index-only scan:

postgres=# explain select count(*) from t; QUERY PLAN
----------------------------------------------------------- Aggregate
(cost=41.88..41.88 rows=1 width=8) -> Seq Scan on t (cost=0.00..35.50
rows=2550 width=0) (2 rows) postgres=# set enable_seqscan = off; SET
postgres=# explain select count(*) from t; QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=75.54..75.55 rows=1 width=8) -> Bitmap Heap Scan on t
(cost=33.67..69.17 rows=2550 width=0) -> Bitmap Index Scan on t_pkey
(cost=0.00..33.03 rows=2550 width=0) (3 rows) postgres=# set
enable_bitmapscan = off; SET postgres=# explain select count(*) from t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=45.77..45.78 rows=1 width=8) -> Index Only Scan using
t_pkey on t (cost=0.28..43.27 rows=1000 width=0) (2 rows)

This is strange behavior of pg17 which for some reasons rejects IOS (but
it is used if number of records in the table is 100k or more). But the
main problem is that used plan Bitmap Heap Scan + Bitmap Index Scan may
return incorrect result.

Replacing `select count(*)` with `select count(pk)` eliminates the
problem, as well as disabling of autovacuum. It seems to be clear that
the problem is with visibility map.

We have the following code in heap bitmap scan: /* * We can skip
fetching the heap page if we don't need any fields from the * heap, the
bitmap entries don't need rechecking, and all tuples on the * page are
visible to our transaction. */ if (!(scan->rs_flags & SO_NEED_TUPLES) &&
!tbmres->recheck && VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno,
&hscan->rs_vmbuffer)) { /* can't be lossy in the skip_fetch case */
Assert(tbmres->ntuples >= 0); Assert(hscan->rs_empty_tuples_pending >=
0); hscan->rs_empty_tuples_pending += tbmres->ntuples; return true; }

So if we do not need tuples (|count(*)|case) and page is marked as
all-visible in VM, then we just count|tbmres->ntuples|elements without
extra checks.
I almost not so familiar with internals of executor, but it is not clear
to me how we avoid race condition between VM update and heap bitmap scan?

Assume that bitmap scan index marks all tids available in index. Some
elements in this bitmap can refer old (invisible) versions. Then vacuum
comes, removes dead elements and mark page as all-visible. After it we
start heap bitmap scan, see that page is all-visible and count all
marked elements on this page including dead (which are not present in
the page any more).
Which lock or check should prevent such scenario?

Attachments:

test_ios.pytext/x-python-script; charset=UTF-8; name=test_ios.pyDownload
#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: Incorrect result of bitmap heap scan.

On Mon, 2 Dec 2024 at 15:25, Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Hi hackers,

Attached script reproduces the problem with incorrect results of `select count(*)` (it returns larger number of records than really available in the table).
It is not always reproduced, so you may need to repeat it multiple times - at my system it failed 3 times from 10.

The problem takes place with pg16/17/18 (other versions I have not checked).

I suspect all back branches are affected. As I partially also mentioned offline:

The running theory is that bitmap executor nodes incorrectly assume
that the rows contained in the bitmap all are still present in the
index, and thus assume they're allowed to only check the visibility
map to see if the reference contained in the bitmap is visible.
However, this seems incorrect: Note that index AMs must hold at least
pins on the index pages that contain their results when those results
are returned by amgettuple() [0], and that amgetbitmap() doesn't do
that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
from the index (and later, heap) that are still present in the bitmap
used in the scan.

Concurrency timeline:

Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.
Session 2. VACUUM runs on the heap, removes TIDs for P1 from the
index, deletes those TIDs from the heap pages, and finally sets heap
pages' VM bits to ALL_VISIBLE, including the now cleaned page P1
Session 1. Executes the bitmap heap scan that uses the bitmap without
checking tuples on P1 due to ALL_VISIBLE and a lack of output columns.

I think this might be an oversight when the feature was originally
committed in 7c70996e (PG11): we don't know when the VM bit was set,
and the bitmap we're scanning may thus be out-of-date (and should've
had TIDs removed it it had been an index), so I propose disabling this
optimization for now, as attached. Patch v1 is against a recent HEAD,
PG17 applies to the 17 branch, and PG16- should work on all (or at
least, most) active backbranches older than PG17's.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS.
I don't think the optimization itself is completely impossible, and we
can probably re-enable an optimization like that if (or when) we find
a way to reliably keep track of when to stop using the optimization. I
don't think that's an easy fix, though, and definitely not for
backbranches.

The solution I could think to keep most of this optimization requires
the heap bitmap scan to notice that a concurrent process started
removing TIDs from the heap after amgetbitmap was called; i.e.. a
"vacuum generation counter" incremented every time heap starts the
cleanup run. This is quite non-trivial, however, as we don't have much
in place regarding per-relation shared structures which we could put
such a value into, nor a good place to signal changes of the value to
bitmap heap-scanning backends.

Attachments:

PG17-Disable-BitmapScan-s-skip_fetch-optimization.patch.txttext/plain; charset=US-ASCII; name=PG17-Disable-BitmapScan-s-skip_fetch-optimization.patch.txtDownload+8-88
v1-0001-Remove-BitmapScan-s-skip_fetch-optimization.patchapplication/x-patch; name=v1-0001-Remove-BitmapScan-s-skip_fetch-optimization.patchDownload+4-99
PG16-_Disable-BitmapScan-s-skip_fetch-optimization.patch.txttext/plain; charset=US-ASCII; name=PG16-_Disable-BitmapScan-s-skip_fetch-optimization.patch.txtDownload+4-61
In reply to: Matthias van de Meent (#2)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

The running theory is that bitmap executor nodes incorrectly assume
that the rows contained in the bitmap all are still present in the
index, and thus assume they're allowed to only check the visibility
map to see if the reference contained in the bitmap is visible.
However, this seems incorrect: Note that index AMs must hold at least
pins on the index pages that contain their results when those results
are returned by amgettuple() [0], and that amgetbitmap() doesn't do
that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
from the index (and later, heap) that are still present in the bitmap
used in the scan.

This theory seems very believable.

We hold onto a leaf page buffer pin for index-only scans as an
interlock against concurrent TID recycling. If we assume for the sake
of argument that the optimization from commit 7c70996e is correct,
then why do we even bother with holding onto the pin during index-only
scans?

In theory we should either do the "buffer pin interlock against TID
recycling" thing everywhere, or nowhere -- how could bitmap scans
possibly be different here?

I think this might be an oversight when the feature was originally
committed in 7c70996e (PG11): we don't know when the VM bit was set,
and the bitmap we're scanning may thus be out-of-date (and should've
had TIDs removed it it had been an index), so I propose disabling this
optimization for now, as attached.

I have a hard time imagining any alternative fix that is suitable for
backpatch. Can we save the optimization on the master branch?

Clearly it would be wildly impractical to do the "buffer pin interlock
against TID recycling" thing in the context of bitmap scans. The only
thing that I can think of that might work is a scheme that establishes
a "safe LSN" for a given MVCC snapshot. If the VM page's LSN is later
than the "safe LSN", it's not okay to trust its all-visible bits. At
least not in the context of bitmap index scans that use the
optimization from 7c70996e.

--
Peter Geoghegan

#4Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#2)
Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:

Concurrency timeline:

Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.

IIUC, an unstanted important aspect here is that these dead tuples are *not*
visible to S1. Otherwise the VACUUM in the next step could not remove the dead
tuples.

Session 2. VACUUM runs on the heap, removes TIDs for P1 from the
index, deletes those TIDs from the heap pages, and finally sets heap
pages' VM bits to ALL_VISIBLE, including the now cleaned page P1
Session 1. Executes the bitmap heap scan that uses the bitmap without
checking tuples on P1 due to ALL_VISIBLE and a lack of output columns.

Ugh :/

Pretty depressing that we've only found this now, ~6 years after it's been
released. We're lacking some tooling to find this stuff in a more automated
fashion.

PS.
I don't think the optimization itself is completely impossible, and we
can probably re-enable an optimization like that if (or when) we find
a way to reliably keep track of when to stop using the optimization. I
don't think that's an easy fix, though, and definitely not for
backbranches.

One way to make the optimization safe could be to move it into the indexam. If
an indexam would check the VM bit while blocking removal of the index entries,
it should make it safe. Of course that has the disadvantage of needing more VM
lookups than before, because it would not yet have been deduplicated...

Another issue with bitmap index scans is that it currently can't use
killtuples. I've seen several production outages where performance would
degrade horribly over time whenever estimates lead to important queries to
switch to bitmap scans, because lots of dead tuples would get accessed over
and over.

It seems pretty much impossible to fix that with the current interaction
between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
and likely with some heuristics / limits, address this by performing some
visibility checks during the bitmap build. I'm bringing it up here because I
suspect that some of the necessary changes would overlap with what's needed to
repair bitmap index-only scans.

The solution I could think to keep most of this optimization requires
the heap bitmap scan to notice that a concurrent process started
removing TIDs from the heap after amgetbitmap was called; i.e.. a
"vacuum generation counter" incremented every time heap starts the
cleanup run.

I'm not sure this is a good path. We can already clean up pages during index
accesses and we are working on being able to set all-visible during "hot
pruning"/page-level-vacuuming. That'd probably actually be safe (because we
couldn't remove dead items without a real vacuum), but it's starting to get
pretty complicated...

This is quite non-trivial, however, as we don't have much in place regarding
per-relation shared structures which we could put such a value into, nor a
good place to signal changes of the value to bitmap heap-scanning backends.

It doesn't have to be driven of table state, it could be state in
indexes. Most (all?) already have a metapage.

We could also add that state to pg_class? We already update pg_class after
most vacuums, so I don't think that'd be an issue.

Thomas had a WIP patch to add a shared-memory table of all open
relations. Which we need for quite a few features by now (e.g. more efficient
buffer mapping table, avoiding the relation size lookup during planning /
execution, more efficient truncation of relations, ...).

From 07739e5af83664b67ea02d3db7a461a106d74040 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:59:35 +0100
Subject: [PATCH v1] Disable BitmapScan's skip_fetch optimization

The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.

The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.

Unfortunately I don't see a better path either.

I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...

Greetings,

Andres Freund

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#3)
Re: Incorrect result of bitmap heap scan.

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

The running theory is that bitmap executor nodes incorrectly assume
that the rows contained in the bitmap all are still present in the
index, and thus assume they're allowed to only check the visibility
map to see if the reference contained in the bitmap is visible.
However, this seems incorrect: Note that index AMs must hold at least
pins on the index pages that contain their results when those results
are returned by amgettuple() [0], and that amgetbitmap() doesn't do
that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
from the index (and later, heap) that are still present in the bitmap
used in the scan.

This theory seems very believable.

I'm not convinced. I think there are two assumptions underlying
bitmap scan:

1. Regardless of index contents, it's not okay for vacuum to remove
tuples that an open transaction could potentially see. So the heap
tuple will be there if we look, unless it was already dead (in which
case it could have been replaced, so we have to check visibility of
whatever we find).

2. If the page's all-visible bit is set, there has been no recent
change in its contents, so we don't have to look at the page.
"Recent" is a bit squishily defined, but again it should surely
cover outdating or removal of a tuple that an open transaction
could see.

If this is not working, I am suspicious that somebody made page
freezing too aggressive somewhere along the line.

Whether that's true or not, it seems like it'd be worth bisecting
to see if we can finger a commit where the behavior changed (and
the same goes for the question of why-isnt-it-an-IOS-scan). However,
the reproducer seems to have quite a low failure probability for me,
which makes it hard to do bisection testing with much confidence.
Can we do anything to make the test more reliable? If I'm right
to suspect autovacuum, maybe triggering lots of manual vacuums
would improve the odds?

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#3)
Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 11:07:38 -0500, Peter Geoghegan wrote:

Clearly it would be wildly impractical to do the "buffer pin interlock
against TID recycling" thing in the context of bitmap scans.

That's certainly true if we do the VM check at the time of the bitmap heap
scan.

What if we did it whenever we first enter a block into the TID bitmap? There's
sufficient padding space in PagetableEntry to store such state without
increasing memory usage.

That'd probably need some API evolution, because we'd only know after entering
into the tidbitmap that we need to check the VM, we'd need to index pins
during the VM checking and then update PagetableEntry with the result of the
VM probe.

I think, but am not certain, that it'd be sufficient to do the VM check the
first time an index scan encounters a block. If a concurrent vacuum would mark
the heap page all-visible after we encountered it first, we'd do "real"
visibility checks, because the first VM lookup won't have it as all
visible. And in the opposite situation, where we find a page all-visible in
the VM, which then subsequently gets marked not-all-visible, normal snapshot
semantics would make it safe to still consider the contents all-visible,
because update/delete can't be visible to "us".

Greetings,

Andres Freund

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 11:43:42 -0500, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

This theory seems very believable.

I'm not convinced. I think there are two assumptions underlying
bitmap scan:

1. Regardless of index contents, it's not okay for vacuum to remove
tuples that an open transaction could potentially see. So the heap
tuple will be there if we look, unless it was already dead (in which
case it could have been replaced, so we have to check visibility of
whatever we find).

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed. Then, during the
bitmap heap scan, we do a VM lookup and see the page being all-visible and
thus assume there's a visible tuple pointed to by the tid.

No snapshot semantics protect against this, as the tuple is *not* visible to
anyone.

Greetings,

Andres Freund

In reply to: Andres Freund (#4)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 11:32 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:

Concurrency timeline:

Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.

IIUC, an unstanted important aspect here is that these dead tuples are *not*
visible to S1. Otherwise the VACUUM in the next step could not remove the dead
tuples.

I would state the same thing slightly differently, FWIW: I would say
that the assumption that has been violated is that a TID is a stable
identifier for a given index tuple/logical row/row version (stable for
the duration of the scan).

This emphasis/definition seems slightly more useful, because it makes
it clear why this is hard to hit: you have to be fairly unlucky for a
dead-to-everyone TID to be returned by your scan (no LP_DEAD bit can
be set) and set in the scan's bitmap, only to later be concurrently
set LP_UNUSED in the heap -- all without VACUUM randomly being
prevented from setting the same page all-visible due to some unrelated
not-all-visible heap item making it unsafe.

Ugh :/

Pretty depressing that we've only found this now, ~6 years after it's been
released. We're lacking some tooling to find this stuff in a more automated
fashion.

FWIW I have suspicions about similar problems with index-only scans
that run in hot standby, and about all GiST index-only scans:

/messages/by-id/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com

In general there seems to be a lack of rigor in this area. I'm hoping
that Tomas Vondra's work can tighten things up here in passing.

It seems pretty much impossible to fix that with the current interaction
between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
and likely with some heuristics / limits, address this by performing some
visibility checks during the bitmap build. I'm bringing it up here because I
suspect that some of the necessary changes would overlap with what's needed to
repair bitmap index-only scans.

This seems like it plays into some of the stuff I've discussed with
Tomas, about caching visibility information in local state, as a means
to avoiding holding onto pins in the index AM. For things like
mark/restore support.

This is quite non-trivial, however, as we don't have much in place regarding
per-relation shared structures which we could put such a value into, nor a
good place to signal changes of the value to bitmap heap-scanning backends.

It doesn't have to be driven of table state, it could be state in
indexes. Most (all?) already have a metapage.

FWIW GiST doesn't have a metapage (a historical oversight).

Unfortunately I don't see a better path either.

I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...

Maybe a good use for injection points?

--
Peter Geoghegan

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: Incorrect result of bitmap heap scan.

Andres Freund <andres@anarazel.de> writes:

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.

Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.

regards, tom lane

In reply to: Tom Lane (#9)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.

Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.

Freezing a page, and setting a page all-visible are orthogonal.
Nothing has changed about when or how we set pages all-visible in a
long time -- VACUUM has always done that to the maximum extent that
its OldestXmin cutoff will allow. (The changes to freezing made
freezing work a little bit more like that, the general idea being to
batch-up work.)

--
Peter Geoghegan

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#10)
Re: Incorrect result of bitmap heap scan.

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.

Freezing a page, and setting a page all-visible are orthogonal.

Sorry, sloppy wording on my part.

regards, tom lane

In reply to: Andres Freund (#7)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 11:52 AM Andres Freund <andres@anarazel.de> wrote:

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though.

Right, exactly.

FWIW, this same issue is why it is safe for nbtree to drop its pin
early during plain index scans, but not during index-only scans -- see
_bt_drop_lock_and_maybe_pin, and the nbtree/README section on making
concurrent TID recycling safe. Weirdly, nbtree is specifically aware
that it needs to *not* drop its pin in the context of index-only scans
(to make sure that VACUUM cannot do unsafe concurrent TID recycling)
-- even though an equivalent index scan would be able to drop its pin
like this.

The underlying reason why nbtree can discriminate like this is that it
"knows" that plain index scans will always visit the heap proper. If a
TID points to an LP_UNUSED item, then it is considered dead to the
scan (even though in general the heap page itself might be marked
all-visible). If some completely unrelated, newly inserted heap tuple
is found instead, then it cannot be visible to the plain index scan's
MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
dropped like this).

--
Peter Geoghegan

#13Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 12:02:39 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.

Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.

How? This basically would mean we could never set all-visible if there is
*any* concurrent scan on the current relation, because any concurrent scan
could have an outdated view of all-visible. Afaict this isn't an issue of
"too-aggressive setting of the all-visible bit", it's an issue of setting it
at all.

Greetings,

Andres Freund

In reply to: Tom Lane (#11)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Freezing a page, and setting a page all-visible are orthogonal.

Sorry, sloppy wording on my part.

Freezing doesn't affect the contents of the visibility map in any way
that seems relevant. The executor only cares about the all-visible bit
(and never the all-frozen bit), and the rules around when and how
VACUUM sets the all-visible bit (and how everybody else unsets the
all-visible bit) haven't changed in forever. So I just can't see it.

I guess it's natural to suspect more recent work -- commit 7c70996e is
about 6 years old. But I the race condition that I suspect is at play
here is very narrow.

It's pretty unlikely that there'll be a dead-to-all TID returned to a
scan (not just dead to our MVCC snapshot, dead to everybody's) that is
subsequently concurrently removed from the index, and then set
LP_UNUSED in the heap. It's probably impossible if you don't have a
small table -- VACUUM just isn't going to be fast enough to get to the
leaf page after the bitmap index scan, but still be able to get to the
heap before its corresponding bitmap heap scan (that uses the VM as an
optimization) can do the relevant visibility checks (while it could
happen with a large table and a slow bitmap scan, the chances of the
VACUUM being precisely aligned with the bitmap scan, in just the wrong
way, seem remote in the extreme). Finally, none of this will happen if
some other factor hinders VACUUM from setting the relevant heap page
all-visible.

AFAICT this is only a problem because of the involvement of the VM,
specifically -- an MVCC snapshot *is* generally sufficient to make
bitmap index scans safe from the dangers of concurrent TID recycling,
as explained in "62.4. Index Locking Considerations". That only ceases
to be true when the visibility map becomes involved (the VM lacks the
granular visibility information required to make all this safe). This
is essentially the same VM race issue that nbtree's
_bt_drop_lock_and_maybe_pin protects against during conventional
index-only scans.

--
Peter Geoghegan

#15Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 11:31:48 -0500, Andres Freund wrote:

I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...

Attached is an isolationtest that reliably shows wrong query results.

Greetings,

Andres Freund

Attachments:

v1-0001-isolationtester-showing-broken-index-only-bitmap-.patchtext/x-diff; charset=us-asciiDownload+114-1
In reply to: Tom Lane (#5)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 11:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Whether that's true or not, it seems like it'd be worth bisecting
to see if we can finger a commit where the behavior changed (and
the same goes for the question of why-isnt-it-an-IOS-scan). However,
the reproducer seems to have quite a low failure probability for me,
which makes it hard to do bisection testing with much confidence.
Can we do anything to make the test more reliable? If I'm right
to suspect autovacuum, maybe triggering lots of manual vacuums
would improve the odds?

I agree that autovacuum (actually, VACUUM) is important here.

I find that the test becomes much more reliable if I create the test
table "with (autovacuum_analyze_scale_factor=0.99,
vacuum_truncate=off)". More importantly, rather than relying on
autovacuum, I just run VACUUM manually from psql. I find it convenient
to use "\watch 0.01" to run VACUUM repeatedly.

--
Peter Geoghegan

#17Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#14)
Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 13:39:43 -0500, Peter Geoghegan wrote:

I guess it's natural to suspect more recent work -- commit 7c70996e is
about 6 years old. But I the race condition that I suspect is at play
here is very narrow.

FWIW, the test I just posted shows the issue down to 11 (although for 11 one
has to remove the (TRUNCATE false). 10 returns correct results.

I don't think the race is particularly narrow. Having a vacuum complete
between the start of the bitmap indexscan and the end of the bitmap heapscan,
leaves a lot of time with an expensive query.

I suspect one contributor to this avoiding attention till now was that the
optimization is fairly narrow:

/*
* We can potentially skip fetching heap pages if we do not need
* any columns of the table, either for checking non-indexable
* quals or for returning data. This test is a bit simplistic, as
* it checks the stronger condition that there's no qual or return
* tlist at all. But in most cases it's probably not worth working
* harder than that.
*/
need_tuples = (node->ss.ps.plan->qual != NIL ||
node->ss.ps.plan->targetlist != NIL);

Even an entry in the targetlist that that does not depend on the current row
disables the optimization.

Due to not being able to return any content for those rows, it's also somewhat
hard to actually notice that the results are wrong...

It's pretty unlikely that there'll be a dead-to-all TID returned to a
scan (not just dead to our MVCC snapshot, dead to everybody's) that is
subsequently concurrently removed from the index, and then set
LP_UNUSED in the heap. It's probably impossible if you don't have a
small table -- VACUUM just isn't going to be fast enough to get to the
leaf page after the bitmap index scan, but still be able to get to the
heap before its corresponding bitmap heap scan (that uses the VM as an
optimization) can do the relevant visibility checks (while it could
happen with a large table and a slow bitmap scan, the chances of the
VACUUM being precisely aligned with the bitmap scan, in just the wrong
way, seem remote in the extreme).

A cursor, waiting for IO, waiting for other parts of the query, ... can all
make this windows almost arbitrarily large.

Greetings,

Andres Freund

In reply to: Andres Freund (#15)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

Attached is an isolationtest that reliably shows wrong query results.

Nice approach with the cursor!

I took what you wrote, and repurposed it to prove my old theory about
GiST index-only scans being broken due to the lack of an appropriate
interlock against concurrent TID recycling. See the attached patch.

--
Peter Geoghegan

Attachments:

v1-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v1-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload+108-1
In reply to: Peter Geoghegan (#18)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 3:56 PM Peter Geoghegan <pg@bowt.ie> wrote:

I took what you wrote, and repurposed it to prove my old theory about
GiST index-only scans being broken due to the lack of an appropriate
interlock against concurrent TID recycling. See the attached patch.

BTW, if you change the test case to use the default B-Tree index AM
(by removing "USING GIST"), you'll see that VACUUM blocks on acquiring
a cleanup lock (and so the test just times out). The problem is that
GiST VACUUM just doesn't care about cleanup locks/TID recycling safety
-- though clearly it should.

--
Peter Geoghegan

In reply to: Andres Freund (#17)
Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 3:55 PM Andres Freund <andres@anarazel.de> wrote:

I suspect one contributor to this avoiding attention till now was that the
optimization is fairly narrow:

/*
* We can potentially skip fetching heap pages if we do not need
* any columns of the table, either for checking non-indexable
* quals or for returning data. This test is a bit simplistic, as
* it checks the stronger condition that there's no qual or return
* tlist at all. But in most cases it's probably not worth working
* harder than that.
*/
need_tuples = (node->ss.ps.plan->qual != NIL ||
node->ss.ps.plan->targetlist != NIL);

Even an entry in the targetlist that that does not depend on the current row
disables the optimization.

Good point. I agree that that factor is likely to have masked the
problem over the past 6 years.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#12)
In reply to: Peter Geoghegan (#18)
#23Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tom Lane (#9)
#24Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#4)
#25Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#24)
#26Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#25)
In reply to: Matthias van de Meent (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#27)
#29Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tom Lane (#28)
#30Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#28)
#31vignesh C
vignesh21@gmail.com
In reply to: Matthias van de Meent (#29)
#32Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: vignesh C (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#33)
#35Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#34)
#36Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#37)
#39Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#37)
#40Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#38)
#41Core Studios Inc.
corestudiosinc@gmail.com
In reply to: Matthias van de Meent (#29)
#42Andres Freund
andres@anarazel.de
In reply to: Core Studios Inc. (#41)
#43Core Studios Inc.
corestudiosinc@gmail.com
In reply to: Andres Freund (#42)
#44Core Studios Inc.
corestudiosinc@gmail.com
In reply to: Andres Freund (#42)
#45Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Core Studios Inc. (#41)
#46Andres Freund
andres@anarazel.de
In reply to: Core Studios Inc. (#44)