Incorrect result of bitmap heap scan.
Hi hackers,
Attached script reproduces the problem with incorrect results of `select
count(*)` (it returns larger number of records than really available in
the table).
It is not always reproduced, so you may need to repeat it multiple times
- at my system it failed 3 times from 10.
The problem takes place with pg16/17/18 (other versions I have not checked).
The test is called `test_ios` (index-only-scan), but it is not correct.
Index-only scan is not used in this case.
And this is actually the first question to PG17/18: IOS is not used when
number of records is less than 100k (for this particular table):
postgres=# create table t(pk integer primary key); CREATE TABLE
postgres=# set enable_seqscan = off; SET postgres=# set enable_indexscan
= off; SET postgres=# insert into t values (generate_series(1,1000));
INSERT 0 1000 postgres=# vacuum t; VACUUM postgres=# explain select
count(*) from t; QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=43.02..43.03 rows=1 width=8) -> Bitmap Heap Scan on t
(cost=25.52..40.52 rows=1000 width=0) -> Bitmap Index Scan on t_pkey
(cost=0.00..25.27 rows=1000 width=0) (3 rows) postgres=# set
enable_bitmapscan = off; SET postgres=# explain select count(*) from t;
QUERY PLAN ------------------------------------------------------------
Aggregate (cost=17.50..17.51 rows=1 width=8) -> Seq Scan on t
(cost=0.00..15.00 rows=1000 width=0) Disabled: true (3 rows)
So, as you can see, Postgres prefers to use disabled seqscan, but not
IOS. It is different from pg16 where disabling bitmap scan makes
optimizer to choose index-only scan:
postgres=# explain select count(*) from t; QUERY PLAN
----------------------------------------------------------- Aggregate
(cost=41.88..41.88 rows=1 width=8) -> Seq Scan on t (cost=0.00..35.50
rows=2550 width=0) (2 rows) postgres=# set enable_seqscan = off; SET
postgres=# explain select count(*) from t; QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=75.54..75.55 rows=1 width=8) -> Bitmap Heap Scan on t
(cost=33.67..69.17 rows=2550 width=0) -> Bitmap Index Scan on t_pkey
(cost=0.00..33.03 rows=2550 width=0) (3 rows) postgres=# set
enable_bitmapscan = off; SET postgres=# explain select count(*) from t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=45.77..45.78 rows=1 width=8) -> Index Only Scan using
t_pkey on t (cost=0.28..43.27 rows=1000 width=0) (2 rows)
This is strange behavior of pg17 which for some reasons rejects IOS (but
it is used if number of records in the table is 100k or more). But the
main problem is that used plan Bitmap Heap Scan + Bitmap Index Scan may
return incorrect result.
Replacing `select count(*)` with `select count(pk)` eliminates the
problem, as well as disabling of autovacuum. It seems to be clear that
the problem is with visibility map.
We have the following code in heap bitmap scan: /* * We can skip
fetching the heap page if we don't need any fields from the * heap, the
bitmap entries don't need rechecking, and all tuples on the * page are
visible to our transaction. */ if (!(scan->rs_flags & SO_NEED_TUPLES) &&
!tbmres->recheck && VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno,
&hscan->rs_vmbuffer)) { /* can't be lossy in the skip_fetch case */
Assert(tbmres->ntuples >= 0); Assert(hscan->rs_empty_tuples_pending >=
0); hscan->rs_empty_tuples_pending += tbmres->ntuples; return true; }
So if we do not need tuples (|count(*)|case) and page is marked as
all-visible in VM, then we just count|tbmres->ntuples|elements without
extra checks.
I almost not so familiar with internals of executor, but it is not clear
to me how we avoid race condition between VM update and heap bitmap scan?
Assume that bitmap scan index marks all tids available in index. Some
elements in this bitmap can refer old (invisible) versions. Then vacuum
comes, removes dead elements and mark page as all-visible. After it we
start heap bitmap scan, see that page is all-visible and count all
marked elements on this page including dead (which are not present in
the page any more).
Which lock or check should prevent such scenario?
Attachments:
On Mon, 2 Dec 2024 at 15:25, Konstantin Knizhnik <knizhnik@garret.ru> wrote:
Hi hackers,
Attached script reproduces the problem with incorrect results of `select count(*)` (it returns larger number of records than really available in the table).
It is not always reproduced, so you may need to repeat it multiple times - at my system it failed 3 times from 10.The problem takes place with pg16/17/18 (other versions I have not checked).
I suspect all back branches are affected. As I partially also mentioned offline:
The running theory is that bitmap executor nodes incorrectly assume
that the rows contained in the bitmap all are still present in the
index, and thus assume they're allowed to only check the visibility
map to see if the reference contained in the bitmap is visible.
However, this seems incorrect: Note that index AMs must hold at least
pins on the index pages that contain their results when those results
are returned by amgettuple() [0], and that amgetbitmap() doesn't do
that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
from the index (and later, heap) that are still present in the bitmap
used in the scan.
Concurrency timeline:
Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.
Session 2. VACUUM runs on the heap, removes TIDs for P1 from the
index, deletes those TIDs from the heap pages, and finally sets heap
pages' VM bits to ALL_VISIBLE, including the now cleaned page P1
Session 1. Executes the bitmap heap scan that uses the bitmap without
checking tuples on P1 due to ALL_VISIBLE and a lack of output columns.
I think this might be an oversight when the feature was originally
committed in 7c70996e (PG11): we don't know when the VM bit was set,
and the bitmap we're scanning may thus be out-of-date (and should've
had TIDs removed it it had been an index), so I propose disabling this
optimization for now, as attached. Patch v1 is against a recent HEAD,
PG17 applies to the 17 branch, and PG16- should work on all (or at
least, most) active backbranches older than PG17's.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
PS.
I don't think the optimization itself is completely impossible, and we
can probably re-enable an optimization like that if (or when) we find
a way to reliably keep track of when to stop using the optimization. I
don't think that's an easy fix, though, and definitely not for
backbranches.
The solution I could think to keep most of this optimization requires
the heap bitmap scan to notice that a concurrent process started
removing TIDs from the heap after amgetbitmap was called; i.e.. a
"vacuum generation counter" incremented every time heap starts the
cleanup run. This is quite non-trivial, however, as we don't have much
in place regarding per-relation shared structures which we could put
such a value into, nor a good place to signal changes of the value to
bitmap heap-scanning backends.
Attachments:
PG17-Disable-BitmapScan-s-skip_fetch-optimization.patch.txttext/plain; charset=US-ASCII; name=PG17-Disable-BitmapScan-s-skip_fetch-optimization.patch.txtDownload
From 07739e5af83664b67ea02d3db7a461a106d74040 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:59:35 +0100
Subject: [PATCH v1] Disable BitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/include/access/heapam.h | 9 +++--
src/include/access/tableam.h | 3 +-
src/backend/access/heap/heapam.c | 13 -------
src/backend/access/heap/heapam_handler.c | 28 ---------------
src/backend/executor/nodeBitmapHeapscan.c | 42 ++---------------------
5 files changed, 8 insertions(+), 87 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 65999dd64e..42f28109e2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -92,11 +92,10 @@ typedef struct HeapScanDescData
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
/*
- * These fields are only used for bitmap scans for the "skip fetch"
- * optimization. Bitmap scans needing no fields from the heap may skip
- * fetching an all visible block, instead using the number of tuples per
- * block reported by the bitmap to determine how many NULL-filled tuples
- * to return.
+ * These fields were kept around to guarantee ABI stability, but are
+ * otherwise unused. They were only used for bitmap scans for the
+ * "skip fetch" optimization, which turned out to be unsound when the
+ * second phase of vacuum ran concurrently.
*/
Buffer rs_vmbuffer;
int rs_empty_tuples_pending;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..5d54b0a18b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -955,8 +955,7 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
{
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
- if (need_tuple)
- flags |= SO_NEED_TUPLES;
+ flags |= SO_NEED_TUPLES;
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bbe64b1e53..ca2dbb0b75 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1188,19 +1188,6 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (BufferIsValid(scan->rs_vmbuffer))
- {
- ReleaseBuffer(scan->rs_vmbuffer);
- scan->rs_vmbuffer = InvalidBuffer;
- }
-
- /*
- * Reset rs_empty_tuples_pending, a field only used by bitmap heap scan,
- * to avoid incorrectly emitting NULL-filled tuples from a previous scan
- * on rescan.
- */
- scan->rs_empty_tuples_pending = 0;
-
/*
* The read stream is reset on rescan. This must be done before
* initscan(), as some state referred to by read_stream_reset() is reset
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..93980be98a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2131,24 +2131,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, the bitmap entries don't need rechecking, and all tuples on the
- * page are visible to our transaction.
- */
- if (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
-
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
-
- return true;
- }
-
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
@@ -2261,16 +2243,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
- if (hscan->rs_empty_tuples_pending > 0)
- {
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
- hscan->rs_empty_tuples_pending--;
- return true;
- }
-
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 6b48a6d835..82402c0ac0 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -185,24 +185,11 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!scan)
{
- bool need_tuples = false;
-
- /*
- * We can potentially skip fetching heap pages if we do not need
- * any columns of the table, either for checking non-indexable
- * quals or for returning data. This test is a bit simplistic, as
- * it checks the stronger condition that there's no qual or return
- * tlist at all. But in most cases it's probably not worth working
- * harder than that.
- */
- need_tuples = (node->ss.ps.plan->qual != NIL ||
- node->ss.ps.plan->targetlist != NIL);
-
scan = table_beginscan_bm(node->ss.ss_currentRelation,
node->ss.ps.state->es_snapshot,
0,
NULL,
- need_tuples);
+ true);
node->ss.ss_currentScanDesc = scan;
}
@@ -459,7 +446,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
if (tbmpre == NULL)
{
@@ -470,20 +456,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
node->prefetch_pages++;
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
@@ -500,7 +473,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;
- bool skip_fetch;
/*
* Recheck under the mutex. If some other process has already
@@ -526,15 +498,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
break;
}
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
}
--
2.45.2
v1-0001-Remove-BitmapScan-s-skip_fetch-optimization.patchapplication/x-patch; name=v1-0001-Remove-BitmapScan-s-skip_fetch-optimization.patchDownload
From 43144d7511c93ed153b4ab5b30bf59ea3af20fbd Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:38:14 +0100
Subject: [PATCH v1] Remove BitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/include/access/heapam.h | 1 -
src/include/access/tableam.h | 12 +------
src/backend/access/heap/heapam.c | 18 ----------
src/backend/access/heap/heapam_handler.c | 28 ---------------
src/backend/executor/nodeBitmapHeapscan.c | 43 ++---------------------
5 files changed, 4 insertions(+), 98 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 96cf82f97b..6dd233dc17 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,7 +99,6 @@ typedef struct HeapScanDescData
* block reported by the bitmap to determine how many NULL-filled tuples
* to return.
*/
- Buffer rs_vmbuffer;
int rs_empty_tuples_pending;
/* these fields only used in page-at-a-time mode and for bitmap scans */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93c..283a19babe 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,13 +62,6 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
-
- /*
- * At the discretion of the table AM, bitmap table scans may be able to
- * skip fetching a block from the table if none of the table data is
- * needed. If table data may be needed, set SO_NEED_TUPLES.
- */
- SO_NEED_TUPLES = 1 << 10,
} ScanOptions;
/*
@@ -955,14 +948,11 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key, bool need_tuple)
+ int nkeys, struct ScanKeyData *key)
{
TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
- if (need_tuple)
- flags |= SO_NEED_TUPLES;
-
result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key,
NULL, flags);
result->st.bitmap.rs_shared_iterator = NULL;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dc..9ea9dec863 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1053,8 +1053,6 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
- scan->rs_vmbuffer = InvalidBuffer;
- scan->rs_empty_tuples_pending = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1170,19 +1168,6 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (BufferIsValid(scan->rs_vmbuffer))
- {
- ReleaseBuffer(scan->rs_vmbuffer);
- scan->rs_vmbuffer = InvalidBuffer;
- }
-
- /*
- * Reset rs_empty_tuples_pending, a field only used by bitmap heap scan,
- * to avoid incorrectly emitting NULL-filled tuples from a previous scan
- * on rescan.
- */
- scan->rs_empty_tuples_pending = 0;
-
/*
* The read stream is reset on rescan. This must be done before
* initscan(), as some state referred to by read_stream_reset() is reset
@@ -1210,9 +1195,6 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (BufferIsValid(scan->rs_vmbuffer))
- ReleaseBuffer(scan->rs_vmbuffer);
-
/*
* Must free the read stream before freeing the BufferAccessStrategy.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c..b432eb8140 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2158,24 +2158,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*blockno = tbmres->blockno;
*recheck = tbmres->recheck;
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, the bitmap entries don't need rechecking, and all tuples on the
- * page are visible to our transaction.
- */
- if (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
-
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
-
- return true;
- }
-
block = tbmres->blockno;
/*
@@ -2290,16 +2272,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
- if (hscan->rs_empty_tuples_pending > 0)
- {
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
- hscan->rs_empty_tuples_pending--;
- return true;
- }
-
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 689bde16dd..9bf01bb0c3 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -176,24 +176,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!scan)
{
- bool need_tuples = false;
-
- /*
- * We can potentially skip fetching heap pages if we do not need
- * any columns of the table, either for checking non-indexable
- * quals or for returning data. This test is a bit simplistic, as
- * it checks the stronger condition that there's no qual or return
- * tlist at all. But in most cases it's probably not worth working
- * harder than that.
- */
- need_tuples = (node->ss.ps.plan->qual != NIL ||
- node->ss.ps.plan->targetlist != NIL);
-
scan = table_beginscan_bm(node->ss.ss_currentRelation,
node->ss.ps.state->es_snapshot,
0,
- NULL,
- need_tuples);
+ NULL);
node->ss.ss_currentScanDesc = scan;
}
@@ -453,7 +439,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
if (tbmpre == NULL)
{
@@ -465,20 +450,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_pages++;
node->prefetch_blockno = tbmpre->blockno;
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
@@ -495,7 +467,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;
- bool skip_fetch;
/*
* Recheck under the mutex. If some other process has already
@@ -523,15 +494,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_blockno = tbmpre->blockno;
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
}
--
2.45.2
PG16-_Disable-BitmapScan-s-skip_fetch-optimization.patch.txttext/plain; charset=US-ASCII; name=PG16-_Disable-BitmapScan-s-skip_fetch-optimization.patch.txtDownload
From eed8c4b613b5c44113e55bc6917ef3564d4632f8 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 16:05:09 +0100
Subject: [PATCH v1] Disable BitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we still have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/backend/executor/nodeBitmapHeapscan.c | 64 ++---------------------
1 file changed, 4 insertions(+), 60 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 1cf0bbddd4..214c129472 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -186,8 +186,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
/*
@@ -212,32 +210,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
else
node->lossy_pages++;
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
+ if (!table_scan_bitmap_next_block(scan, tbmres))
{
/* AM doesn't think this block is valid, skip */
continue;
@@ -475,7 +448,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
if (tbmpre == NULL)
{
@@ -486,25 +458,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
node->prefetch_pages++;
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
- */
- skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
@@ -521,7 +475,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;
- bool skip_fetch;
/*
* Recheck under the mutex. If some other process has already
@@ -547,15 +500,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
break;
}
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
}
@@ -749,8 +694,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
* stronger condition that there's no qual or return tlist at all. But in
* most cases it's probably not worth working harder than that.
*/
- scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ scanstate->can_skip_fetch = false;
/*
* Miscellaneous initialization
--
2.45.2
On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
The running theory is that bitmap executor nodes incorrectly assume
that the rows contained in the bitmap all are still present in the
index, and thus assume they're allowed to only check the visibility
map to see if the reference contained in the bitmap is visible.
However, this seems incorrect: Note that index AMs must hold at least
pins on the index pages that contain their results when those results
are returned by amgettuple() [0], and that amgetbitmap() doesn't do
that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
from the index (and later, heap) that are still present in the bitmap
used in the scan.
This theory seems very believable.
We hold onto a leaf page buffer pin for index-only scans as an
interlock against concurrent TID recycling. If we assume for the sake
of argument that the optimization from commit 7c70996e is correct,
then why do we even bother with holding onto the pin during index-only
scans?
In theory we should either do the "buffer pin interlock against TID
recycling" thing everywhere, or nowhere -- how could bitmap scans
possibly be different here?
I think this might be an oversight when the feature was originally
committed in 7c70996e (PG11): we don't know when the VM bit was set,
and the bitmap we're scanning may thus be out-of-date (and should've
had TIDs removed it it had been an index), so I propose disabling this
optimization for now, as attached.
I have a hard time imagining any alternative fix that is suitable for
backpatch. Can we save the optimization on the master branch?
Clearly it would be wildly impractical to do the "buffer pin interlock
against TID recycling" thing in the context of bitmap scans. The only
thing that I can think of that might work is a scheme that establishes
a "safe LSN" for a given MVCC snapshot. If the VM page's LSN is later
than the "safe LSN", it's not okay to trust its all-visible bits. At
least not in the context of bitmap index scans that use the
optimization from 7c70996e.
--
Peter Geoghegan
Hi,
On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:
Concurrency timeline:
Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.
IIUC, an unstanted important aspect here is that these dead tuples are *not*
visible to S1. Otherwise the VACUUM in the next step could not remove the dead
tuples.
Session 2. VACUUM runs on the heap, removes TIDs for P1 from the
index, deletes those TIDs from the heap pages, and finally sets heap
pages' VM bits to ALL_VISIBLE, including the now cleaned page P1
Session 1. Executes the bitmap heap scan that uses the bitmap without
checking tuples on P1 due to ALL_VISIBLE and a lack of output columns.
Ugh :/
Pretty depressing that we've only found this now, ~6 years after it's been
released. We're lacking some tooling to find this stuff in a more automated
fashion.
PS.
I don't think the optimization itself is completely impossible, and we
can probably re-enable an optimization like that if (or when) we find
a way to reliably keep track of when to stop using the optimization. I
don't think that's an easy fix, though, and definitely not for
backbranches.
One way to make the optimization safe could be to move it into the indexam. If
an indexam would check the VM bit while blocking removal of the index entries,
it should make it safe. Of course that has the disadvantage of needing more VM
lookups than before, because it would not yet have been deduplicated...
Another issue with bitmap index scans is that it currently can't use
killtuples. I've seen several production outages where performance would
degrade horribly over time whenever estimates lead to important queries to
switch to bitmap scans, because lots of dead tuples would get accessed over
and over.
It seems pretty much impossible to fix that with the current interaction
between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
and likely with some heuristics / limits, address this by performing some
visibility checks during the bitmap build. I'm bringing it up here because I
suspect that some of the necessary changes would overlap with what's needed to
repair bitmap index-only scans.
The solution I could think to keep most of this optimization requires
the heap bitmap scan to notice that a concurrent process started
removing TIDs from the heap after amgetbitmap was called; i.e.. a
"vacuum generation counter" incremented every time heap starts the
cleanup run.
I'm not sure this is a good path. We can already clean up pages during index
accesses and we are working on being able to set all-visible during "hot
pruning"/page-level-vacuuming. That'd probably actually be safe (because we
couldn't remove dead items without a real vacuum), but it's starting to get
pretty complicated...
This is quite non-trivial, however, as we don't have much in place regarding
per-relation shared structures which we could put such a value into, nor a
good place to signal changes of the value to bitmap heap-scanning backends.
It doesn't have to be driven of table state, it could be state in
indexes. Most (all?) already have a metapage.
We could also add that state to pg_class? We already update pg_class after
most vacuums, so I don't think that'd be an issue.
Thomas had a WIP patch to add a shared-memory table of all open
relations. Which we need for quite a few features by now (e.g. more efficient
buffer mapping table, avoiding the relation size lookup during planning /
execution, more efficient truncation of relations, ...).
From 07739e5af83664b67ea02d3db7a461a106d74040 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:59:35 +0100
Subject: [PATCH v1] Disable BitmapScan's skip_fetch optimizationThe optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Unfortunately I don't see a better path either.
I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...
Greetings,
Andres Freund
Peter Geoghegan <pg@bowt.ie> writes:
On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:The running theory is that bitmap executor nodes incorrectly assume
that the rows contained in the bitmap all are still present in the
index, and thus assume they're allowed to only check the visibility
map to see if the reference contained in the bitmap is visible.
However, this seems incorrect: Note that index AMs must hold at least
pins on the index pages that contain their results when those results
are returned by amgettuple() [0], and that amgetbitmap() doesn't do
that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
from the index (and later, heap) that are still present in the bitmap
used in the scan.
This theory seems very believable.
I'm not convinced. I think there are two assumptions underlying
bitmap scan:
1. Regardless of index contents, it's not okay for vacuum to remove
tuples that an open transaction could potentially see. So the heap
tuple will be there if we look, unless it was already dead (in which
case it could have been replaced, so we have to check visibility of
whatever we find).
2. If the page's all-visible bit is set, there has been no recent
change in its contents, so we don't have to look at the page.
"Recent" is a bit squishily defined, but again it should surely
cover outdating or removal of a tuple that an open transaction
could see.
If this is not working, I am suspicious that somebody made page
freezing too aggressive somewhere along the line.
Whether that's true or not, it seems like it'd be worth bisecting
to see if we can finger a commit where the behavior changed (and
the same goes for the question of why-isnt-it-an-IOS-scan). However,
the reproducer seems to have quite a low failure probability for me,
which makes it hard to do bisection testing with much confidence.
Can we do anything to make the test more reliable? If I'm right
to suspect autovacuum, maybe triggering lots of manual vacuums
would improve the odds?
regards, tom lane
Hi,
On 2024-12-02 11:07:38 -0500, Peter Geoghegan wrote:
Clearly it would be wildly impractical to do the "buffer pin interlock
against TID recycling" thing in the context of bitmap scans.
That's certainly true if we do the VM check at the time of the bitmap heap
scan.
What if we did it whenever we first enter a block into the TID bitmap? There's
sufficient padding space in PagetableEntry to store such state without
increasing memory usage.
That'd probably need some API evolution, because we'd only know after entering
into the tidbitmap that we need to check the VM, we'd need to index pins
during the VM checking and then update PagetableEntry with the result of the
VM probe.
I think, but am not certain, that it'd be sufficient to do the VM check the
first time an index scan encounters a block. If a concurrent vacuum would mark
the heap page all-visible after we encountered it first, we'd do "real"
visibility checks, because the first VM lookup won't have it as all
visible. And in the opposite situation, where we find a page all-visible in
the VM, which then subsequently gets marked not-all-visible, normal snapshot
semantics would make it safe to still consider the contents all-visible,
because update/delete can't be visible to "us".
Greetings,
Andres Freund
Hi,
On 2024-12-02 11:43:42 -0500, Tom Lane wrote:
Peter Geoghegan <pg@bowt.ie> writes:
This theory seems very believable.
I'm not convinced. I think there are two assumptions underlying
bitmap scan:1. Regardless of index contents, it's not okay for vacuum to remove
tuples that an open transaction could potentially see. So the heap
tuple will be there if we look, unless it was already dead (in which
case it could have been replaced, so we have to check visibility of
whatever we find).
I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed. Then, during the
bitmap heap scan, we do a VM lookup and see the page being all-visible and
thus assume there's a visible tuple pointed to by the tid.
No snapshot semantics protect against this, as the tuple is *not* visible to
anyone.
Greetings,
Andres Freund
On Mon, Dec 2, 2024 at 11:32 AM Andres Freund <andres@anarazel.de> wrote:
On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:
Concurrency timeline:
Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.IIUC, an unstanted important aspect here is that these dead tuples are *not*
visible to S1. Otherwise the VACUUM in the next step could not remove the dead
tuples.
I would state the same thing slightly differently, FWIW: I would say
that the assumption that has been violated is that a TID is a stable
identifier for a given index tuple/logical row/row version (stable for
the duration of the scan).
This emphasis/definition seems slightly more useful, because it makes
it clear why this is hard to hit: you have to be fairly unlucky for a
dead-to-everyone TID to be returned by your scan (no LP_DEAD bit can
be set) and set in the scan's bitmap, only to later be concurrently
set LP_UNUSED in the heap -- all without VACUUM randomly being
prevented from setting the same page all-visible due to some unrelated
not-all-visible heap item making it unsafe.
Ugh :/
Pretty depressing that we've only found this now, ~6 years after it's been
released. We're lacking some tooling to find this stuff in a more automated
fashion.
FWIW I have suspicions about similar problems with index-only scans
that run in hot standby, and about all GiST index-only scans:
/messages/by-id/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com
In general there seems to be a lack of rigor in this area. I'm hoping
that Tomas Vondra's work can tighten things up here in passing.
It seems pretty much impossible to fix that with the current interaction
between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
and likely with some heuristics / limits, address this by performing some
visibility checks during the bitmap build. I'm bringing it up here because I
suspect that some of the necessary changes would overlap with what's needed to
repair bitmap index-only scans.
This seems like it plays into some of the stuff I've discussed with
Tomas, about caching visibility information in local state, as a means
to avoiding holding onto pins in the index AM. For things like
mark/restore support.
This is quite non-trivial, however, as we don't have much in place regarding
per-relation shared structures which we could put such a value into, nor a
good place to signal changes of the value to bitmap heap-scanning backends.It doesn't have to be driven of table state, it could be state in
indexes. Most (all?) already have a metapage.
FWIW GiST doesn't have a metapage (a historical oversight).
Unfortunately I don't see a better path either.
I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...
Maybe a good use for injection points?
--
Peter Geoghegan
Andres Freund <andres@anarazel.de> writes:
I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.
Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.
regards, tom lane
On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.
Freezing a page, and setting a page all-visible are orthogonal.
Nothing has changed about when or how we set pages all-visible in a
long time -- VACUUM has always done that to the maximum extent that
its OldestXmin cutoff will allow. (The changes to freezing made
freezing work a little bit more like that, the general idea being to
batch-up work.)
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.
Freezing a page, and setting a page all-visible are orthogonal.
Sorry, sloppy wording on my part.
regards, tom lane
On Mon, Dec 2, 2024 at 11:52 AM Andres Freund <andres@anarazel.de> wrote:
I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though.
Right, exactly.
FWIW, this same issue is why it is safe for nbtree to drop its pin
early during plain index scans, but not during index-only scans -- see
_bt_drop_lock_and_maybe_pin, and the nbtree/README section on making
concurrent TID recycling safe. Weirdly, nbtree is specifically aware
that it needs to *not* drop its pin in the context of index-only scans
(to make sure that VACUUM cannot do unsafe concurrent TID recycling)
-- even though an equivalent index scan would be able to drop its pin
like this.
The underlying reason why nbtree can discriminate like this is that it
"knows" that plain index scans will always visit the heap proper. If a
TID points to an LP_UNUSED item, then it is considered dead to the
scan (even though in general the heap page itself might be marked
all-visible). If some completely unrelated, newly inserted heap tuple
is found instead, then it cannot be visible to the plain index scan's
MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
dropped like this).
--
Peter Geoghegan
Hi,
On 2024-12-02 12:02:39 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.
How? This basically would mean we could never set all-visible if there is
*any* concurrent scan on the current relation, because any concurrent scan
could have an outdated view of all-visible. Afaict this isn't an issue of
"too-aggressive setting of the all-visible bit", it's an issue of setting it
at all.
Greetings,
Andres Freund
On Mon, Dec 2, 2024 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Freezing a page, and setting a page all-visible are orthogonal.
Sorry, sloppy wording on my part.
Freezing doesn't affect the contents of the visibility map in any way
that seems relevant. The executor only cares about the all-visible bit
(and never the all-frozen bit), and the rules around when and how
VACUUM sets the all-visible bit (and how everybody else unsets the
all-visible bit) haven't changed in forever. So I just can't see it.
I guess it's natural to suspect more recent work -- commit 7c70996e is
about 6 years old. But I the race condition that I suspect is at play
here is very narrow.
It's pretty unlikely that there'll be a dead-to-all TID returned to a
scan (not just dead to our MVCC snapshot, dead to everybody's) that is
subsequently concurrently removed from the index, and then set
LP_UNUSED in the heap. It's probably impossible if you don't have a
small table -- VACUUM just isn't going to be fast enough to get to the
leaf page after the bitmap index scan, but still be able to get to the
heap before its corresponding bitmap heap scan (that uses the VM as an
optimization) can do the relevant visibility checks (while it could
happen with a large table and a slow bitmap scan, the chances of the
VACUUM being precisely aligned with the bitmap scan, in just the wrong
way, seem remote in the extreme). Finally, none of this will happen if
some other factor hinders VACUUM from setting the relevant heap page
all-visible.
AFAICT this is only a problem because of the involvement of the VM,
specifically -- an MVCC snapshot *is* generally sufficient to make
bitmap index scans safe from the dangers of concurrent TID recycling,
as explained in "62.4. Index Locking Considerations". That only ceases
to be true when the visibility map becomes involved (the VM lacks the
granular visibility information required to make all this safe). This
is essentially the same VM race issue that nbtree's
_bt_drop_lock_and_maybe_pin protects against during conventional
index-only scans.
--
Peter Geoghegan
Hi,
On 2024-12-02 11:31:48 -0500, Andres Freund wrote:
I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...
Attached is an isolationtest that reliably shows wrong query results.
Greetings,
Andres Freund
Attachments:
v1-0001-isolationtester-showing-broken-index-only-bitmap-.patchtext/x-diff; charset=us-asciiDownload
From a666e6da7af9a0af31ae506de8c2c229b713f8a6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 2 Dec 2024 14:38:11 -0500
Subject: [PATCH v1] isolationtester showing broken index-only bitmap heap scan
---
.../expected/index-only-bitmapscan.out | 28 ++++++
src/test/isolation/isolation_schedule | 1 +
.../specs/index-only-bitmapscan.spec | 85 +++++++++++++++++++
3 files changed, 114 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out
create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec
diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out
new file mode 100644
index 00000000000..132ff1bda70
--- /dev/null
+++ b/src/test/isolation/expected/index-only-bitmapscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod:
+ DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+row_number
+----------
+ 1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+row_number
+----------
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..e3c669a29c7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-bitmapscan
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec
new file mode 100644
index 00000000000..9962b8dc531
--- /dev/null
+++ b/src/test/isolation/specs/index-only-bitmapscan.spec
@@ -0,0 +1,85 @@
+# index-only-bitmapscan test showing wrong results
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_bitmap_a ON ios_bitmap(a);
+ CREATE INDEX ios_bitmap_b ON ios_bitmap(b);
+}
+
+teardown
+{
+ DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+ SET enable_seqscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+
+# The test query uses an or between two indexes to ensure make it more likely
+# to use a bitmap index scan
+#
+# The row_number() hack is a way to have something returned (isolationtester
+# doesn't display empty rows) while still allowing for the index-only scan
+# optimization in bitmap heap scans, which requires an empty targetlist.
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2.746.g06e570c0df.dirty
On Mon, Dec 2, 2024 at 11:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Whether that's true or not, it seems like it'd be worth bisecting
to see if we can finger a commit where the behavior changed (and
the same goes for the question of why-isnt-it-an-IOS-scan). However,
the reproducer seems to have quite a low failure probability for me,
which makes it hard to do bisection testing with much confidence.
Can we do anything to make the test more reliable? If I'm right
to suspect autovacuum, maybe triggering lots of manual vacuums
would improve the odds?
I agree that autovacuum (actually, VACUUM) is important here.
I find that the test becomes much more reliable if I create the test
table "with (autovacuum_analyze_scale_factor=0.99,
vacuum_truncate=off)". More importantly, rather than relying on
autovacuum, I just run VACUUM manually from psql. I find it convenient
to use "\watch 0.01" to run VACUUM repeatedly.
--
Peter Geoghegan
Hi,
On 2024-12-02 13:39:43 -0500, Peter Geoghegan wrote:
I guess it's natural to suspect more recent work -- commit 7c70996e is
about 6 years old. But I the race condition that I suspect is at play
here is very narrow.
FWIW, the test I just posted shows the issue down to 11 (although for 11 one
has to remove the (TRUNCATE false). 10 returns correct results.
I don't think the race is particularly narrow. Having a vacuum complete
between the start of the bitmap indexscan and the end of the bitmap heapscan,
leaves a lot of time with an expensive query.
I suspect one contributor to this avoiding attention till now was that the
optimization is fairly narrow:
/*
* We can potentially skip fetching heap pages if we do not need
* any columns of the table, either for checking non-indexable
* quals or for returning data. This test is a bit simplistic, as
* it checks the stronger condition that there's no qual or return
* tlist at all. But in most cases it's probably not worth working
* harder than that.
*/
need_tuples = (node->ss.ps.plan->qual != NIL ||
node->ss.ps.plan->targetlist != NIL);
Even an entry in the targetlist that that does not depend on the current row
disables the optimization.
Due to not being able to return any content for those rows, it's also somewhat
hard to actually notice that the results are wrong...
It's pretty unlikely that there'll be a dead-to-all TID returned to a
scan (not just dead to our MVCC snapshot, dead to everybody's) that is
subsequently concurrently removed from the index, and then set
LP_UNUSED in the heap. It's probably impossible if you don't have a
small table -- VACUUM just isn't going to be fast enough to get to the
leaf page after the bitmap index scan, but still be able to get to the
heap before its corresponding bitmap heap scan (that uses the VM as an
optimization) can do the relevant visibility checks (while it could
happen with a large table and a slow bitmap scan, the chances of the
VACUUM being precisely aligned with the bitmap scan, in just the wrong
way, seem remote in the extreme).
A cursor, waiting for IO, waiting for other parts of the query, ... can all
make this windows almost arbitrarily large.
Greetings,
Andres Freund
On Mon, Dec 2, 2024 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:
Attached is an isolationtest that reliably shows wrong query results.
Nice approach with the cursor!
I took what you wrote, and repurposed it to prove my old theory about
GiST index-only scans being broken due to the lack of an appropriate
interlock against concurrent TID recycling. See the attached patch.
--
Peter Geoghegan
Attachments:
v1-0001-isolationtester-showing-broken-index-only-scans-w.patchapplication/octet-stream; name=v1-0001-isolationtester-showing-broken-index-only-scans-w.patchDownload
From 3b189a59b2e59220ed192d8dde4080659f76fec9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 2 Dec 2024 15:50:04 -0500
Subject: [PATCH v1] isolationtester showing broken index-only scans with GiST
---
.../expected/index-only-gistscan.out | 28 +++++++
src/test/isolation/isolation_schedule | 1 +
.../isolation/specs/index-only-gistscan.spec | 79 +++++++++++++++++++
3 files changed, 108 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-gistscan.out
create mode 100644 src/test/isolation/specs/index-only-gistscan.spec
diff --git a/src/test/isolation/expected/index-only-gistscan.out b/src/test/isolation/expected/index-only-gistscan.out
new file mode 100644
index 000000000..5c97d0f27
--- /dev/null
+++ b/src/test/isolation/expected/index-only-gistscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod:
+ DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_bitmap WHERE a > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4..e3c669a29 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-bitmapscan
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-gistscan.spec b/src/test/isolation/specs/index-only-gistscan.spec
new file mode 100644
index 000000000..f42bcd80f
--- /dev/null
+++ b/src/test/isolation/specs/index-only-gistscan.spec
@@ -0,0 +1,79 @@
+# index-only-scan test showing wrong results with GiST
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ create extension btree_gist;
+ CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_gist_a ON ios_bitmap USING GIST (a);
+}
+
+teardown
+{
+ DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+ SET enable_seqscan = false;
+ SET enable_bitmapscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_bitmap WHERE a > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
On Mon, Dec 2, 2024 at 3:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
I took what you wrote, and repurposed it to prove my old theory about
GiST index-only scans being broken due to the lack of an appropriate
interlock against concurrent TID recycling. See the attached patch.
BTW, if you change the test case to use the default B-Tree index AM
(by removing "USING GIST"), you'll see that VACUUM blocks on acquiring
a cleanup lock (and so the test just times out). The problem is that
GiST VACUUM just doesn't care about cleanup locks/TID recycling safety
-- though clearly it should.
--
Peter Geoghegan
On Mon, Dec 2, 2024 at 3:55 PM Andres Freund <andres@anarazel.de> wrote:
I suspect one contributor to this avoiding attention till now was that the
optimization is fairly narrow:/*
* We can potentially skip fetching heap pages if we do not need
* any columns of the table, either for checking non-indexable
* quals or for returning data. This test is a bit simplistic, as
* it checks the stronger condition that there's no qual or return
* tlist at all. But in most cases it's probably not worth working
* harder than that.
*/
need_tuples = (node->ss.ps.plan->qual != NIL ||
node->ss.ps.plan->targetlist != NIL);Even an entry in the targetlist that that does not depend on the current row
disables the optimization.
Good point. I agree that that factor is likely to have masked the
problem over the past 6 years.
--
Peter Geoghegan
On Mon, Dec 2, 2024 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
The underlying reason why nbtree can discriminate like this is that it
"knows" that plain index scans will always visit the heap proper. If a
TID points to an LP_UNUSED item, then it is considered dead to the
scan (even though in general the heap page itself might be marked
all-visible). If some completely unrelated, newly inserted heap tuple
is found instead, then it cannot be visible to the plain index scan's
MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
dropped like this).
If I add "SET enable_indexonlyscan = false" to the "setup" section of
the v1-0001-isolationtester-showing-broken-index-only-scans-w.patch
isolationtester test case I posted earlier today, the test will pass.
There is no reason to think that plain GiST index scans are broken.
The fact that GiST VACUUM won't acquire cleanup locks is a problem
*only* because GiST supports index-only scans (AFAIK no other thing
within GiST is broken).
My point is that index-only scans are generally distinct from plain
index scans from an interlocking point of view -- they're not
equivalent (not quite). And yet the "62.4. Index Locking
Considerations" docs nevertheless say nothing about index-only scans
in particular (only about bitmap scans). The docs do say that an MVCC
snapshot is protective, though -- which makes it sound like GiST can't
be doing anything wrong here (GiST only uses MVCC snapshots).
Actually, I don't think that GiST would be broken at all if we'd
simply never added support for index-only scans to GiST. I'm fairly
sure that index-only scans didn't exist when the bulk of this part of
the docs was originally written. ISTM that we ought to do something
about these obsolete docs, after we've fixed the bugs themselves.
--
Peter Geoghegan
On Mon, Dec 2, 2024 at 3:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
I took what you wrote, and repurposed it to prove my old theory about
GiST index-only scans being broken due to the lack of an appropriate
interlock against concurrent TID recycling. See the attached patch.
I've moved discussion of this GiST bug over to the old 2021 thread
where I first raised concerns about the issue:
/messages/by-id/CAH2-Wz=jjiNL9FCh8C1L-GUH15f4WFTWub2x+_NucngcDDcHKw@mail.gmail.com
The GiST bug is actually causally unrelated to the bitmap index scan
bug under discussion, despite the high-level similarity. Seems best to
keep discussion of GiST on its own thread, for that reason.
--
Peter Geoghegan
On Mon, 2 Dec 2024 at 18:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.Yup. I am saying that that qualifies as too-aggressive setting of the
all-visible bit. I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.
I don't think it's new page locking rules, but rather a feature that
forgot to apply page locking rules after bypassing MVCC's snapshot
rules. IOS is only allowed exactly when they comply with index AM's
locking rules "only return a TID that's on a page that can't
concurrently be processed by vacuum" - why would this be different for
the bitmap equivalent?
By saying we're too aggressive with setting the all-visible bit you
seem to suggest we should add rules to vacuum that to remove LP_DEAD
items we don't just have to wait for tuples to be dead to all
transactions, but also for all transactions that might've gotten those
all_dead TIDs from indexes to have committed or rolled back, so that
no references to those TIDs exist that might consider them "possibly
visible".
We could achieve that by adding a waitpoint to vacuum (between the
index scan and the second heap scan for LP cleanup) which waits for
all concurrent transactions accessing the table to commit (thus all
bitmaps would be dropped), similar to REINDEX CONCURRENTLY's wait
phase, but that would slow down vacuum's ability to clean up old data
significantly, and change overall vacuum behaviour in a fundamental
way. I'm quite opposed to such a change.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Mon, 2 Dec 2024 at 17:31, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:
Concurrency timeline:
Session 1. amgetbitmap() gets snapshot of index contents, containing
references to dead tuples on heap P1.IIUC, an unstanted important aspect here is that these dead tuples are *not*
visible to S1. Otherwise the VACUUM in the next step could not remove the dead
tuples.
Correct, this is about TIDs that are dead to all transactions, so
which might already be LP_DEAD in the heap.
PS.
I don't think the optimization itself is completely impossible, and we
can probably re-enable an optimization like that if (or when) we find
a way to reliably keep track of when to stop using the optimization. I
don't think that's an easy fix, though, and definitely not for
backbranches.One way to make the optimization safe could be to move it into the indexam. If
an indexam would check the VM bit while blocking removal of the index entries,
it should make it safe. Of course that has the disadvantage of needing more VM
lookups than before, because it would not yet have been deduplicated...
I'm -0.5 on adding visibility checking to index AM's contract. The
most basic contract that an index AM must implement is currently:
1. Store the index tuples provided by aminsert()
2.a With amgettuple, return at least all stored TIDs that might match
the scankey (optionally marked with recheck when the AM isn't sure
about the TID matching the scankeys, or when returning TIDs not
provided by aminsert()), or
2.b With amgetbitmap, return a bitmap containing at least the TIDs
that match the description of 2.a (i.e. allows an index to have an
optimized approach to adding outputs of 2.a into a bitmap)
3. Remove all the bulkdelete-provided tids from its internal structures
Note that visibility checking is absent. Any visibility or dead tuple
hints (through e.g. kill_prior_tuple, or calling
table_index_delete_tuples) are optimizations which are not required
for basic index AM operations, and indeed they are frequently not
implemented in index AMs. Adding a requirement for index AMs to do
visibility checks would IMO significantly change the current
API/layering contracts.
Another issue with bitmap index scans is that it currently can't use
killtuples. I've seen several production outages where performance would
degrade horribly over time whenever estimates lead to important queries to
switch to bitmap scans, because lots of dead tuples would get accessed over
and over.It seems pretty much impossible to fix that with the current interaction
between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
and likely with some heuristics / limits, address this by performing some
visibility checks during the bitmap build.
It's an interesting approach worth exploring. I'm a bit concerned
about the IO patterns this would create, though, especially when this
relates to BitmapAnd nodes: This node would create VM check IOs on the
order of O(sum_matching_tuple_pages), instead of
O(intersect_matching_tuple_pages). Additionally, it's wasted IO if
we're planning to go to the heap anyway, so this VM precheck would
have to be conditional on the bitmap scan not wanting any table data
(e.g. row_number, count()).
I'm bringing it up here because I
suspect that some of the necessary changes would overlap with what's needed to
repair bitmap index-only scans.
I'd call this "bitmap-only scans" instead, as there might be multiple
indexes involved, but indeed, this also does sound like a viable
approach.
The solution I could think to keep most of this optimization requires
the heap bitmap scan to notice that a concurrent process started
removing TIDs from the heap after amgetbitmap was called; i.e.. a
"vacuum generation counter" incremented every time heap starts the
cleanup run.I'm not sure this is a good path. We can already clean up pages during index
accesses and we are working on being able to set all-visible during "hot
pruning"/page-level-vacuuming. That'd probably actually be safe (because we
couldn't remove dead items without a real vacuum), but it's starting to get
pretty complicated...
I imagine this solution to happen in the executor/heapAM bitmapscan
code, not in the index AM's amgetbitmap. It'd note down the 'vacuum
generation counter' before extracting the index's bitmap, and then,
after every VM lookup during the bitmap heap scan, it validates that
the generation counter hasn't changed (and thus that we can use that
VM bit as authorative for the visibility of the TIDs we got). I don't
think that this interaction specifically is very complicated, but it
would indeed add to the overall complexity of the system.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi,
I've rebased my earlier patch to fix some minor conflicts with the
work done on bitmaps in December last year. I've also included Andres'
cursor-based isolation test as 0002; which now passes.
This should take care of cfbot's misidentification of which patch to
test, and thus get CFBot to succeed again.
The patches for the back-branches didn't need updating, as those
branches have not diverged enough for those patches to have gotten
stale. They're still available in my initial mail over at [0]/messages/by-id/CAEze2Wg1Q4gWzm9RZ0yXydm23Mj3iScu8LA__Zz3JJEgpnoGPQ@mail.gmail.com.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: /messages/by-id/CAEze2Wg1Q4gWzm9RZ0yXydm23Mj3iScu8LA__Zz3JJEgpnoGPQ@mail.gmail.com
Attachments:
v3-0001-Remove-BitmapScan-s-skip_fetch-optimization.patchapplication/octet-stream; name=v3-0001-Remove-BitmapScan-s-skip_fetch-optimization.patchDownload
From f5e90fc3ac68b8606cc14bba42d72d062c25eab1 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:38:14 +0100
Subject: [PATCH v3 1/2] Remove BitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/include/access/heapam.h | 1 -
src/include/access/tableam.h | 12 +------
src/backend/access/heap/heapam.c | 18 ----------
src/backend/access/heap/heapam_handler.c | 28 ---------------
src/backend/executor/nodeBitmapHeapscan.c | 43 ++---------------------
5 files changed, 4 insertions(+), 98 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83fc..84a1f30aecb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,7 +99,6 @@ typedef struct HeapScanDescData
* block reported by the bitmap to determine how many NULL-filled tuples
* to return.
*/
- Buffer rs_vmbuffer;
int rs_empty_tuples_pending;
/* these fields only used in page-at-a-time mode and for bitmap scans */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..74c8befef6b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,13 +62,6 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
-
- /*
- * At the discretion of the table AM, bitmap table scans may be able to
- * skip fetching a block from the table if none of the table data is
- * needed. If table data may be needed, set SO_NEED_TUPLES.
- */
- SO_NEED_TUPLES = 1 << 10,
} ScanOptions;
/*
@@ -955,13 +948,10 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key, bool need_tuple)
+ int nkeys, struct ScanKeyData *key)
{
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
- if (need_tuple)
- flags |= SO_NEED_TUPLES;
-
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key,
NULL, flags);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 485525f4d64..905dd9d04a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1056,8 +1056,6 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
- scan->rs_vmbuffer = InvalidBuffer;
- scan->rs_empty_tuples_pending = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1173,19 +1171,6 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (BufferIsValid(scan->rs_vmbuffer))
- {
- ReleaseBuffer(scan->rs_vmbuffer);
- scan->rs_vmbuffer = InvalidBuffer;
- }
-
- /*
- * Reset rs_empty_tuples_pending, a field only used by bitmap heap scan,
- * to avoid incorrectly emitting NULL-filled tuples from a previous scan
- * on rescan.
- */
- scan->rs_empty_tuples_pending = 0;
-
/*
* The read stream is reset on rescan. This must be done before
* initscan(), as some state referred to by read_stream_reset() is reset
@@ -1213,9 +1198,6 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (BufferIsValid(scan->rs_vmbuffer))
- ReleaseBuffer(scan->rs_vmbuffer);
-
/*
* Must free the read stream before freeing the BufferAccessStrategy.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..0ea4f029793 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2155,24 +2155,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*blockno = tbmres->blockno;
*recheck = tbmres->recheck;
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, the bitmap entries don't need rechecking, and all tuples on the
- * page are visible to our transaction.
- */
- if (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
-
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
-
- return true;
- }
-
block = tbmres->blockno;
/*
@@ -2287,16 +2269,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
- if (hscan->rs_empty_tuples_pending > 0)
- {
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
- hscan->rs_empty_tuples_pending--;
- return true;
- }
-
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index be616683f98..e5d7fab4dbe 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -158,24 +158,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!scan)
{
- bool need_tuples = false;
-
- /*
- * We can potentially skip fetching heap pages if we do not need
- * any columns of the table, either for checking non-indexable
- * quals or for returning data. This test is a bit simplistic, as
- * it checks the stronger condition that there's no qual or return
- * tlist at all. But in most cases it's probably not worth working
- * harder than that.
- */
- need_tuples = (node->ss.ps.plan->qual != NIL ||
- node->ss.ps.plan->targetlist != NIL);
-
scan = table_beginscan_bm(node->ss.ss_currentRelation,
node->ss.ps.state->es_snapshot,
0,
- NULL,
- need_tuples);
+ NULL);
node->ss.ss_currentScanDesc = scan;
}
@@ -434,7 +420,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
if (tbmpre == NULL)
{
@@ -445,20 +430,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_pages++;
node->prefetch_blockno = tbmpre->blockno;
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
@@ -475,7 +447,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;
- bool skip_fetch;
/*
* Recheck under the mutex. If some other process has already
@@ -502,15 +473,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_blockno = tbmpre->blockno;
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
}
--
2.45.2
v3-0002-isolationtester-showing-broken-index-only-bitmap-.patchapplication/octet-stream; name=v3-0002-isolationtester-showing-broken-index-only-bitmap-.patchDownload
From 11c89a393ebf194d5d5f6617d6bd70f753f6da74 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 2 Dec 2024 14:38:11 -0500
Subject: [PATCH v3 2/2] isolationtester showing broken index-only bitmap heap
scan
---
.../expected/index-only-bitmapscan.out | 28 ++++++
src/test/isolation/isolation_schedule | 1 +
.../specs/index-only-bitmapscan.spec | 85 +++++++++++++++++++
3 files changed, 114 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out
create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec
diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out
new file mode 100644
index 00000000000..132ff1bda70
--- /dev/null
+++ b/src/test/isolation/expected/index-only-bitmapscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod:
+ DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+row_number
+----------
+ 1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+row_number
+----------
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..e3c669a29c7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-bitmapscan
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec
new file mode 100644
index 00000000000..9962b8dc531
--- /dev/null
+++ b/src/test/isolation/specs/index-only-bitmapscan.spec
@@ -0,0 +1,85 @@
+# index-only-bitmapscan test showing wrong results
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_bitmap_a ON ios_bitmap(a);
+ CREATE INDEX ios_bitmap_b ON ios_bitmap(b);
+}
+
+teardown
+{
+ DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+ SET enable_seqscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+
+# The test query uses an or between two indexes to ensure make it more likely
+# to use a bitmap index scan
+#
+# The row_number() hack is a way to have something returned (isolationtester
+# doesn't display empty rows) while still allowing for the index-only scan
+# optimization in bitmap heap scans, which requires an empty targetlist.
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
On Tue, 7 Jan 2025 at 18:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Hi,
I've rebased my earlier patch to fix some minor conflicts with the
work done on bitmaps in December last year. I've also included Andres'
cursor-based isolation test as 0002; which now passes.
Rebased again, now on top of head due to f7a8fc10.
The patches for the back-branches didn't need updating, as those
branches have not diverged enough for those patches to have gotten
stale. They're still available in my initial mail over at [0].
Same again.
Show quoted text
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)[0] /messages/by-id/CAEze2Wg1Q4gWzm9RZ0yXydm23Mj3iScu8LA__Zz3JJEgpnoGPQ@mail.gmail.com
Attachments:
v4-0002-isolationtester-showing-broken-index-only-bitmap-.patchapplication/octet-stream; name=v4-0002-isolationtester-showing-broken-index-only-bitmap-.patchDownload
From 0eb1060da55d488717445878b364d1a255089657 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 2 Dec 2024 14:38:11 -0500
Subject: [PATCH v4 2/2] isolationtester showing broken index-only bitmap heap
scan
---
.../expected/index-only-bitmapscan.out | 28 ++++++
src/test/isolation/isolation_schedule | 1 +
.../specs/index-only-bitmapscan.spec | 85 +++++++++++++++++++
3 files changed, 114 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out
create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec
diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out
new file mode 100644
index 00000000000..132ff1bda70
--- /dev/null
+++ b/src/test/isolation/expected/index-only-bitmapscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod:
+ DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+row_number
+----------
+ 1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+row_number
+----------
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..e3c669a29c7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-bitmapscan
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec
new file mode 100644
index 00000000000..9962b8dc531
--- /dev/null
+++ b/src/test/isolation/specs/index-only-bitmapscan.spec
@@ -0,0 +1,85 @@
+# index-only-bitmapscan test showing wrong results
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_bitmap_a ON ios_bitmap(a);
+ CREATE INDEX ios_bitmap_b ON ios_bitmap(b);
+}
+
+teardown
+{
+ DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+ SET enable_seqscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+
+# The test query uses an or between two indexes to ensure make it more likely
+# to use a bitmap index scan
+#
+# The row_number() hack is a way to have something returned (isolationtester
+# doesn't display empty rows) while still allowing for the index-only scan
+# optimization in bitmap heap scans, which requires an empty targetlist.
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v4-0001-Remove-HeapBitmapScan-s-skip_fetch-optimization.patchapplication/octet-stream; name=v4-0001-Remove-HeapBitmapScan-s-skip_fetch-optimization.patchDownload
From 46a2670a20ed2e55f2c6c8dce6c211496123e544 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:38:14 +0100
Subject: [PATCH v4 1/2] Remove HeapBitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/include/access/heapam.h | 12 +---------
src/backend/access/heap/heapam.c | 29 -----------------------
src/backend/access/heap/heapam_handler.c | 28 ----------------------
src/backend/executor/nodeBitmapHeapscan.c | 27 ++-------------------
4 files changed, 3 insertions(+), 93 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..e48fe434cd3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -103,17 +103,7 @@ typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
- /*
- * These fields are only used for bitmap scans for the "skip fetch"
- * optimization. Bitmap scans needing no fields from the heap may skip
- * fetching an all visible block, instead using the number of tuples per
- * block reported by the bitmap to determine how many NULL-filled tuples
- * to return. They are common to parallel and serial BitmapHeapScans
- */
-
- /* page of VM containing info for current block */
- Buffer rs_vmbuffer;
- int rs_empty_tuples_pending;
+ /* Holds no data */
} BitmapHeapScanDescData;
typedef struct BitmapHeapScanDescData *BitmapHeapScanDesc;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed3..4714e2bfaea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1054,8 +1054,6 @@ heap_beginscan(Relation relation, Snapshot snapshot,
{
BitmapHeapScanDesc bscan = palloc(sizeof(BitmapHeapScanDescData));
- bscan->rs_vmbuffer = InvalidBuffer;
- bscan->rs_empty_tuples_pending = 0;
scan = (HeapScanDesc) bscan;
}
else
@@ -1182,24 +1180,6 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
- {
- BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
-
- /*
- * Reset empty_tuples_pending, a field only used by bitmap heap scan,
- * to avoid incorrectly emitting NULL-filled tuples from a previous
- * scan on rescan.
- */
- bscan->rs_empty_tuples_pending = 0;
-
- if (BufferIsValid(bscan->rs_vmbuffer))
- {
- ReleaseBuffer(bscan->rs_vmbuffer);
- bscan->rs_vmbuffer = InvalidBuffer;
- }
- }
-
/*
* The read stream is reset on rescan. This must be done before
* initscan(), as some state referred to by read_stream_reset() is reset
@@ -1227,15 +1207,6 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
- {
- BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) sscan;
-
- bscan->rs_empty_tuples_pending = 0;
- if (BufferIsValid(bscan->rs_vmbuffer))
- ReleaseBuffer(bscan->rs_vmbuffer);
- }
-
/*
* Must free the read stream before freeing the BufferAccessStrategy.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c0bec014154..00503fd3c82 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2160,24 +2160,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*blockno = tbmres->blockno;
*recheck = tbmres->recheck;
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, the bitmap entries don't need rechecking, and all tuples on the
- * page are visible to our transaction.
- */
- if (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &bscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
- Assert(bscan->rs_empty_tuples_pending >= 0);
-
- bscan->rs_empty_tuples_pending += tbmres->ntuples;
-
- return true;
- }
-
block = tbmres->blockno;
/*
@@ -2293,16 +2275,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
- if (bscan->rs_empty_tuples_pending > 0)
- {
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
- bscan->rs_empty_tuples_pending--;
- return true;
- }
-
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index be0d24d901b..49270634d54 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -442,7 +442,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
if (tbmpre == NULL)
{
@@ -453,20 +452,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_pages++;
node->prefetch_blockno = tbmpre->blockno;
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
@@ -483,7 +469,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;
- bool skip_fetch;
/*
* Recheck under the mutex. If some other process has already
@@ -510,15 +495,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_blockno = tbmpre->blockno;
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
}
--
2.45.2
On Fri, Feb 28, 2025 at 12:53 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Rebased again, now on top of head due to f7a8fc10.
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.
If you don't see a clear path forward, then "disable" is the only
reasonable choice for the back branches. Maybe we'll find a fix
in future, but it seems unlikely that it'd be back-patchable.
regards, tom lane
On Sun, 2 Mar 2025 at 01:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.If you don't see a clear path forward, then "disable" is the only
reasonable choice for the back branches. Maybe we'll find a fix
in future, but it seems unlikely that it'd be back-patchable.
Agreed.
Here's patch v5 for the master branch (now up to f4694e0f), with no
interesting changes other than fixing apply conflicts caused by
bfe56cdf.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v5-0002-isolationtester-showing-broken-index-only-bitmap-.patchapplication/octet-stream; name=v5-0002-isolationtester-showing-broken-index-only-bitmap-.patchDownload
From e664d1033a2866ff5dd126a62186fb646e63c54b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 2 Dec 2024 14:38:11 -0500
Subject: [PATCH v5 2/2] isolationtester showing broken index-only bitmap heap
scan
---
.../expected/index-only-bitmapscan.out | 28 ++++++
src/test/isolation/isolation_schedule | 1 +
.../specs/index-only-bitmapscan.spec | 85 +++++++++++++++++++
3 files changed, 114 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out
create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec
diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out
new file mode 100644
index 00000000000..132ff1bda70
--- /dev/null
+++ b/src/test/isolation/expected/index-only-bitmapscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod:
+ DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+row_number
+----------
+ 1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+row_number
+----------
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..e3c669a29c7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-bitmapscan
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec
new file mode 100644
index 00000000000..9962b8dc531
--- /dev/null
+++ b/src/test/isolation/specs/index-only-bitmapscan.spec
@@ -0,0 +1,85 @@
+# index-only-bitmapscan test showing wrong results
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_bitmap_a ON ios_bitmap(a);
+ CREATE INDEX ios_bitmap_b ON ios_bitmap(b);
+}
+
+teardown
+{
+ DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+ SET enable_seqscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+
+# The test query uses an or between two indexes to ensure make it more likely
+# to use a bitmap index scan
+#
+# The row_number() hack is a way to have something returned (isolationtester
+# doesn't display empty rows) while still allowing for the index-only scan
+# optimization in bitmap heap scans, which requires an empty targetlist.
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v5-0001-Remove-HeapBitmapScan-s-skip_fetch-optimization.patchapplication/octet-stream; name=v5-0001-Remove-HeapBitmapScan-s-skip_fetch-optimization.patchDownload
From 71c770623725f16e66d4a302415e8c9c66bae2bc Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:38:14 +0100
Subject: [PATCH v5 1/2] Remove HeapBitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/include/access/heapam.h | 12 +---------
src/backend/access/heap/heapam.c | 29 -----------------------
src/backend/access/heap/heapam_handler.c | 29 -----------------------
src/backend/executor/nodeBitmapHeapscan.c | 27 ++-------------------
4 files changed, 3 insertions(+), 94 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..e48fe434cd3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -103,17 +103,7 @@ typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
- /*
- * These fields are only used for bitmap scans for the "skip fetch"
- * optimization. Bitmap scans needing no fields from the heap may skip
- * fetching an all visible block, instead using the number of tuples per
- * block reported by the bitmap to determine how many NULL-filled tuples
- * to return. They are common to parallel and serial BitmapHeapScans
- */
-
- /* page of VM containing info for current block */
- Buffer rs_vmbuffer;
- int rs_empty_tuples_pending;
+ /* Holds no data */
} BitmapHeapScanDescData;
typedef struct BitmapHeapScanDescData *BitmapHeapScanDesc;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed3..4714e2bfaea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1054,8 +1054,6 @@ heap_beginscan(Relation relation, Snapshot snapshot,
{
BitmapHeapScanDesc bscan = palloc(sizeof(BitmapHeapScanDescData));
- bscan->rs_vmbuffer = InvalidBuffer;
- bscan->rs_empty_tuples_pending = 0;
scan = (HeapScanDesc) bscan;
}
else
@@ -1182,24 +1180,6 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
- {
- BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
-
- /*
- * Reset empty_tuples_pending, a field only used by bitmap heap scan,
- * to avoid incorrectly emitting NULL-filled tuples from a previous
- * scan on rescan.
- */
- bscan->rs_empty_tuples_pending = 0;
-
- if (BufferIsValid(bscan->rs_vmbuffer))
- {
- ReleaseBuffer(bscan->rs_vmbuffer);
- bscan->rs_vmbuffer = InvalidBuffer;
- }
- }
-
/*
* The read stream is reset on rescan. This must be done before
* initscan(), as some state referred to by read_stream_reset() is reset
@@ -1227,15 +1207,6 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
- {
- BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) sscan;
-
- bscan->rs_empty_tuples_pending = 0;
- if (BufferIsValid(bscan->rs_vmbuffer))
- ReleaseBuffer(bscan->rs_vmbuffer);
- }
-
/*
* Must free the read stream before freeing the BufferAccessStrategy.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..014aecca51a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2167,25 +2167,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*blockno = tbmres->blockno;
*recheck = tbmres->recheck;
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, the bitmap entries don't need rechecking, and all tuples on the
- * page are visible to our transaction.
- */
- if (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &bscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(!tbmres->lossy);
- Assert(bscan->rs_empty_tuples_pending >= 0);
- Assert(noffsets > -1);
-
- bscan->rs_empty_tuples_pending += noffsets;
-
- return true;
- }
-
block = tbmres->blockno;
/*
@@ -2304,16 +2285,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
- if (bscan->rs_empty_tuples_pending > 0)
- {
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
- bscan->rs_empty_tuples_pending--;
- return true;
- }
-
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index be0d24d901b..49270634d54 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -442,7 +442,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
if (tbmpre == NULL)
{
@@ -453,20 +452,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_pages++;
node->prefetch_blockno = tbmpre->blockno;
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
@@ -483,7 +469,6 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;
- bool skip_fetch;
/*
* Recheck under the mutex. If some other process has already
@@ -510,15 +495,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
node->prefetch_blockno = tbmpre->blockno;
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (!(scan->rs_flags & SO_NEED_TUPLES) &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
}
}
}
--
2.45.2
Hi,
On 2025-03-01 19:35:15 -0500, Tom Lane wrote:
Peter Geoghegan <pg@bowt.ie> writes:
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.If you don't see a clear path forward, then "disable" is the only
reasonable choice for the back branches. Maybe we'll find a fix
in future, but it seems unlikely that it'd be back-patchable.
I don't think there's a realistic way to fix it in the backbranches. And even
on master, I doubt that much of the current code would survive.
Greetings,
Andres Freund
On Wed, 5 Mar 2025 at 16:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 2 Mar 2025 at 01:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.If you don't see a clear path forward, then "disable" is the only
reasonable choice for the back branches. Maybe we'll find a fix
in future, but it seems unlikely that it'd be back-patchable.Agreed.
Here's patch v5 for the master branch (now up to f4694e0f), with no
interesting changes other than fixing apply conflicts caused by
bfe56cdf.
I noticed that Andres's comment from [1]/messages/by-id/tf5pp2o2a5x5qjcseq354bd26ya4o7p2vjzm5z4w57ca3vy6bc@ep7enrljvdkr is not yet addressed,
changing the commitfest entry status to Waiting on Author, please
address the comment and change it back to Needs review.
[1]: /messages/by-id/tf5pp2o2a5x5qjcseq354bd26ya4o7p2vjzm5z4w57ca3vy6bc@ep7enrljvdkr
Regards,
Vignesh
On Sun, 16 Mar 2025 at 13:55, vignesh C <vignesh21@gmail.com> wrote:
On Wed, 5 Mar 2025 at 16:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Sun, 2 Mar 2025 at 01:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.If you don't see a clear path forward, then "disable" is the only
reasonable choice for the back branches. Maybe we'll find a fix
in future, but it seems unlikely that it'd be back-patchable.Agreed.
Here's patch v5 for the master branch (now up to f4694e0f), with no
interesting changes other than fixing apply conflicts caused by
bfe56cdf.I noticed that Andres's comment from [1] is not yet addressed,
changing the commitfest entry status to Waiting on Author, please
address the comment and change it back to Needs review.
[1] - /messages/by-id/tf5pp2o2a5x5qjcseq354bd26ya4o7p2vjzm5z4w57ca3vy6bc@ep7enrljvdkr
As I understand it, Andres agrees that the feature is unsalvageable in
backbranches ("don't think there's a realistic way to fix it"). Andres
then continues to elaborate that even if the feature were salvageable,
it wouldn't contain much of the current code.
My patch removes the feature altogether in master and disables the
feature in the backbranches in the patches that were built against the
backbranches, which seems to match Andres' comments.
@vignesh, could you elaborate on what about Andres' comments I need to
address in my patch?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi,
On 2025-03-17 16:17:03 +0100, Matthias van de Meent wrote:
As I understand it, Andres agrees that the feature is unsalvageable in
backbranches ("don't think there's a realistic way to fix it"). Andres
then continues to elaborate that even if the feature were salvageable,
it wouldn't contain much of the current code.
Correct.
My patch removes the feature altogether in master and disables the
feature in the backbranches in the patches that were built against the
backbranches, which seems to match Andres' comments.
Agreed.
@vignesh, could you elaborate on what about Andres' comments I need to
address in my patch?
I don't think there are any...
Greetings,
Andres Freund
Hi,
Matthias, any chance you can provide a rebased version for master? If not,
I'll do the rebase. Either way I'm planning to push this fairly soon.
Greetings,
Andres Freund
On Wed, 2 Apr 2025 at 17:37, Andres Freund <andres@anarazel.de> wrote:
Hi,
Matthias, any chance you can provide a rebased version for master?
Sure, I'll try to have it in your inbox later today CEST.
Either way I'm planning to push this fairly soon.
OK, thanks!
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, 2 Apr 2025 at 18:20, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 2 Apr 2025 at 17:37, Andres Freund <andres@anarazel.de> wrote:
Hi,
Matthias, any chance you can provide a rebased version for master?
Sure, I'll try to have it in your inbox later today CEST.
And here it is, v6 of the patchset, rebased up to master @ 764d501d.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v6-0002-isolationtester-showing-broken-index-only-bitmap-.patchapplication/octet-stream; name=v6-0002-isolationtester-showing-broken-index-only-bitmap-.patchDownload
From 14dbf4bd4c03d8610c92cbe7020ec979bd55dd74 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 2 Dec 2024 14:38:11 -0500
Subject: [PATCH v6 2/2] isolationtester showing broken index-only bitmap heap
scan
---
.../expected/index-only-bitmapscan.out | 28 ++++++
src/test/isolation/isolation_schedule | 1 +
.../specs/index-only-bitmapscan.spec | 85 +++++++++++++++++++
3 files changed, 114 insertions(+)
create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out
create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec
diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out
new file mode 100644
index 00000000000..132ff1bda70
--- /dev/null
+++ b/src/test/isolation/expected/index-only-bitmapscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod:
+ DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare:
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+
+step s1_fetch_1:
+ FETCH FROM foo;
+
+row_number
+----------
+ 1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all:
+ FETCH ALL FROM foo;
+
+row_number
+----------
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..e3c669a29c7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
+test: index-only-bitmapscan
test: predicate-lock-hot-tuple
test: update-conflict-out
test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec
new file mode 100644
index 00000000000..9962b8dc531
--- /dev/null
+++ b/src/test/isolation/specs/index-only-bitmapscan.spec
@@ -0,0 +1,85 @@
+# index-only-bitmapscan test showing wrong results
+#
+setup
+{
+ -- by using a low fillfactor and a wide tuple we can get multiple blocks
+ -- with just few rows
+ CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+ WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+ INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+ CREATE INDEX ios_bitmap_a ON ios_bitmap(a);
+ CREATE INDEX ios_bitmap_b ON ios_bitmap(b);
+}
+
+teardown
+{
+ DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+ SET enable_seqscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+
+# The test query uses an or between two indexes to ensure make it more likely
+# to use a bitmap index scan
+#
+# The row_number() hack is a way to have something returned (isolationtester
+# doesn't display empty rows) while still allowing for the index-only scan
+# optimization in bitmap heap scans, which requires an empty targetlist.
+step s1_prepare {
+ DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+}
+
+step s1_fetch_1 {
+ FETCH FROM foo;
+}
+
+step s1_fetch_all {
+ FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+ DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+ # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+ # VM to be size 0, due to caching. Can't do that in setup because
+ s2_vacuum
+
+ # delete nearly all rows, to make issue visible
+ s2_mod
+ # create a cursor
+ s1_begin
+ s1_prepare
+
+ # fetch one row from the cursor, that ensures the index scan portion is done
+ # before the vacuum in the next step
+ s1_fetch_1
+
+ # with the bug this vacuum will mark pages as all-visible that the scan in
+ # the next step then considers all-visible, despite all rows from those
+ # pages having been removed.
+ s2_vacuum
+ # if this returns any rows, we're busted
+ s1_fetch_all
+
+ s1_commit
--
2.45.2
v6-0001-Remove-HeapBitmapScan-s-skip_fetch-optimization.patchapplication/octet-stream; name=v6-0001-Remove-HeapBitmapScan-s-skip_fetch-optimization.patchDownload
From 7544de9eb5d5122651f00878be8847c1c14a07a3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 2 Dec 2024 15:38:14 +0100
Subject: [PATCH v6 1/2] Remove HeapBitmapScan's skip_fetch optimization
The optimization doesn't take concurrent vacuum's removal of TIDs into
account, which can remove dead TIDs and make ALL_VISIBLE pages for which
we have pointers to those dead TIDs in the bitmap.
The optimization itself isn't impossible, but would require more work
figuring out that vacuum started removing TIDs and then stop using the
optimization. However, we currently don't have such a system in place,
making the optimization unsound to keep around.
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Backpatch-through: 13
---
src/include/access/heapam.h | 12 +----
src/include/access/tableam.h | 12 +----
src/backend/access/heap/heapam.c | 62 +++--------------------
src/backend/access/heap/heapam_handler.c | 44 +---------------
src/backend/executor/nodeBitmapHeapscan.c | 15 +-----
5 files changed, 13 insertions(+), 132 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..e48fe434cd3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -103,17 +103,7 @@ typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
- /*
- * These fields are only used for bitmap scans for the "skip fetch"
- * optimization. Bitmap scans needing no fields from the heap may skip
- * fetching an all visible block, instead using the number of tuples per
- * block reported by the bitmap to determine how many NULL-filled tuples
- * to return. They are common to parallel and serial BitmapHeapScans
- */
-
- /* page of VM containing info for current block */
- Buffer rs_vmbuffer;
- int rs_empty_tuples_pending;
+ /* Holds no data */
} BitmapHeapScanDescData;
typedef struct BitmapHeapScanDescData *BitmapHeapScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..8713e12cbfb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,13 +62,6 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
-
- /*
- * At the discretion of the table AM, bitmap table scans may be able to
- * skip fetching a block from the table if none of the table data is
- * needed. If table data may be needed, set SO_NEED_TUPLES.
- */
- SO_NEED_TUPLES = 1 << 10,
} ScanOptions;
/*
@@ -920,13 +913,10 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key, bool need_tuple)
+ int nkeys, struct ScanKeyData *key)
{
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
- if (need_tuple)
- flags |= SO_NEED_TUPLES;
-
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key,
NULL, flags);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cedaa195cb6..c13288d83b7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -314,31 +314,6 @@ bitmapheap_stream_read_next(ReadStream *pgsr, void *private_data,
tbmres->blockno >= hscan->rs_nblocks)
continue;
- /*
- * We can skip fetching the heap page if we don't need any fields from
- * the heap, the bitmap entries don't need rechecking, and all tuples
- * on the page are visible to our transaction.
- */
- if (!(sscan->rs_flags & SO_NEED_TUPLES) &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(sscan->rs_rd, tbmres->blockno, &bscan->rs_vmbuffer))
- {
- OffsetNumber offsets[TBM_MAX_TUPLES_PER_PAGE];
- int noffsets;
-
- /* can't be lossy in the skip_fetch case */
- Assert(!tbmres->lossy);
- Assert(bscan->rs_empty_tuples_pending >= 0);
-
- /*
- * We throw away the offsets, but this is the easiest way to get a
- * count of tuples.
- */
- noffsets = tbm_extract_page_tuple(tbmres, offsets, TBM_MAX_TUPLES_PER_PAGE);
- bscan->rs_empty_tuples_pending += noffsets;
- continue;
- }
-
return tbmres->blockno;
}
@@ -1123,9 +1098,10 @@ heap_beginscan(Relation relation, Snapshot snapshot,
if (flags & SO_TYPE_BITMAPSCAN)
{
BitmapHeapScanDesc bscan = palloc(sizeof(BitmapHeapScanDescData));
-
- bscan->rs_vmbuffer = InvalidBuffer;
- bscan->rs_empty_tuples_pending = 0;
+ /*
+ * Bitmap Heap scans do not have any new fields that a normal Heap
+ * Scan does not have, so no special initializations required here.
+ */
scan = (HeapScanDesc) bscan;
}
else
@@ -1280,23 +1256,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
scan->rs_cbuf = InvalidBuffer;
}
- if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
- {
- BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
-
- /*
- * Reset empty_tuples_pending, a field only used by bitmap heap scan,
- * to avoid incorrectly emitting NULL-filled tuples from a previous
- * scan on rescan.
- */
- bscan->rs_empty_tuples_pending = 0;
-
- if (BufferIsValid(bscan->rs_vmbuffer))
- {
- ReleaseBuffer(bscan->rs_vmbuffer);
- bscan->rs_vmbuffer = InvalidBuffer;
- }
- }
+ /*
+ * SO_TYPE_BITMAPSCAN would be cleaned up here, but it does not hold any
+ * additional data vs a normal HeapScan
+ */
/*
* The read stream is reset on rescan. This must be done before
@@ -1325,15 +1288,6 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
- if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
- {
- BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) sscan;
-
- bscan->rs_empty_tuples_pending = 0;
- if (BufferIsValid(bscan->rs_vmbuffer))
- ReleaseBuffer(bscan->rs_vmbuffer);
- }
-
/*
* Must free the read stream before freeing the BufferAccessStrategy.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..453470e5aee 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2137,32 +2137,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
*/
while (hscan->rs_cindex >= hscan->rs_ntuples)
{
- /*
- * Emit empty tuples before advancing to the next block
- */
- if (bscan->rs_empty_tuples_pending > 0)
- {
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
- bscan->rs_empty_tuples_pending--;
-
- /*
- * We do not recheck all NULL tuples. Because the streaming read
- * API only yields TBMIterateResults for blocks actually fetched
- * from the heap, we must unset `recheck` ourselves here to ensure
- * correct results.
- *
- * Our read stream callback accrues a count of empty tuples to
- * emit and then emits them after emitting tuples from the next
- * fetched block. If no blocks need fetching, we'll emit the
- * accrued count at the end of the scan.
- */
- *recheck = false;
- return true;
- }
-
/*
* Returns false if the bitmap is exhausted and there are no further
* blocks we need to scan.
@@ -2516,24 +2490,10 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
if (BufferIsInvalid(hscan->rs_cbuf))
{
- if (BufferIsValid(bscan->rs_vmbuffer))
- {
- ReleaseBuffer(bscan->rs_vmbuffer);
- bscan->rs_vmbuffer = InvalidBuffer;
- }
-
/*
- * The bitmap is exhausted. Now emit any remaining empty tuples. The
- * read stream API only returns TBMIterateResults for blocks actually
- * fetched from the heap. Our callback will accrue a count of empty
- * tuples to emit for all blocks we skipped fetching. So, if we skip
- * fetching heap blocks at the end of the relation (or no heap blocks
- * are fetched) we need to ensure we emit empty tuples before ending
- * the scan. We don't recheck empty tuples so ensure `recheck` is
- * unset.
+ * The bitmap is exhausted.
*/
- *recheck = false;
- return bscan->rs_empty_tuples_pending > 0;
+ return false;
}
Assert(per_buffer_data);
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3e33360c0fc..bf24f3d7fe0 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -105,24 +105,11 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
*/
if (!node->ss.ss_currentScanDesc)
{
- bool need_tuples = false;
-
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or
- * for returning data. This test is a bit simplistic, as it checks
- * the stronger condition that there's no qual or return tlist at all.
- * But in most cases it's probably not worth working harder than that.
- */
- need_tuples = (node->ss.ps.plan->qual != NIL ||
- node->ss.ps.plan->targetlist != NIL);
-
node->ss.ss_currentScanDesc =
table_beginscan_bm(node->ss.ss_currentRelation,
node->ss.ps.state->es_snapshot,
0,
- NULL,
- need_tuples);
+ NULL);
}
node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
--
2.45.2
Hi,
On 2025-04-02 18:58:11 +0200, Matthias van de Meent wrote:
And here it is, v6 of the patchset, rebased up to master @ 764d501d.
Thanks!
Does anybody have an opinion about how non-invasive to be in the
back-branches? The minimal version is something like this diff:
diff --git i/src/backend/executor/nodeBitmapHeapscan.c w/src/backend/executor/nodeBitmapHeapscan.c
index 6b48a6d8350..0bd8b3404c1 100644
--- i/src/backend/executor/nodeBitmapHeapscan.c
+++ w/src/backend/executor/nodeBitmapHeapscan.c
@@ -187,6 +187,19 @@ BitmapHeapNext(BitmapHeapScanState *node)
{
bool need_tuples = false;
+ /*
+ * Unfortunately it turns out that the below optimization does not
+ * take the removal of TIDs by a concurrent vacuum into
+ * account. The concurrent vacuum can remove dead TIDs and make
+ * pages ALL_VISIBLE while those dead TIDs are referenced in the
+ * bitmap. This would lead to a !need_tuples scan returning too
+ * many tuples.
+ *
+ * In the back-branches, we therefore simply disable the
+ * optimization. Removing all the relevant code would be too
+ * invasive (and a major backpatching pain).
+ */
+#ifdef NOT_ANYMORE
/*
* We can potentially skip fetching heap pages if we do not need
* any columns of the table, either for checking non-indexable
@@ -197,6 +210,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
need_tuples = (node->ss.ps.plan->qual != NIL ||
node->ss.ps.plan->targetlist != NIL);
+#endif
But it seems a bit weird to continue checking SO_NEED_TUPLES (which is what
need_tuples ends up as) in other parts of the code. But it's less work to
backpatch that way. Obviously we can't remove the relevant struct fields in
the backbranches.
Other notes:
- Should we commit the test showing that the naive implementation of
index-only-bitmap-heapscan is broken, in case somebody wants to re-introduce
it?
If so, I think we should not backpatch the test. If it turns out to not be
stable, it's a pain to fix test stability issues over multiple branches.
And the patch would need a bit of comment editing to explain that, easy
enough.
- We have some superfluous includes in nodeBitmapHeapscan.c - but I think
that's not actually the fault of this patch. Seems the read-stream patch
should have removed the at least the includes of visibilitymap.h, bufmgr.h
and spccache.h? And b09ff53667f math.h...
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
Does anybody have an opinion about how non-invasive to be in the
back-branches? The minimal version is something like this diff:
Minimal is good -- less chance of breaking anything.
- Should we commit the test showing that the naive implementation of
index-only-bitmap-heapscan is broken, in case somebody wants to re-introduce
it?
Seems like a good idea. Agreed on HEAD-only for that.
- We have some superfluous includes in nodeBitmapHeapscan.c - but I think
that's not actually the fault of this patch. Seems the read-stream patch
should have removed the at least the includes of visibilitymap.h, bufmgr.h
and spccache.h? And b09ff53667f math.h...
Meh, let's leave that for the next round of IWYU hacking.
regards, tom lane
On Wed, 2 Apr 2025 at 19:48, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-04-02 18:58:11 +0200, Matthias van de Meent wrote:
And here it is, v6 of the patchset, rebased up to master @ 764d501d.
Thanks!
Does anybody have an opinion about how non-invasive to be in the
back-branches? The minimal version is something like this diff:
[...]
But it seems a bit weird to continue checking SO_NEED_TUPLES (which is what
need_tuples ends up as) in other parts of the code. But it's less work to
backpatch that way. Obviously we can't remove the relevant struct fields in
the backbranches.
A minimal version seems fine to me.
Other notes:
- Should we commit the test showing that the naive implementation of
index-only-bitmap-heapscan is broken, in case somebody wants to re-introduce
it?
I'd prefer that, yes.
If so, I think we should not backpatch the test. If it turns out to not be
stable, it's a pain to fix test stability issues over multiple branches.
That's fair. But if the reason for not adding a test is potential
instability we could just as well add it now and remove it if it
actually proves to be unstable.
- We have some superfluous includes in nodeBitmapHeapscan.c - but I think
that's not actually the fault of this patch. Seems the read-stream patch
should have removed the at least the includes of visibilitymap.h, bufmgr.h
and spccache.h? And b09ff53667f math.h...
I have no strong opinion about this.
Thank you for picking this up!
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi,
On 2025-04-02 14:17:01 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Does anybody have an opinion about how non-invasive to be in the
back-branches? The minimal version is something like this diff:Minimal is good -- less chance of breaking anything.
- Should we commit the test showing that the naive implementation of
index-only-bitmap-heapscan is broken, in case somebody wants to re-introduce
it?Seems like a good idea. Agreed on HEAD-only for that.
Pushed that way.
Thanks for the report and the fix!
Greetings,
Andres Freund
Hello,
We noticed a sustained increased in IO Wait of read queries after
upgrading from 13.13 to 13.21. Eventually, we narrowed it down to a
spike in index_blocks_read of a certain table where Bitmap Heap Scans do
happen.
Do you think that this change (i.e. removing the optimization) could be
what caused this regression?
Thanks
Show quoted text
On 3/5/25 1:12 PM, Matthias van de Meent wrote:
On Sun, 2 Mar 2025 at 01:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
Is everybody in agreement about committing and back patching this fix,
which simply disables the optimization altogether?
I myself don't see a better way, but thought I'd ask before proceeding
with review and commit.If you don't see a clear path forward, then "disable" is the only
reasonable choice for the back branches. Maybe we'll find a fix
in future, but it seems unlikely that it'd be back-patchable.Agreed.
Here's patch v5 for the master branch (now up to f4694e0f), with no
interesting changes other than fixing apply conflicts caused by
bfe56cdf.Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi,
On September 16, 2025 7:57:54 AM EDT, "Core Studios Inc." <corestudiosinc@gmail.com> wrote:
Hello,
We noticed a sustained increased in IO Wait of read queries after upgrading from 13.13 to 13.21. Eventually, we narrowed it down to a spike in index_blocks_read of a certain table where Bitmap Heap Scans do happen.
Do you think that this change (i.e. removing the optimization) could be what caused this regression?
You're not providing enough details for us to answer that question. We'd need an explain verbose for the query.
Greetings,
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
We could not pin this down to a specific query, but here's one frequent
query that is running on the said table:
piers=> EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM events
WHERE priority = 'high' and ( events.fire_at <= now() +
interval '1 hour' ) AND ( events.fire_at >= now() )
ORDER BY fire_at ASC
LIMIT 1000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.80..2.81 rows=1 width=453) (actual
time=0.598..0.618 rows=222 loops=1)
Buffers: shared hit=224
-> Sort (cost=2.80..2.81 rows=1 width=453) (actual
time=0.598..0.606 rows=222 loops=1)
Sort Key: fire_at
Sort Method: quicksort Memory: 241kB
Buffers: shared hit=224
-> Bitmap Heap Scan on events (cost=1.67..2.79
rows=1 width=453) (actual time=0.073..0.465 rows=222 loops=1)
Recheck Cond: (((priority)::text = 'high'::text)
AND (fire_at <= (now() + '01:00:00'::interval)) AND (fire_at >=
now()))
Heap Blocks: exact=219
Buffers: shared hit=224
-> Bitmap Index Scan on
index_events_on_priority_fire_at (cost=0.00..1.67 rows=1
width=0) (actual time=0.052..0.052 rows=222 loops=1)
Index Cond: (((queue)::text =
'high'::text) AND (fire_at <= (now() + '01:00:00'::interval))
AND (fire_at >= now()))
Buffers: shared hit=5
Planning Time: 0.067 ms
Execution Time: 0.646 ms
(15 rows)
Interestingly, before the upgrade we see that the majority of query
plans for this query are not executing any Bitmap Heap Scans:
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..2.78 rows=1 width=468) (actual
time=0.028..0.489 rows=217 loops=1)
Buffers: shared hit=222
-> Index Scan using index_events_on_priority_fire_at on
events (cost=0.56..2.78 rows=1 width=468) (actual
time=0.027..0.473 rows=217 loops=1)
Index Cond: (((priority)::text = 'high'::text) AND
(fire_at <= (now() + '01:00:00'::interval)) AND (fire_at >= now()))
Buffers: shared hit=222
Planning:
Buffers: shared hit=64
Planning Time: 0.312 ms
Execution Time: 0.512 ms
(9 rows)
Thanks in advance
Show quoted text
On 9/16/25 3:25 PM, Andres Freund wrote:
Hi,
On September 16, 2025 7:57:54 AM EDT, "Core Studios Inc."<corestudiosinc@gmail.com> wrote:
Hello,
We noticed a sustained increased in IO Wait of read queries after upgrading from 13.13 to 13.21. Eventually, we narrowed it down to a spike in index_blocks_read of a certain table where Bitmap Heap Scans do happen.
Do you think that this change (i.e. removing the optimization) could be what caused this regression?
You're not providing enough details for us to answer that question. We'd need an explain verbose for the query.
Greetings,
Andres
Following up to correct my previous email: the query plan did not really
change - we see Bitmap Heap Scans even before the upgrade. Therefore,
the query plan doesn't seem to have changed after the upgrade.
However after the upgrade it seems that those queries are doing more
disk IO (increased index_blocks_read in that table).
Thanks in advance
Show quoted text
On 9/16/25 3:25 PM, Andres Freund wrote:
Hi,
On September 16, 2025 7:57:54 AM EDT, "Core Studios Inc." <corestudiosinc@gmail.com> wrote:
Hello,
We noticed a sustained increased in IO Wait of read queries after upgrading from 13.13 to 13.21. Eventually, we narrowed it down to a spike in index_blocks_read of a certain table where Bitmap Heap Scans do happen.
Do you think that this change (i.e. removing the optimization) could be what caused this regression?
You're not providing enough details for us to answer that question. We'd need an explain verbose for the query.
Greetings,
Andres
On Tue, 16 Sept 2025 at 13:57, Core Studios Inc.
<corestudiosinc@gmail.com> wrote:
Hello,
We noticed a sustained increased in IO Wait of read queries after
upgrading from 13.13 to 13.21. Eventually, we narrowed it down to a
spike in index_blocks_read of a certain table where Bitmap Heap Scans do
happen.Do you think that this change (i.e. removing the optimization) could be
what caused this regression?
I'm quite sure that the IOs for a Bitmap Heap Scan (which is the node
that now may see increased IO) count towards the table's IO statistics
(e.g. pg_stat_all_tables, heap_blks_read), not the index (e.g.
pg_stat_all_indexes, idx_blcks_read). Only "Index Scan" (and Bitmap
IS, and IOS) nodes can count their IO toward their relative indexes,
and the executor code for those nodes was not modified in this bugfix.
Note that before the removal of the VM-only option, these "bitmap
only" nodes could return incorrect results, and thus that "bitmap-only
scan" wasn't really a valid optimization.
Also note that neither the "optimization" nor the fix modified any
part or dependency of the planner, so any changes in plans must have
been caused by other changes in the system as a whole (such as updated
table statistics, and (unlikely) other backported bugfixes).
Kind regards,
Matthias van de Meent
Databricks
Hi,
On 2025-09-16 15:55:49 +0300, Core Studios Inc. wrote:
Following up to correct my previous email: the query plan did not really
change - we see Bitmap Heap Scans even before the upgrade. Therefore, the
query plan doesn't seem to have changed after the upgrade.However after the upgrade it seems that those queries are doing more disk IO
(increased index_blocks_read in that table).
Without the query + plan I can't judge whether the query could even have
benefited from the, fairly narrow, optimization...
Greetings,
Andres