BitmapHeapScan streaming read user and prelim refactoring
Hi,
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.
patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2]/messages/by-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ+opUrXDRXdcfwFZGA@mail.gmail.com.
patches 0005-0006 push the skip fetch optimization into the table AMs
patches 0007-0009 change the control flow of BitmapHeapNext() to match
that required by the streaming read API
patch 0010 is the streaming read code not yet in master
patch 0011 is the actual bitmapheapscan streaming read user.
patches 0001-0009 apply on top of master but 0010 and 0011 must be
applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
version of the streaming read API is on the mailing list).
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().
A TBMIterateResult used to be threaded through both of these functions
and used in BitmapHeapNext(). This patch set removes all references to
TBMIterateResults from BitmapHeapNext. Because the streaming read API
requires the callback to specify the next block, BitmapHeapNext() can
no longer pass a TBMIterateResult to table_scan_bitmap_next_block().
More subtly, table_scan_bitmap_next_block() used to return false if
there were no more visible tuples on the page or if the block that was
requested was not valid. With these changes,
table_scan_bitmap_next_block() will only return false when the bitmap
has been exhausted and the scan can end. In order to use the streaming
read API, the user must be able to request the blocks it needs without
requiring synchronous feedback per block. Thus, this table AM function
must change its meaning.
I think the way the patches are split up could be improved. I will
think more about this. There are also probably a few mistakes with
which comments are updated in which patches in the set.
- Melanie
[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
[2]: /messages/by-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ+opUrXDRXdcfwFZGA@mail.gmail.com
Attachments:
v1-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchtext/x-patch; charset=US-ASCII; name=v1-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchDownload+30-18
v1-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-patch; charset=US-ASCII; name=v1-0002-BitmapHeapScan-set-can_skip_fetch-later.patchDownload+11-11
v1-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchDownload+1-10
v1-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchDownload+4-5
v1-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-patch; charset=US-ASCII; name=v1-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchDownload+2-3
v1-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-patch; charset=US-ASCII; name=v1-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchDownload+74-91
v1-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-patch; charset=US-ASCII; name=v1-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchDownload+8-12
v1-0009-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-patch; charset=US-ASCII; name=v1-0009-Make-table_scan_bitmap_next_block-async-friendly.patchDownload+121-117
v1-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchtext/x-patch; charset=US-ASCII; name=v1-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchDownload+28-6
v1-0010-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0010-Streaming-Read-API.patchDownload+986-239
v1-0011-BitmapHeapScan-uses-streaming-read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0011-BitmapHeapScan-uses-streaming-read-API.patchDownload+178-445
On Feb 13, 2024, at 3:11 PM, Melanie Plageman <melanieplageman@gmail.com> wrote:
Thanks for the patch...
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2].
patches 0005-0006 push the skip fetch optimization into the table AMs
patches 0007-0009 change the control flow of BitmapHeapNext() to match
that required by the streaming read API
patch 0010 is the streaming read code not yet in master
patch 0011 is the actual bitmapheapscan streaming read user.patches 0001-0009 apply on top of master but 0010 and 0011 must be
applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
version of the streaming read API is on the mailing list).
I followed your lead and applied them to 6a8ffe812d194ba6f4f26791b6388a4837d17d6c. `make check` worked fine, though I expect you know that already.
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().
You might want an independent perspective on how much of a hassle those breaking changes are, so I took a stab at that. Having written a custom proprietary TAM for postgresql 15 here at EDB, and having ported it and released it for postgresql 16, I thought I'd try porting it to the the above commit with your patches. Even without your patches, I already see breaking changes coming from commit f691f5b80a85c66d715b4340ffabb503eb19393e, which creates a similar amount of breakage for me as does your patches. Dealing with the combined breakage might amount to a day of work, including testing, half of which I think I've already finished. In other words, it doesn't seem like a big deal.
Were postgresql 17 shaping up to be compatible with TAMs written for 16, your patch would change that qualitatively, but since things are already incompatible, I think you're in the clear.
A TBMIterateResult used to be threaded through both of these functions
and used in BitmapHeapNext(). This patch set removes all references to
TBMIterateResults from BitmapHeapNext. Because the streaming read API
requires the callback to specify the next block, BitmapHeapNext() can
no longer pass a TBMIterateResult to table_scan_bitmap_next_block().More subtly, table_scan_bitmap_next_block() used to return false if
there were no more visible tuples on the page or if the block that was
requested was not valid. With these changes,
table_scan_bitmap_next_block() will only return false when the bitmap
has been exhausted and the scan can end. In order to use the streaming
read API, the user must be able to request the blocks it needs without
requiring synchronous feedback per block. Thus, this table AM function
must change its meaning.I think the way the patches are split up could be improved. I will
think more about this. There are also probably a few mistakes with
which comments are updated in which patches in the set.
I look forward to the next version of the patch set. Thanks again for working on this.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Feb 13, 2024 at 11:34 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
On Feb 13, 2024, at 3:11 PM, Melanie Plageman <melanieplageman@gmail.com> wrote:
Thanks for the patch...
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2].
patches 0005-0006 push the skip fetch optimization into the table AMs
patches 0007-0009 change the control flow of BitmapHeapNext() to match
that required by the streaming read API
patch 0010 is the streaming read code not yet in master
patch 0011 is the actual bitmapheapscan streaming read user.patches 0001-0009 apply on top of master but 0010 and 0011 must be
applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
version of the streaming read API is on the mailing list).I followed your lead and applied them to 6a8ffe812d194ba6f4f26791b6388a4837d17d6c. `make check` worked fine, though I expect you know that already.
Thanks for taking a look!
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().You might want an independent perspective on how much of a hassle those breaking changes are, so I took a stab at that. Having written a custom proprietary TAM for postgresql 15 here at EDB, and having ported it and released it for postgresql 16, I thought I'd try porting it to the the above commit with your patches. Even without your patches, I already see breaking changes coming from commit f691f5b80a85c66d715b4340ffabb503eb19393e, which creates a similar amount of breakage for me as does your patches. Dealing with the combined breakage might amount to a day of work, including testing, half of which I think I've already finished. In other words, it doesn't seem like a big deal.
Were postgresql 17 shaping up to be compatible with TAMs written for 16, your patch would change that qualitatively, but since things are already incompatible, I think you're in the clear.
Oh, good to know! I'm very happy to have the perspective of a table AM
author. Just curious, did your table AM implement
table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(),
and, if so, did you use the TBMIterateResult? Since it is not used in
BitmapHeapNext() in my version, table AMs would have to change how
they use TBMIterateResults anyway. But I assume they could add it to a
table AM specific scan descriptor if they want access to a
TBMIterateResult of their own making in both
table_san_bitmap_next_block() and next_tuple()?
- Melanie
On Feb 14, 2024, at 6:47 AM, Melanie Plageman <melanieplageman@gmail.com> wrote:
Just curious, did your table AM implement
table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(),
and, if so, did you use the TBMIterateResult? Since it is not used in
BitmapHeapNext() in my version, table AMs would have to change how
they use TBMIterateResults anyway. But I assume they could add it to a
table AM specific scan descriptor if they want access to a
TBMIterateResult of their own making in both
table_san_bitmap_next_block() and next_tuple()?
My table AM does implement those two functions and does use the TBMIterateResult *tbmres argument, yes. I would deal with the issue in very much the same way that your patches modify heapam. I don't really have any additional comments about that.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote:
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.
Thanks for working on this! While I have some quibbles with details, I think
this is quite a bit of progress in the right direction.
patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2].
I'd not quite call it a bugfix, it's not like it leads to wrong
behaviour. Seems more like an optimization. But whatever :)
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().
That's to be expected, I don't think it's worth worrying about. Right now a
bunch of TAMs can't implement bitmap scans, this goes a fair bit towards
allowing that...
From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setupThere is no reason for table_beginscan_bm() to begin the actual scan of
the underlying table in ExecInitBitmapHeapScan(). We can begin the
underlying table scan after the index scan has been completed and the
bitmap built.The one use of the scan descriptor during initialization was
ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
with one from an array in the parallel state. This overwrote the
snapshot set in table_beginscan_bm().By saving that worker snapshot as a member in the BitmapHeapScanState
during initialization, it can be restored in table_beginscan_bm() after
returning from the table AM specific begin scan function.
I don't understand what the point of passing two different snapshots to
table_beginscan_bm() is. What does that even mean? Why can't we just use the
correct snapshot initially?
From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
fetchPreviously BitmapPrefetch() used the recheck flag for the current block
to determine whether or not it could skip prefetching the proposed
prefetch block. It makes more sense for it to use the recheck flag from
the TBMIterateResult for the prefetch block instead.
I'd mention the commit that introduced the current logic and link to the
the thread that you started about this.
From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumberBitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
Hm - I'm not convinced this is a good direction - doesn't that arguably
*increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap
heap scans in a TAM without blocks, but still.
From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
table AMThis resolves the long-standing FIXME in BitmapHeapNext() which said that
the optmization to skip fetching blocks of the underlying table when
none of the column data was needed should be pushed into the table AM
specific code.
Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old
enough to drink, at least in some countries. :)
The table AM agnostic functions for prefetching still need to know if
skipping fetching is permitted for this scan. However, this dependency
will be removed when that prefetching code is removed in favor of the
upcoming streaming read API.
---
src/backend/access/heap/heapam.c | 10 +++
src/backend/access/heap/heapam_handler.c | 29 +++++++
src/backend/executor/nodeBitmapHeapscan.c | 100 ++++++----------------
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 17 ++--
src/include/nodes/execnodes.h | 6 --
6 files changed, 74 insertions(+), 90 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 707460a5364..7aae1ecf0a9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_base.rs_flags = flags; scan->rs_base.rs_parallel = parallel_scan; scan->rs_strategy = NULL; /* set in initscan */ + scan->vmbuffer = InvalidBuffer; + scan->empty_tuples = 0;
These don't follow the existing naming pattern for HeapScanDescData. While I
explicitly dislike the practice of adding prefixes to struct members, I don't
think mixing conventions within a single struct improves things.
I also think it'd be good to note in comments that the vm buffer currently is
only used for bitmap heap scans, otherwise one might think they'd also be used
for normal scans, where we don't need them, because of the page level flag.
Also, perhaps worth renaming "empty_tuples" to something indicating that it's
the number of empty tuples to be returned later? num_empty_tuples_pending or
such? Or the current "return_empty_tuples".
@@ -1043,6 +1045,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);+ if (BufferIsValid(scan->vmbuffer)) + ReleaseBuffer(scan->vmbuffer); + scan->vmbuffer = InvalidBuffer;
It does not matter one iota here, but personally I prefer moving the write
inside the if, as dirtying the cacheline after we just figured out whe
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 9372b49bfaa..c0fb06c9688 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -108,6 +108,7 @@ BitmapHeapNext(BitmapHeapScanState *node) */ if (!node->initialized) { + bool can_skip_fetch; /* * We can potentially skip fetching heap pages if we do not need any * columns of the table, either for checking non-indexable quals or
Pretty sure pgindent will move this around.
+++ b/src/include/access/tableam.h @@ -62,6 +62,7 @@ typedef enum ScanOptions/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
Would be nice to add a comment explaining what this flag means.
From 500c84019b982a1e6c8b8dd40240c8510d83c287 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v1 07/11] BitmapHeapScan scan desc counts lossy and exact
pagesFuture commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.
---
src/backend/access/heap/heapam.c | 2 ++
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 18 +++++++++++++-----
src/include/access/relscan.h | 4 ++++
4 files changed, 28 insertions(+), 5 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7aae1ecf0a9..88b4aad5820 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -957,6 +957,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_strategy = NULL; /* set in initscan */ scan->vmbuffer = InvalidBuffer; scan->empty_tuples = 0; + scan->rs_base.lossy_pages = 0; + scan->rs_base.exact_pages = 0;/* * Disable page-at-a-time mode if it's not a MVCC-safe snapshot. diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index baba09c87c0..6e85ef7a946 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, Assert(ntup <= MaxHeapTuplesPerPage); hscan->rs_ntuples = ntup;+ /* Only count exact and lossy pages with visible tuples */ + if (ntup > 0) + { + if (tbmres->ntuples >= 0) + scan->exact_pages++; + else + scan->lossy_pages++; + } + return ntup > 0; }diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index c0fb06c9688..19d115de06f 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -53,6 +53,8 @@ #include "utils/spccache.h"static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); +static inline void BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan); static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate); static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node, BlockNumber blockno); @@ -234,11 +236,6 @@ BitmapHeapNext(BitmapHeapScanState *node) continue; }- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
- node->lossy_pages++;
-
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
@@ -315,9 +312,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* if we get here it means we are at the end of the scan..
*/
+ BitmapAccumCounters(node, scan);
return ExecClearTuple(slot);
}+static inline void +BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan) +{ + node->exact_pages += scan->exact_pages; + scan->exact_pages = 0; + node->lossy_pages += scan->lossy_pages; + scan->lossy_pages = 0; +} +
I don't think this is quite right - you're calling BitmapAccumCounters() only
when the scan doesn't return anything anymore, but there's no guarantee
that'll ever be reached. E.g. a bitmap heap scan below a limit node. I think
this needs to be in a) ExecEndBitmapHeapScan() b) ExecReScanBitmapHeapScan()
/* * BitmapDoneInitializingSharedState - Shared state is initialized * diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 521043304ab..b74e08dd745 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -40,6 +40,10 @@ typedef struct TableScanDescData ItemPointerData rs_mintid; ItemPointerData rs_maxtid;+ /* Only used for Bitmap table scans */ + long exact_pages; + long lossy_pages; + /* * Information about type and behaviour of the scan, a bitmask of members * of the ScanOptions enum (see tableam.h).
I wonder if this really is the best place for the data to be accumulated. This
requires the accounting to be implemented in each AM, which doesn't obviously
seem required. Why can't the accounting continue to live in
nodeBitmapHeapscan.c, to be done after each table_scan_bitmap_next_block()
call?
From 555743e4bc885609d20768f7f2990c6ba69b13a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v1 09/11] Make table_scan_bitmap_next_block() async friendlytable_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the known end of the table.
This sounds a bit like the block is actually past the end of the table,
but in reality this happens if the block is past the end of the table as it
was when the scan was started. Somehow that feels significant, but I don't
really know why I think that.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 88b4aad5820..d8569373987 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -959,6 +959,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->empty_tuples = 0; scan->rs_base.lossy_pages = 0; scan->rs_base.exact_pages = 0; + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL;/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1051,6 +1053,18 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
ReleaseBuffer(scan->vmbuffer);
scan->vmbuffer = InvalidBuffer;+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN) + { + if (scan->rs_base.shared_tbmiterator) + tbm_end_shared_iterate(scan->rs_base.shared_tbmiterator); + + if (scan->rs_base.tbmiterator) + tbm_end_iterate(scan->rs_base.tbmiterator); + } + + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL; + /* * reinitialize scan descriptor */
If every AM would need to implement this, perhaps this shouldn't be done here,
but in generic code?
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2114,17 +2114,49 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,static bool heapam_scan_bitmap_next_block(TableScanDesc scan, - TBMIterateResult *tbmres) + bool *recheck, BlockNumber *blockno) { HeapScanDesc hscan = (HeapScanDesc) scan; - BlockNumber block = tbmres->blockno; + BlockNumber block; Buffer buffer; Snapshot snapshot; int ntup; + TBMIterateResult *tbmres;hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;+ *blockno = InvalidBlockNumber; + *recheck = true; + + do + { + if (scan->shared_tbmiterator) + tbmres = tbm_shared_iterate(scan->shared_tbmiterator); + else + tbmres = tbm_iterate(scan->tbmiterator); + + if (tbmres == NULL) + { + /* no more entries in the bitmap */ + Assert(hscan->empty_tuples == 0); + return false; + } + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
Hm. Isn't it a problem that we have no CHECK_FOR_INTERRUPTS() in this loop?
@@ -2251,7 +2274,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}- return ntup > 0; + /* + * Return true to indicate that a valid block was found and the bitmap is + * not exhausted. If there are no visible tuples on this page, + * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will + * return false returning control to this function to advance to the next + * block in the bitmap. + */ + return true; }
Why can't we fetch the next block immediately?
@@ -201,46 +197,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
can_skip_fetch);
}- node->tbmiterator = tbmiterator; - node->shared_tbmiterator = shared_tbmiterator; + scan->tbmiterator = tbmiterator; + scan->shared_tbmiterator = shared_tbmiterator;
It seems a bit odd that this code modifies the scan descriptor, instead of
passing the iterator, or perhaps better the bitmap itself, to
table_beginscan_bm()?
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index b74e08dd745..bf7ee044268 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -16,6 +16,7 @@#include "access/htup_details.h"
#include "access/itup.h"
+#include "nodes/tidbitmap.h"
I'd like to avoid exposing this to everything including relscan.h. I think we
could just forward declare the structs and use them here to avoid that?
From aac60985d6bc70bfedf77a77ee3c512da87bfcb1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:27:57 -0500
Subject: [PATCH v1 11/11] BitmapHeapScan uses streaming read APIRemove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/gin/ginget.c | 15 +-
src/backend/access/gin/ginscan.c | 7 +
src/backend/access/heap/heapam.c | 71 +++++
src/backend/access/heap/heapam_handler.c | 78 +++--
src/backend/executor/nodeBitmapHeapscan.c | 328 +---------------------
src/backend/nodes/tidbitmap.c | 80 +++---
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 14 +-
src/include/nodes/execnodes.h | 19 --
src/include/nodes/tidbitmap.h | 8 +-
10 files changed, 178 insertions(+), 444 deletions(-)diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index 0b4f2ebadb6..3ce28078a6f 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -373,7 +373,10 @@ restartScanEntry: if (entry->matchBitmap) { if (entry->matchIterator) + { tbm_end_iterate(entry->matchIterator); + pfree(entry->matchResult); + } entry->matchIterator = NULL; tbm_free(entry->matchBitmap); entry->matchBitmap = NULL; @@ -386,6 +389,7 @@ restartScanEntry: if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap)) { entry->matchIterator = tbm_begin_iterate(entry->matchBitmap); + entry->matchResult = palloc0(TBM_ITERATE_RESULT_SIZE);
Do we actually have to use palloc0? TBM_ITERATE_RESULT_SIZE ain't small, so
zeroing all of it isn't free.
+static BlockNumber bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data);
Is it correct to have _single in the name here? Aren't we also using for
parallel scans?
+static BlockNumber +bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data) +{ + TBMIterateResult *tbmres = per_buffer_data; + HeapScanDesc hdesc = (HeapScanDesc) pgsr_private; + + for (;;) + { + if (hdesc->rs_base.shared_tbmiterator) + tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres); + else + tbm_iterate(hdesc->rs_base.tbmiterator, tbmres); + + /* no more entries in the bitmap */ + if (!BlockNumberIsValid(tbmres->blockno)) + return InvalidBlockNumber; + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks) + continue; + + + if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH && + !tbmres->recheck && + VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer)) + { + hdesc->empty_tuples += tbmres->ntuples; + continue; + } + + return tbmres->blockno; + } + + /* not reachable */ + Assert(false); +}
Need to check for interrupts somewhere here.
@@ -124,15 +119,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
There's still a comment in BitmapHeapNext talking about prefetching with two
iterators etc. That seems outdated now.
/* * tbm_iterate - scan through next page of a TIDBitmap * - * Returns a TBMIterateResult representing one page, or NULL if there are - * no more pages to scan. Pages are guaranteed to be delivered in numerical - * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to - * remember the exact tuples to look at on this page --- the caller must - * examine all tuples on the page and check if they meet the intended - * condition. If result->recheck is true, only the indicated tuples need - * be examined, but the condition must be rechecked anyway. (For ease of - * testing, recheck is always set true when ntuples < 0.) + * Caller must pass in a TBMIterateResult to be filled. + * + * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is + * set to InvalidBlockNumber when there are no more pages to scan. If + * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the + * exact tuples to look at on this page --- the caller must examine all tuples + * on the page and check if they meet the intended condition. If + * tbmres->recheck is true, only the indicated tuples need be examined, but the + * condition must be rechecked anyway. (For ease of testing, recheck is always + * set true when ntuples < 0.) */ -TBMIterateResult * -tbm_iterate(TBMIterator *iterator) +void +tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
Hm - it seems a tad odd that we later have to find out if the scan is done
iterating by checking if blockno is valid, when tbm_iterate already knew. But
I guess the code would be a bit uglier if we needed the result of
tbm_[shared_]iterate(), due to the two functions.
Right now ExecEndBitmapHeapScan() frees the tbm before it does table_endscan()
- which seems problematic, as heap_endscan() will do stuff like
tbm_end_iterate(), which imo shouldn't be called after the tbm has been freed,
even if that works today.
It seems a bit confusing that your changes seem to treat
BitmapHeapScanState->initialized as separate from ->scan, even though afaict
scan should be NULL iff initialized is false and vice versa.
Independent of your patches, but brr, it's ugly that
BitmapShouldInitializeSharedState() blocks.
Greetings,
Andres Freund
Thank you so much for this thorough review!!!!
On Wed, Feb 14, 2024 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:
On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote:
From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setupThere is no reason for table_beginscan_bm() to begin the actual scan of
the underlying table in ExecInitBitmapHeapScan(). We can begin the
underlying table scan after the index scan has been completed and the
bitmap built.The one use of the scan descriptor during initialization was
ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
with one from an array in the parallel state. This overwrote the
snapshot set in table_beginscan_bm().By saving that worker snapshot as a member in the BitmapHeapScanState
during initialization, it can be restored in table_beginscan_bm() after
returning from the table AM specific begin scan function.I don't understand what the point of passing two different snapshots to
table_beginscan_bm() is. What does that even mean? Why can't we just use the
correct snapshot initially?
Indeed. Honestly, it was an unlabeled TODO for me. I wasn't quite sure
how to get the same behavior as in master. Fixed in attached v2.
Now the parallel worker still restores and registers that snapshot in
ExecBitmapHeapInitializeWorker() and then saves it in the
BitmapHeapScanState. We then pass SO_TEMP_SNAPSHOT as an extra flag
(to set rs_flags) to table_beginscan_bm() if there is a parallel
worker snapshot saved in the BitmapHeapScanState.
From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
fetchPreviously BitmapPrefetch() used the recheck flag for the current block
to determine whether or not it could skip prefetching the proposed
prefetch block. It makes more sense for it to use the recheck flag from
the TBMIterateResult for the prefetch block instead.I'd mention the commit that introduced the current logic and link to the
the thread that you started about this.
Done
From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumberBitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.Hm - I'm not convinced this is a good direction - doesn't that arguably
*increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap
heap scans in a TAM without blocks, but still.
This is removed in later commits and is an intermediate state to try
and move the TBMIterateResult out of BitmapHeapNext(). I can find
another way to achieve this if it is important.
From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
table AMThis resolves the long-standing FIXME in BitmapHeapNext() which said that
the optmization to skip fetching blocks of the underlying table when
none of the column data was needed should be pushed into the table AM
specific code.Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old
enough to drink, at least in some countries. :)
;) I've updated the commit message. Though it is longstanding in that
it predates Melanie + Postgres.
The table AM agnostic functions for prefetching still need to know if
skipping fetching is permitted for this scan. However, this dependency
will be removed when that prefetching code is removed in favor of the
upcoming streaming read API.---
src/backend/access/heap/heapam.c | 10 +++
src/backend/access/heap/heapam_handler.c | 29 +++++++
src/backend/executor/nodeBitmapHeapscan.c | 100 ++++++----------------
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 17 ++--
src/include/nodes/execnodes.h | 6 --
6 files changed, 74 insertions(+), 90 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 707460a5364..7aae1ecf0a9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_base.rs_flags = flags; scan->rs_base.rs_parallel = parallel_scan; scan->rs_strategy = NULL; /* set in initscan */ + scan->vmbuffer = InvalidBuffer; + scan->empty_tuples = 0;These don't follow the existing naming pattern for HeapScanDescData. While I
explicitly dislike the practice of adding prefixes to struct members, I don't
think mixing conventions within a single struct improves things.
I've updated the names. What does rs even stand for?
I also think it'd be good to note in comments that the vm buffer currently is
only used for bitmap heap scans, otherwise one might think they'd also be used
for normal scans, where we don't need them, because of the page level flag.
Done.
Also, perhaps worth renaming "empty_tuples" to something indicating that it's
the number of empty tuples to be returned later? num_empty_tuples_pending or
such? Or the current "return_empty_tuples".
Done.
@@ -1043,6 +1045,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);+ if (BufferIsValid(scan->vmbuffer)) + ReleaseBuffer(scan->vmbuffer); + scan->vmbuffer = InvalidBuffer;It does not matter one iota here, but personally I prefer moving the write
inside the if, as dirtying the cacheline after we just figured out whe
I've now followed this convention throughout my patchset in the places
where I noticed it.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 9372b49bfaa..c0fb06c9688 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -108,6 +108,7 @@ BitmapHeapNext(BitmapHeapScanState *node) */ if (!node->initialized) { + bool can_skip_fetch; /* * We can potentially skip fetching heap pages if we do not need any * columns of the table, either for checking non-indexable quals orPretty sure pgindent will move this around.
This is gone now, but I have pgindented all the commits so it
shouldn't be a problem again.
+++ b/src/include/access/tableam.h @@ -62,6 +62,7 @@ typedef enum ScanOptions/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;Would be nice to add a comment explaining what this flag means.
Done.
From 500c84019b982a1e6c8b8dd40240c8510d83c287 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v1 07/11] BitmapHeapScan scan desc counts lossy and exact
pagesFuture commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.---
src/backend/access/heap/heapam.c | 2 ++
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 18 +++++++++++++-----
src/include/access/relscan.h | 4 ++++
4 files changed, 28 insertions(+), 5 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7aae1ecf0a9..88b4aad5820 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -957,6 +957,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_strategy = NULL; /* set in initscan */ scan->vmbuffer = InvalidBuffer; scan->empty_tuples = 0; + scan->rs_base.lossy_pages = 0; + scan->rs_base.exact_pages = 0;/* * Disable page-at-a-time mode if it's not a MVCC-safe snapshot. diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index baba09c87c0..6e85ef7a946 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, Assert(ntup <= MaxHeapTuplesPerPage); hscan->rs_ntuples = ntup;+ /* Only count exact and lossy pages with visible tuples */ + if (ntup > 0) + { + if (tbmres->ntuples >= 0) + scan->exact_pages++; + else + scan->lossy_pages++; + } + return ntup > 0; }diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index c0fb06c9688..19d115de06f 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -53,6 +53,8 @@ #include "utils/spccache.h"static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); +static inline void BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan); static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate); static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node, BlockNumber blockno); @@ -234,11 +236,6 @@ BitmapHeapNext(BitmapHeapScanState *node) continue; }- if (tbmres->ntuples >= 0) - node->exact_pages++; - else - node->lossy_pages++; - /* Adjust the prefetch target */ BitmapAdjustPrefetchTarget(node); } @@ -315,9 +312,20 @@ BitmapHeapNext(BitmapHeapScanState *node) /* * if we get here it means we are at the end of the scan.. */ + BitmapAccumCounters(node, scan); return ExecClearTuple(slot); }+static inline void +BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan) +{ + node->exact_pages += scan->exact_pages; + scan->exact_pages = 0; + node->lossy_pages += scan->lossy_pages; + scan->lossy_pages = 0; +} +I don't think this is quite right - you're calling BitmapAccumCounters() only
when the scan doesn't return anything anymore, but there's no guarantee
that'll ever be reached. E.g. a bitmap heap scan below a limit node. I think
this needs to be in a) ExecEndBitmapHeapScan() b) ExecReScanBitmapHeapScan()
The scan descriptor isn't available in ExecEnd/ReScanBitmapHeapScan().
So, if we count in the scan descriptor we can't accumulate into the
BitmapHeapScanState there. The reason to count in the scan descriptor
is that it is in the table AM where we know if we have a lossy or
exact page -- and we only have the scan descriptor not the
BitmapHeapScanState in the table AM.
I added a call to BitmapAccumCounters before the tuple is returned for
correctness in this version (not ideal, I realize). See below for
thoughts about what we could do instead.
/* * BitmapDoneInitializingSharedState - Shared state is initialized * diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 521043304ab..b74e08dd745 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -40,6 +40,10 @@ typedef struct TableScanDescData ItemPointerData rs_mintid; ItemPointerData rs_maxtid;+ /* Only used for Bitmap table scans */ + long exact_pages; + long lossy_pages; + /* * Information about type and behaviour of the scan, a bitmask of members * of the ScanOptions enum (see tableam.h).I wonder if this really is the best place for the data to be accumulated. This
requires the accounting to be implemented in each AM, which doesn't obviously
seem required. Why can't the accounting continue to live in
nodeBitmapHeapscan.c, to be done after each table_scan_bitmap_next_block()
call?
Yes, I would really prefer not to do it in the table AM. But, we only
count exact and lossy pages for which at least one or more tuples were
visible (change this and you'll see tests fail). So, we need to decide
if we are going to increment the counters somewhere where we have
access to that information. In the case of heap, that is really only
once I have the value of ntup in heapam_scan_bitmap_next_block(). To
get that information back out to BitmapHeapNext(), I considered adding
another parameter to heapam_scan_bitmap_next_block() -- maybe an enum
like this:
/*
* BitmapHeapScans's bitmaps can choose to store per page information in a
* lossy or exact way. Exact pages in the bitmap have the individual tuple
* offsets that need to be visited while lossy pages in the bitmap have only the
* block number of the page.
*/
typedef enum BitmapBlockResolution
{
BITMAP_BLOCK_NO_VISIBLE,
BITMAP_BLOCK_LOSSY,
BITMAP_BLOCK_EXACT,
} BitmapBlockResolution;
which we then use to increment the counter. But while I was writing
this code, I found myself narrating in the comment that the reason
this had to be set inside of the table AM is that only the table AM
knows if it wants to count the block as lossy, exact, or not count it.
So, that made me question if it really should be in the
BitmapHeapScanState.
I also explored passing the table scan descriptor to
show_tidbitmap_info() -- but that had its own problems.
From 555743e4bc885609d20768f7f2990c6ba69b13a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v1 09/11] Make table_scan_bitmap_next_block() async friendlytable_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the known end of the table.This sounds a bit like the block is actually past the end of the table,
but in reality this happens if the block is past the end of the table as it
was when the scan was started. Somehow that feels significant, but I don't
really know why I think that.
I have tried to update the commit message to make it clearer. I was
actually wondering: now that we do table_beginscan_bm() in
BitmapHeapNext() instead of ExecInitBitmapHeapScan(), have we reduced
or eliminated the opportunity for this to be true? initscan() sets
rs_nblocks and that now happens much later.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 88b4aad5820..d8569373987 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -959,6 +959,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->empty_tuples = 0; scan->rs_base.lossy_pages = 0; scan->rs_base.exact_pages = 0; + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL;/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1051,6 +1053,18 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
ReleaseBuffer(scan->vmbuffer);
scan->vmbuffer = InvalidBuffer;+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN) + { + if (scan->rs_base.shared_tbmiterator) + tbm_end_shared_iterate(scan->rs_base.shared_tbmiterator); + + if (scan->rs_base.tbmiterator) + tbm_end_iterate(scan->rs_base.tbmiterator); + } + + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL; + /* * reinitialize scan descriptor */If every AM would need to implement this, perhaps this shouldn't be done here,
but in generic code?
Fixed.
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2114,17 +2114,49 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,static bool heapam_scan_bitmap_next_block(TableScanDesc scan, - TBMIterateResult *tbmres) + bool *recheck, BlockNumber *blockno) { HeapScanDesc hscan = (HeapScanDesc) scan; - BlockNumber block = tbmres->blockno; + BlockNumber block; Buffer buffer; Snapshot snapshot; int ntup; + TBMIterateResult *tbmres;hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;+ *blockno = InvalidBlockNumber; + *recheck = true; + + do + { + if (scan->shared_tbmiterator) + tbmres = tbm_shared_iterate(scan->shared_tbmiterator); + else + tbmres = tbm_iterate(scan->tbmiterator); + + if (tbmres == NULL) + { + /* no more entries in the bitmap */ + Assert(hscan->empty_tuples == 0); + return false; + } + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);Hm. Isn't it a problem that we have no CHECK_FOR_INTERRUPTS() in this loop?
Yes. fixed.
@@ -2251,7 +2274,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}- return ntup > 0; + /* + * Return true to indicate that a valid block was found and the bitmap is + * not exhausted. If there are no visible tuples on this page, + * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will + * return false returning control to this function to advance to the next + * block in the bitmap. + */ + return true; }Why can't we fetch the next block immediately?
We don't know that we want another block until we've gone through this
page and seen there were no visible tuples, so we'd somehow have to
jump back up to the top of the function to get the next block -- which
is basically what is happening in my revised control flow. We call
heapam_scan_bitmap_next_tuple() and rs_ntuples is 0, so we end up
calling heapam_scan_bitmap_next_block() right away.
@@ -201,46 +197,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
can_skip_fetch);
}- node->tbmiterator = tbmiterator; - node->shared_tbmiterator = shared_tbmiterator; + scan->tbmiterator = tbmiterator; + scan->shared_tbmiterator = shared_tbmiterator;It seems a bit odd that this code modifies the scan descriptor, instead of
passing the iterator, or perhaps better the bitmap itself, to
table_beginscan_bm()?
On rescan we actually will have initialized = false and make new
iterators but have the old scan descriptor. So, we need to be able to
set the iterator in the scan to the new iterator.
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index b74e08dd745..bf7ee044268 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -16,6 +16,7 @@#include "access/htup_details.h"
#include "access/itup.h"
+#include "nodes/tidbitmap.h"I'd like to avoid exposing this to everything including relscan.h. I think we
could just forward declare the structs and use them here to avoid that?
Done
From aac60985d6bc70bfedf77a77ee3c512da87bfcb1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:27:57 -0500
Subject: [PATCH v1 11/11] BitmapHeapScan uses streaming read APIRemove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/gin/ginget.c | 15 +-
src/backend/access/gin/ginscan.c | 7 +
src/backend/access/heap/heapam.c | 71 +++++
src/backend/access/heap/heapam_handler.c | 78 +++--
src/backend/executor/nodeBitmapHeapscan.c | 328 +---------------------
src/backend/nodes/tidbitmap.c | 80 +++---
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 14 +-
src/include/nodes/execnodes.h | 19 --
src/include/nodes/tidbitmap.h | 8 +-
10 files changed, 178 insertions(+), 444 deletions(-)diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index 0b4f2ebadb6..3ce28078a6f 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -373,7 +373,10 @@ restartScanEntry: if (entry->matchBitmap) { if (entry->matchIterator) + { tbm_end_iterate(entry->matchIterator); + pfree(entry->matchResult); + } entry->matchIterator = NULL; tbm_free(entry->matchBitmap); entry->matchBitmap = NULL; @@ -386,6 +389,7 @@ restartScanEntry: if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap)) { entry->matchIterator = tbm_begin_iterate(entry->matchBitmap); + entry->matchResult = palloc0(TBM_ITERATE_RESULT_SIZE);Do we actually have to use palloc0? TBM_ITERATE_RESULT_SIZE ain't small, so
zeroing all of it isn't free.
Tests actually did fail when I didn't use palloc0.
This code is different now though. There are a few new patches in v2
that 1) make the offsets array in the TBMIterateResult fixed size and
then this makes it possible to 2) make matchResult an inline member of
the GinScanEntry. I have a TODO in the code asking if setting blockno
in the TBMIterateResult to InvalidBlockNumber is sufficient
"resetting".
+static BlockNumber bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data);Is it correct to have _single in the name here? Aren't we also using for
parallel scans?
Right. I had a separate parallel version and then deleted it. This is now fixed.
+static BlockNumber +bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data) +{ + TBMIterateResult *tbmres = per_buffer_data; + HeapScanDesc hdesc = (HeapScanDesc) pgsr_private; + + for (;;) + { + if (hdesc->rs_base.shared_tbmiterator) + tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres); + else + tbm_iterate(hdesc->rs_base.tbmiterator, tbmres); + + /* no more entries in the bitmap */ + if (!BlockNumberIsValid(tbmres->blockno)) + return InvalidBlockNumber; + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks) + continue; + + + if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH && + !tbmres->recheck && + VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer)) + { + hdesc->empty_tuples += tbmres->ntuples; + continue; + } + + return tbmres->blockno; + } + + /* not reachable */ + Assert(false); +}Need to check for interrupts somewhere here.
Done.
@@ -124,15 +119,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
There's still a comment in BitmapHeapNext talking about prefetching with two
iterators etc. That seems outdated now.
Fixed.
/* * tbm_iterate - scan through next page of a TIDBitmap * - * Returns a TBMIterateResult representing one page, or NULL if there are - * no more pages to scan. Pages are guaranteed to be delivered in numerical - * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to - * remember the exact tuples to look at on this page --- the caller must - * examine all tuples on the page and check if they meet the intended - * condition. If result->recheck is true, only the indicated tuples need - * be examined, but the condition must be rechecked anyway. (For ease of - * testing, recheck is always set true when ntuples < 0.) + * Caller must pass in a TBMIterateResult to be filled. + * + * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is + * set to InvalidBlockNumber when there are no more pages to scan. If + * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the + * exact tuples to look at on this page --- the caller must examine all tuples + * on the page and check if they meet the intended condition. If + * tbmres->recheck is true, only the indicated tuples need be examined, but the + * condition must be rechecked anyway. (For ease of testing, recheck is always + * set true when ntuples < 0.) */ -TBMIterateResult * -tbm_iterate(TBMIterator *iterator) +void +tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)Hm - it seems a tad odd that we later have to find out if the scan is done
iterating by checking if blockno is valid, when tbm_iterate already knew. But
I guess the code would be a bit uglier if we needed the result of
tbm_[shared_]iterate(), due to the two functions.
Yes.
Right now ExecEndBitmapHeapScan() frees the tbm before it does table_endscan()
- which seems problematic, as heap_endscan() will do stuff like
tbm_end_iterate(), which imo shouldn't be called after the tbm has been freed,
even if that works today.
I've flipped the order -- I end the scan then free the bitmap.
It seems a bit confusing that your changes seem to treat
BitmapHeapScanState->initialized as separate from ->scan, even though afaict
scan should be NULL iff initialized is false and vice versa.
I thought so too, but it seems on rescan that the node->initialized is
set to false but the scan is reused. So, we want to only make a new
scan descriptor if it is truly the beginning of a new scan.
- Melanie
Attachments:
v2-0011-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-patch; charset=US-ASCII; name=v2-0011-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchDownload+107-88
v2-0012-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v2-0012-Streaming-Read-API.patchDownload+986-239
v2-0010-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-patch; charset=US-ASCII; name=v2-0010-Hard-code-TBMIterateResult-offsets-array-size.patchDownload+16-24
v2-0013-BitmapHeapScan-uses-streaming-read-API.patchtext/x-patch; charset=US-ASCII; name=v2-0013-BitmapHeapScan-uses-streaming-read-API.patchDownload+117-420
v2-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-patch; charset=US-ASCII; name=v2-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchDownload+9-12
v2-0009-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-patch; charset=US-ASCII; name=v2-0009-Make-table_scan_bitmap_next_block-async-friendly.patchDownload+145-125
v2-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchtext/x-patch; charset=US-ASCII; name=v2-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchDownload+38-29
v2-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchDownload+1-10
v2-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-patch; charset=US-ASCII; name=v2-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchDownload+2-3
v2-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-patch; charset=US-ASCII; name=v2-0002-BitmapHeapScan-set-can_skip_fetch-later.patchDownload+11-11
v2-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-patch; charset=US-ASCII; name=v2-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchDownload+4-5
v2-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchtext/x-patch; charset=US-ASCII; name=v2-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchDownload+32-7
v2-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-patch; charset=US-ASCII; name=v2-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchDownload+95-98
In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.
I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).
The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previous
email
2) the TODO in the TIDBitmap API changes about being sure that setting
TBMIterateResult->blockno to InvalidBlockNumber is sufficient for
indicating an invalid TBMIterateResult (and an exhausted bitmap)
3) the streaming read API is not committed yet, so the last two patches
are not "done"
- Melanie
Attachments:
v3-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patchtext/x-diff; charset=us-asciiDownload+42-29
v3-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-diff; charset=us-asciiDownload+11-11
v3-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-diff; charset=us-asciiDownload+94-93
v3-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-diff; charset=us-asciiDownload+2-9
v3-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-diff; charset=us-asciiDownload+4-5
v3-0006-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchtext/x-diff; charset=us-asciiDownload+32-7
v3-0007-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-diff; charset=us-asciiDownload+9-12
v3-0008-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-diff; charset=us-asciiDownload+2-14
v3-0009-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-diff; charset=us-asciiDownload+150-129
v3-0010-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-diff; charset=us-asciiDownload+17-25
v3-0011-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-diff; charset=us-asciiDownload+107-88
v3-0012-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload+986-239
v3-0013-BitmapHeapScan-uses-streaming-read-API.patchtext/x-diff; charset=us-asciiDownload+117-422
On Fri, Feb 16, 2024 at 12:35:59PM -0500, Melanie Plageman wrote:
In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previous
I've resolved this. I added a new patch to the set which starts counting
even pages with no visible tuples toward lossy and exact pages. After an
off-list conversation with Andres, it seems that this omission in master
may not have been intentional.
Once we have only two types of pages to differentiate between (lossy and
exact [no longer have to care about "has no visible tuples"]), it is
easy enough to pass a "lossy" boolean paramater to
table_scan_bitmap_next_block(). I've done this in the attached v4.
- Melanie
Attachments:
v4-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patchtext/x-diff; charset=us-asciiDownload+42-29
v4-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-diff; charset=us-asciiDownload+11-11
v4-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-diff; charset=us-asciiDownload+94-93
v4-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-diff; charset=us-asciiDownload+2-9
v4-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-diff; charset=us-asciiDownload+4-5
v4-0006-EXPLAIN-Bitmap-table-scan-also-count-no-visible-t.patchtext/x-diff; charset=us-asciiDownload+13-7
v4-0007-table_scan_bitmap_next_block-returns-lossy-or-exa.patchtext/x-diff; charset=us-asciiDownload+19-11
v4-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-diff; charset=us-asciiDownload+9-12
v4-0009-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-diff; charset=us-asciiDownload+2-14
v4-0010-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-diff; charset=us-asciiDownload+168-143
v4-0011-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-diff; charset=us-asciiDownload+17-25
v4-0012-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-diff; charset=us-asciiDownload+107-88
v4-0013-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload+986-239
v4-0014-BitmapHeapScan-uses-streaming-read-API.patchtext/x-diff; charset=us-asciiDownload+117-421
On Mon, Feb 26, 2024 at 08:50:28PM -0500, Melanie Plageman wrote:
On Fri, Feb 16, 2024 at 12:35:59PM -0500, Melanie Plageman wrote:
In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previousI've resolved this. I added a new patch to the set which starts counting
even pages with no visible tuples toward lossy and exact pages. After an
off-list conversation with Andres, it seems that this omission in master
may not have been intentional.Once we have only two types of pages to differentiate between (lossy and
exact [no longer have to care about "has no visible tuples"]), it is
easy enough to pass a "lossy" boolean paramater to
table_scan_bitmap_next_block(). I've done this in the attached v4.
Thomas posted a new version of the Streaming Read API [1]/messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com, so here is a
rebased v5. This should make it easier to review as it can be applied on
top of master.
- Melanie
[1]: /messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com
Attachments:
v5-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patchtext/x-diff; charset=us-asciiDownload+42-29
v5-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-diff; charset=us-asciiDownload+11-11
v5-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-diff; charset=us-asciiDownload+94-93
v5-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-diff; charset=us-asciiDownload+2-9
v5-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-diff; charset=us-asciiDownload+4-5
v5-0006-EXPLAIN-Bitmap-table-scan-also-count-no-visible-t.patchtext/x-diff; charset=us-asciiDownload+13-7
v5-0007-table_scan_bitmap_next_block-returns-lossy-or-exa.patchtext/x-diff; charset=us-asciiDownload+19-11
v5-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-diff; charset=us-asciiDownload+9-12
v5-0009-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-diff; charset=us-asciiDownload+2-14
v5-0010-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-diff; charset=us-asciiDownload+168-143
v5-0011-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-diff; charset=us-asciiDownload+17-25
v5-0012-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-diff; charset=us-asciiDownload+107-88
v5-0013-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload+1218-212
v5-0014-BitmapHeapScan-uses-streaming-read-API.patchtext/x-diff; charset=us-asciiDownload+117-421
Hi,
I haven't looked at the code very closely yet, but I decided to do some
basic benchmarks to see if/how this refactoring affects behavior.
Attached is a simple .sh script that
1) creates a table with one of a couple basic data distributions
(uniform, linear, ...), with an index on top
2) runs a simple query with a where condition matching a known fraction
of the table (0 - 100%), and measures duration
3) the query is forced to use bitmapscan by disabling other options
4) there's a couple parameters the script varies (work_mem, parallel
workers, ...), the script drops caches etc.
5) I only have results for table with 1M rows, which is ~320MB, so not
huge. I'm running this for larger data set, but that will take time.
I did this on my two "usual" machines - i5 and xeon. Both have flash
storage, although i5 is SATA and xeon has NVMe. I won't share the raw
results, because the CSV is like 5MB - ping me off-list if you need the
file, ofc.
Attached is PDF summarizing the results as a pivot table, with results
for "master" and "patched" builds. The interesting bit is the last
column, which shows whether the patch makes it faster (green) or slower
(red).
The results seem pretty mixed, on both machines. If you focus on the
uncached results (pages 4 and 8-9), there's both runs that are much
faster (by a factor of 2-5x) and slower (similar factor).
Of course, these results are with forced bitmap scans, so the question
is if those regressions even matter - maybe we'd use a different scan
type, making these changes less severe. So I logged "optimal plan" for
each run, tracking the scan type the optimizer would really pick without
all the enable_* GUCs. And the -optimal.pdf shows only results for the
runs where the optimal plan uses the bitmap scan. And yes, while the
impact of the changes (in either direction) is reduced, it's still very
much there.
What's a bit surprising to me is that these regressions affect runs with
effective_io_concurrency=0 in particular, which traditionally meant to
not do any prefetching / async stuff. I've perceived the patch mostly as
refactoring, so have not really expected such massive impact on these cases.
So I wonder if the refactoring means that we're actually doing some sort
amount of prefetching even with e_i_c=0. I'm not sure that'd be great, I
assume people have valid reasons to disable prefetching ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 28, 2024 at 8:22 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Hi,
I haven't looked at the code very closely yet, but I decided to do some
basic benchmarks to see if/how this refactoring affects behavior.Attached is a simple .sh script that
1) creates a table with one of a couple basic data distributions
(uniform, linear, ...), with an index on top2) runs a simple query with a where condition matching a known fraction
of the table (0 - 100%), and measures duration3) the query is forced to use bitmapscan by disabling other options
4) there's a couple parameters the script varies (work_mem, parallel
workers, ...), the script drops caches etc.5) I only have results for table with 1M rows, which is ~320MB, so not
huge. I'm running this for larger data set, but that will take time.I did this on my two "usual" machines - i5 and xeon. Both have flash
storage, although i5 is SATA and xeon has NVMe. I won't share the raw
results, because the CSV is like 5MB - ping me off-list if you need the
file, ofc.
I haven't looked at your results in detail yet. I plan to dig into
this more later today. But, I was wondering if it was easy for you to
run the shorter tests on just the commits before the last
https://github.com/melanieplageman/postgres/tree/bhs_pgsr
i.e. patches 0001-0013. Patch 0014 implements the streaming read user
and removes all of the existing prefetch code. I would be interested
to know if the behavior with just the preliminary refactoring differs
at all.
- Melanie
On 2/28/24 15:38, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 8:22 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Hi,
I haven't looked at the code very closely yet, but I decided to do some
basic benchmarks to see if/how this refactoring affects behavior.Attached is a simple .sh script that
1) creates a table with one of a couple basic data distributions
(uniform, linear, ...), with an index on top2) runs a simple query with a where condition matching a known fraction
of the table (0 - 100%), and measures duration3) the query is forced to use bitmapscan by disabling other options
4) there's a couple parameters the script varies (work_mem, parallel
workers, ...), the script drops caches etc.5) I only have results for table with 1M rows, which is ~320MB, so not
huge. I'm running this for larger data set, but that will take time.I did this on my two "usual" machines - i5 and xeon. Both have flash
storage, although i5 is SATA and xeon has NVMe. I won't share the raw
results, because the CSV is like 5MB - ping me off-list if you need the
file, ofc.I haven't looked at your results in detail yet. I plan to dig into
this more later today. But, I was wondering if it was easy for you to
run the shorter tests on just the commits before the last
https://github.com/melanieplageman/postgres/tree/bhs_pgsr
i.e. patches 0001-0013. Patch 0014 implements the streaming read user
and removes all of the existing prefetch code. I would be interested
to know if the behavior with just the preliminary refactoring differs
at all.
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.
Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).
The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.
FWIW I'm not implying the patch must 100% maintain the current behavior,
or anything like that. At this point I'm more motivated to understand if
this change in behavior is expected and/or what this means for users.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.
Hmm. I'm trying to think how my refactor could have had this impact.
It seems like all the most notable regressions are with 4 parallel
workers. What do the numeric column labels mean across the top
(2,4,8,16...) -- are they related to "matches"? And if so, what does
that mean?
- Melanie
On 2/28/24 21:06, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.Hmm. I'm trying to think how my refactor could have had this impact.
It seems like all the most notable regressions are with 4 parallel
workers. What do the numeric column labels mean across the top
(2,4,8,16...) -- are they related to "matches"? And if so, what does
that mean?
That's the number of distinct values matched by the query, which should
be an approximation of the number of matching rows. The number of
distinct values in the data set differs by data set, but for 1M rows
it's roughly like this:
uniform: 10k
linear: 10k
cyclic: 100
So for example matches=128 means ~1% of rows for uniform/linear, and
100% for cyclic data sets.
As for the possible cause, I think it's clear most of the difference
comes from the last patch that actually switches bitmap heap scan to the
streaming read API. That's mostly expected/understandable, although we
probably need to look into the regressions or cases with e_i_c=0.
To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
individual patches. I can try doing that tomorrow. It'll have to be a
limited set of tests, to reduce the time, but might tell us whether it's
due to a single patch or multiple patches.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 2/28/24 21:06, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.Hmm. I'm trying to think how my refactor could have had this impact.
It seems like all the most notable regressions are with 4 parallel
workers. What do the numeric column labels mean across the top
(2,4,8,16...) -- are they related to "matches"? And if so, what does
that mean?That's the number of distinct values matched by the query, which should
be an approximation of the number of matching rows. The number of
distinct values in the data set differs by data set, but for 1M rows
it's roughly like this:uniform: 10k
linear: 10k
cyclic: 100So for example matches=128 means ~1% of rows for uniform/linear, and
100% for cyclic data sets.
Ah, thank you for the explanation. I also looked at your script after
having sent this email and saw that it is clear in your script what
"matches" is.
As for the possible cause, I think it's clear most of the difference
comes from the last patch that actually switches bitmap heap scan to the
streaming read API. That's mostly expected/understandable, although we
probably need to look into the regressions or cases with e_i_c=0.
Right, I'm mostly surprised about the regressions for patches 0001-0012.
Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
great point about eic 0. In old bitmapheapscan code eic 0 basically
disabled prefetching but with the streaming read API, it will still
issue fadvises when eic is 0. That is an easy one line fix. Thomas
prefers to fix it by always avoiding an fadvise for the last buffer in
a range before issuing a read (since we are about to read it anyway,
best not fadvise it too). This will fix eic 0 and also cut one system
call from each invocation of the streaming read machinery.
To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
individual patches. I can try doing that tomorrow. It'll have to be a
limited set of tests, to reduce the time, but might tell us whether it's
due to a single patch or multiple patches.
Yes, tomorrow I planned to start trying to repro some of the "red"
cases myself. Any one of the commits could cause a slight regression
but a 3.5x regression is quite surprising, so I might focus on trying
to repro that locally and then narrow down which patch causes it.
For the non-cached regressions, perhaps the commit to use the correct
recheck flag (0004) when prefetching could be the culprit. And for the
cached regressions, my money is on the commit which changes the whole
control flow of BitmapHeapNext() and the next_block() and next_tuple()
functions (0010).
- Melanie
On 2/29/24 00:40, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 21:06, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.Hmm. I'm trying to think how my refactor could have had this impact.
It seems like all the most notable regressions are with 4 parallel
workers. What do the numeric column labels mean across the top
(2,4,8,16...) -- are they related to "matches"? And if so, what does
that mean?That's the number of distinct values matched by the query, which should
be an approximation of the number of matching rows. The number of
distinct values in the data set differs by data set, but for 1M rows
it's roughly like this:uniform: 10k
linear: 10k
cyclic: 100So for example matches=128 means ~1% of rows for uniform/linear, and
100% for cyclic data sets.Ah, thank you for the explanation. I also looked at your script after
having sent this email and saw that it is clear in your script what
"matches" is.As for the possible cause, I think it's clear most of the difference
comes from the last patch that actually switches bitmap heap scan to the
streaming read API. That's mostly expected/understandable, although we
probably need to look into the regressions or cases with e_i_c=0.Right, I'm mostly surprised about the regressions for patches 0001-0012.
Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
great point about eic 0. In old bitmapheapscan code eic 0 basically
disabled prefetching but with the streaming read API, it will still
issue fadvises when eic is 0. That is an easy one line fix. Thomas
prefers to fix it by always avoiding an fadvise for the last buffer in
a range before issuing a read (since we are about to read it anyway,
best not fadvise it too). This will fix eic 0 and also cut one system
call from each invocation of the streaming read machinery.To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
individual patches. I can try doing that tomorrow. It'll have to be a
limited set of tests, to reduce the time, but might tell us whether it's
due to a single patch or multiple patches.Yes, tomorrow I planned to start trying to repro some of the "red"
cases myself. Any one of the commits could cause a slight regression
but a 3.5x regression is quite surprising, so I might focus on trying
to repro that locally and then narrow down which patch causes it.For the non-cached regressions, perhaps the commit to use the correct
recheck flag (0004) when prefetching could be the culprit. And for the
cached regressions, my money is on the commit which changes the whole
control flow of BitmapHeapNext() and the next_block() and next_tuple()
functions (0010).
I do have some partial results, comparing the patches. I only ran one of
the more affected workloads (cyclic) on the xeon, attached is a PDF
comparing master and the 0001-0014 patches. The percentages are timing
vs. the preceding patch (green - faster, red - slower).
This suggests only patches 0010 and 0014 affect performance, the rest is
just noise. I'll see if I can do more runs and get data from the other
machine (seems it's more significant on old SATA SSDs).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 2/29/24 00:40, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 21:06, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.Hmm. I'm trying to think how my refactor could have had this impact.
It seems like all the most notable regressions are with 4 parallel
workers. What do the numeric column labels mean across the top
(2,4,8,16...) -- are they related to "matches"? And if so, what does
that mean?That's the number of distinct values matched by the query, which should
be an approximation of the number of matching rows. The number of
distinct values in the data set differs by data set, but for 1M rows
it's roughly like this:uniform: 10k
linear: 10k
cyclic: 100So for example matches=128 means ~1% of rows for uniform/linear, and
100% for cyclic data sets.Ah, thank you for the explanation. I also looked at your script after
having sent this email and saw that it is clear in your script what
"matches" is.As for the possible cause, I think it's clear most of the difference
comes from the last patch that actually switches bitmap heap scan to the
streaming read API. That's mostly expected/understandable, although we
probably need to look into the regressions or cases with e_i_c=0.Right, I'm mostly surprised about the regressions for patches 0001-0012.
Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
great point about eic 0. In old bitmapheapscan code eic 0 basically
disabled prefetching but with the streaming read API, it will still
issue fadvises when eic is 0. That is an easy one line fix. Thomas
prefers to fix it by always avoiding an fadvise for the last buffer in
a range before issuing a read (since we are about to read it anyway,
best not fadvise it too). This will fix eic 0 and also cut one system
call from each invocation of the streaming read machinery.To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
individual patches. I can try doing that tomorrow. It'll have to be a
limited set of tests, to reduce the time, but might tell us whether it's
due to a single patch or multiple patches.Yes, tomorrow I planned to start trying to repro some of the "red"
cases myself. Any one of the commits could cause a slight regression
but a 3.5x regression is quite surprising, so I might focus on trying
to repro that locally and then narrow down which patch causes it.For the non-cached regressions, perhaps the commit to use the correct
recheck flag (0004) when prefetching could be the culprit. And for the
cached regressions, my money is on the commit which changes the whole
control flow of BitmapHeapNext() and the next_block() and next_tuple()
functions (0010).I do have some partial results, comparing the patches. I only ran one of
the more affected workloads (cyclic) on the xeon, attached is a PDF
comparing master and the 0001-0014 patches. The percentages are timing
vs. the preceding patch (green - faster, red - slower).
Just confirming: the results are for uncached?
- Melanie
On 2/29/24 22:19, Melanie Plageman wrote:
On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/29/24 00:40, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 21:06, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 2/28/24 15:56, Tomas Vondra wrote:
...
Sure, I can do that. It'll take a couple hours to get the results, I'll
share them when I have them.Here are the results with only patches 0001 - 0012 applied (i.e. without
the patch introducing the streaming read API, and the patch switching
the bitmap heap scan to use it).The changes in performance don't disappear entirely, but the scale is
certainly much smaller - both in the complete results for all runs, and
for the "optimal" runs that would actually pick bitmapscan.Hmm. I'm trying to think how my refactor could have had this impact.
It seems like all the most notable regressions are with 4 parallel
workers. What do the numeric column labels mean across the top
(2,4,8,16...) -- are they related to "matches"? And if so, what does
that mean?That's the number of distinct values matched by the query, which should
be an approximation of the number of matching rows. The number of
distinct values in the data set differs by data set, but for 1M rows
it's roughly like this:uniform: 10k
linear: 10k
cyclic: 100So for example matches=128 means ~1% of rows for uniform/linear, and
100% for cyclic data sets.Ah, thank you for the explanation. I also looked at your script after
having sent this email and saw that it is clear in your script what
"matches" is.As for the possible cause, I think it's clear most of the difference
comes from the last patch that actually switches bitmap heap scan to the
streaming read API. That's mostly expected/understandable, although we
probably need to look into the regressions or cases with e_i_c=0.Right, I'm mostly surprised about the regressions for patches 0001-0012.
Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
great point about eic 0. In old bitmapheapscan code eic 0 basically
disabled prefetching but with the streaming read API, it will still
issue fadvises when eic is 0. That is an easy one line fix. Thomas
prefers to fix it by always avoiding an fadvise for the last buffer in
a range before issuing a read (since we are about to read it anyway,
best not fadvise it too). This will fix eic 0 and also cut one system
call from each invocation of the streaming read machinery.To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
individual patches. I can try doing that tomorrow. It'll have to be a
limited set of tests, to reduce the time, but might tell us whether it's
due to a single patch or multiple patches.Yes, tomorrow I planned to start trying to repro some of the "red"
cases myself. Any one of the commits could cause a slight regression
but a 3.5x regression is quite surprising, so I might focus on trying
to repro that locally and then narrow down which patch causes it.For the non-cached regressions, perhaps the commit to use the correct
recheck flag (0004) when prefetching could be the culprit. And for the
cached regressions, my money is on the commit which changes the whole
control flow of BitmapHeapNext() and the next_block() and next_tuple()
functions (0010).I do have some partial results, comparing the patches. I only ran one of
the more affected workloads (cyclic) on the xeon, attached is a PDF
comparing master and the 0001-0014 patches. The percentages are timing
vs. the preceding patch (green - faster, red - slower).Just confirming: the results are for uncached?
Yes, cyclic data set, uncached case. I picked this because it seemed
like one of the most affected cases. Do you want me to test some other
cases too?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2/29/24 23:44, Tomas Vondra wrote:
...
I do have some partial results, comparing the patches. I only ran one of
the more affected workloads (cyclic) on the xeon, attached is a PDF
comparing master and the 0001-0014 patches. The percentages are timing
vs. the preceding patch (green - faster, red - slower).Just confirming: the results are for uncached?
Yes, cyclic data set, uncached case. I picked this because it seemed
like one of the most affected cases. Do you want me to test some other
cases too?
BTW I decided to look at the data from a slightly different angle and
compare the behavior with increasing effective_io_concurrency. Attached
are charts for three "uncached" cases:
* uniform, work_mem=4MB, workers_per_gather=0
* linear-fuzz, work_mem=4MB, workers_per_gather=0
* uniform, work_mem=4MB, workers_per_gather=4
Each page has charts for master and patched build (with all patches). I
think there's a pretty obvious difference in how increasing e_i_c
affects the two builds:
1) On master there's clear difference between eic=0 and eic=1 cases, but
on the patched build there's literally no difference - for example the
"uniform" distribution is clearly not great for prefetching, but eic=0
regresses to eic=1 poor behavior).
Note: This is where the the "red bands" in the charts come from.
2) For some reason, the prefetching with eic>1 perform much better with
the patches, except for with very low selectivity values (close to 0%).
Not sure why this is happening - either the overhead is much lower
(which would matter on these "adversarial" data distribution, but how
could that be when fadvise is not free), or it ends up not doing any
prefetching (but then what about (1)?).
3) I'm not sure about the linear-fuzz case, the only explanation I have
we're able to skip almost all of the prefetches (and read-ahead likely
works pretty well here).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company