BitmapHeapScan streaming read user and prelim refactoring
Hi,
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.
patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2]/messages/by-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ+opUrXDRXdcfwFZGA@mail.gmail.com.
patches 0005-0006 push the skip fetch optimization into the table AMs
patches 0007-0009 change the control flow of BitmapHeapNext() to match
that required by the streaming read API
patch 0010 is the streaming read code not yet in master
patch 0011 is the actual bitmapheapscan streaming read user.
patches 0001-0009 apply on top of master but 0010 and 0011 must be
applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
version of the streaming read API is on the mailing list).
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().
A TBMIterateResult used to be threaded through both of these functions
and used in BitmapHeapNext(). This patch set removes all references to
TBMIterateResults from BitmapHeapNext. Because the streaming read API
requires the callback to specify the next block, BitmapHeapNext() can
no longer pass a TBMIterateResult to table_scan_bitmap_next_block().
More subtly, table_scan_bitmap_next_block() used to return false if
there were no more visible tuples on the page or if the block that was
requested was not valid. With these changes,
table_scan_bitmap_next_block() will only return false when the bitmap
has been exhausted and the scan can end. In order to use the streaming
read API, the user must be able to request the blocks it needs without
requiring synchronous feedback per block. Thus, this table AM function
must change its meaning.
I think the way the patches are split up could be improved. I will
think more about this. There are also probably a few mistakes with
which comments are updated in which patches in the set.
- Melanie
[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
[2]: /messages/by-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ+opUrXDRXdcfwFZGA@mail.gmail.com
Attachments:
v1-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchtext/x-patch; charset=US-ASCII; name=v1-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchDownload
From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setup
There is no reason for table_beginscan_bm() to begin the actual scan of
the underlying table in ExecInitBitmapHeapScan(). We can begin the
underlying table scan after the index scan has been completed and the
bitmap built.
The one use of the scan descriptor during initialization was
ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
with one from an array in the parallel state. This overwrote the
snapshot set in table_beginscan_bm().
By saving that worker snapshot as a member in the BitmapHeapScanState
during initialization, it can be restored in table_beginscan_bm() after
returning from the table AM specific begin scan function.
---
src/backend/executor/nodeBitmapHeapscan.c | 27 ++++++++++++++---------
src/include/access/tableam.h | 18 +++++++++------
src/include/nodes/execnodes.h | 2 ++
3 files changed, 30 insertions(+), 17 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 76382c91fd7..fd697d16c72 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -191,6 +191,17 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
#endif /* USE_PREFETCH */
}
+
+ if (!scan)
+ {
+ scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
+ node->ss.ss_currentRelation,
+ node->ss.ps.state->es_snapshot,
+ node->worker_snapshot,
+ 0,
+ NULL);
+ }
+
node->initialized = true;
}
@@ -614,7 +625,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
PlanState *outerPlan = outerPlanState(node);
/* rescan to release any page pin */
- table_rescan(node->ss.ss_currentScanDesc, NULL);
+ if (node->ss.ss_currentScanDesc)
+ table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
if (node->tbmiterator)
@@ -691,7 +703,8 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
/*
* close heap scan
*/
- table_endscan(scanDesc);
+ if (scanDesc)
+ table_endscan(scanDesc);
}
/* ----------------------------------------------------------------
@@ -740,6 +753,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->can_skip_fetch = false;
+ scanstate->worker_snapshot = NULL;
/*
* Miscellaneous initialization
@@ -788,11 +802,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ss_currentRelation = currentRelation;
- scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
- estate->es_snapshot,
- 0,
- NULL);
-
/*
* all done.
*/
@@ -931,13 +940,11 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelBitmapHeapState *pstate;
- Snapshot snapshot;
Assert(node->ss.ps.state->es_query_dsa != NULL);
pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->pstate = pstate;
- snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
- table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
+ node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4d495216f07..77f32a7472d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -931,6 +931,11 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+/*
+ * Update snapshot used by the scan.
+ */
+extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
+
/*
* table_beginscan_bm is an alternative entry point for setting up a
* TableScanDesc for a bitmap heap scan. Although that scan technology is
@@ -938,12 +943,16 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
* make it worth using the same data structure.
*/
static inline TableScanDesc
-table_beginscan_bm(Relation rel, Snapshot snapshot,
+table_beginscan_bm(Relation rel, Snapshot snapshot, Snapshot worker_snapshot,
int nkeys, struct ScanKeyData *key)
{
+ TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
- return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ if (worker_snapshot)
+ table_scan_update_snapshot(result, worker_snapshot);
+ return result;
}
/*
@@ -1033,11 +1042,6 @@ table_rescan_set_params(TableScanDesc scan, struct ScanKeyData *key,
allow_pagemode);
}
-/*
- * Update snapshot used by the scan.
- */
-extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
-
/*
* Return next tuple from `scan`, store in slot.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..00c75fb10e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1726,6 +1726,7 @@ typedef struct ParallelBitmapHeapState
* shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
+ * worker_snapshot snapshot for parallel worker
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1750,6 +1751,7 @@ typedef struct BitmapHeapScanState
TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
+ Snapshot worker_snapshot;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v1-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-patch; charset=US-ASCII; name=v1-0002-BitmapHeapScan-set-can_skip_fetch-later.patchDownload
From 5f915bc84eae56e52b5a61e9b7e691834fdb9680 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:38:41 -0500
Subject: [PATCH v1 02/11] BitmapHeapScan set can_skip_fetch later
There is no reason for BitmapHeapScan to calculate can_skip_fetch in
ExecInitBitmapHeapScan(). Moving it into BitmapHeapNext() is a
preliminary step toward moving can_skip_fetch into table AM specific
code, as we would need to set it after the scan has begun.
---
src/backend/executor/nodeBitmapHeapscan.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index d670939246b..76382c91fd7 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,6 +108,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ /*
+ * We can potentially skip fetching heap pages if we do not need any
+ * columns of the table, either for checking non-indexable quals or
+ * for returning data. This test is a bit simplistic, as it checks
+ * the stronger condition that there's no qual or return tlist at all.
+ * But in most cases it's probably not worth working harder than that.
+ */
+ node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
+ node->ss.ps.plan->targetlist == NIL);
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -729,16 +739,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
-
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or for
- * returning data. This test is a bit simplistic, as it checks the
- * stronger condition that there's no qual or return tlist at all. But in
- * most cases it's probably not worth working harder than that.
- */
- scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ scanstate->can_skip_fetch = false;
/*
* Miscellaneous initialization
--
2.37.2
v1-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchDownload
From 575fb1f93128ebfd8125c769de628f91e0d5c592 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:13:41 -0500
Subject: [PATCH v1 01/11] Remove table_scan_bitmap_next_tuple parameter tbmres
Future commits will remove the input TBMIterateResult from
table_scan_bitmap_next_block() as the streaming read API will be
responsible for iterating through the blocks in the bitmap and not
BitmapHeapNext(). Given that this parameter will not be set from
BitmapHeapNext(), it no longer makes sense to use it as a means of
communication between table_scan_bitmap_next_tuple() and
table_scan_bitmap_next_block().
---
src/backend/access/heap/heapam_handler.c | 1 -
src/backend/executor/nodeBitmapHeapscan.c | 2 +-
src/include/access/tableam.h | 7 -------
3 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..716d477e271 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2228,7 +2228,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
static bool
heapam_scan_bitmap_next_tuple(TableScanDesc scan,
- TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed63..d670939246b 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -304,7 +304,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* Attempt to fetch tuple from AM.
*/
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
/* nothing more to look at on this page */
node->tbmres = tbmres = NULL;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..4d495216f07 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -810,15 +810,10 @@ typedef struct TableAmRoutine
* Fetch the next tuple of a bitmap table scan into `slot` and return true
* if a visible tuple was found, false otherwise.
*
- * For some AMs it will make more sense to do all the work referencing
- * `tbmres` contents in scan_bitmap_next_block, for others it might be
- * better to defer more work to this callback.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_tuple) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot);
/*
@@ -1980,7 +1975,6 @@ table_scan_bitmap_next_block(TableScanDesc scan,
*/
static inline bool
table_scan_bitmap_next_tuple(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
/*
@@ -1992,7 +1986,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
- tbmres,
slot);
}
--
2.37.2
v1-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchDownload
From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumber
BitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index ab97f308a5f..9372b49bfaa 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -55,7 +55,7 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres);
+ BlockNumber blockno);
static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
static inline void BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan);
@@ -226,7 +226,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
break;
}
- BitmapAdjustPrefetchIterator(node, tbmres);
+ BitmapAdjustPrefetchIterator(node, tbmres->blockno);
/*
* We can skip fetching the heap page if we don't need any fields
@@ -379,7 +379,7 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
*/
static inline void
BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres)
+ BlockNumber blockno)
{
#ifdef USE_PREFETCH
ParallelBitmapHeapState *pstate = node->pstate;
@@ -398,7 +398,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
/* Do not let the prefetch iterator get behind the main one */
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
+ if (tbmpre == NULL || tbmpre->blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
--
2.37.2
v1-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-patch; charset=US-ASCII; name=v1-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchDownload
From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
fetch
Previously BitmapPrefetch() used the recheck flag for the current block
to determine whether or not it could skip prefetching the proposed
prefetch block. It makes more sense for it to use the recheck flag from
the TBMIterateResult for the prefetch block instead.
---
src/backend/executor/nodeBitmapHeapscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index fd697d16c72..ab97f308a5f 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -519,7 +519,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* but is true in many cases.
*/
skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
@@ -570,7 +570,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
--
2.37.2
v1-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-patch; charset=US-ASCII; name=v1-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchDownload
From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
table AM
This resolves the long-standing FIXME in BitmapHeapNext() which said that
the optmization to skip fetching blocks of the underlying table when
none of the column data was needed should be pushed into the table AM
specific code.
heapam_scan_bitmap_next_block() now does the visibility check and
accounting of empty tuples to be returned; while
heapam_scan_bitmap_next_tuple() prepares the slot to return empty
tuples.
The table AM agnostic functions for prefetching still need to know if
skipping fetching is permitted for this scan. However, this dependency
will be removed when that prefetching code is removed in favor of the
upcoming streaming read API.
---
src/backend/access/heap/heapam.c | 10 +++
src/backend/access/heap/heapam_handler.c | 29 +++++++
src/backend/executor/nodeBitmapHeapscan.c | 100 ++++++----------------
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 17 ++--
src/include/nodes/execnodes.h | 6 --
6 files changed, 74 insertions(+), 90 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..7aae1ecf0a9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->vmbuffer = InvalidBuffer;
+ scan->empty_tuples = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1043,6 +1045,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->vmbuffer))
+ ReleaseBuffer(scan->vmbuffer);
+ scan->vmbuffer = InvalidBuffer;
+
/*
* reinitialize scan descriptor
*/
@@ -1062,6 +1068,10 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->vmbuffer))
+ ReleaseBuffer(scan->vmbuffer);
+ scan->vmbuffer = InvalidBuffer;
+
/*
* decrement relation reference count and free scan descriptor storage
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 716d477e271..baba09c87c0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
#include "access/syncscan.h"
#include "access/tableam.h"
#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -2124,6 +2125,24 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ /*
+ * We can skip fetching the heap page if we don't need any fields from the
+ * heap, and the bitmap entries don't need rechecking, and all tuples on
+ * the page are visible to our transaction.
+ */
+ if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->vmbuffer))
+ {
+ /* can't be lossy in the skip_fetch case */
+ Assert(tbmres->ntuples >= 0);
+ Assert(hscan->empty_tuples >= 0);
+
+ hscan->empty_tuples += tbmres->ntuples;
+
+ return true;
+ }
+
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
@@ -2235,6 +2254,16 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
+ if (hscan->empty_tuples > 0)
+ {
+ /*
+ * If we don't have to fetch the tuple, just return nulls.
+ */
+ ExecStoreAllNullTuple(slot);
+ hscan->empty_tuples--;
+ return true;
+ }
+
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9372b49bfaa..c0fb06c9688 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,6 +108,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ bool can_skip_fetch;
/*
* We can potentially skip fetching heap pages if we do not need any
* columns of the table, either for checking non-indexable quals or
@@ -115,7 +116,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
* the stronger condition that there's no qual or return tlist at all.
* But in most cases it's probably not worth working harder than that.
*/
- node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
+ can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
node->ss.ps.plan->targetlist == NIL);
if (!pstate)
@@ -199,7 +200,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->ss.ps.state->es_snapshot,
node->worker_snapshot,
0,
- NULL);
+ NULL,
+ can_skip_fetch);
}
node->initialized = true;
@@ -207,8 +209,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
/*
@@ -228,32 +228,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres->blockno);
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
+ if (!table_scan_bitmap_next_block(scan, tbmres))
{
/* AM doesn't think this block is valid, skip */
continue;
@@ -307,46 +282,30 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
BitmapPrefetch(node, scan);
- if (node->return_empty_tuples > 0)
+ /*
+ * Attempt to fetch tuple from AM.
+ */
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
-
- if (--node->return_empty_tuples == 0)
- {
- /* no more tuples to return in the next round */
- node->tbmres = tbmres = NULL;
- }
+ /* nothing more to look at on this page */
+ node->tbmres = tbmres = NULL;
+ continue;
}
- else
+
+ /*
+ * If we are using lossy info, we have to recheck the qual conditions
+ * at every tuple.
+ */
+ if (tbmres->recheck)
{
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
{
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
continue;
}
-
- /*
- * If we are using lossy info, we have to recheck the qual
- * conditions at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
- {
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
- }
- }
}
/* OK to return this tuple */
@@ -518,7 +477,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* it did for the current heap page; which is not a certainty
* but is true in many cases.
*/
- skip_fetch = (node->can_skip_fetch &&
+
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
!tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -569,7 +529,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
/* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
!tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -639,8 +599,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
@@ -650,7 +608,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
node->initialized = false;
node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
- node->vmbuffer = InvalidBuffer;
node->pvmbuffer = InvalidBuffer;
ExecScanReScan(&node->ss);
@@ -695,8 +652,6 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
@@ -739,8 +694,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->tbm = NULL;
scanstate->tbmiterator = NULL;
scanstate->tbmres = NULL;
- scanstate->return_empty_tuples = 0;
- scanstate->vmbuffer = InvalidBuffer;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -752,7 +705,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
- scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..2fc369a18ff 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -73,6 +73,8 @@ typedef struct HeapScanDescData
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
/* these fields only used in page-at-a-time mode and for bitmap scans */
+ Buffer vmbuffer; /* for checking if can skip fetch */
+ int empty_tuples; /* count of all NULL tuples to be returned */
int rs_cindex; /* current tuple's index in vistuples */
int rs_ntuples; /* number of visible tuples on page */
OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 77f32a7472d..05e700c5055 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,6 +62,7 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
/*
@@ -780,10 +781,8 @@ typedef struct TableAmRoutine
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
- * make sense to perform tuple visibility checks at this time). For some
- * AMs it will make more sense to do all the work referencing `tbmres`
- * contents here, for others it might be better to defer more work to
- * scan_bitmap_next_tuple.
+ * make sense to perform tuple visibility checks at this time). All work
+ * referencing `tbmres` must be done here.
*
* If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
* on the page have to be returned, otherwise the tuples at offsets in
@@ -795,11 +794,6 @@ typedef struct TableAmRoutine
* performs prefetching directly using that interface. This probably
* needs to be rectified at a later point.
*
- * XXX: Currently this may only be implemented if the AM uses the
- * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
- * perform prefetching. This probably needs to be rectified at a later
- * point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
@@ -944,11 +938,14 @@ extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot, Snapshot worker_snapshot,
- int nkeys, struct ScanKeyData *key)
+ int nkeys, struct ScanKeyData *key, bool can_skip_fetch)
{
TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+ if (can_skip_fetch)
+ flags |= SO_CAN_SKIP_FETCH;
+
result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
if (worker_snapshot)
table_scan_update_snapshot(result, worker_snapshot);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 00c75fb10e2..9392923eb32 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1711,9 +1711,6 @@ typedef struct ParallelBitmapHeapState
* tbm bitmap obtained from child index scan(s)
* tbmiterator iterator for scanning current pages
* tbmres current-page data
- * can_skip_fetch can we potentially skip tuple fetches in this scan?
- * return_empty_tuples number of empty tuples to return
- * vmbuffer buffer for visibility-map lookups
* pvmbuffer ditto, for prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
@@ -1736,9 +1733,6 @@ typedef struct BitmapHeapScanState
TIDBitmap *tbm;
TBMIterator *tbmiterator;
TBMIterateResult *tbmres;
- bool can_skip_fetch;
- int return_empty_tuples;
- Buffer vmbuffer;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
--
2.37.2
v1-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-patch; charset=US-ASCII; name=v1-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchDownload
From ccd5d688fd5c1dd16908788e0a0abd0f3e64eb77 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:17:47 -0500
Subject: [PATCH v1 08/11] Reduce scope of BitmapHeapScan tbmiterator local
variables
To simplify the diff of a future commit which will move the TBMIterators
into the scan descriptor, define them in a narrower scope now.
---
src/backend/executor/nodeBitmapHeapscan.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 19d115de06f..4d55390715c 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -76,8 +76,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterator *tbmiterator = NULL;
- TBMSharedIterator *shared_tbmiterator = NULL;
TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
@@ -90,10 +88,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- if (pstate == NULL)
- tbmiterator = node->tbmiterator;
- else
- shared_tbmiterator = node->shared_tbmiterator;
tbmres = node->tbmres;
/*
@@ -111,6 +105,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
if (!node->initialized)
{
bool can_skip_fetch;
+ TBMIterator *tbmiterator = NULL;
+ TBMSharedIterator *shared_tbmiterator = NULL;
/*
* We can potentially skip fetching heap pages if we do not need any
* columns of the table, either for checking non-indexable quals or
@@ -129,7 +125,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
- node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
+ tbmiterator = tbm_begin_iterate(tbm);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -182,8 +178,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* Allocate a private iterator and attach the shared state to it */
- node->shared_tbmiterator = shared_tbmiterator =
- tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
+ shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -206,6 +201,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
can_skip_fetch);
}
+ node->tbmiterator = tbmiterator;
+ node->shared_tbmiterator = shared_tbmiterator;
node->initialized = true;
}
@@ -219,9 +216,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
if (tbmres == NULL)
{
if (!pstate)
- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
else
- node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
+ node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
if (tbmres == NULL)
{
/* no more entries in the bitmap */
--
2.37.2
v1-0009-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-patch; charset=US-ASCII; name=v1-0009-Make-table_scan_bitmap_next_block-async-friendly.patchDownload
From 555743e4bc885609d20768f7f2990c6ba69b13a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v1 09/11] Make table_scan_bitmap_next_block() async friendly
table_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the known end of the table.
This forced the caller to be responsible for determining if additional
blocks should be fetched and then for invoking
table_scan_bitmap_next_block() for these blocks.
It makes more sense for table_scan_bitmap_next_block() to be responsible
for finding a block that is not past the end of the table and for
table_scan_bitmap_next_tuple() to return false if there are no visible
tuples on the page.
This also allows us to move responsibility for the iterator to table AM
specific code. This means handling invalid blocks is entirely up to
the table AM.
These changes will enable bitmapheapscan to use the future streaming
read API. The table AMs will implement a streaming read API callback
that returns the next block that needs to be fetched. In heap AM's case,
the callback will use the iterator to find the next block to be fetched.
Since choosing the next block will no longer the responsibility of
BitmapHeapNext(), the streaming read control flow requires these changes
to table_scan_bitmap_next_block().
---
src/backend/access/heap/heapam.c | 22 ++++
src/backend/access/heap/heapam_handler.c | 56 ++++++---
src/backend/executor/nodeBitmapHeapscan.c | 132 ++++++++--------------
src/include/access/relscan.h | 3 +
src/include/access/tableam.h | 14 +--
src/include/nodes/execnodes.h | 10 +-
6 files changed, 121 insertions(+), 116 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 88b4aad5820..d8569373987 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -959,6 +959,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->empty_tuples = 0;
scan->rs_base.lossy_pages = 0;
scan->rs_base.exact_pages = 0;
+ scan->rs_base.shared_tbmiterator = NULL;
+ scan->rs_base.tbmiterator = NULL;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1051,6 +1053,18 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
ReleaseBuffer(scan->vmbuffer);
scan->vmbuffer = InvalidBuffer;
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->rs_base.shared_tbmiterator)
+ tbm_end_shared_iterate(scan->rs_base.shared_tbmiterator);
+
+ if (scan->rs_base.tbmiterator)
+ tbm_end_iterate(scan->rs_base.tbmiterator);
+ }
+
+ scan->rs_base.shared_tbmiterator = NULL;
+ scan->rs_base.tbmiterator = NULL;
+
/*
* reinitialize scan descriptor
*/
@@ -1074,6 +1088,14 @@ heap_endscan(TableScanDesc sscan)
ReleaseBuffer(scan->vmbuffer);
scan->vmbuffer = InvalidBuffer;
+ if (sscan->shared_tbmiterator)
+ tbm_end_shared_iterate(sscan->shared_tbmiterator);
+ sscan->shared_tbmiterator = NULL;
+
+ if (sscan->tbmiterator)
+ tbm_end_iterate(sscan->tbmiterator);
+ sscan->tbmiterator = NULL;
+
/*
* decrement relation reference count and free scan descriptor storage
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6e85ef7a946..d55ece23a35 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2114,17 +2114,49 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres)
+ bool *recheck, BlockNumber *blockno)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
- BlockNumber block = tbmres->blockno;
+ BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
+ TBMIterateResult *tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ *blockno = InvalidBlockNumber;
+ *recheck = true;
+
+ do
+ {
+ if (scan->shared_tbmiterator)
+ tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ else
+ tbmres = tbm_iterate(scan->tbmiterator);
+
+ if (tbmres == NULL)
+ {
+ /* no more entries in the bitmap */
+ Assert(hscan->empty_tuples == 0);
+ return false;
+ }
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+
+ /* Got a valid block */
+ *blockno = tbmres->blockno;
+ *recheck = tbmres->recheck;
+
/*
* We can skip fetching the heap page if we don't need any fields from the
* heap, and the bitmap entries don't need rechecking, and all tuples on
@@ -2143,16 +2175,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
return true;
}
- /*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE isolation
- * though, as we need to examine all invisible tuples reachable by the
- * index.
- */
- if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
- return false;
+ block = tbmres->blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2251,7 +2274,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}
- return ntup > 0;
+ /*
+ * Return true to indicate that a valid block was found and the bitmap is
+ * not exhausted. If there are no visible tuples on this page,
+ * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will
+ * return false returning control to this function to advance to the next
+ * block in the bitmap.
+ */
+ return true;
}
static bool
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 4d55390715c..efc6952e353 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -76,7 +76,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area *dsa = node->ss.ps.state->es_query_dsa;
@@ -88,7 +87,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- tbmres = node->tbmres;
/*
* If we haven't yet performed the underlying index scan, do it, and begin
@@ -126,7 +124,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -179,7 +176,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -201,46 +197,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
can_skip_fetch);
}
- node->tbmiterator = tbmiterator;
- node->shared_tbmiterator = shared_tbmiterator;
+ scan->tbmiterator = tbmiterator;
+ scan->shared_tbmiterator = shared_tbmiterator;
+
node->initialized = true;
+
+ /* Get the first block. if none, end of scan */
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ goto exit;
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ BitmapAdjustPrefetchTarget(node);
}
for (;;)
{
- CHECK_FOR_INTERRUPTS();
-
- /*
- * Get next page of results if needed
- */
- if (tbmres == NULL)
+ while (table_scan_bitmap_next_tuple(scan, slot))
{
- if (!pstate)
- node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
- else
- node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
- if (tbmres == NULL)
- {
- /* no more entries in the bitmap */
- break;
- }
-
- BitmapAdjustPrefetchIterator(node, tbmres->blockno);
-
- if (!table_scan_bitmap_next_block(scan, tbmres))
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
-
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
- else
- {
- /*
- * Continuing in previously obtained page.
- */
+ CHECK_FOR_INTERRUPTS();
#ifdef USE_PREFETCH
@@ -262,53 +235,46 @@ BitmapHeapNext(BitmapHeapScanState *node)
SpinLockRelease(&pstate->mutex);
}
#endif /* USE_PREFETCH */
- }
- /*
- * We issue prefetch requests *after* fetching the current page to try
- * to avoid having prefetching interfere with the main I/O. Also, this
- * should happen only when we have determined there is still something
- * to do on the current page, else we may uselessly prefetch the same
- * page we are just about to request for real.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- BitmapPrefetch(node, scan);
-
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
- {
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
- continue;
- }
+ /*
+ * We prefetch before fetching the current pages. We expect that a
+ * future streaming read API will do this, so do it now for
+ * consistency.
+ */
+ BitmapPrefetch(node, scan);
- /*
- * If we are using lossy info, we have to recheck the qual conditions
- * at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ /*
+ * If we are using lossy info, we have to recheck the qual
+ * conditions at every tuple.
+ */
+ if (node->recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
+ continue;
+ }
}
+
+ /* OK to return this tuple */
+ return slot;
}
- /* OK to return this tuple */
- return slot;
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ break;
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ /* Adjust the prefetch target */
+ BitmapAdjustPrefetchTarget(node);
}
/*
* if we get here it means we are at the end of the scan..
*/
+exit:
BitmapAccumCounters(node, scan);
return ExecClearTuple(slot);
}
@@ -594,12 +560,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
@@ -607,13 +569,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->tbmiterator = NULL;
- node->tbmres = NULL;
node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
node->pvmbuffer = InvalidBuffer;
+ node->recheck = true;
+ node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -647,14 +608,10 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
/*
* release bitmaps and buffers if any
*/
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->pvmbuffer != InvalidBuffer)
@@ -697,8 +654,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->tbmiterator = NULL;
- scanstate->tbmres = NULL;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -707,10 +662,11 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
+ scanstate->recheck = true;
+ scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b74e08dd745..bf7ee044268 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
#include "access/htup_details.h"
#include "access/itup.h"
+#include "nodes/tidbitmap.h"
#include "port/atomics.h"
#include "storage/buf.h"
#include "storage/spin.h"
@@ -41,6 +42,8 @@ typedef struct TableScanDescData
ItemPointerData rs_maxtid;
/* Only used for Bitmap table scans */
+ TBMIterator *tbmiterator;
+ TBMSharedIterator *shared_tbmiterator;
long exact_pages;
long lossy_pages;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 05e700c5055..b90d9b7f3fa 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -798,7 +798,7 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres);
+ bool *recheck, BlockNumber *blockno);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1942,17 +1942,16 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
- * a bitmap table scan. `scan` needs to have been started via
- * table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise.
+ * Prepare to fetch / check / return tuples as part of a bitmap table scan.
+ * `scan` needs to have been started via table_beginscan_bm(). Returns false if
+ * there are no more blocks in the bitmap, true otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres)
+ bool *recheck, BlockNumber *blockno)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1962,8 +1961,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, blockno);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9392923eb32..03973a3f262 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1709,9 +1709,7 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * tbmiterator iterator for scanning current pages
- * tbmres current-page data
- * pvmbuffer ditto, for prefetched pages
+ * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
* prefetch_iterator iterator for prefetching ahead of current page
@@ -1720,7 +1718,6 @@ typedef struct ParallelBitmapHeapState
* prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
@@ -1731,8 +1728,6 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- TBMIterator *tbmiterator;
- TBMIterateResult *tbmres;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
@@ -1742,10 +1737,11 @@ typedef struct BitmapHeapScanState
int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
+ bool recheck;
+ BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v1-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchtext/x-patch; charset=US-ASCII; name=v1-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchDownload
From 500c84019b982a1e6c8b8dd40240c8510d83c287 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v1 07/11] BitmapHeapScan scan desc counts lossy and exact
pages
Future commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.
---
src/backend/access/heap/heapam.c | 2 ++
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 18 +++++++++++++-----
src/include/access/relscan.h | 4 ++++
4 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7aae1ecf0a9..88b4aad5820 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -957,6 +957,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_strategy = NULL; /* set in initscan */
scan->vmbuffer = InvalidBuffer;
scan->empty_tuples = 0;
+ scan->rs_base.lossy_pages = 0;
+ scan->rs_base.exact_pages = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index baba09c87c0..6e85ef7a946 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
+ /* Only count exact and lossy pages with visible tuples */
+ if (ntup > 0)
+ {
+ if (tbmres->ntuples >= 0)
+ scan->exact_pages++;
+ else
+ scan->lossy_pages++;
+ }
+
return ntup > 0;
}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c0fb06c9688..19d115de06f 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -53,6 +53,8 @@
#include "utils/spccache.h"
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
+static inline void BitmapAccumCounters(BitmapHeapScanState *node,
+ TableScanDesc scan);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
BlockNumber blockno);
@@ -234,11 +236,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
continue;
}
- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
- node->lossy_pages++;
-
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
@@ -315,9 +312,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* if we get here it means we are at the end of the scan..
*/
+ BitmapAccumCounters(node, scan);
return ExecClearTuple(slot);
}
+static inline void
+BitmapAccumCounters(BitmapHeapScanState *node,
+ TableScanDesc scan)
+{
+ node->exact_pages += scan->exact_pages;
+ scan->exact_pages = 0;
+ node->lossy_pages += scan->lossy_pages;
+ scan->lossy_pages = 0;
+}
+
/*
* BitmapDoneInitializingSharedState - Shared state is initialized
*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..b74e08dd745 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -40,6 +40,10 @@ typedef struct TableScanDescData
ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;
+ /* Only used for Bitmap table scans */
+ long exact_pages;
+ long lossy_pages;
+
/*
* Information about type and behaviour of the scan, a bitmask of members
* of the ScanOptions enum (see tableam.h).
--
2.37.2
v1-0010-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0010-Streaming-Read-API.patchDownload
From 9eb510c1f2fd4d1b3c831f62af1e4c0f422a0922 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 10/11] Streaming Read API
---
contrib/pg_prewarm/pg_prewarm.c | 40 +-
src/backend/access/transam/xlogutils.c | 2 +-
src/backend/postmaster/bgwriter.c | 8 +-
src/backend/postmaster/checkpointer.c | 15 +-
src/backend/storage/Makefile | 2 +-
src/backend/storage/aio/Makefile | 14 +
src/backend/storage/aio/meson.build | 5 +
src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 560 +++++++++++++++--------
src/backend/storage/buffer/localbuf.c | 14 +-
src/backend/storage/meson.build | 1 +
src/backend/storage/smgr/smgr.c | 49 +-
src/include/storage/bufmgr.h | 22 +
src/include/storage/smgr.h | 4 +-
src/include/storage/streaming_read.h | 45 ++
src/include/utils/rel.h | 6 -
src/tools/pgindent/typedefs.list | 2 +
17 files changed, 986 insertions(+), 238 deletions(-)
create mode 100644 src/backend/storage/aio/Makefile
create mode 100644 src/backend/storage/aio/meson.build
create mode 100644 src/backend/storage/aio/streaming_read.c
create mode 100644 src/include/storage/streaming_read.h
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/smgr.h"
+#include "storage/streaming_read.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
static PGIOAlignedBlock blockbuffer;
+struct pg_prewarm_streaming_read_private
+{
+ BlockNumber blocknum;
+ int64 last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_data)
+{
+ struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+ if (p->blocknum <= p->last_block)
+ return p->blocknum++;
+
+ return InvalidBlockNumber;
+}
+
/*
* pg_prewarm(regclass, mode text, fork text,
* first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
}
else if (ptype == PREWARM_BUFFER)
{
+ struct pg_prewarm_streaming_read_private p;
+ PgStreamingRead *pgsr;
+
/*
* In buffer mode, we actually pull the data into shared_buffers.
*/
+
+ /* Set up the private state for our streaming buffer read callback. */
+ p.blocknum = first_block;
+ p.last_block = last_block;
+
+ pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ &p,
+ 0,
+ NULL,
+ BMR_REL(rel),
+ forkNumber,
+ pg_prewarm_streaming_read_next);
+
for (block = first_block; block <= last_block; ++block)
{
Buffer buf;
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+ buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
ReleaseBuffer(buf);
++blocks_done;
}
+ Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+ pg_streaming_read_free(pgsr);
}
/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
* This is unnecessarily heavy-handed, as it will close SMgrRelation
* objects for other databases as well. DROP DATABASE occurs seldom enough
* that it's not worth introducing a variant of smgrclose for just this
- * purpose. XXX: Or should we rather leave the smgr entries dangling?
+ * purpose.
*/
smgrcloseall();
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
if (FirstCallSinceLastCheckpoint())
{
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the bgwriter does
+ * not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
}
/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
ckpt_performed = CreateRestartPoint(flags);
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the checkpointer
+ * does not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
/*
* Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
*/
CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
- /*
- * After any checkpoint, close all smgr files. This is so we won't
- * hang onto smgr references to deleted files indefinitely.
- */
- smgrcloseall();
+ /* Free all smgr objects, as CheckpointerMain() normally would. */
+ smgrdestroyall();
return;
}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS = aio buffer file freespace ipc large_object lmgr page smgr sync
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+ bool advice_issued;
+ bool need_complete;
+ BlockNumber blocknum;
+ int nblocks;
+ int per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+ Buffer buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+ int max_ios;
+ int ios_in_progress;
+ int ios_in_progress_trigger;
+ int max_pinned_buffers;
+ int pinned_buffers;
+ int pinned_buffers_trigger;
+ int next_tail_buffer;
+ bool finished;
+ void *pgsr_private;
+ PgStreamingReadBufferCB callback;
+ BufferAccessStrategy strategy;
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+
+ bool advice_enabled;
+
+ /* Next expected block, for detecting sequential access. */
+ BlockNumber seq_blocknum;
+
+ /* Space for optional per-buffer private data. */
+ size_t per_buffer_data_size;
+ void *per_buffer_data;
+ int per_buffer_data_next;
+
+ /* Circular buffer of ranges. */
+ int size;
+ int head;
+ int tail;
+ PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy)
+{
+ PgStreamingRead *pgsr;
+ int size;
+ int max_ios;
+ uint32 max_pinned_buffers;
+
+
+ /*
+ * Decide how many assumed I/Os we will allow to run concurrently. That
+ * is, advice to the kernel to tell it that we will soon read. This
+ * number also affects how far we look ahead for opportunities to start
+ * more I/Os.
+ */
+ if (flags & PGSR_FLAG_MAINTENANCE)
+ max_ios = maintenance_io_concurrency;
+ else
+ max_ios = effective_io_concurrency;
+
+ /*
+ * The desired level of I/O concurrency controls how far ahead we are
+ * willing to look ahead. We also clamp it to at least
+ * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+ * sized read, even when max_ios is zero.
+ */
+ max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+ /*
+ * The *_io_concurrency GUCs, we might have 0. We want to allow at least
+ * one, to keep our gating logic simple.
+ */
+ max_ios = Max(max_ios, 1);
+
+ /*
+ * Don't allow this backend to pin too many buffers. For now we'll apply
+ * the limit for the shared buffer pool and the local buffer pool, without
+ * worrying which it is.
+ */
+ LimitAdditionalPins(&max_pinned_buffers);
+ LimitAdditionalLocalPins(&max_pinned_buffers);
+ Assert(max_pinned_buffers > 0);
+
+ /*
+ * pgsr->ranges is a circular buffer. When it is empty, head == tail.
+ * When it is full, there is an empty element between head and tail. Head
+ * can also be empty (nblocks == 0), therefore we need two extra elements
+ * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+ * maxmimum possible number of occupied ranges of the smallest possible
+ * size of one.
+ */
+ size = max_pinned_buffers + 2;
+
+ pgsr = (PgStreamingRead *)
+ palloc0(offsetof(PgStreamingRead, ranges) +
+ sizeof(pgsr->ranges[0]) * size);
+
+ pgsr->max_ios = max_ios;
+ pgsr->per_buffer_data_size = per_buffer_data_size;
+ pgsr->max_pinned_buffers = max_pinned_buffers;
+ pgsr->pgsr_private = pgsr_private;
+ pgsr->strategy = strategy;
+ pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+ /*
+ * This system supports prefetching advice. As long as direct I/O isn't
+ * enabled, and the caller hasn't promised sequential access, we can use
+ * it.
+ */
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ (flags & PGSR_FLAG_SEQUENTIAL) == 0)
+ pgsr->advice_enabled = true;
+#endif
+
+ /*
+ * We want to avoid creating ranges that are smaller than they could be
+ * just because we hit max_pinned_buffers. We only look ahead when the
+ * number of pinned buffers falls below this trigger number, or put
+ * another way, we stop looking ahead when we wouldn't be able to build a
+ * "full sized" range.
+ */
+ pgsr->pinned_buffers_trigger =
+ Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+ /* Space the callback to store extra data along with each block. */
+ if (per_buffer_data_size)
+ pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+ return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb)
+{
+ PgStreamingRead *result;
+
+ result = pg_streaming_read_buffer_alloc_internal(flags,
+ pgsr_private,
+ per_buffer_data_size,
+ strategy);
+ result->callback = next_block_cb;
+ result->bmr = bmr;
+ result->forknum = forknum;
+
+ return result;
+}
+
+/*
+ * Start building a new range. This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading. In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+ PgStreamingReadRange *head_range;
+
+ head_range = &pgsr->ranges[pgsr->head];
+ Assert(head_range->nblocks > 0);
+
+ /*
+ * If a call to CompleteReadBuffers() will be needed, and we can issue
+ * advice to the kernel to get the read started. We suppress it if the
+ * access pattern appears to be completely sequential, though, because on
+ * some systems that interfers with the kernel's own sequential read ahead
+ * heurstics and hurts performance.
+ */
+ if (pgsr->advice_enabled)
+ {
+ BlockNumber blocknum = head_range->blocknum;
+ int nblocks = head_range->nblocks;
+
+ if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+ {
+ SMgrRelation smgr =
+ pgsr->bmr.smgr ? pgsr->bmr.smgr :
+ RelationGetSmgr(pgsr->bmr.rel);
+
+ Assert(!head_range->advice_issued);
+
+ smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+ /*
+ * Count this as an I/O that is concurrently in progress, though
+ * we don't really know if the kernel generates a physical I/O.
+ */
+ head_range->advice_issued = true;
+ pgsr->ios_in_progress++;
+ }
+
+ /* Remember the block after this range, for sequence detection. */
+ pgsr->seq_blocknum = blocknum + nblocks;
+ }
+
+ /* Create a new head range. There must be space. */
+ Assert(pgsr->size > pgsr->max_pinned_buffers);
+ Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+ if (++pgsr->head == pgsr->size)
+ pgsr->head = 0;
+ head_range = &pgsr->ranges[pgsr->head];
+ head_range->nblocks = 0;
+
+ return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+ /*
+ * If we're finished or can't start more I/O, then don't look ahead.
+ */
+ if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+ return;
+
+ /*
+ * We'll also wait until the number of pinned buffers falls below our
+ * trigger level, so that we have the chance to create a full range.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+ return;
+
+ do
+ {
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+ Buffer buffer;
+ bool found;
+ bool need_complete;
+ PgStreamingReadRange *head_range;
+ void *per_buffer_data;
+
+ /* Do we have a full-sized range? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks == lengthof(head_range->buffers))
+ {
+ Assert(head_range->need_complete);
+ head_range = pg_streaming_read_new_range(pgsr);
+
+ /*
+ * Give up now if I/O is saturated, or we wouldn't be able form
+ * another full range after this due to the pin limit.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+ pgsr->ios_in_progress == pgsr->max_ios)
+ break;
+ }
+
+ per_buffer_data = (char *) pgsr->per_buffer_data +
+ pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+ /* Find out which block the callback wants to read next. */
+ blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+ if (blocknum == InvalidBlockNumber)
+ {
+ pgsr->finished = true;
+ break;
+ }
+ bmr = pgsr->bmr;
+ forknum = pgsr->forknum;
+
+ Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+ buffer = PrepareReadBuffer(bmr,
+ forknum,
+ blocknum,
+ pgsr->strategy,
+ &found);
+ pgsr->pinned_buffers++;
+
+ need_complete = !found;
+
+ /* Is there a head range that we can't extend? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks > 0 &&
+ (!need_complete ||
+ !head_range->need_complete ||
+ head_range->blocknum + head_range->nblocks != blocknum))
+ {
+ /* Yes, time to start building a new one. */
+ head_range = pg_streaming_read_new_range(pgsr);
+ Assert(head_range->nblocks == 0);
+ }
+
+ if (head_range->nblocks == 0)
+ {
+ /* Initialize a new range beginning at this block. */
+ head_range->blocknum = blocknum;
+ head_range->need_complete = need_complete;
+ head_range->advice_issued = false;
+ }
+ else
+ {
+ /* We can extend an existing range by one block. */
+ Assert(head_range->blocknum + head_range->nblocks == blocknum);
+ Assert(head_range->need_complete);
+ }
+
+ head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+ head_range->buffers[head_range->nblocks] = buffer;
+ head_range->nblocks++;
+
+ if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+ pgsr->per_buffer_data_next = 0;
+
+ } while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+ pgsr->ios_in_progress < pgsr->max_ios);
+
+ if (pgsr->ranges[pgsr->head].nblocks > 0)
+ pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+ pg_streaming_read_look_ahead(pgsr);
+
+ /* See if we have one buffer to return. */
+ while (pgsr->tail != pgsr->head)
+ {
+ PgStreamingReadRange *tail_range;
+
+ tail_range = &pgsr->ranges[pgsr->tail];
+
+ /*
+ * Do we need to perform an I/O before returning the buffers from this
+ * range?
+ */
+ if (tail_range->need_complete)
+ {
+ CompleteReadBuffers(pgsr->bmr,
+ tail_range->buffers,
+ pgsr->forknum,
+ tail_range->blocknum,
+ tail_range->nblocks,
+ false,
+ pgsr->strategy);
+ tail_range->need_complete = false;
+
+ /*
+ * We don't really know if the kernel generated an physical I/O
+ * when we issued advice, let alone when it finished, but it has
+ * certainly finished after a read call returns.
+ */
+ if (tail_range->advice_issued)
+ pgsr->ios_in_progress--;
+ }
+
+ /* Are there more buffers available in this range? */
+ if (pgsr->next_tail_buffer < tail_range->nblocks)
+ {
+ int buffer_index;
+ Buffer buffer;
+
+ buffer_index = pgsr->next_tail_buffer++;
+ buffer = tail_range->buffers[buffer_index];
+
+ Assert(BufferIsValid(buffer));
+
+ /* We are giving away ownership of this pinned buffer. */
+ Assert(pgsr->pinned_buffers > 0);
+ pgsr->pinned_buffers--;
+
+ if (per_buffer_data)
+ *per_buffer_data = (char *) pgsr->per_buffer_data +
+ tail_range->per_buffer_data_index[buffer_index] *
+ pgsr->per_buffer_data_size;
+
+ return buffer;
+ }
+
+ /* Advance tail to next range, if there is one. */
+ if (++pgsr->tail == pgsr->size)
+ pgsr->tail = 0;
+ pgsr->next_tail_buffer = 0;
+ }
+
+ Assert(pgsr->pinned_buffers == 0);
+
+ return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+ Buffer buffer;
+
+ /* Stop looking ahead, and unpin anything that wasn't consumed. */
+ pgsr->finished = true;
+ while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+ ReleaseBuffer(buffer);
+
+ if (pgsr->per_buffer_data)
+ pfree(pgsr->per_buffer_data);
+ pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
)
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static int SyncOneBuffer(int buf_id, bool skip_recently_used,
WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner);
static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot access temporary tables of other sessions")));
- /*
- * Read the buffer, and update pgstat counters to reflect a cache hit or
- * miss.
- */
- pgstat_count_buffer_read(reln);
- buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+ buf = ReadBuffer_common(BMR_REL(reln),
forkNum, blockNum, mode, strategy, &hit);
- if (hit)
- pgstat_count_buffer_hit(reln);
+
return buf;
}
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
- return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
- RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+ return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+ RELPERSISTENCE_UNLOGGED),
+ forkNum, blockNum,
mode, strategy, &hit);
}
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
bool hit;
Assert(extended_by == 0);
- buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+ buffer = ReadBuffer_common(bmr,
fork, extend_to - 1, mode, strategy,
&hit);
}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
* *hit is set to true if the request was satisfied from shared buffer cache.
*/
static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy, bool *hit)
{
- BufferDesc *bufHdr;
- Block bufBlock;
- bool found;
- IOContext io_context;
- IOObject io_object;
- bool isLocalBuf = SmgrIsTemp(smgr);
-
- *hit = false;
+ Buffer buffer;
/*
* Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;
- return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
- forkNum, strategy, flags);
+ *hit = false;
+
+ return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}
- TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend);
+ buffer = PrepareReadBuffer(bmr,
+ forkNum,
+ blockNum,
+ strategy,
+ hit);
+
+ /* At this point we do NOT hold any locks. */
+ if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+ {
+ /* if we just want zeroes and a lock, we're done */
+ ZeroBuffer(buffer, mode);
+ }
+ else if (!*hit)
+ {
+ /* we might need to perform I/O */
+ CompleteReadBuffers(bmr,
+ &buffer,
+ forkNum,
+ blockNum,
+ 1,
+ mode == RBM_ZERO_ON_ERROR,
+ strategy);
+ }
+
+ return buffer;
+}
+
+/*
+ * Prepare to read a block. The buffer is pinned. If this is a 'hit', then
+ * the returned buffer can be used immediately. Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer(). PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr)
+{
+ BufferDesc *bufHdr;
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
+
+ Assert(blockNum != P_NEW);
+
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
- /*
- * We do not use a BufferAccessStrategy for I/O of temporary tables.
- * However, in some cases, the "strategy" may not be NULL, so we can't
- * rely on IOContextForStrategy() to set the right IOContext for us.
- * This may happen in cases like CREATE TEMPORARY TABLE AS...
- */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
- bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
- if (found)
- pgBufferUsage.local_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.local_blks_read++;
}
else
{
- /*
- * lookup the buffer. IO_IN_PROGRESS is set if the requested block is
- * not currently in memory.
- */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
- bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
- strategy, &found, io_context);
- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;
}
- /* At this point we do NOT hold any locks. */
+ TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend);
- /* if it was already in the buffer pool, we're done */
- if (found)
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+ if (isLocalBuf)
+ {
+ bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+ if (*foundPtr)
+ pgBufferUsage.local_blks_hit++;
+ }
+ else
+ {
+ bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+ strategy, foundPtr, io_context);
+ if (*foundPtr)
+ pgBufferUsage.shared_blks_hit++;
+ }
+ if (bmr.rel)
+ {
+ /*
+ * While pgBufferUsage's "read" counter isn't bumped unless we reach
+ * CompleteReadBuffers() (so, not for hits, and not for buffers that
+ * are zeroed instead), the per-relation stats always count them.
+ */
+ pgstat_count_buffer_read(bmr.rel);
+ if (*foundPtr)
+ pgstat_count_buffer_hit(bmr.rel);
+ }
+ if (*foundPtr)
{
- /* Just need to update stats before we exit */
- *hit = true;
VacuumPageHit++;
pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageHit;
TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ }
- /*
- * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
- * on return.
- */
- if (!isLocalBuf)
- {
- if (mode == RBM_ZERO_AND_LOCK)
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
- LW_EXCLUSIVE);
- else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
- LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
- }
+ return BufferDescriptorGetBuffer(bufHdr);
+}
- return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+ if (BufferIsLocal(buffer))
+ {
+ BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+ return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
+ else
+ return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
- /*
- * if we have gotten to this point, we have allocated a buffer for the
- * page but its contents are not yet valid. IO_IN_PROGRESS is set for it,
- * if it's a shared buffer.
- */
- Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers(). The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy)
+{
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
- bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
+ if (isLocalBuf)
+ {
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
+ }
+ else
+ {
+ io_context = IOContextForStrategy(strategy);
+ io_object = IOOBJECT_RELATION;
+ }
/*
- * Read in the page, unless the caller intends to overwrite it and just
- * wants us to allocate a buffer.
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
*/
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ if (isLocalBuf)
+ pgBufferUsage.local_blks_read += nblocks;
else
+ pgBufferUsage.shared_blks_read += nblocks;
+
+ for (int i = 0; i < nblocks; ++i)
{
- instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+ int io_buffers_len;
+ Buffer io_buffers[MAX_BUFFERS_PER_TRANSFER];
+ void *io_pages[MAX_BUFFERS_PER_TRANSFER];
+ instr_time io_start;
+ BlockNumber io_first_block;
- smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
- pgstat_count_io_op_time(io_object, io_context,
- IOOP_READ, io_start, 1);
+ /*
+ * We could get all the information from buffer headers, but it can be
+ * expensive to access buffer header cache lines so we make the caller
+ * provide all the information we need, and assert that it is
+ * consistent.
+ */
+ {
+ RelFileLocator xlocator;
+ ForkNumber xforknum;
+ BlockNumber xblocknum;
+
+ BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+ Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+ Assert(xforknum == forknum);
+ Assert(xblocknum == blocknum + i);
+ }
+#endif
+
+ /*
+ * Skip this block if someone else has already completed it. If an
+ * I/O is already in progress in another backend, this will wait for
+ * the outcome: either done, or something went wrong and we will
+ * retry.
+ */
+ if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+ {
+ /*
+ * Report this as a 'hit' for this backend, even though it must
+ * have started out as a miss in PrepareReadBuffer().
+ */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ continue;
+ }
+
+ /* We found a buffer that we need to read in. */
+ io_buffers[0] = buffers[i];
+ io_pages[0] = BufferGetBlock(buffers[i]);
+ io_first_block = blocknum + i;
+ io_buffers_len = 1;
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
+ /*
+ * How many neighboring-on-disk blocks can we can scatter-read into
+ * other buffers at the same time? In this case we don't wait if we
+ * see an I/O already in progress. We already hold BM_IO_IN_PROGRESS
+ * for the head block, so we should get on with that I/O as soon as
+ * possible. We'll come back to this block again, above.
+ */
+ while ((i + 1) < nblocks &&
+ CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+ {
+ /* Must be consecutive block numbers. */
+ Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+ BufferGetBlockNumber(buffers[i]) + 1);
+
+ io_buffers[io_buffers_len] = buffers[++i];
+ io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+ }
+
+ io_start = pgstat_prepare_io_time(track_io_timing);
+ smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+ io_buffers_len);
+
+ /* Verify each block we read, and terminate the I/O. */
+ for (int j = 0; j < io_buffers_len; ++j)
{
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+ BufferDesc *bufHdr;
+ Block bufBlock;
+
+ if (isLocalBuf)
{
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+ bufBlock = LocalBufHdrGetBlock(bufHdr);
}
else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- }
- }
-
- /*
- * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
- * content lock before marking the page as valid, to make sure that no
- * other backend sees the zeroed page before the caller has had a chance
- * to initialize it.
- *
- * Since no-one else can be looking at the page contents yet, there is no
- * difference between an exclusive lock and a cleanup-strength lock. (Note
- * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
- * they assert that the buffer is already valid.)
- */
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
- {
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
- }
+ {
+ bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+ bufBlock = BufHdrGetBlock(bufHdr);
+ }
- if (isLocalBuf)
- {
- /* Only need to adjust flags */
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ /* check for garbage data */
+ if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ if (zero_on_error || zero_damaged_pages)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ memset(bufBlock, 0, BLCKSZ);
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ }
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
- }
+ /* Terminate I/O and set BM_VALID. */
+ if (isLocalBuf)
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
- VacuumPageMiss++;
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss;
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ /* Set BM_VALID, terminate IO, and wake up any waiters */
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ }
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ /* Report I/Os as completing individually. */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ false);
+ }
- return BufferDescriptorGetBuffer(bufHdr);
+ VacuumPageMiss += io_buffers_len;
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ }
}
/*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*
* The returned buffer is pinned and is already marked as holding the
* desired page. If it already did have the desired page, *foundPtr is
- * set true. Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true. Otherwise, *foundPtr is set false. A read should be
+ * performed with CompleteReadBuffers().
*
* io_context is passed as an output parameter to avoid calling
* IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(buf, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(existing_buf_hdr, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to obtain the right to
- * start I/O. If StartBufferIO returns false, then someone else managed
- * to read it before we did, so there's nothing left for BufferAlloc() to
- * do.
+ * Buffer contents are currently invalid.
*/
- if (StartBufferIO(victim_buf_hdr, true))
- *foundPtr = false;
- else
- *foundPtr = true;
+ *foundPtr = false;
return victim_buf_hdr;
}
@@ -1774,7 +1899,7 @@ again:
* pessimistic, but outside of toy-sized shared_buffers it should allow
* sufficient pins.
*/
-static void
+void
LimitAdditionalPins(uint32 *additional_pins)
{
uint32 max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
buf_state &= ~BM_VALID;
UnlockBufHdr(existing_hdr, buf_state);
- } while (!StartBufferIO(existing_hdr, true));
+ } while (!StartBufferIO(existing_hdr, true, false));
}
else
{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
LWLockRelease(partition_lock);
/* XXX: could combine the locked operations in it with the above */
- StartBufferIO(victim_buf_hdr, true);
+ StartBufferIO(victim_buf_hdr, true, false);
}
}
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
else
{
/*
- * If we previously pinned the buffer, it must surely be valid.
+ * If we previously pinned the buffer, it is likely to be valid, but
+ * it may not be if PrepareReadBuffer() was called and
+ * CompleteReadBuffers() hasn't been called yet. We'll check by
+ * loading the flags without locking. This is racy, but it's OK to
+ * return false spuriously: when CompleteReadBuffers() calls
+ * StartBufferIO(), it'll see that it's now valid.
*
* Note: We deliberately avoid a Valgrind client request here.
* Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
* that the buffer page is legitimately non-accessible here. We
* cannot meddle with that.
*/
- result = true;
+ result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
}
ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* someone else flushed the buffer before we could, so we need not do
* anything.
*/
- if (!StartBufferIO(buf, false))
+ if (!StartBufferIO(buf, false, false))
return;
/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
LW_EXCLUSIVE);
}
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would. The buffer must be already pinned. It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+ if (BufferIsLocal(buffer))
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ else
+ {
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ if (mode == RBM_ZERO_AND_LOCK)
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ else
+ LockBufferForCleanup(buffer);
+ }
+
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ if (BufferIsLocal(buffer))
+ {
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ buf_state = LockBufHdr(bufHdr);
+ buf_state |= BM_VALID;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
/*
* Verify that this backend is pinning the buffer exactly once.
*
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
*
* Returns true if we successfully marked the buffer as I/O busy,
* false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend. In that case, false indicates either that the I/O was already
+ * finished, or is still in progress. This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
*/
static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
UnlockBufHdr(buf, buf_state);
+ if (nowait)
+ return false;
WaitIO(buf);
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
* LocalBufferAlloc -
* Find or create a local buffer for the given page of the given relation.
*
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local. Also, IO_IN_PROGRESS
- * does not get set. Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local. We support only default access
+ * strategy (hence, usage_count is always advanced).
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
}
/* see LimitAdditionalPins() */
-static void
+void
LimitAdditionalLocalPins(uint32 *additional_pins)
{
uint32 max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
/*
* In contrast to LimitAdditionalPins() other backends don't play a role
- * here. We can allow up to NLocBuffer pins in total.
+ * here. We can allow up to NLocBuffer pins in total, but it might not be
+ * initialized yet so read num_temp_buffers.
*/
- max_pins = (NLocBuffer - NLocalPinnedBuffers);
+ max_pins = (num_temp_buffers - NLocalPinnedBuffers);
if (*additional_pins >= max_pins)
*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('aio')
subdir('buffer')
subdir('file')
subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
/*
* smgropen() -- Return an SMgrRelation object, creating it if need be.
*
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files. The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
*/
SMgrRelation
smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
}
/*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
*/
void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
{
SMgrRelation *owner;
ForkNumber forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
}
/*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
*
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr(). It may be re-owned if it is accessed by a
+ * relation before then.
*/
void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
{
for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
}
reln->smgr_targblock = InvalidBlockNumber;
+
+ if (reln->smgr_owner)
+ {
+ *reln->smgr_owner = NULL;
+ reln->smgr_owner = NULL;
+ dlist_push_tail(&unowned_relns, &reln->node);
+ }
}
/*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
*/
void
-smgrreleaseall(void)
+smgrcloseall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrrelease(reln);
+ smgrclose(reln);
}
/*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
*/
void
-smgrcloseall(void)
+smgrdestroyall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrclose(reln);
+ smgrdestroy(reln);
}
/*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
* AtEOXact_SMgr
*
* This routine is called during transaction commit or abort (it doesn't
- * particularly care which). All transient SMgrRelation objects are closed.
+ * particularly care which). All transient SMgrRelation objects are
+ * destroyed.
*
* We do this as a compromise between wanting transient SMgrRelations to
* live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
dlist_mutable_iter iter;
/*
- * Zap all unowned SMgrRelations. We rely on smgrclose() to remove each
+ * Zap all unowned SMgrRelations. We rely on smgrdestroy() to remove each
* one from the list.
*/
dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
Assert(rel->smgr_owner == NULL);
- smgrclose(rel);
+ smgrdestroy(rel);
}
}
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
bool
ProcessBarrierSmgrRelease(void)
{
- smgrreleaseall();
+ smgrcloseall();
return true;
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
#ifndef BUFMGR_H
#define BUFMGR_H
+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
#define BUFFER_LOCK_SHARE 1
#define BUFFER_LOCK_EXCLUSIVE 2
+/*
+ * Maximum number of buffers for multi-buffer I/O functions. This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
/*
* prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy);
extern void ReleaseBuffer(Buffer buffer);
extern void UnlockReleaseBuffer(Buffer buffer);
extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
/* in buf_init.c */
extern void InitBufferPool(void);
extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected. Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_private_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
*
* Very little code is authorized to touch rel->rd_smgr directly. Instead
* use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period. Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation. It's quite cheap in
- * comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b7..8007f17320a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2094,6 +2094,8 @@ PgStat_TableCounts
PgStat_TableStatus
PgStat_TableXactStatus
PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
PgXmlErrorContext
PgXmlStrictness
Pg_finfo_record
--
2.37.2
v1-0011-BitmapHeapScan-uses-streaming-read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0011-BitmapHeapScan-uses-streaming-read-API.patchDownload
From aac60985d6bc70bfedf77a77ee3c512da87bfcb1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:27:57 -0500
Subject: [PATCH v1 11/11] BitmapHeapScan uses streaming read API
Remove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/gin/ginget.c | 15 +-
src/backend/access/gin/ginscan.c | 7 +
src/backend/access/heap/heapam.c | 71 +++++
src/backend/access/heap/heapam_handler.c | 78 +++--
src/backend/executor/nodeBitmapHeapscan.c | 328 +---------------------
src/backend/nodes/tidbitmap.c | 80 +++---
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 14 +-
src/include/nodes/execnodes.h | 19 --
src/include/nodes/tidbitmap.h | 8 +-
10 files changed, 178 insertions(+), 444 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb6..3ce28078a6f 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -373,7 +373,10 @@ restartScanEntry:
if (entry->matchBitmap)
{
if (entry->matchIterator)
+ {
tbm_end_iterate(entry->matchIterator);
+ pfree(entry->matchResult);
+ }
entry->matchIterator = NULL;
tbm_free(entry->matchBitmap);
entry->matchBitmap = NULL;
@@ -386,6 +389,7 @@ restartScanEntry:
if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
{
entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+ entry->matchResult = palloc0(TBM_ITERATE_RESULT_SIZE);
entry->isFinished = false;
}
}
@@ -823,21 +827,24 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
{
/*
* If we've exhausted all items on this block, move to next block
- * in the bitmap.
+ * in the bitmap. tbm_iterate() sets matchResult->blockno to
+ * InvalidBlockNumber when the bitmap is exhausted.
*/
- while (entry->matchResult == NULL ||
+ while ((!BlockNumberIsValid(entry->matchResult->blockno)) ||
(entry->matchResult->ntuples >= 0 &&
entry->offset >= entry->matchResult->ntuples) ||
entry->matchResult->blockno < advancePastBlk ||
(ItemPointerIsLossyPage(&advancePast) &&
entry->matchResult->blockno == advancePastBlk))
{
- entry->matchResult = tbm_iterate(entry->matchIterator);
- if (entry->matchResult == NULL)
+ tbm_iterate(entry->matchIterator, entry->matchResult);
+ if (!BlockNumberIsValid(entry->matchResult->blockno))
{
ItemPointerSetInvalid(&entry->curItem);
tbm_end_iterate(entry->matchIterator);
+ pfree(entry->matchResult);
+ entry->matchResult = NULL;
entry->matchIterator = NULL;
entry->isFinished = true;
break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544e..be27f9fe07e 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -246,7 +246,14 @@ ginFreeScanKeys(GinScanOpaque so)
if (entry->list)
pfree(entry->list);
if (entry->matchIterator)
+ {
tbm_end_iterate(entry->matchIterator);
+ if (entry->matchResult)
+ {
+ pfree(entry->matchResult);
+ entry->matchResult = NULL;
+ }
+ }
if (entry->matchBitmap)
tbm_free(entry->matchBitmap);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d8569373987..86484c6c72a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -115,6 +115,8 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static BlockNumber bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data);
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -335,6 +337,22 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
if (key != NULL && scan->rs_base.rs_nkeys > 0)
memcpy(scan->rs_base.rs_key, key, scan->rs_base.rs_nkeys * sizeof(ScanKeyData));
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->pgsr)
+ pg_streaming_read_free(scan->pgsr);
+
+ scan->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ scan,
+ TBM_ITERATE_RESULT_SIZE,
+ scan->rs_strategy,
+ BMR_REL(scan->rs_base.rs_rd),
+ MAIN_FORKNUM,
+ bitmapheap_pgsr_next_single);
+
+
+ }
+
/*
* Currently, we only have a stats counter for sequential heap scans (but
* e.g for bitmap scans the underlying bitmap index scans will be counted,
@@ -955,6 +973,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->pgsr = NULL;
scan->vmbuffer = InvalidBuffer;
scan->empty_tuples = 0;
scan->rs_base.lossy_pages = 0;
@@ -1113,6 +1132,13 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->pgsr)
+ pg_streaming_read_free(scan->pgsr);
+ scan->pgsr = NULL;
+ }
+
pfree(scan);
}
@@ -10270,3 +10296,48 @@ HeapCheckForSerializableConflictOut(bool visible, Relation relation,
CheckForSerializableConflictOut(relation, xid, snapshot);
}
+
+static BlockNumber
+bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data)
+{
+ TBMIterateResult *tbmres = per_buffer_data;
+ HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+ for (;;)
+ {
+ if (hdesc->rs_base.shared_tbmiterator)
+ tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+ else
+ tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+ /* no more entries in the bitmap */
+ if (!BlockNumberIsValid(tbmres->blockno))
+ return InvalidBlockNumber;
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+ continue;
+
+
+ if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
+ {
+ hdesc->empty_tuples += tbmres->ntuples;
+ continue;
+ }
+
+ return tbmres->blockno;
+ }
+
+ /* not reachable */
+ Assert(false);
+}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d55ece23a35..0cd586cd4b8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2113,77 +2113,65 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
*/
static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, BlockNumber *blockno)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
+ void *io_private;
BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
TBMIterateResult *tbmres;
+ Assert(hscan->pgsr);
+
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- *blockno = InvalidBlockNumber;
*recheck = true;
- do
+ /* Release buffer containing previous block. */
+ if (BufferIsValid(hscan->rs_cbuf))
{
- if (scan->shared_tbmiterator)
- tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
- else
- tbmres = tbm_iterate(scan->tbmiterator);
+ ReleaseBuffer(hscan->rs_cbuf);
+ hscan->rs_cbuf = InvalidBuffer;
+ }
+
+ hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->pgsr, &io_private);
- if (tbmres == NULL)
+ if (BufferIsInvalid(hscan->rs_cbuf))
+ {
+ if (BufferIsValid(hscan->vmbuffer))
{
- /* no more entries in the bitmap */
- Assert(hscan->empty_tuples == 0);
- return false;
+ ReleaseBuffer(hscan->vmbuffer);
+ hscan->vmbuffer = InvalidBuffer;
}
/*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE
- * isolation though, as we need to examine all invisible tuples
- * reachable by the index.
+ * Bitmap is exhausted. Time to emit empty tuples if relevant. We emit
+ * all empty tuples at the end instead of emitting them per block we
+ * skip fetching. This is necessary because the streaming read API will
+ * only return TBMIterateResults for blocks actually fetched. When we
+ * skip fetching a block, we keep track of how many empty tuples to
+ * emit at the end of the BitmapHeapScan. We do not recheck all NULL
+ * tuples.
*/
- } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+ *recheck = false;
+ return hscan->empty_tuples > 0;
+ }
- /* Got a valid block */
- *blockno = tbmres->blockno;
- *recheck = tbmres->recheck;
+ Assert(io_private);
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, and the bitmap entries don't need rechecking, and all tuples on
- * the page are visible to our transaction.
- */
- if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
- Assert(hscan->empty_tuples >= 0);
+ tbmres = (TBMIterateResult *) io_private;
- hscan->empty_tuples += tbmres->ntuples;
+ Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
- return true;
- }
+ *recheck = tbmres->recheck;
- block = tbmres->blockno;
+ hscan->rs_cblock = tbmres->blockno;
+ hscan->rs_ntuples = tbmres->ntuples;
- /*
- * Acquire pin on the target heap page, trading in any pin we held before.
- */
- hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
- scan->rs_rd,
- block);
- hscan->rs_cblock = block;
+ block = tbmres->blockno;
buffer = hscan->rs_cbuf;
snapshot = scan->rs_snapshot;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index efc6952e353..8b7f87a4779 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -56,11 +56,6 @@ static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapAccumCounters(BitmapHeapScanState *node,
TableScanDesc scan);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
- TableScanDesc scan);
static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
@@ -124,15 +119,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->prefetch_iterator = tbm_begin_iterate(tbm);
- node->prefetch_pages = 0;
- node->prefetch_target = -1;
- }
-#endif /* USE_PREFETCH */
}
else
{
@@ -155,20 +141,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
* multiple processes to iterate jointly.
*/
pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- pstate->prefetch_iterator =
- tbm_prepare_shared_iterate(tbm);
-
- /*
- * We don't need the mutex here as we haven't yet woke up
- * others.
- */
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = -1;
- }
-#endif
/* We have initialized the shared state so wake up others. */
BitmapDoneInitializingSharedState(pstate);
@@ -176,14 +148,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->shared_prefetch_iterator =
- tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
- }
-#endif /* USE_PREFETCH */
}
if (!scan)
@@ -203,46 +167,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->initialized = true;
/* Get the first block. if none, end of scan */
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck))
goto exit;
- BitmapAdjustPrefetchIterator(node, node->blockno);
- BitmapAdjustPrefetchTarget(node);
}
- for (;;)
+ do
{
while (table_scan_bitmap_next_tuple(scan, slot))
{
CHECK_FOR_INTERRUPTS();
-#ifdef USE_PREFETCH
-
- /*
- * Try to prefetch at least a few pages even before we get to the
- * second page if we don't stop reading after the first tuple.
- */
- if (!pstate)
- {
- if (node->prefetch_target < node->prefetch_maximum)
- node->prefetch_target++;
- }
- else if (pstate->prefetch_target < node->prefetch_maximum)
- {
- /* take spinlock while updating shared state */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target < node->prefetch_maximum)
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-
- /*
- * We prefetch before fetching the current pages. We expect that a
- * future streaming read API will do this, so do it now for
- * consistency.
- */
- BitmapPrefetch(node, scan);
-
/*
* If we are using lossy info, we have to recheck the qual
* conditions at every tuple.
@@ -263,13 +197,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
return slot;
}
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
- break;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
+ } while (table_scan_bitmap_next_block(scan, &node->recheck));
/*
* if we get here it means we are at the end of the scan..
@@ -304,215 +232,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
ConditionVariableBroadcast(&pstate->cv);
}
-/*
- * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (node->prefetch_pages > 0)
- {
- /* The main iterator has closed the distance by one page */
- node->prefetch_pages--;
- }
- else if (prefetch_iterator)
- {
- /* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
- if (tbmpre == NULL || tbmpre->blockno != blockno)
- elog(ERROR, "prefetch and main iterators are out of sync");
- }
- return;
- }
-
- if (node->prefetch_maximum > 0)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages > 0)
- {
- pstate->prefetch_pages--;
- SpinLockRelease(&pstate->mutex);
- }
- else
- {
- /* Release the mutex before iterating */
- SpinLockRelease(&pstate->mutex);
-
- /*
- * In case of shared mode, we can not ensure that the current
- * blockno of the main iterator and that of the prefetch iterator
- * are same. It's possible that whatever blockno we are
- * prefetching will be processed by another process. Therefore,
- * we don't validate the blockno here as we do in non-parallel
- * case.
- */
- if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator);
- }
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max. Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
- return;
- }
-
- /* Do an unlocked check first to save spinlock acquisitions. */
- if (pstate->prefetch_target < node->prefetch_maximum)
- {
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
- pstate->prefetch_target = node->prefetch_maximum;
- else if (pstate->prefetch_target > 0)
- pstate->prefetch_target *= 2;
- else
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (node->prefetch_pages < node->prefetch_target)
- {
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
-
- if (tbmpre == NULL)
- {
- /* No more pages to prefetch */
- tbm_end_iterate(prefetch_iterator);
- node->prefetch_iterator = NULL;
- break;
- }
- node->prefetch_pages++;
-
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
- */
-
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
- }
- }
-
- return;
- }
-
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (1)
- {
- TBMIterateResult *tbmpre;
- bool do_prefetch = false;
- bool skip_fetch;
-
- /*
- * Recheck under the mutex. If some other process has already
- * done enough prefetching then we need not to do anything.
- */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- pstate->prefetch_pages++;
- do_prefetch = true;
- }
- SpinLockRelease(&pstate->mutex);
-
- if (!do_prefetch)
- return;
-
- tbmpre = tbm_shared_iterate(prefetch_iterator);
- if (tbmpre == NULL)
- {
- /* No more pages to prefetch */
- tbm_end_shared_iterate(prefetch_iterator);
- node->shared_prefetch_iterator = NULL;
- break;
- }
-
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
- }
- }
- }
-#endif /* USE_PREFETCH */
-}
/*
* BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
@@ -559,22 +278,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->ss.ss_currentScanDesc)
table_rescan(node->ss.ss_currentScanDesc, NULL);
- /* release bitmaps and buffers if any */
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
+ /* release bitmaps if any */
if (node->tbm)
tbm_free(node->tbm);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_prefetch_iterator = NULL;
- node->pvmbuffer = InvalidBuffer;
node->recheck = true;
- node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -606,16 +315,10 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
ExecEndNode(outerPlanState(node));
/*
- * release bitmaps and buffers if any
+ * release bitmaps if any
*/
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
/*
* close heap scan
@@ -654,19 +357,13 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
- scanstate->prefetch_iterator = NULL;
- scanstate->prefetch_pages = 0;
- scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
scanstate->recheck = true;
- scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
@@ -706,13 +403,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->bitmapqualorig =
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
- /*
- * Maximum number of prefetches for the tablespace if configured,
- * otherwise the current value of the effective_io_concurrency GUC.
- */
- scanstate->prefetch_maximum =
- get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
scanstate->ss.ss_currentRelation = currentRelation;
/*
@@ -796,14 +486,10 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
return;
pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
-
pstate->tbmiterator = 0;
- pstate->prefetch_iterator = 0;
/* Initialize the mutex */
SpinLockInit(&pstate->mutex);
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = 0;
pstate->state = BM_INITIAL;
ConditionVariableInit(&pstate->cv);
@@ -835,11 +521,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
if (DsaPointerIsValid(pstate->tbmiterator))
tbm_free_shared_area(dsa, pstate->tbmiterator);
- if (DsaPointerIsValid(pstate->prefetch_iterator))
- tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
pstate->tbmiterator = InvalidDsaPointer;
- pstate->prefetch_iterator = InvalidDsaPointer;
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 0f4850065fb..ccb511fb608 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -180,7 +180,6 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
};
/*
@@ -221,7 +220,6 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
};
/* Local function prototypes */
@@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
* Create the TBMIterator struct, with enough trailing space to serve the
* needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
iterator->tbm = tbm;
/*
@@ -957,20 +954,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
/*
* tbm_iterate - scan through next page of a TIDBitmap
*
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan. Pages are guaranteed to be delivered in numerical
- * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition. If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway. (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition. If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway. (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
*/
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
{
TIDBitmap *tbm = iterator->tbm;
- TBMIterateResult *output = &(iterator->output);
Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
@@ -998,6 +996,7 @@ tbm_iterate(TBMIterator *iterator)
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
+ Assert(tbmres);
if (iterator->schunkptr < tbm->nchunks)
{
PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -1008,11 +1007,11 @@ tbm_iterate(TBMIterator *iterator)
chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
iterator->schunkbit++;
- return output;
+ return;
}
}
@@ -1028,18 +1027,20 @@ tbm_iterate(TBMIterator *iterator)
page = tbm->spages[iterator->spageptr];
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
iterator->spageptr++;
- return output;
+ return;
}
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
+
/*
* tbm_shared_iterate - scan through next page of a TIDBitmap
*
@@ -1047,10 +1048,9 @@ tbm_iterate(TBMIterator *iterator)
* across multiple processes. We need to acquire the iterator LWLock,
* before accessing the shared members.
*/
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
{
- TBMIterateResult *output = &iterator->output;
TBMSharedIteratorState *istate = iterator->state;
PagetableEntry *ptbase = NULL;
int *idxpages = NULL;
@@ -1101,13 +1101,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
}
@@ -1117,21 +1117,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
int ntuples;
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
istate->spageptr++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
LWLockRelease(&istate->lock);
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
@@ -1470,8 +1471,7 @@ tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer dp)
* Create the TBMSharedIterator struct, with enough trailing space to
* serve the needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator));
istate = (TBMSharedIteratorState *) dsa_get_address(dsa, dp);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2fc369a18ff..33e8a7e0bba 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
#include "storage/dsm.h"
#include "storage/lockdefs.h"
#include "storage/shm_toc.h"
+#include "storage/streaming_read.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -73,6 +74,7 @@ typedef struct HeapScanDescData
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
/* these fields only used in page-at-a-time mode and for bitmap scans */
+ PgStreamingRead *pgsr;
Buffer vmbuffer; /* for checking if can skip fetch */
int empty_tuples; /* count of all NULL tuples to be returned */
int rs_cindex; /* current tuple's index in vistuples */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b90d9b7f3fa..adde320d1eb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -788,17 +788,10 @@ typedef struct TableAmRoutine
* on the page have to be returned, otherwise the tuples at offsets in
* `tbmres->offsets` need to be returned.
*
- * XXX: Currently this may only be implemented if the AM uses md.c as its
- * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
- * blockids directly to the underlying storage. nodeBitmapHeapscan.c
- * performs prefetching directly using that interface. This probably
- * needs to be rectified at a later point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
- bool (*scan_bitmap_next_block) (TableScanDesc scan,
- bool *recheck, BlockNumber *blockno);
+ bool (*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1950,8 +1943,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* used after verifying the presence (at plan time or such).
*/
static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, BlockNumber *blockno)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1961,7 +1953,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, blockno);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 03973a3f262..96afabc67e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,8 @@ typedef enum
/* ----------------
* ParallelBitmapHeapState information
* tbmiterator iterator for scanning current pages
- * prefetch_iterator iterator for prefetching ahead of current page
* mutex mutual exclusion for the prefetching variable
* and state
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
* state current state of the TIDBitmap
* cv conditional wait variable
* phs_snapshot_data snapshot data shared to workers
@@ -1695,10 +1692,7 @@ typedef enum
typedef struct ParallelBitmapHeapState
{
dsa_pointer tbmiterator;
- dsa_pointer prefetch_iterator;
slock_t mutex;
- int prefetch_pages;
- int prefetch_target;
SharedBitmapState state;
ConditionVariable cv;
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,16 +1703,10 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
- * prefetch_iterator iterator for prefetching ahead of current page
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
- * prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
* ----------------
@@ -1728,20 +1716,13 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
- TBMIterator *prefetch_iterator;
- int prefetch_pages;
- int prefetch_target;
- int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
bool recheck;
- BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639bf..672608200ba 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -64,12 +64,16 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
extern void tbm_end_iterate(TBMIterator *iterator);
extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
dsa_pointer dp);
extern long tbm_calculate_entries(double maxbytes);
+#define TBM_ITERATE_RESULT_SIZE \
+ (offsetof(TBMIterateResult, offsets) + \
+ MaxHeapTuplesPerPage * sizeof(OffsetNumber))
+
#endif /* TIDBITMAP_H */
--
2.37.2
On Feb 13, 2024, at 3:11 PM, Melanie Plageman <melanieplageman@gmail.com> wrote:
Thanks for the patch...
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2].
patches 0005-0006 push the skip fetch optimization into the table AMs
patches 0007-0009 change the control flow of BitmapHeapNext() to match
that required by the streaming read API
patch 0010 is the streaming read code not yet in master
patch 0011 is the actual bitmapheapscan streaming read user.patches 0001-0009 apply on top of master but 0010 and 0011 must be
applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
version of the streaming read API is on the mailing list).
I followed your lead and applied them to 6a8ffe812d194ba6f4f26791b6388a4837d17d6c. `make check` worked fine, though I expect you know that already.
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().
You might want an independent perspective on how much of a hassle those breaking changes are, so I took a stab at that. Having written a custom proprietary TAM for postgresql 15 here at EDB, and having ported it and released it for postgresql 16, I thought I'd try porting it to the the above commit with your patches. Even without your patches, I already see breaking changes coming from commit f691f5b80a85c66d715b4340ffabb503eb19393e, which creates a similar amount of breakage for me as does your patches. Dealing with the combined breakage might amount to a day of work, including testing, half of which I think I've already finished. In other words, it doesn't seem like a big deal.
Were postgresql 17 shaping up to be compatible with TAMs written for 16, your patch would change that qualitatively, but since things are already incompatible, I think you're in the clear.
A TBMIterateResult used to be threaded through both of these functions
and used in BitmapHeapNext(). This patch set removes all references to
TBMIterateResults from BitmapHeapNext. Because the streaming read API
requires the callback to specify the next block, BitmapHeapNext() can
no longer pass a TBMIterateResult to table_scan_bitmap_next_block().More subtly, table_scan_bitmap_next_block() used to return false if
there were no more visible tuples on the page or if the block that was
requested was not valid. With these changes,
table_scan_bitmap_next_block() will only return false when the bitmap
has been exhausted and the scan can end. In order to use the streaming
read API, the user must be able to request the blocks it needs without
requiring synchronous feedback per block. Thus, this table AM function
must change its meaning.I think the way the patches are split up could be improved. I will
think more about this. There are also probably a few mistakes with
which comments are updated in which patches in the set.
I look forward to the next version of the patch set. Thanks again for working on this.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Feb 13, 2024 at 11:34 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
On Feb 13, 2024, at 3:11 PM, Melanie Plageman <melanieplageman@gmail.com> wrote:
Thanks for the patch...
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2].
patches 0005-0006 push the skip fetch optimization into the table AMs
patches 0007-0009 change the control flow of BitmapHeapNext() to match
that required by the streaming read API
patch 0010 is the streaming read code not yet in master
patch 0011 is the actual bitmapheapscan streaming read user.patches 0001-0009 apply on top of master but 0010 and 0011 must be
applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
version of the streaming read API is on the mailing list).I followed your lead and applied them to 6a8ffe812d194ba6f4f26791b6388a4837d17d6c. `make check` worked fine, though I expect you know that already.
Thanks for taking a look!
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().You might want an independent perspective on how much of a hassle those breaking changes are, so I took a stab at that. Having written a custom proprietary TAM for postgresql 15 here at EDB, and having ported it and released it for postgresql 16, I thought I'd try porting it to the the above commit with your patches. Even without your patches, I already see breaking changes coming from commit f691f5b80a85c66d715b4340ffabb503eb19393e, which creates a similar amount of breakage for me as does your patches. Dealing with the combined breakage might amount to a day of work, including testing, half of which I think I've already finished. In other words, it doesn't seem like a big deal.
Were postgresql 17 shaping up to be compatible with TAMs written for 16, your patch would change that qualitatively, but since things are already incompatible, I think you're in the clear.
Oh, good to know! I'm very happy to have the perspective of a table AM
author. Just curious, did your table AM implement
table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(),
and, if so, did you use the TBMIterateResult? Since it is not used in
BitmapHeapNext() in my version, table AMs would have to change how
they use TBMIterateResults anyway. But I assume they could add it to a
table AM specific scan descriptor if they want access to a
TBMIterateResult of their own making in both
table_san_bitmap_next_block() and next_tuple()?
- Melanie
On Feb 14, 2024, at 6:47 AM, Melanie Plageman <melanieplageman@gmail.com> wrote:
Just curious, did your table AM implement
table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(),
and, if so, did you use the TBMIterateResult? Since it is not used in
BitmapHeapNext() in my version, table AMs would have to change how
they use TBMIterateResults anyway. But I assume they could add it to a
table AM specific scan descriptor if they want access to a
TBMIterateResult of their own making in both
table_san_bitmap_next_block() and next_tuple()?
My table AM does implement those two functions and does use the TBMIterateResult *tbmres argument, yes. I would deal with the issue in very much the same way that your patches modify heapam. I don't really have any additional comments about that.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote:
Attached is a patch set which refactors BitmapHeapScan such that it
can use the streaming read API [1]. It also resolves the long-standing
FIXME in the BitmapHeapScan code suggesting that the skip fetch
optimization should be pushed into the table AMs. Additionally, it
moves table scan initialization to after the index scan and bitmap
initialization.
Thanks for working on this! While I have some quibbles with details, I think
this is quite a bit of progress in the right direction.
patches 0001-0002 are assorted cleanup needed later in the set.
patches 0003 moves the table scan initialization to after bitmap creation
patch 0004 is, I think, a bug fix. see [2].
I'd not quite call it a bugfix, it's not like it leads to wrong
behaviour. Seems more like an optimization. But whatever :)
The caveat is that these patches introduce breaking changes to two
table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
and table_scan_bitmap_next_tuple().
That's to be expected, I don't think it's worth worrying about. Right now a
bunch of TAMs can't implement bitmap scans, this goes a fair bit towards
allowing that...
From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setupThere is no reason for table_beginscan_bm() to begin the actual scan of
the underlying table in ExecInitBitmapHeapScan(). We can begin the
underlying table scan after the index scan has been completed and the
bitmap built.The one use of the scan descriptor during initialization was
ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
with one from an array in the parallel state. This overwrote the
snapshot set in table_beginscan_bm().By saving that worker snapshot as a member in the BitmapHeapScanState
during initialization, it can be restored in table_beginscan_bm() after
returning from the table AM specific begin scan function.
I don't understand what the point of passing two different snapshots to
table_beginscan_bm() is. What does that even mean? Why can't we just use the
correct snapshot initially?
From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
fetchPreviously BitmapPrefetch() used the recheck flag for the current block
to determine whether or not it could skip prefetching the proposed
prefetch block. It makes more sense for it to use the recheck flag from
the TBMIterateResult for the prefetch block instead.
I'd mention the commit that introduced the current logic and link to the
the thread that you started about this.
From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumberBitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
Hm - I'm not convinced this is a good direction - doesn't that arguably
*increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap
heap scans in a TAM without blocks, but still.
From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
table AMThis resolves the long-standing FIXME in BitmapHeapNext() which said that
the optmization to skip fetching blocks of the underlying table when
none of the column data was needed should be pushed into the table AM
specific code.
Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old
enough to drink, at least in some countries. :)
The table AM agnostic functions for prefetching still need to know if
skipping fetching is permitted for this scan. However, this dependency
will be removed when that prefetching code is removed in favor of the
upcoming streaming read API.
---
src/backend/access/heap/heapam.c | 10 +++
src/backend/access/heap/heapam_handler.c | 29 +++++++
src/backend/executor/nodeBitmapHeapscan.c | 100 ++++++----------------
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 17 ++--
src/include/nodes/execnodes.h | 6 --
6 files changed, 74 insertions(+), 90 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 707460a5364..7aae1ecf0a9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_base.rs_flags = flags; scan->rs_base.rs_parallel = parallel_scan; scan->rs_strategy = NULL; /* set in initscan */ + scan->vmbuffer = InvalidBuffer; + scan->empty_tuples = 0;
These don't follow the existing naming pattern for HeapScanDescData. While I
explicitly dislike the practice of adding prefixes to struct members, I don't
think mixing conventions within a single struct improves things.
I also think it'd be good to note in comments that the vm buffer currently is
only used for bitmap heap scans, otherwise one might think they'd also be used
for normal scans, where we don't need them, because of the page level flag.
Also, perhaps worth renaming "empty_tuples" to something indicating that it's
the number of empty tuples to be returned later? num_empty_tuples_pending or
such? Or the current "return_empty_tuples".
@@ -1043,6 +1045,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);+ if (BufferIsValid(scan->vmbuffer)) + ReleaseBuffer(scan->vmbuffer); + scan->vmbuffer = InvalidBuffer;
It does not matter one iota here, but personally I prefer moving the write
inside the if, as dirtying the cacheline after we just figured out whe
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 9372b49bfaa..c0fb06c9688 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -108,6 +108,7 @@ BitmapHeapNext(BitmapHeapScanState *node) */ if (!node->initialized) { + bool can_skip_fetch; /* * We can potentially skip fetching heap pages if we do not need any * columns of the table, either for checking non-indexable quals or
Pretty sure pgindent will move this around.
+++ b/src/include/access/tableam.h @@ -62,6 +62,7 @@ typedef enum ScanOptions/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
Would be nice to add a comment explaining what this flag means.
From 500c84019b982a1e6c8b8dd40240c8510d83c287 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v1 07/11] BitmapHeapScan scan desc counts lossy and exact
pagesFuture commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.
---
src/backend/access/heap/heapam.c | 2 ++
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 18 +++++++++++++-----
src/include/access/relscan.h | 4 ++++
4 files changed, 28 insertions(+), 5 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7aae1ecf0a9..88b4aad5820 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -957,6 +957,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_strategy = NULL; /* set in initscan */ scan->vmbuffer = InvalidBuffer; scan->empty_tuples = 0; + scan->rs_base.lossy_pages = 0; + scan->rs_base.exact_pages = 0;/* * Disable page-at-a-time mode if it's not a MVCC-safe snapshot. diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index baba09c87c0..6e85ef7a946 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, Assert(ntup <= MaxHeapTuplesPerPage); hscan->rs_ntuples = ntup;+ /* Only count exact and lossy pages with visible tuples */ + if (ntup > 0) + { + if (tbmres->ntuples >= 0) + scan->exact_pages++; + else + scan->lossy_pages++; + } + return ntup > 0; }diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index c0fb06c9688..19d115de06f 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -53,6 +53,8 @@ #include "utils/spccache.h"static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); +static inline void BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan); static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate); static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node, BlockNumber blockno); @@ -234,11 +236,6 @@ BitmapHeapNext(BitmapHeapScanState *node) continue; }- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
- node->lossy_pages++;
-
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
@@ -315,9 +312,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* if we get here it means we are at the end of the scan..
*/
+ BitmapAccumCounters(node, scan);
return ExecClearTuple(slot);
}+static inline void +BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan) +{ + node->exact_pages += scan->exact_pages; + scan->exact_pages = 0; + node->lossy_pages += scan->lossy_pages; + scan->lossy_pages = 0; +} +
I don't think this is quite right - you're calling BitmapAccumCounters() only
when the scan doesn't return anything anymore, but there's no guarantee
that'll ever be reached. E.g. a bitmap heap scan below a limit node. I think
this needs to be in a) ExecEndBitmapHeapScan() b) ExecReScanBitmapHeapScan()
/* * BitmapDoneInitializingSharedState - Shared state is initialized * diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 521043304ab..b74e08dd745 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -40,6 +40,10 @@ typedef struct TableScanDescData ItemPointerData rs_mintid; ItemPointerData rs_maxtid;+ /* Only used for Bitmap table scans */ + long exact_pages; + long lossy_pages; + /* * Information about type and behaviour of the scan, a bitmask of members * of the ScanOptions enum (see tableam.h).
I wonder if this really is the best place for the data to be accumulated. This
requires the accounting to be implemented in each AM, which doesn't obviously
seem required. Why can't the accounting continue to live in
nodeBitmapHeapscan.c, to be done after each table_scan_bitmap_next_block()
call?
From 555743e4bc885609d20768f7f2990c6ba69b13a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v1 09/11] Make table_scan_bitmap_next_block() async friendlytable_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the known end of the table.
This sounds a bit like the block is actually past the end of the table,
but in reality this happens if the block is past the end of the table as it
was when the scan was started. Somehow that feels significant, but I don't
really know why I think that.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 88b4aad5820..d8569373987 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -959,6 +959,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->empty_tuples = 0; scan->rs_base.lossy_pages = 0; scan->rs_base.exact_pages = 0; + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL;/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1051,6 +1053,18 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
ReleaseBuffer(scan->vmbuffer);
scan->vmbuffer = InvalidBuffer;+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN) + { + if (scan->rs_base.shared_tbmiterator) + tbm_end_shared_iterate(scan->rs_base.shared_tbmiterator); + + if (scan->rs_base.tbmiterator) + tbm_end_iterate(scan->rs_base.tbmiterator); + } + + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL; + /* * reinitialize scan descriptor */
If every AM would need to implement this, perhaps this shouldn't be done here,
but in generic code?
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2114,17 +2114,49 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,static bool heapam_scan_bitmap_next_block(TableScanDesc scan, - TBMIterateResult *tbmres) + bool *recheck, BlockNumber *blockno) { HeapScanDesc hscan = (HeapScanDesc) scan; - BlockNumber block = tbmres->blockno; + BlockNumber block; Buffer buffer; Snapshot snapshot; int ntup; + TBMIterateResult *tbmres;hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;+ *blockno = InvalidBlockNumber; + *recheck = true; + + do + { + if (scan->shared_tbmiterator) + tbmres = tbm_shared_iterate(scan->shared_tbmiterator); + else + tbmres = tbm_iterate(scan->tbmiterator); + + if (tbmres == NULL) + { + /* no more entries in the bitmap */ + Assert(hscan->empty_tuples == 0); + return false; + } + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
Hm. Isn't it a problem that we have no CHECK_FOR_INTERRUPTS() in this loop?
@@ -2251,7 +2274,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}- return ntup > 0; + /* + * Return true to indicate that a valid block was found and the bitmap is + * not exhausted. If there are no visible tuples on this page, + * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will + * return false returning control to this function to advance to the next + * block in the bitmap. + */ + return true; }
Why can't we fetch the next block immediately?
@@ -201,46 +197,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
can_skip_fetch);
}- node->tbmiterator = tbmiterator; - node->shared_tbmiterator = shared_tbmiterator; + scan->tbmiterator = tbmiterator; + scan->shared_tbmiterator = shared_tbmiterator;
It seems a bit odd that this code modifies the scan descriptor, instead of
passing the iterator, or perhaps better the bitmap itself, to
table_beginscan_bm()?
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index b74e08dd745..bf7ee044268 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -16,6 +16,7 @@#include "access/htup_details.h"
#include "access/itup.h"
+#include "nodes/tidbitmap.h"
I'd like to avoid exposing this to everything including relscan.h. I think we
could just forward declare the structs and use them here to avoid that?
From aac60985d6bc70bfedf77a77ee3c512da87bfcb1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:27:57 -0500
Subject: [PATCH v1 11/11] BitmapHeapScan uses streaming read APIRemove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/gin/ginget.c | 15 +-
src/backend/access/gin/ginscan.c | 7 +
src/backend/access/heap/heapam.c | 71 +++++
src/backend/access/heap/heapam_handler.c | 78 +++--
src/backend/executor/nodeBitmapHeapscan.c | 328 +---------------------
src/backend/nodes/tidbitmap.c | 80 +++---
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 14 +-
src/include/nodes/execnodes.h | 19 --
src/include/nodes/tidbitmap.h | 8 +-
10 files changed, 178 insertions(+), 444 deletions(-)diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index 0b4f2ebadb6..3ce28078a6f 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -373,7 +373,10 @@ restartScanEntry: if (entry->matchBitmap) { if (entry->matchIterator) + { tbm_end_iterate(entry->matchIterator); + pfree(entry->matchResult); + } entry->matchIterator = NULL; tbm_free(entry->matchBitmap); entry->matchBitmap = NULL; @@ -386,6 +389,7 @@ restartScanEntry: if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap)) { entry->matchIterator = tbm_begin_iterate(entry->matchBitmap); + entry->matchResult = palloc0(TBM_ITERATE_RESULT_SIZE);
Do we actually have to use palloc0? TBM_ITERATE_RESULT_SIZE ain't small, so
zeroing all of it isn't free.
+static BlockNumber bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data);
Is it correct to have _single in the name here? Aren't we also using for
parallel scans?
+static BlockNumber +bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data) +{ + TBMIterateResult *tbmres = per_buffer_data; + HeapScanDesc hdesc = (HeapScanDesc) pgsr_private; + + for (;;) + { + if (hdesc->rs_base.shared_tbmiterator) + tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres); + else + tbm_iterate(hdesc->rs_base.tbmiterator, tbmres); + + /* no more entries in the bitmap */ + if (!BlockNumberIsValid(tbmres->blockno)) + return InvalidBlockNumber; + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks) + continue; + + + if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH && + !tbmres->recheck && + VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer)) + { + hdesc->empty_tuples += tbmres->ntuples; + continue; + } + + return tbmres->blockno; + } + + /* not reachable */ + Assert(false); +}
Need to check for interrupts somewhere here.
@@ -124,15 +119,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
There's still a comment in BitmapHeapNext talking about prefetching with two
iterators etc. That seems outdated now.
/* * tbm_iterate - scan through next page of a TIDBitmap * - * Returns a TBMIterateResult representing one page, or NULL if there are - * no more pages to scan. Pages are guaranteed to be delivered in numerical - * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to - * remember the exact tuples to look at on this page --- the caller must - * examine all tuples on the page and check if they meet the intended - * condition. If result->recheck is true, only the indicated tuples need - * be examined, but the condition must be rechecked anyway. (For ease of - * testing, recheck is always set true when ntuples < 0.) + * Caller must pass in a TBMIterateResult to be filled. + * + * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is + * set to InvalidBlockNumber when there are no more pages to scan. If + * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the + * exact tuples to look at on this page --- the caller must examine all tuples + * on the page and check if they meet the intended condition. If + * tbmres->recheck is true, only the indicated tuples need be examined, but the + * condition must be rechecked anyway. (For ease of testing, recheck is always + * set true when ntuples < 0.) */ -TBMIterateResult * -tbm_iterate(TBMIterator *iterator) +void +tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
Hm - it seems a tad odd that we later have to find out if the scan is done
iterating by checking if blockno is valid, when tbm_iterate already knew. But
I guess the code would be a bit uglier if we needed the result of
tbm_[shared_]iterate(), due to the two functions.
Right now ExecEndBitmapHeapScan() frees the tbm before it does table_endscan()
- which seems problematic, as heap_endscan() will do stuff like
tbm_end_iterate(), which imo shouldn't be called after the tbm has been freed,
even if that works today.
It seems a bit confusing that your changes seem to treat
BitmapHeapScanState->initialized as separate from ->scan, even though afaict
scan should be NULL iff initialized is false and vice versa.
Independent of your patches, but brr, it's ugly that
BitmapShouldInitializeSharedState() blocks.
Greetings,
Andres Freund
Thank you so much for this thorough review!!!!
On Wed, Feb 14, 2024 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:
On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote:
From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setupThere is no reason for table_beginscan_bm() to begin the actual scan of
the underlying table in ExecInitBitmapHeapScan(). We can begin the
underlying table scan after the index scan has been completed and the
bitmap built.The one use of the scan descriptor during initialization was
ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
with one from an array in the parallel state. This overwrote the
snapshot set in table_beginscan_bm().By saving that worker snapshot as a member in the BitmapHeapScanState
during initialization, it can be restored in table_beginscan_bm() after
returning from the table AM specific begin scan function.I don't understand what the point of passing two different snapshots to
table_beginscan_bm() is. What does that even mean? Why can't we just use the
correct snapshot initially?
Indeed. Honestly, it was an unlabeled TODO for me. I wasn't quite sure
how to get the same behavior as in master. Fixed in attached v2.
Now the parallel worker still restores and registers that snapshot in
ExecBitmapHeapInitializeWorker() and then saves it in the
BitmapHeapScanState. We then pass SO_TEMP_SNAPSHOT as an extra flag
(to set rs_flags) to table_beginscan_bm() if there is a parallel
worker snapshot saved in the BitmapHeapScanState.
From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
fetchPreviously BitmapPrefetch() used the recheck flag for the current block
to determine whether or not it could skip prefetching the proposed
prefetch block. It makes more sense for it to use the recheck flag from
the TBMIterateResult for the prefetch block instead.I'd mention the commit that introduced the current logic and link to the
the thread that you started about this.
Done
From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumberBitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.Hm - I'm not convinced this is a good direction - doesn't that arguably
*increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap
heap scans in a TAM without blocks, but still.
This is removed in later commits and is an intermediate state to try
and move the TBMIterateResult out of BitmapHeapNext(). I can find
another way to achieve this if it is important.
From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
table AMThis resolves the long-standing FIXME in BitmapHeapNext() which said that
the optmization to skip fetching blocks of the underlying table when
none of the column data was needed should be pushed into the table AM
specific code.Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old
enough to drink, at least in some countries. :)
;) I've updated the commit message. Though it is longstanding in that
it predates Melanie + Postgres.
The table AM agnostic functions for prefetching still need to know if
skipping fetching is permitted for this scan. However, this dependency
will be removed when that prefetching code is removed in favor of the
upcoming streaming read API.---
src/backend/access/heap/heapam.c | 10 +++
src/backend/access/heap/heapam_handler.c | 29 +++++++
src/backend/executor/nodeBitmapHeapscan.c | 100 ++++++----------------
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 17 ++--
src/include/nodes/execnodes.h | 6 --
6 files changed, 74 insertions(+), 90 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 707460a5364..7aae1ecf0a9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_base.rs_flags = flags; scan->rs_base.rs_parallel = parallel_scan; scan->rs_strategy = NULL; /* set in initscan */ + scan->vmbuffer = InvalidBuffer; + scan->empty_tuples = 0;These don't follow the existing naming pattern for HeapScanDescData. While I
explicitly dislike the practice of adding prefixes to struct members, I don't
think mixing conventions within a single struct improves things.
I've updated the names. What does rs even stand for?
I also think it'd be good to note in comments that the vm buffer currently is
only used for bitmap heap scans, otherwise one might think they'd also be used
for normal scans, where we don't need them, because of the page level flag.
Done.
Also, perhaps worth renaming "empty_tuples" to something indicating that it's
the number of empty tuples to be returned later? num_empty_tuples_pending or
such? Or the current "return_empty_tuples".
Done.
@@ -1043,6 +1045,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);+ if (BufferIsValid(scan->vmbuffer)) + ReleaseBuffer(scan->vmbuffer); + scan->vmbuffer = InvalidBuffer;It does not matter one iota here, but personally I prefer moving the write
inside the if, as dirtying the cacheline after we just figured out whe
I've now followed this convention throughout my patchset in the places
where I noticed it.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 9372b49bfaa..c0fb06c9688 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -108,6 +108,7 @@ BitmapHeapNext(BitmapHeapScanState *node) */ if (!node->initialized) { + bool can_skip_fetch; /* * We can potentially skip fetching heap pages if we do not need any * columns of the table, either for checking non-indexable quals orPretty sure pgindent will move this around.
This is gone now, but I have pgindented all the commits so it
shouldn't be a problem again.
+++ b/src/include/access/tableam.h @@ -62,6 +62,7 @@ typedef enum ScanOptions/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;Would be nice to add a comment explaining what this flag means.
Done.
From 500c84019b982a1e6c8b8dd40240c8510d83c287 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v1 07/11] BitmapHeapScan scan desc counts lossy and exact
pagesFuture commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.---
src/backend/access/heap/heapam.c | 2 ++
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 18 +++++++++++++-----
src/include/access/relscan.h | 4 ++++
4 files changed, 28 insertions(+), 5 deletions(-)diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7aae1ecf0a9..88b4aad5820 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -957,6 +957,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_strategy = NULL; /* set in initscan */ scan->vmbuffer = InvalidBuffer; scan->empty_tuples = 0; + scan->rs_base.lossy_pages = 0; + scan->rs_base.exact_pages = 0;/* * Disable page-at-a-time mode if it's not a MVCC-safe snapshot. diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index baba09c87c0..6e85ef7a946 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, Assert(ntup <= MaxHeapTuplesPerPage); hscan->rs_ntuples = ntup;+ /* Only count exact and lossy pages with visible tuples */ + if (ntup > 0) + { + if (tbmres->ntuples >= 0) + scan->exact_pages++; + else + scan->lossy_pages++; + } + return ntup > 0; }diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index c0fb06c9688..19d115de06f 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -53,6 +53,8 @@ #include "utils/spccache.h"static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); +static inline void BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan); static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate); static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node, BlockNumber blockno); @@ -234,11 +236,6 @@ BitmapHeapNext(BitmapHeapScanState *node) continue; }- if (tbmres->ntuples >= 0) - node->exact_pages++; - else - node->lossy_pages++; - /* Adjust the prefetch target */ BitmapAdjustPrefetchTarget(node); } @@ -315,9 +312,20 @@ BitmapHeapNext(BitmapHeapScanState *node) /* * if we get here it means we are at the end of the scan.. */ + BitmapAccumCounters(node, scan); return ExecClearTuple(slot); }+static inline void +BitmapAccumCounters(BitmapHeapScanState *node, + TableScanDesc scan) +{ + node->exact_pages += scan->exact_pages; + scan->exact_pages = 0; + node->lossy_pages += scan->lossy_pages; + scan->lossy_pages = 0; +} +I don't think this is quite right - you're calling BitmapAccumCounters() only
when the scan doesn't return anything anymore, but there's no guarantee
that'll ever be reached. E.g. a bitmap heap scan below a limit node. I think
this needs to be in a) ExecEndBitmapHeapScan() b) ExecReScanBitmapHeapScan()
The scan descriptor isn't available in ExecEnd/ReScanBitmapHeapScan().
So, if we count in the scan descriptor we can't accumulate into the
BitmapHeapScanState there. The reason to count in the scan descriptor
is that it is in the table AM where we know if we have a lossy or
exact page -- and we only have the scan descriptor not the
BitmapHeapScanState in the table AM.
I added a call to BitmapAccumCounters before the tuple is returned for
correctness in this version (not ideal, I realize). See below for
thoughts about what we could do instead.
/* * BitmapDoneInitializingSharedState - Shared state is initialized * diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 521043304ab..b74e08dd745 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -40,6 +40,10 @@ typedef struct TableScanDescData ItemPointerData rs_mintid; ItemPointerData rs_maxtid;+ /* Only used for Bitmap table scans */ + long exact_pages; + long lossy_pages; + /* * Information about type and behaviour of the scan, a bitmask of members * of the ScanOptions enum (see tableam.h).I wonder if this really is the best place for the data to be accumulated. This
requires the accounting to be implemented in each AM, which doesn't obviously
seem required. Why can't the accounting continue to live in
nodeBitmapHeapscan.c, to be done after each table_scan_bitmap_next_block()
call?
Yes, I would really prefer not to do it in the table AM. But, we only
count exact and lossy pages for which at least one or more tuples were
visible (change this and you'll see tests fail). So, we need to decide
if we are going to increment the counters somewhere where we have
access to that information. In the case of heap, that is really only
once I have the value of ntup in heapam_scan_bitmap_next_block(). To
get that information back out to BitmapHeapNext(), I considered adding
another parameter to heapam_scan_bitmap_next_block() -- maybe an enum
like this:
/*
* BitmapHeapScans's bitmaps can choose to store per page information in a
* lossy or exact way. Exact pages in the bitmap have the individual tuple
* offsets that need to be visited while lossy pages in the bitmap have only the
* block number of the page.
*/
typedef enum BitmapBlockResolution
{
BITMAP_BLOCK_NO_VISIBLE,
BITMAP_BLOCK_LOSSY,
BITMAP_BLOCK_EXACT,
} BitmapBlockResolution;
which we then use to increment the counter. But while I was writing
this code, I found myself narrating in the comment that the reason
this had to be set inside of the table AM is that only the table AM
knows if it wants to count the block as lossy, exact, or not count it.
So, that made me question if it really should be in the
BitmapHeapScanState.
I also explored passing the table scan descriptor to
show_tidbitmap_info() -- but that had its own problems.
From 555743e4bc885609d20768f7f2990c6ba69b13a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v1 09/11] Make table_scan_bitmap_next_block() async friendlytable_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the known end of the table.This sounds a bit like the block is actually past the end of the table,
but in reality this happens if the block is past the end of the table as it
was when the scan was started. Somehow that feels significant, but I don't
really know why I think that.
I have tried to update the commit message to make it clearer. I was
actually wondering: now that we do table_beginscan_bm() in
BitmapHeapNext() instead of ExecInitBitmapHeapScan(), have we reduced
or eliminated the opportunity for this to be true? initscan() sets
rs_nblocks and that now happens much later.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 88b4aad5820..d8569373987 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -959,6 +959,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->empty_tuples = 0; scan->rs_base.lossy_pages = 0; scan->rs_base.exact_pages = 0; + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL;/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1051,6 +1053,18 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
ReleaseBuffer(scan->vmbuffer);
scan->vmbuffer = InvalidBuffer;+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN) + { + if (scan->rs_base.shared_tbmiterator) + tbm_end_shared_iterate(scan->rs_base.shared_tbmiterator); + + if (scan->rs_base.tbmiterator) + tbm_end_iterate(scan->rs_base.tbmiterator); + } + + scan->rs_base.shared_tbmiterator = NULL; + scan->rs_base.tbmiterator = NULL; + /* * reinitialize scan descriptor */If every AM would need to implement this, perhaps this shouldn't be done here,
but in generic code?
Fixed.
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2114,17 +2114,49 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,static bool heapam_scan_bitmap_next_block(TableScanDesc scan, - TBMIterateResult *tbmres) + bool *recheck, BlockNumber *blockno) { HeapScanDesc hscan = (HeapScanDesc) scan; - BlockNumber block = tbmres->blockno; + BlockNumber block; Buffer buffer; Snapshot snapshot; int ntup; + TBMIterateResult *tbmres;hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;+ *blockno = InvalidBlockNumber; + *recheck = true; + + do + { + if (scan->shared_tbmiterator) + tbmres = tbm_shared_iterate(scan->shared_tbmiterator); + else + tbmres = tbm_iterate(scan->tbmiterator); + + if (tbmres == NULL) + { + /* no more entries in the bitmap */ + Assert(hscan->empty_tuples == 0); + return false; + } + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);Hm. Isn't it a problem that we have no CHECK_FOR_INTERRUPTS() in this loop?
Yes. fixed.
@@ -2251,7 +2274,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}- return ntup > 0; + /* + * Return true to indicate that a valid block was found and the bitmap is + * not exhausted. If there are no visible tuples on this page, + * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will + * return false returning control to this function to advance to the next + * block in the bitmap. + */ + return true; }Why can't we fetch the next block immediately?
We don't know that we want another block until we've gone through this
page and seen there were no visible tuples, so we'd somehow have to
jump back up to the top of the function to get the next block -- which
is basically what is happening in my revised control flow. We call
heapam_scan_bitmap_next_tuple() and rs_ntuples is 0, so we end up
calling heapam_scan_bitmap_next_block() right away.
@@ -201,46 +197,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
can_skip_fetch);
}- node->tbmiterator = tbmiterator; - node->shared_tbmiterator = shared_tbmiterator; + scan->tbmiterator = tbmiterator; + scan->shared_tbmiterator = shared_tbmiterator;It seems a bit odd that this code modifies the scan descriptor, instead of
passing the iterator, or perhaps better the bitmap itself, to
table_beginscan_bm()?
On rescan we actually will have initialized = false and make new
iterators but have the old scan descriptor. So, we need to be able to
set the iterator in the scan to the new iterator.
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index b74e08dd745..bf7ee044268 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -16,6 +16,7 @@#include "access/htup_details.h"
#include "access/itup.h"
+#include "nodes/tidbitmap.h"I'd like to avoid exposing this to everything including relscan.h. I think we
could just forward declare the structs and use them here to avoid that?
Done
From aac60985d6bc70bfedf77a77ee3c512da87bfcb1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:27:57 -0500
Subject: [PATCH v1 11/11] BitmapHeapScan uses streaming read APIRemove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/gin/ginget.c | 15 +-
src/backend/access/gin/ginscan.c | 7 +
src/backend/access/heap/heapam.c | 71 +++++
src/backend/access/heap/heapam_handler.c | 78 +++--
src/backend/executor/nodeBitmapHeapscan.c | 328 +---------------------
src/backend/nodes/tidbitmap.c | 80 +++---
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 14 +-
src/include/nodes/execnodes.h | 19 --
src/include/nodes/tidbitmap.h | 8 +-
10 files changed, 178 insertions(+), 444 deletions(-)diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index 0b4f2ebadb6..3ce28078a6f 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -373,7 +373,10 @@ restartScanEntry: if (entry->matchBitmap) { if (entry->matchIterator) + { tbm_end_iterate(entry->matchIterator); + pfree(entry->matchResult); + } entry->matchIterator = NULL; tbm_free(entry->matchBitmap); entry->matchBitmap = NULL; @@ -386,6 +389,7 @@ restartScanEntry: if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap)) { entry->matchIterator = tbm_begin_iterate(entry->matchBitmap); + entry->matchResult = palloc0(TBM_ITERATE_RESULT_SIZE);Do we actually have to use palloc0? TBM_ITERATE_RESULT_SIZE ain't small, so
zeroing all of it isn't free.
Tests actually did fail when I didn't use palloc0.
This code is different now though. There are a few new patches in v2
that 1) make the offsets array in the TBMIterateResult fixed size and
then this makes it possible to 2) make matchResult an inline member of
the GinScanEntry. I have a TODO in the code asking if setting blockno
in the TBMIterateResult to InvalidBlockNumber is sufficient
"resetting".
+static BlockNumber bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data);Is it correct to have _single in the name here? Aren't we also using for
parallel scans?
Right. I had a separate parallel version and then deleted it. This is now fixed.
+static BlockNumber +bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, + void *per_buffer_data) +{ + TBMIterateResult *tbmres = per_buffer_data; + HeapScanDesc hdesc = (HeapScanDesc) pgsr_private; + + for (;;) + { + if (hdesc->rs_base.shared_tbmiterator) + tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres); + else + tbm_iterate(hdesc->rs_base.tbmiterator, tbmres); + + /* no more entries in the bitmap */ + if (!BlockNumberIsValid(tbmres->blockno)) + return InvalidBlockNumber; + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). We don't take this optimization in SERIALIZABLE + * isolation though, as we need to examine all invisible tuples + * reachable by the index. + */ + if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks) + continue; + + + if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH && + !tbmres->recheck && + VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer)) + { + hdesc->empty_tuples += tbmres->ntuples; + continue; + } + + return tbmres->blockno; + } + + /* not reachable */ + Assert(false); +}Need to check for interrupts somewhere here.
Done.
@@ -124,15 +119,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
There's still a comment in BitmapHeapNext talking about prefetching with two
iterators etc. That seems outdated now.
Fixed.
/* * tbm_iterate - scan through next page of a TIDBitmap * - * Returns a TBMIterateResult representing one page, or NULL if there are - * no more pages to scan. Pages are guaranteed to be delivered in numerical - * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to - * remember the exact tuples to look at on this page --- the caller must - * examine all tuples on the page and check if they meet the intended - * condition. If result->recheck is true, only the indicated tuples need - * be examined, but the condition must be rechecked anyway. (For ease of - * testing, recheck is always set true when ntuples < 0.) + * Caller must pass in a TBMIterateResult to be filled. + * + * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is + * set to InvalidBlockNumber when there are no more pages to scan. If + * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the + * exact tuples to look at on this page --- the caller must examine all tuples + * on the page and check if they meet the intended condition. If + * tbmres->recheck is true, only the indicated tuples need be examined, but the + * condition must be rechecked anyway. (For ease of testing, recheck is always + * set true when ntuples < 0.) */ -TBMIterateResult * -tbm_iterate(TBMIterator *iterator) +void +tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)Hm - it seems a tad odd that we later have to find out if the scan is done
iterating by checking if blockno is valid, when tbm_iterate already knew. But
I guess the code would be a bit uglier if we needed the result of
tbm_[shared_]iterate(), due to the two functions.
Yes.
Right now ExecEndBitmapHeapScan() frees the tbm before it does table_endscan()
- which seems problematic, as heap_endscan() will do stuff like
tbm_end_iterate(), which imo shouldn't be called after the tbm has been freed,
even if that works today.
I've flipped the order -- I end the scan then free the bitmap.
It seems a bit confusing that your changes seem to treat
BitmapHeapScanState->initialized as separate from ->scan, even though afaict
scan should be NULL iff initialized is false and vice versa.
I thought so too, but it seems on rescan that the node->initialized is
set to false but the scan is reused. So, we want to only make a new
scan descriptor if it is truly the beginning of a new scan.
- Melanie
Attachments:
v2-0011-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-patch; charset=US-ASCII; name=v2-0011-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchDownload
From b0417f661ffab18058a392327370eb8690b49c38 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:23:41 -0500
Subject: [PATCH v2 11/13] Separate TBM[Shared]Iterator and TBMIterateResult
Remove the TBMIterateResult from the TBMIterator and TBMSharedIterator
and have tbm_[shared_]iterate() take a TBMIterateResult as a parameter.
This will allow multiple TBMIterateResults to exist concurrently
allowing asynchronous use of the TIDBitmap for prefetching, for example.
tbm_[shared]_iterate() now sets blockno to InvalidBlockNumber when the
bitmap is exhausted instead of returning NULL.
BitmapHeapScan callers of tbm_iterate make a TBMIterateResult locally
and pass it in.
Because GIN only needs a single TBMIterateResult, inline the matchResult
in the GinScanEntry to avoid having to separately manage memory for the
TBMIterateResult.
---
src/backend/access/gin/ginget.c | 48 +++++++++------
src/backend/access/gin/ginscan.c | 2 +-
src/backend/access/heap/heapam_handler.c | 32 +++++-----
src/backend/executor/nodeBitmapHeapscan.c | 33 +++++-----
src/backend/nodes/tidbitmap.c | 73 ++++++++++++-----------
src/include/access/gin_private.h | 2 +-
src/include/nodes/tidbitmap.h | 4 +-
7 files changed, 107 insertions(+), 87 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb6..831941271c4 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -332,10 +332,22 @@ restartScanEntry:
entry->list = NULL;
entry->nlist = 0;
entry->matchBitmap = NULL;
- entry->matchResult = NULL;
entry->reduceResult = false;
entry->predictNumberResult = 0;
+ /*
+ * MTODO: is it enough to set blockno to InvalidBlockNumber? In all the
+ * places were we previously set matchResult to NULL, I just set blockno to
+ * InvalidBlockNumber. It seems like this should be okay because that is
+ * usually what we check before using the matchResult members. But it might
+ * be safer to zero out the offsets array. But that is expensive.
+ */
+ entry->matchResult.blockno = InvalidBlockNumber;
+ entry->matchResult.ntuples = 0;
+ entry->matchResult.recheck = true;
+ memset(entry->matchResult.offsets, 0,
+ sizeof(OffsetNumber) * MaxHeapTuplesPerPage);
+
/*
* we should find entry, and begin scan of posting tree or just store
* posting list in memory
@@ -374,6 +386,7 @@ restartScanEntry:
{
if (entry->matchIterator)
tbm_end_iterate(entry->matchIterator);
+ entry->matchResult.blockno = InvalidBlockNumber;
entry->matchIterator = NULL;
tbm_free(entry->matchBitmap);
entry->matchBitmap = NULL;
@@ -823,18 +836,19 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
{
/*
* If we've exhausted all items on this block, move to next block
- * in the bitmap.
+ * in the bitmap. tbm_iterate() sets matchResult->blockno to
+ * InvalidBlockNumber when the bitmap is exhausted.
*/
- while (entry->matchResult == NULL ||
- (entry->matchResult->ntuples >= 0 &&
- entry->offset >= entry->matchResult->ntuples) ||
- entry->matchResult->blockno < advancePastBlk ||
+ while ((!BlockNumberIsValid(entry->matchResult.blockno)) ||
+ (entry->matchResult.ntuples >= 0 &&
+ entry->offset >= entry->matchResult.ntuples) ||
+ entry->matchResult.blockno < advancePastBlk ||
(ItemPointerIsLossyPage(&advancePast) &&
- entry->matchResult->blockno == advancePastBlk))
+ entry->matchResult.blockno == advancePastBlk))
{
- entry->matchResult = tbm_iterate(entry->matchIterator);
+ tbm_iterate(entry->matchIterator, &entry->matchResult);
- if (entry->matchResult == NULL)
+ if (!BlockNumberIsValid(entry->matchResult.blockno))
{
ItemPointerSetInvalid(&entry->curItem);
tbm_end_iterate(entry->matchIterator);
@@ -858,10 +872,10 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* We're now on the first page after advancePast which has any
* items on it. If it's a lossy result, return that.
*/
- if (entry->matchResult->ntuples < 0)
+ if (entry->matchResult.ntuples < 0)
{
ItemPointerSetLossyPage(&entry->curItem,
- entry->matchResult->blockno);
+ entry->matchResult.blockno);
/*
* We might as well fall out of the loop; we could not
@@ -875,27 +889,27 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* Not a lossy page. Skip over any offsets <= advancePast, and
* return that.
*/
- if (entry->matchResult->blockno == advancePastBlk)
+ if (entry->matchResult.blockno == advancePastBlk)
{
/*
* First, do a quick check against the last offset on the
* page. If that's > advancePast, so are all the other
* offsets, so just go back to the top to get the next page.
*/
- if (entry->matchResult->offsets[entry->matchResult->ntuples - 1] <= advancePastOff)
+ if (entry->matchResult.offsets[entry->matchResult.ntuples - 1] <= advancePastOff)
{
- entry->offset = entry->matchResult->ntuples;
+ entry->offset = entry->matchResult.ntuples;
continue;
}
/* Otherwise scan to find the first item > advancePast */
- while (entry->matchResult->offsets[entry->offset] <= advancePastOff)
+ while (entry->matchResult.offsets[entry->offset] <= advancePastOff)
entry->offset++;
}
ItemPointerSet(&entry->curItem,
- entry->matchResult->blockno,
- entry->matchResult->offsets[entry->offset]);
+ entry->matchResult.blockno,
+ entry->matchResult.offsets[entry->offset]);
entry->offset++;
/* Done unless we need to reduce the result */
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544e..033d5253394 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -106,7 +106,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
ItemPointerSetMin(&scanEntry->curItem);
scanEntry->matchBitmap = NULL;
scanEntry->matchIterator = NULL;
- scanEntry->matchResult = NULL;
+ scanEntry->matchResult.blockno = InvalidBlockNumber;
scanEntry->list = NULL;
scanEntry->nlist = 0;
scanEntry->offset = InvalidOffsetNumber;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c8da3def645..ba6793a749c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2121,7 +2121,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult *tbmres;
+ TBMIterateResult tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
@@ -2134,11 +2134,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
CHECK_FOR_INTERRUPTS();
if (scan->shared_tbmiterator)
- tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
else
- tbmres = tbm_iterate(scan->tbmiterator);
+ tbm_iterate(scan->tbmiterator, &tbmres);
- if (tbmres == NULL)
+ if (!BlockNumberIsValid(tbmres.blockno))
{
/* no more entries in the bitmap */
Assert(hscan->rs_empty_tuples_pending == 0);
@@ -2153,11 +2153,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* isolation though, as we need to examine all invisible tuples
* reachable by the index.
*/
- } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+ } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
/* Got a valid block */
- *blockno = tbmres->blockno;
- *recheck = tbmres->recheck;
+ *blockno = tbmres.blockno;
+ *recheck = tbmres.recheck;
/*
* We can skip fetching the heap page if we don't need any fields from the
@@ -2165,19 +2165,19 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* the page are visible to our transaction.
*/
if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ !tbmres.recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
{
/* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
+ Assert(tbmres.ntuples >= 0);
Assert(hscan->rs_empty_tuples_pending >= 0);
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
+ hscan->rs_empty_tuples_pending += tbmres.ntuples;
return true;
}
- block = tbmres->blockno;
+ block = tbmres.blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2206,7 +2206,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres->ntuples >= 0)
+ if (tbmres.ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2215,9 +2215,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres.ntuples; curslot++)
{
- OffsetNumber offnum = tbmres->offsets[curslot];
+ OffsetNumber offnum = tbmres.offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2270,7 +2270,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/* Only count exact and lossy pages with visible tuples */
if (ntup > 0)
{
- if (tbmres->ntuples >= 0)
+ if (tbmres.ntuples >= 0)
scan->exact_pages++;
else
scan->lossy_pages++;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index ae837785116..284641fa8ea 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -340,9 +340,10 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
else if (prefetch_iterator)
{
/* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
+ tbm_iterate(prefetch_iterator, &tbmpre);
- if (tbmpre == NULL || tbmpre->blockno != blockno)
+ if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
@@ -360,6 +361,8 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
}
else
{
+ TBMIterateResult tbmpre;
+
/* Release the mutex before iterating */
SpinLockRelease(&pstate->mutex);
@@ -372,7 +375,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
* case.
*/
if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator);
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
}
}
#endif /* USE_PREFETCH */
@@ -439,10 +442,12 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (node->prefetch_pages < node->prefetch_target)
{
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
bool skip_fetch;
- if (tbmpre == NULL)
+ tbm_iterate(prefetch_iterator, &tbmpre);
+
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_iterate(prefetch_iterator);
@@ -464,13 +469,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
*/
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
@@ -485,7 +490,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (1)
{
- TBMIterateResult *tbmpre;
+ TBMIterateResult tbmpre;
bool do_prefetch = false;
bool skip_fetch;
@@ -504,8 +509,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
if (!do_prefetch)
return;
- tbmpre = tbm_shared_iterate(prefetch_iterator);
- if (tbmpre == NULL)
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_shared_iterate(prefetch_iterator);
@@ -515,13 +520,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
}
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index f711c056143..b4dcb1cbb88 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -171,7 +171,6 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output;
};
/*
@@ -212,7 +211,6 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
};
/* Local function prototypes */
@@ -943,20 +941,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
/*
* tbm_iterate - scan through next page of a TIDBitmap
*
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan. Pages are guaranteed to be delivered in numerical
- * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition. If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway. (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition. If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway. (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
*/
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
{
TIDBitmap *tbm = iterator->tbm;
- TBMIterateResult *output = &(iterator->output);
Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
@@ -984,6 +983,7 @@ tbm_iterate(TBMIterator *iterator)
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
+ Assert(tbmres);
if (iterator->schunkptr < tbm->nchunks)
{
PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -994,11 +994,11 @@ tbm_iterate(TBMIterator *iterator)
chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
iterator->schunkbit++;
- return output;
+ return;
}
}
@@ -1014,16 +1014,17 @@ tbm_iterate(TBMIterator *iterator)
page = tbm->spages[iterator->spageptr];
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
iterator->spageptr++;
- return output;
+ return;
}
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
@@ -1033,10 +1034,9 @@ tbm_iterate(TBMIterator *iterator)
* across multiple processes. We need to acquire the iterator LWLock,
* before accessing the shared members.
*/
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
{
- TBMIterateResult *output = &iterator->output;
TBMSharedIteratorState *istate = iterator->state;
PagetableEntry *ptbase = NULL;
int *idxpages = NULL;
@@ -1087,13 +1087,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
}
@@ -1103,21 +1103,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
int ntuples;
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
istate->spageptr++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
LWLockRelease(&istate->lock);
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 51d0c74a6b0..e423d92b41c 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -352,7 +352,7 @@ typedef struct GinScanEntryData
/* for a partial-match or full-scan query, we accumulate all TIDs here */
TIDBitmap *matchBitmap;
TBMIterator *matchIterator;
- TBMIterateResult *matchResult;
+ TBMIterateResult matchResult;
/* used for Posting list and one page in Posting tree */
ItemPointerData *list;
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 432fae52962..f000c1af28f 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -72,8 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
extern void tbm_end_iterate(TBMIterator *iterator);
extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
--
2.37.2
v2-0012-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v2-0012-Streaming-Read-API.patchDownload
From d84a520da846da83717b748a2bd30f4185d36ebe Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v2 12/13] Streaming Read API
---
contrib/pg_prewarm/pg_prewarm.c | 40 +-
src/backend/access/transam/xlogutils.c | 2 +-
src/backend/postmaster/bgwriter.c | 8 +-
src/backend/postmaster/checkpointer.c | 15 +-
src/backend/storage/Makefile | 2 +-
src/backend/storage/aio/Makefile | 14 +
src/backend/storage/aio/meson.build | 5 +
src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 560 +++++++++++++++--------
src/backend/storage/buffer/localbuf.c | 14 +-
src/backend/storage/meson.build | 1 +
src/backend/storage/smgr/smgr.c | 49 +-
src/include/storage/bufmgr.h | 22 +
src/include/storage/smgr.h | 4 +-
src/include/storage/streaming_read.h | 45 ++
src/include/utils/rel.h | 6 -
src/tools/pgindent/typedefs.list | 2 +
17 files changed, 986 insertions(+), 238 deletions(-)
create mode 100644 src/backend/storage/aio/Makefile
create mode 100644 src/backend/storage/aio/meson.build
create mode 100644 src/backend/storage/aio/streaming_read.c
create mode 100644 src/include/storage/streaming_read.h
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/smgr.h"
+#include "storage/streaming_read.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
static PGIOAlignedBlock blockbuffer;
+struct pg_prewarm_streaming_read_private
+{
+ BlockNumber blocknum;
+ int64 last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_data)
+{
+ struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+ if (p->blocknum <= p->last_block)
+ return p->blocknum++;
+
+ return InvalidBlockNumber;
+}
+
/*
* pg_prewarm(regclass, mode text, fork text,
* first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
}
else if (ptype == PREWARM_BUFFER)
{
+ struct pg_prewarm_streaming_read_private p;
+ PgStreamingRead *pgsr;
+
/*
* In buffer mode, we actually pull the data into shared_buffers.
*/
+
+ /* Set up the private state for our streaming buffer read callback. */
+ p.blocknum = first_block;
+ p.last_block = last_block;
+
+ pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ &p,
+ 0,
+ NULL,
+ BMR_REL(rel),
+ forkNumber,
+ pg_prewarm_streaming_read_next);
+
for (block = first_block; block <= last_block; ++block)
{
Buffer buf;
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+ buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
ReleaseBuffer(buf);
++blocks_done;
}
+ Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+ pg_streaming_read_free(pgsr);
}
/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
* This is unnecessarily heavy-handed, as it will close SMgrRelation
* objects for other databases as well. DROP DATABASE occurs seldom enough
* that it's not worth introducing a variant of smgrclose for just this
- * purpose. XXX: Or should we rather leave the smgr entries dangling?
+ * purpose.
*/
smgrcloseall();
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
if (FirstCallSinceLastCheckpoint())
{
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the bgwriter does
+ * not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
}
/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
ckpt_performed = CreateRestartPoint(flags);
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the checkpointer
+ * does not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
/*
* Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
*/
CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
- /*
- * After any checkpoint, close all smgr files. This is so we won't
- * hang onto smgr references to deleted files indefinitely.
- */
- smgrcloseall();
+ /* Free all smgr objects, as CheckpointerMain() normally would. */
+ smgrdestroyall();
return;
}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS = aio buffer file freespace ipc large_object lmgr page smgr sync
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+ bool advice_issued;
+ bool need_complete;
+ BlockNumber blocknum;
+ int nblocks;
+ int per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+ Buffer buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+ int max_ios;
+ int ios_in_progress;
+ int ios_in_progress_trigger;
+ int max_pinned_buffers;
+ int pinned_buffers;
+ int pinned_buffers_trigger;
+ int next_tail_buffer;
+ bool finished;
+ void *pgsr_private;
+ PgStreamingReadBufferCB callback;
+ BufferAccessStrategy strategy;
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+
+ bool advice_enabled;
+
+ /* Next expected block, for detecting sequential access. */
+ BlockNumber seq_blocknum;
+
+ /* Space for optional per-buffer private data. */
+ size_t per_buffer_data_size;
+ void *per_buffer_data;
+ int per_buffer_data_next;
+
+ /* Circular buffer of ranges. */
+ int size;
+ int head;
+ int tail;
+ PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy)
+{
+ PgStreamingRead *pgsr;
+ int size;
+ int max_ios;
+ uint32 max_pinned_buffers;
+
+
+ /*
+ * Decide how many assumed I/Os we will allow to run concurrently. That
+ * is, advice to the kernel to tell it that we will soon read. This
+ * number also affects how far we look ahead for opportunities to start
+ * more I/Os.
+ */
+ if (flags & PGSR_FLAG_MAINTENANCE)
+ max_ios = maintenance_io_concurrency;
+ else
+ max_ios = effective_io_concurrency;
+
+ /*
+ * The desired level of I/O concurrency controls how far ahead we are
+ * willing to look ahead. We also clamp it to at least
+ * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+ * sized read, even when max_ios is zero.
+ */
+ max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+ /*
+ * The *_io_concurrency GUCs, we might have 0. We want to allow at least
+ * one, to keep our gating logic simple.
+ */
+ max_ios = Max(max_ios, 1);
+
+ /*
+ * Don't allow this backend to pin too many buffers. For now we'll apply
+ * the limit for the shared buffer pool and the local buffer pool, without
+ * worrying which it is.
+ */
+ LimitAdditionalPins(&max_pinned_buffers);
+ LimitAdditionalLocalPins(&max_pinned_buffers);
+ Assert(max_pinned_buffers > 0);
+
+ /*
+ * pgsr->ranges is a circular buffer. When it is empty, head == tail.
+ * When it is full, there is an empty element between head and tail. Head
+ * can also be empty (nblocks == 0), therefore we need two extra elements
+ * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+ * maxmimum possible number of occupied ranges of the smallest possible
+ * size of one.
+ */
+ size = max_pinned_buffers + 2;
+
+ pgsr = (PgStreamingRead *)
+ palloc0(offsetof(PgStreamingRead, ranges) +
+ sizeof(pgsr->ranges[0]) * size);
+
+ pgsr->max_ios = max_ios;
+ pgsr->per_buffer_data_size = per_buffer_data_size;
+ pgsr->max_pinned_buffers = max_pinned_buffers;
+ pgsr->pgsr_private = pgsr_private;
+ pgsr->strategy = strategy;
+ pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+ /*
+ * This system supports prefetching advice. As long as direct I/O isn't
+ * enabled, and the caller hasn't promised sequential access, we can use
+ * it.
+ */
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ (flags & PGSR_FLAG_SEQUENTIAL) == 0)
+ pgsr->advice_enabled = true;
+#endif
+
+ /*
+ * We want to avoid creating ranges that are smaller than they could be
+ * just because we hit max_pinned_buffers. We only look ahead when the
+ * number of pinned buffers falls below this trigger number, or put
+ * another way, we stop looking ahead when we wouldn't be able to build a
+ * "full sized" range.
+ */
+ pgsr->pinned_buffers_trigger =
+ Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+ /* Space the callback to store extra data along with each block. */
+ if (per_buffer_data_size)
+ pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+ return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb)
+{
+ PgStreamingRead *result;
+
+ result = pg_streaming_read_buffer_alloc_internal(flags,
+ pgsr_private,
+ per_buffer_data_size,
+ strategy);
+ result->callback = next_block_cb;
+ result->bmr = bmr;
+ result->forknum = forknum;
+
+ return result;
+}
+
+/*
+ * Start building a new range. This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading. In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+ PgStreamingReadRange *head_range;
+
+ head_range = &pgsr->ranges[pgsr->head];
+ Assert(head_range->nblocks > 0);
+
+ /*
+ * If a call to CompleteReadBuffers() will be needed, and we can issue
+ * advice to the kernel to get the read started. We suppress it if the
+ * access pattern appears to be completely sequential, though, because on
+ * some systems that interfers with the kernel's own sequential read ahead
+ * heurstics and hurts performance.
+ */
+ if (pgsr->advice_enabled)
+ {
+ BlockNumber blocknum = head_range->blocknum;
+ int nblocks = head_range->nblocks;
+
+ if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+ {
+ SMgrRelation smgr =
+ pgsr->bmr.smgr ? pgsr->bmr.smgr :
+ RelationGetSmgr(pgsr->bmr.rel);
+
+ Assert(!head_range->advice_issued);
+
+ smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+ /*
+ * Count this as an I/O that is concurrently in progress, though
+ * we don't really know if the kernel generates a physical I/O.
+ */
+ head_range->advice_issued = true;
+ pgsr->ios_in_progress++;
+ }
+
+ /* Remember the block after this range, for sequence detection. */
+ pgsr->seq_blocknum = blocknum + nblocks;
+ }
+
+ /* Create a new head range. There must be space. */
+ Assert(pgsr->size > pgsr->max_pinned_buffers);
+ Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+ if (++pgsr->head == pgsr->size)
+ pgsr->head = 0;
+ head_range = &pgsr->ranges[pgsr->head];
+ head_range->nblocks = 0;
+
+ return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+ /*
+ * If we're finished or can't start more I/O, then don't look ahead.
+ */
+ if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+ return;
+
+ /*
+ * We'll also wait until the number of pinned buffers falls below our
+ * trigger level, so that we have the chance to create a full range.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+ return;
+
+ do
+ {
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+ Buffer buffer;
+ bool found;
+ bool need_complete;
+ PgStreamingReadRange *head_range;
+ void *per_buffer_data;
+
+ /* Do we have a full-sized range? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks == lengthof(head_range->buffers))
+ {
+ Assert(head_range->need_complete);
+ head_range = pg_streaming_read_new_range(pgsr);
+
+ /*
+ * Give up now if I/O is saturated, or we wouldn't be able form
+ * another full range after this due to the pin limit.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+ pgsr->ios_in_progress == pgsr->max_ios)
+ break;
+ }
+
+ per_buffer_data = (char *) pgsr->per_buffer_data +
+ pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+ /* Find out which block the callback wants to read next. */
+ blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+ if (blocknum == InvalidBlockNumber)
+ {
+ pgsr->finished = true;
+ break;
+ }
+ bmr = pgsr->bmr;
+ forknum = pgsr->forknum;
+
+ Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+ buffer = PrepareReadBuffer(bmr,
+ forknum,
+ blocknum,
+ pgsr->strategy,
+ &found);
+ pgsr->pinned_buffers++;
+
+ need_complete = !found;
+
+ /* Is there a head range that we can't extend? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks > 0 &&
+ (!need_complete ||
+ !head_range->need_complete ||
+ head_range->blocknum + head_range->nblocks != blocknum))
+ {
+ /* Yes, time to start building a new one. */
+ head_range = pg_streaming_read_new_range(pgsr);
+ Assert(head_range->nblocks == 0);
+ }
+
+ if (head_range->nblocks == 0)
+ {
+ /* Initialize a new range beginning at this block. */
+ head_range->blocknum = blocknum;
+ head_range->need_complete = need_complete;
+ head_range->advice_issued = false;
+ }
+ else
+ {
+ /* We can extend an existing range by one block. */
+ Assert(head_range->blocknum + head_range->nblocks == blocknum);
+ Assert(head_range->need_complete);
+ }
+
+ head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+ head_range->buffers[head_range->nblocks] = buffer;
+ head_range->nblocks++;
+
+ if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+ pgsr->per_buffer_data_next = 0;
+
+ } while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+ pgsr->ios_in_progress < pgsr->max_ios);
+
+ if (pgsr->ranges[pgsr->head].nblocks > 0)
+ pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+ pg_streaming_read_look_ahead(pgsr);
+
+ /* See if we have one buffer to return. */
+ while (pgsr->tail != pgsr->head)
+ {
+ PgStreamingReadRange *tail_range;
+
+ tail_range = &pgsr->ranges[pgsr->tail];
+
+ /*
+ * Do we need to perform an I/O before returning the buffers from this
+ * range?
+ */
+ if (tail_range->need_complete)
+ {
+ CompleteReadBuffers(pgsr->bmr,
+ tail_range->buffers,
+ pgsr->forknum,
+ tail_range->blocknum,
+ tail_range->nblocks,
+ false,
+ pgsr->strategy);
+ tail_range->need_complete = false;
+
+ /*
+ * We don't really know if the kernel generated an physical I/O
+ * when we issued advice, let alone when it finished, but it has
+ * certainly finished after a read call returns.
+ */
+ if (tail_range->advice_issued)
+ pgsr->ios_in_progress--;
+ }
+
+ /* Are there more buffers available in this range? */
+ if (pgsr->next_tail_buffer < tail_range->nblocks)
+ {
+ int buffer_index;
+ Buffer buffer;
+
+ buffer_index = pgsr->next_tail_buffer++;
+ buffer = tail_range->buffers[buffer_index];
+
+ Assert(BufferIsValid(buffer));
+
+ /* We are giving away ownership of this pinned buffer. */
+ Assert(pgsr->pinned_buffers > 0);
+ pgsr->pinned_buffers--;
+
+ if (per_buffer_data)
+ *per_buffer_data = (char *) pgsr->per_buffer_data +
+ tail_range->per_buffer_data_index[buffer_index] *
+ pgsr->per_buffer_data_size;
+
+ return buffer;
+ }
+
+ /* Advance tail to next range, if there is one. */
+ if (++pgsr->tail == pgsr->size)
+ pgsr->tail = 0;
+ pgsr->next_tail_buffer = 0;
+ }
+
+ Assert(pgsr->pinned_buffers == 0);
+
+ return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+ Buffer buffer;
+
+ /* Stop looking ahead, and unpin anything that wasn't consumed. */
+ pgsr->finished = true;
+ while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+ ReleaseBuffer(buffer);
+
+ if (pgsr->per_buffer_data)
+ pfree(pgsr->per_buffer_data);
+ pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
)
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static int SyncOneBuffer(int buf_id, bool skip_recently_used,
WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner);
static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot access temporary tables of other sessions")));
- /*
- * Read the buffer, and update pgstat counters to reflect a cache hit or
- * miss.
- */
- pgstat_count_buffer_read(reln);
- buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+ buf = ReadBuffer_common(BMR_REL(reln),
forkNum, blockNum, mode, strategy, &hit);
- if (hit)
- pgstat_count_buffer_hit(reln);
+
return buf;
}
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
- return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
- RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+ return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+ RELPERSISTENCE_UNLOGGED),
+ forkNum, blockNum,
mode, strategy, &hit);
}
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
bool hit;
Assert(extended_by == 0);
- buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+ buffer = ReadBuffer_common(bmr,
fork, extend_to - 1, mode, strategy,
&hit);
}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
* *hit is set to true if the request was satisfied from shared buffer cache.
*/
static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy, bool *hit)
{
- BufferDesc *bufHdr;
- Block bufBlock;
- bool found;
- IOContext io_context;
- IOObject io_object;
- bool isLocalBuf = SmgrIsTemp(smgr);
-
- *hit = false;
+ Buffer buffer;
/*
* Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;
- return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
- forkNum, strategy, flags);
+ *hit = false;
+
+ return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}
- TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend);
+ buffer = PrepareReadBuffer(bmr,
+ forkNum,
+ blockNum,
+ strategy,
+ hit);
+
+ /* At this point we do NOT hold any locks. */
+ if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+ {
+ /* if we just want zeroes and a lock, we're done */
+ ZeroBuffer(buffer, mode);
+ }
+ else if (!*hit)
+ {
+ /* we might need to perform I/O */
+ CompleteReadBuffers(bmr,
+ &buffer,
+ forkNum,
+ blockNum,
+ 1,
+ mode == RBM_ZERO_ON_ERROR,
+ strategy);
+ }
+
+ return buffer;
+}
+
+/*
+ * Prepare to read a block. The buffer is pinned. If this is a 'hit', then
+ * the returned buffer can be used immediately. Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer(). PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr)
+{
+ BufferDesc *bufHdr;
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
+
+ Assert(blockNum != P_NEW);
+
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
- /*
- * We do not use a BufferAccessStrategy for I/O of temporary tables.
- * However, in some cases, the "strategy" may not be NULL, so we can't
- * rely on IOContextForStrategy() to set the right IOContext for us.
- * This may happen in cases like CREATE TEMPORARY TABLE AS...
- */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
- bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
- if (found)
- pgBufferUsage.local_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.local_blks_read++;
}
else
{
- /*
- * lookup the buffer. IO_IN_PROGRESS is set if the requested block is
- * not currently in memory.
- */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
- bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
- strategy, &found, io_context);
- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;
}
- /* At this point we do NOT hold any locks. */
+ TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend);
- /* if it was already in the buffer pool, we're done */
- if (found)
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+ if (isLocalBuf)
+ {
+ bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+ if (*foundPtr)
+ pgBufferUsage.local_blks_hit++;
+ }
+ else
+ {
+ bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+ strategy, foundPtr, io_context);
+ if (*foundPtr)
+ pgBufferUsage.shared_blks_hit++;
+ }
+ if (bmr.rel)
+ {
+ /*
+ * While pgBufferUsage's "read" counter isn't bumped unless we reach
+ * CompleteReadBuffers() (so, not for hits, and not for buffers that
+ * are zeroed instead), the per-relation stats always count them.
+ */
+ pgstat_count_buffer_read(bmr.rel);
+ if (*foundPtr)
+ pgstat_count_buffer_hit(bmr.rel);
+ }
+ if (*foundPtr)
{
- /* Just need to update stats before we exit */
- *hit = true;
VacuumPageHit++;
pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageHit;
TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ }
- /*
- * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
- * on return.
- */
- if (!isLocalBuf)
- {
- if (mode == RBM_ZERO_AND_LOCK)
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
- LW_EXCLUSIVE);
- else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
- LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
- }
+ return BufferDescriptorGetBuffer(bufHdr);
+}
- return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+ if (BufferIsLocal(buffer))
+ {
+ BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+ return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
+ else
+ return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
- /*
- * if we have gotten to this point, we have allocated a buffer for the
- * page but its contents are not yet valid. IO_IN_PROGRESS is set for it,
- * if it's a shared buffer.
- */
- Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers(). The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy)
+{
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
- bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
+ if (isLocalBuf)
+ {
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
+ }
+ else
+ {
+ io_context = IOContextForStrategy(strategy);
+ io_object = IOOBJECT_RELATION;
+ }
/*
- * Read in the page, unless the caller intends to overwrite it and just
- * wants us to allocate a buffer.
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
*/
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ if (isLocalBuf)
+ pgBufferUsage.local_blks_read += nblocks;
else
+ pgBufferUsage.shared_blks_read += nblocks;
+
+ for (int i = 0; i < nblocks; ++i)
{
- instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+ int io_buffers_len;
+ Buffer io_buffers[MAX_BUFFERS_PER_TRANSFER];
+ void *io_pages[MAX_BUFFERS_PER_TRANSFER];
+ instr_time io_start;
+ BlockNumber io_first_block;
- smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
- pgstat_count_io_op_time(io_object, io_context,
- IOOP_READ, io_start, 1);
+ /*
+ * We could get all the information from buffer headers, but it can be
+ * expensive to access buffer header cache lines so we make the caller
+ * provide all the information we need, and assert that it is
+ * consistent.
+ */
+ {
+ RelFileLocator xlocator;
+ ForkNumber xforknum;
+ BlockNumber xblocknum;
+
+ BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+ Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+ Assert(xforknum == forknum);
+ Assert(xblocknum == blocknum + i);
+ }
+#endif
+
+ /*
+ * Skip this block if someone else has already completed it. If an
+ * I/O is already in progress in another backend, this will wait for
+ * the outcome: either done, or something went wrong and we will
+ * retry.
+ */
+ if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+ {
+ /*
+ * Report this as a 'hit' for this backend, even though it must
+ * have started out as a miss in PrepareReadBuffer().
+ */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ continue;
+ }
+
+ /* We found a buffer that we need to read in. */
+ io_buffers[0] = buffers[i];
+ io_pages[0] = BufferGetBlock(buffers[i]);
+ io_first_block = blocknum + i;
+ io_buffers_len = 1;
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
+ /*
+ * How many neighboring-on-disk blocks can we can scatter-read into
+ * other buffers at the same time? In this case we don't wait if we
+ * see an I/O already in progress. We already hold BM_IO_IN_PROGRESS
+ * for the head block, so we should get on with that I/O as soon as
+ * possible. We'll come back to this block again, above.
+ */
+ while ((i + 1) < nblocks &&
+ CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+ {
+ /* Must be consecutive block numbers. */
+ Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+ BufferGetBlockNumber(buffers[i]) + 1);
+
+ io_buffers[io_buffers_len] = buffers[++i];
+ io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+ }
+
+ io_start = pgstat_prepare_io_time(track_io_timing);
+ smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+ io_buffers_len);
+
+ /* Verify each block we read, and terminate the I/O. */
+ for (int j = 0; j < io_buffers_len; ++j)
{
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+ BufferDesc *bufHdr;
+ Block bufBlock;
+
+ if (isLocalBuf)
{
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+ bufBlock = LocalBufHdrGetBlock(bufHdr);
}
else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- }
- }
-
- /*
- * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
- * content lock before marking the page as valid, to make sure that no
- * other backend sees the zeroed page before the caller has had a chance
- * to initialize it.
- *
- * Since no-one else can be looking at the page contents yet, there is no
- * difference between an exclusive lock and a cleanup-strength lock. (Note
- * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
- * they assert that the buffer is already valid.)
- */
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
- {
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
- }
+ {
+ bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+ bufBlock = BufHdrGetBlock(bufHdr);
+ }
- if (isLocalBuf)
- {
- /* Only need to adjust flags */
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ /* check for garbage data */
+ if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ if (zero_on_error || zero_damaged_pages)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ memset(bufBlock, 0, BLCKSZ);
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ }
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
- }
+ /* Terminate I/O and set BM_VALID. */
+ if (isLocalBuf)
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
- VacuumPageMiss++;
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss;
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ /* Set BM_VALID, terminate IO, and wake up any waiters */
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ }
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ /* Report I/Os as completing individually. */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ false);
+ }
- return BufferDescriptorGetBuffer(bufHdr);
+ VacuumPageMiss += io_buffers_len;
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ }
}
/*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*
* The returned buffer is pinned and is already marked as holding the
* desired page. If it already did have the desired page, *foundPtr is
- * set true. Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true. Otherwise, *foundPtr is set false. A read should be
+ * performed with CompleteReadBuffers().
*
* io_context is passed as an output parameter to avoid calling
* IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(buf, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(existing_buf_hdr, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to obtain the right to
- * start I/O. If StartBufferIO returns false, then someone else managed
- * to read it before we did, so there's nothing left for BufferAlloc() to
- * do.
+ * Buffer contents are currently invalid.
*/
- if (StartBufferIO(victim_buf_hdr, true))
- *foundPtr = false;
- else
- *foundPtr = true;
+ *foundPtr = false;
return victim_buf_hdr;
}
@@ -1774,7 +1899,7 @@ again:
* pessimistic, but outside of toy-sized shared_buffers it should allow
* sufficient pins.
*/
-static void
+void
LimitAdditionalPins(uint32 *additional_pins)
{
uint32 max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
buf_state &= ~BM_VALID;
UnlockBufHdr(existing_hdr, buf_state);
- } while (!StartBufferIO(existing_hdr, true));
+ } while (!StartBufferIO(existing_hdr, true, false));
}
else
{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
LWLockRelease(partition_lock);
/* XXX: could combine the locked operations in it with the above */
- StartBufferIO(victim_buf_hdr, true);
+ StartBufferIO(victim_buf_hdr, true, false);
}
}
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
else
{
/*
- * If we previously pinned the buffer, it must surely be valid.
+ * If we previously pinned the buffer, it is likely to be valid, but
+ * it may not be if PrepareReadBuffer() was called and
+ * CompleteReadBuffers() hasn't been called yet. We'll check by
+ * loading the flags without locking. This is racy, but it's OK to
+ * return false spuriously: when CompleteReadBuffers() calls
+ * StartBufferIO(), it'll see that it's now valid.
*
* Note: We deliberately avoid a Valgrind client request here.
* Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
* that the buffer page is legitimately non-accessible here. We
* cannot meddle with that.
*/
- result = true;
+ result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
}
ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* someone else flushed the buffer before we could, so we need not do
* anything.
*/
- if (!StartBufferIO(buf, false))
+ if (!StartBufferIO(buf, false, false))
return;
/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
LW_EXCLUSIVE);
}
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would. The buffer must be already pinned. It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+ if (BufferIsLocal(buffer))
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ else
+ {
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ if (mode == RBM_ZERO_AND_LOCK)
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ else
+ LockBufferForCleanup(buffer);
+ }
+
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ if (BufferIsLocal(buffer))
+ {
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ buf_state = LockBufHdr(bufHdr);
+ buf_state |= BM_VALID;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
/*
* Verify that this backend is pinning the buffer exactly once.
*
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
*
* Returns true if we successfully marked the buffer as I/O busy,
* false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend. In that case, false indicates either that the I/O was already
+ * finished, or is still in progress. This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
*/
static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
UnlockBufHdr(buf, buf_state);
+ if (nowait)
+ return false;
WaitIO(buf);
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
* LocalBufferAlloc -
* Find or create a local buffer for the given page of the given relation.
*
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local. Also, IO_IN_PROGRESS
- * does not get set. Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local. We support only default access
+ * strategy (hence, usage_count is always advanced).
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
}
/* see LimitAdditionalPins() */
-static void
+void
LimitAdditionalLocalPins(uint32 *additional_pins)
{
uint32 max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
/*
* In contrast to LimitAdditionalPins() other backends don't play a role
- * here. We can allow up to NLocBuffer pins in total.
+ * here. We can allow up to NLocBuffer pins in total, but it might not be
+ * initialized yet so read num_temp_buffers.
*/
- max_pins = (NLocBuffer - NLocalPinnedBuffers);
+ max_pins = (num_temp_buffers - NLocalPinnedBuffers);
if (*additional_pins >= max_pins)
*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('aio')
subdir('buffer')
subdir('file')
subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
/*
* smgropen() -- Return an SMgrRelation object, creating it if need be.
*
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files. The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
*/
SMgrRelation
smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
}
/*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
*/
void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
{
SMgrRelation *owner;
ForkNumber forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
}
/*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
*
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr(). It may be re-owned if it is accessed by a
+ * relation before then.
*/
void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
{
for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
}
reln->smgr_targblock = InvalidBlockNumber;
+
+ if (reln->smgr_owner)
+ {
+ *reln->smgr_owner = NULL;
+ reln->smgr_owner = NULL;
+ dlist_push_tail(&unowned_relns, &reln->node);
+ }
}
/*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
*/
void
-smgrreleaseall(void)
+smgrcloseall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrrelease(reln);
+ smgrclose(reln);
}
/*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
*/
void
-smgrcloseall(void)
+smgrdestroyall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrclose(reln);
+ smgrdestroy(reln);
}
/*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
* AtEOXact_SMgr
*
* This routine is called during transaction commit or abort (it doesn't
- * particularly care which). All transient SMgrRelation objects are closed.
+ * particularly care which). All transient SMgrRelation objects are
+ * destroyed.
*
* We do this as a compromise between wanting transient SMgrRelations to
* live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
dlist_mutable_iter iter;
/*
- * Zap all unowned SMgrRelations. We rely on smgrclose() to remove each
+ * Zap all unowned SMgrRelations. We rely on smgrdestroy() to remove each
* one from the list.
*/
dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
Assert(rel->smgr_owner == NULL);
- smgrclose(rel);
+ smgrdestroy(rel);
}
}
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
bool
ProcessBarrierSmgrRelease(void)
{
- smgrreleaseall();
+ smgrcloseall();
return true;
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
#ifndef BUFMGR_H
#define BUFMGR_H
+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
#define BUFFER_LOCK_SHARE 1
#define BUFFER_LOCK_EXCLUSIVE 2
+/*
+ * Maximum number of buffers for multi-buffer I/O functions. This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
/*
* prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy);
extern void ReleaseBuffer(Buffer buffer);
extern void UnlockReleaseBuffer(Buffer buffer);
extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
/* in buf_init.c */
extern void InitBufferPool(void);
extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected. Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_private_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
*
* Very little code is authorized to touch rel->rd_smgr directly. Instead
* use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period. Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation. It's quite cheap in
- * comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b7..8007f17320a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2094,6 +2094,8 @@ PgStat_TableCounts
PgStat_TableStatus
PgStat_TableXactStatus
PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
PgXmlErrorContext
PgXmlStrictness
Pg_finfo_record
--
2.37.2
v2-0010-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-patch; charset=US-ASCII; name=v2-0010-Hard-code-TBMIterateResult-offsets-array-size.patchDownload
From 0ee5aaf02cd59bec3f42a319d37f7b8755a53554 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 20:13:43 -0500
Subject: [PATCH v2 10/13] Hard-code TBMIterateResult offsets array size
TIDBitmap's TBMIterateResult had a flexible sized array of tuple offsets
but the API always allocated MaxHeapTuplesPerPage OffsetNumbers.
Creating a fixed-size aray of size MaxHeapTuplesPerPage is more clear
for the API user.
---
src/backend/nodes/tidbitmap.c | 27 ++++++---------------------
src/include/nodes/tidbitmap.h | 12 ++++++++++--
2 files changed, 16 insertions(+), 23 deletions(-)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 0f4850065fb..f711c056143 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -40,21 +40,12 @@
#include <limits.h>
-#include "access/htup_details.h"
#include "common/hashfn.h"
#include "nodes/bitmapset.h"
#include "nodes/tidbitmap.h"
#include "storage/lwlock.h"
#include "utils/dsa.h"
-/*
- * The maximum number of tuples per page is not large (typically 256 with
- * 8K pages, or 1024 with 32K pages). So there's not much point in making
- * the per-page bitmaps variable size. We just legislate that the size
- * is this:
- */
-#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage
-
/*
* When we have to switch over to lossy storage, we use a data structure
* with one bit per page, where all pages having the same number DIV
@@ -66,7 +57,7 @@
* table, using identical data structures. (This is because the memory
* management for hashtables doesn't easily/efficiently allow space to be
* transferred easily from one hashtable to another.) Therefore it's best
- * if PAGES_PER_CHUNK is the same as MAX_TUPLES_PER_PAGE, or at least not
+ * if PAGES_PER_CHUNK is the same as MaxHeapTuplesPerPage, or at least not
* too different. But we also want PAGES_PER_CHUNK to be a power of 2 to
* avoid expensive integer remainder operations. So, define it like this:
*/
@@ -78,7 +69,7 @@
#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
/* number of active words for an exact page: */
-#define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
+#define WORDS_PER_PAGE ((MaxHeapTuplesPerPage - 1) / BITS_PER_BITMAPWORD + 1)
/* number of active words for a lossy chunk: */
#define WORDS_PER_CHUNK ((PAGES_PER_CHUNK - 1) / BITS_PER_BITMAPWORD + 1)
@@ -180,7 +171,7 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/*
@@ -389,7 +380,7 @@ tbm_add_tuples(TIDBitmap *tbm, const ItemPointer tids, int ntids,
bitnum;
/* safety check to ensure we don't overrun bit array bounds */
- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxHeapTuplesPerPage)
elog(ERROR, "tuple offset out of range: %u", off);
/*
@@ -691,12 +682,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
Assert(tbm->iterating != TBM_ITERATING_SHARED);
- /*
- * Create the TBMIterator struct, with enough trailing space to serve the
- * needs of the TBMIterateResult sub-struct.
- */
- iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = palloc(sizeof(TBMIterator));
iterator->tbm = tbm;
/*
@@ -1470,8 +1456,7 @@ tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer dp)
* Create the TBMSharedIterator struct, with enough trailing space to
* serve the needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator));
istate = (TBMSharedIteratorState *) dsa_get_address(dsa, dp);
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639bf..432fae52962 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -22,6 +22,7 @@
#ifndef TIDBITMAP_H
#define TIDBITMAP_H
+#include "access/htup_details.h"
#include "storage/itemptr.h"
#include "utils/dsa.h"
@@ -41,9 +42,16 @@ typedef struct TBMIterateResult
{
BlockNumber blockno; /* page number containing tuples */
int ntuples; /* -1 indicates lossy result */
- bool recheck; /* should the tuples be rechecked? */
/* Note: recheck is always true if ntuples < 0 */
- OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+ bool recheck; /* should the tuples be rechecked? */
+
+ /*
+ * The maximum number of tuples per page is not large (typically 256 with
+ * 8K pages, or 1024 with 32K pages). So there's not much point in making
+ * the per-page bitmaps variable size. We just legislate that the size is
+ * this:
+ */
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
} TBMIterateResult;
/* function prototypes in nodes/tidbitmap.c */
--
2.37.2
v2-0013-BitmapHeapScan-uses-streaming-read-API.patchtext/x-patch; charset=US-ASCII; name=v2-0013-BitmapHeapScan-uses-streaming-read-API.patchDownload
From 9473ddcf05c4c3142fbc3fbc2371df2b8a8113e8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:04:18 -0500
Subject: [PATCH v2 13/13] BitmapHeapScan uses streaming read API
Remove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/heap/heapam.c | 68 +++++
src/backend/access/heap/heapam_handler.c | 88 +++---
src/backend/executor/nodeBitmapHeapscan.c | 343 +---------------------
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 14 +-
src/include/nodes/execnodes.h | 19 --
6 files changed, 117 insertions(+), 419 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b93f243c282..c965048af60 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -115,6 +115,8 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static BlockNumber bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data);
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -335,6 +337,22 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
if (key != NULL && scan->rs_base.rs_nkeys > 0)
memcpy(scan->rs_base.rs_key, key, scan->rs_base.rs_nkeys * sizeof(ScanKeyData));
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
+ scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ scan,
+ sizeof(TBMIterateResult),
+ scan->rs_strategy,
+ BMR_REL(scan->rs_base.rs_rd),
+ MAIN_FORKNUM,
+ bitmapheap_pgsr_next);
+
+
+ }
+
/*
* Currently, we only have a stats counter for sequential heap scans (but
* e.g for bitmap scans the underlying bitmap index scans will be counted,
@@ -955,6 +973,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_pgsr = NULL;
scan->rs_vmbuffer = InvalidBuffer;
scan->rs_empty_tuples_pending = 0;
@@ -1093,6 +1112,9 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN && scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
pfree(scan);
}
@@ -10250,3 +10272,49 @@ HeapCheckForSerializableConflictOut(bool visible, Relation relation,
CheckForSerializableConflictOut(relation, xid, snapshot);
}
+
+static BlockNumber
+bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data)
+{
+ TBMIterateResult *tbmres = per_buffer_data;
+ HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (hdesc->rs_base.shared_tbmiterator)
+ tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+ else
+ tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+ /* no more entries in the bitmap */
+ if (!BlockNumberIsValid(tbmres->blockno))
+ return InvalidBlockNumber;
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+ continue;
+
+ if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->rs_vmbuffer))
+ {
+ hdesc->rs_empty_tuples_pending += tbmres->ntuples;
+ continue;
+ }
+
+ return tbmres->blockno;
+ }
+
+ /* not reachable */
+ Assert(false);
+}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ba6793a749c..53812584774 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2113,79 +2113,65 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
*/
static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, BlockNumber *blockno)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
+ void *io_private;
BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult tbmres;
+ TBMIterateResult *tbmres;
+
+ Assert(hscan->rs_pgsr);
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- *blockno = InvalidBlockNumber;
*recheck = true;
- do
+ /* Release buffer containing previous block. */
+ if (BufferIsValid(hscan->rs_cbuf))
{
- CHECK_FOR_INTERRUPTS();
+ ReleaseBuffer(hscan->rs_cbuf);
+ hscan->rs_cbuf = InvalidBuffer;
+ }
- if (scan->shared_tbmiterator)
- tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
- else
- tbm_iterate(scan->tbmiterator, &tbmres);
+ hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->rs_pgsr, &io_private);
- if (!BlockNumberIsValid(tbmres.blockno))
+ if (BufferIsInvalid(hscan->rs_cbuf))
+ {
+ if (BufferIsValid(hscan->rs_vmbuffer))
{
- /* no more entries in the bitmap */
- Assert(hscan->rs_empty_tuples_pending == 0);
- return false;
+ ReleaseBuffer(hscan->rs_vmbuffer);
+ hscan->rs_vmbuffer = InvalidBuffer;
}
/*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE
- * isolation though, as we need to examine all invisible tuples
- * reachable by the index.
+ * Bitmap is exhausted. Time to emit empty tuples if relevant. We emit
+ * all empty tuples at the end instead of emitting them per block we
+ * skip fetching. This is necessary because the streaming read API will
+ * only return TBMIterateResults for blocks actually fetched. When we
+ * skip fetching a block, we keep track of how many empty tuples to
+ * emit at the end of the BitmapHeapScan. We do not recheck all NULL
+ * tuples.
*/
- } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
+ *recheck = false;
+ return hscan->rs_empty_tuples_pending > 0;
+ }
- /* Got a valid block */
- *blockno = tbmres.blockno;
- *recheck = tbmres.recheck;
+ Assert(io_private);
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, and the bitmap entries don't need rechecking, and all tuples on
- * the page are visible to our transaction.
- */
- if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres.recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres.ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
+ tbmres = io_private;
- hscan->rs_empty_tuples_pending += tbmres.ntuples;
+ Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
- return true;
- }
+ *recheck = tbmres->recheck;
- block = tbmres.blockno;
+ hscan->rs_cblock = tbmres->blockno;
+ hscan->rs_ntuples = tbmres->ntuples;
- /*
- * Acquire pin on the target heap page, trading in any pin we held before.
- */
- hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
- scan->rs_rd,
- block);
- hscan->rs_cblock = block;
+ block = tbmres->blockno;
buffer = hscan->rs_cbuf;
snapshot = scan->rs_snapshot;
@@ -2206,7 +2192,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres.ntuples >= 0)
+ if (tbmres->ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2215,9 +2201,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres.ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres->ntuples; curslot++)
{
- OffsetNumber offnum = tbmres.offsets[curslot];
+ OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2270,7 +2256,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/* Only count exact and lossy pages with visible tuples */
if (ntup > 0)
{
- if (tbmres.ntuples >= 0)
+ if (tbmres->ntuples >= 0)
scan->exact_pages++;
else
scan->lossy_pages++;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 284641fa8ea..128621f1306 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -56,11 +56,6 @@ static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapAccumCounters(BitmapHeapScanState *node,
TableScanDesc scan);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
- TableScanDesc scan);
static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
@@ -91,14 +86,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* If we haven't yet performed the underlying index scan, do it, and begin
* the iteration over the bitmap.
- *
- * For prefetching, we use *two* iterators, one for the pages we are
- * actually scanning and another that runs ahead of the first for
- * prefetching. node->prefetch_pages tracks exactly how many pages ahead
- * the prefetch iterator is. Also, node->prefetch_target tracks the
- * desired prefetch distance, which starts small and increases up to the
- * node->prefetch_maximum. This is to avoid doing a lot of prefetching in
- * a scan that stops after a few tuples because of a LIMIT.
*/
if (!node->initialized)
{
@@ -114,15 +101,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->prefetch_iterator = tbm_begin_iterate(tbm);
- node->prefetch_pages = 0;
- node->prefetch_target = -1;
- }
-#endif /* USE_PREFETCH */
}
else
{
@@ -145,20 +123,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
* multiple processes to iterate jointly.
*/
pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- pstate->prefetch_iterator =
- tbm_prepare_shared_iterate(tbm);
-
- /*
- * We don't need the mutex here as we haven't yet woke up
- * others.
- */
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = -1;
- }
-#endif
/* We have initialized the shared state so wake up others. */
BitmapDoneInitializingSharedState(pstate);
@@ -166,14 +130,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->shared_prefetch_iterator =
- tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
- }
-#endif /* USE_PREFETCH */
}
if (!scan)
@@ -216,47 +172,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->initialized = true;
/* Get the first block. if none, end of scan */
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck))
return ExecClearTuple(slot);
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- BitmapAdjustPrefetchTarget(node);
}
- for (;;)
+ do
{
while (table_scan_bitmap_next_tuple(scan, slot))
{
CHECK_FOR_INTERRUPTS();
-#ifdef USE_PREFETCH
-
- /*
- * Try to prefetch at least a few pages even before we get to the
- * second page if we don't stop reading after the first tuple.
- */
- if (!pstate)
- {
- if (node->prefetch_target < node->prefetch_maximum)
- node->prefetch_target++;
- }
- else if (pstate->prefetch_target < node->prefetch_maximum)
- {
- /* take spinlock while updating shared state */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target < node->prefetch_maximum)
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-
- /*
- * We prefetch before fetching the current pages. We expect that a
- * future streaming read API will do this, so do it now for
- * consistency.
- */
- BitmapPrefetch(node, scan);
-
/*
* If we are using lossy info, we have to recheck the qual
* conditions at every tuple.
@@ -278,13 +203,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
return slot;
}
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
- break;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
+ } while (table_scan_bitmap_next_block(scan, &node->recheck));
/*
* if we get here it means we are at the end of the scan..
@@ -318,221 +237,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
ConditionVariableBroadcast(&pstate->cv);
}
-/*
- * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (node->prefetch_pages > 0)
- {
- /* The main iterator has closed the distance by one page */
- node->prefetch_pages--;
- }
- else if (prefetch_iterator)
- {
- /* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult tbmpre;
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
- elog(ERROR, "prefetch and main iterators are out of sync");
- }
- return;
- }
-
- if (node->prefetch_maximum > 0)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages > 0)
- {
- pstate->prefetch_pages--;
- SpinLockRelease(&pstate->mutex);
- }
- else
- {
- TBMIterateResult tbmpre;
-
- /* Release the mutex before iterating */
- SpinLockRelease(&pstate->mutex);
-
- /*
- * In case of shared mode, we can not ensure that the current
- * blockno of the main iterator and that of the prefetch iterator
- * are same. It's possible that whatever blockno we are
- * prefetching will be processed by another process. Therefore,
- * we don't validate the blockno here as we do in non-parallel
- * case.
- */
- if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- }
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max. Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
- return;
- }
-
- /* Do an unlocked check first to save spinlock acquisitions. */
- if (pstate->prefetch_target < node->prefetch_maximum)
- {
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
- pstate->prefetch_target = node->prefetch_maximum;
- else if (pstate->prefetch_target > 0)
- pstate->prefetch_target *= 2;
- else
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (node->prefetch_pages < node->prefetch_target)
- {
- TBMIterateResult tbmpre;
- bool skip_fetch;
-
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_iterate(prefetch_iterator);
- node->prefetch_iterator = NULL;
- break;
- }
- node->prefetch_pages++;
-
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
- */
-
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
-
- return;
- }
-
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (1)
- {
- TBMIterateResult tbmpre;
- bool do_prefetch = false;
- bool skip_fetch;
-
- /*
- * Recheck under the mutex. If some other process has already
- * done enough prefetching then we need not to do anything.
- */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- pstate->prefetch_pages++;
- do_prefetch = true;
- }
- SpinLockRelease(&pstate->mutex);
-
- if (!do_prefetch)
- return;
-
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_shared_iterate(prefetch_iterator);
- node->shared_prefetch_iterator = NULL;
- break;
- }
-
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
- }
-#endif /* USE_PREFETCH */
-}
-
/*
* BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
*/
@@ -578,22 +282,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->ss.ss_currentScanDesc)
table_rescan(node->ss.ss_currentScanDesc, NULL);
- /* release bitmaps and buffers if any */
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
+ /* release bitmaps if any */
if (node->tbm)
tbm_free(node->tbm);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_prefetch_iterator = NULL;
- node->pvmbuffer = InvalidBuffer;
node->recheck = true;
- node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -632,16 +326,10 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
table_endscan(scanDesc);
/*
- * release bitmaps and buffers if any
+ * release bitmaps if any
*/
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
}
/* ----------------------------------------------------------------
@@ -674,19 +362,13 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
- scanstate->prefetch_iterator = NULL;
- scanstate->prefetch_pages = 0;
- scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
scanstate->recheck = true;
- scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
@@ -726,13 +408,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->bitmapqualorig =
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
- /*
- * Maximum number of prefetches for the tablespace if configured,
- * otherwise the current value of the effective_io_concurrency GUC.
- */
- scanstate->prefetch_maximum =
- get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
scanstate->ss.ss_currentRelation = currentRelation;
/*
@@ -816,14 +491,10 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
return;
pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
-
pstate->tbmiterator = 0;
- pstate->prefetch_iterator = 0;
/* Initialize the mutex */
SpinLockInit(&pstate->mutex);
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = 0;
pstate->state = BM_INITIAL;
ConditionVariableInit(&pstate->cv);
@@ -855,11 +526,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
if (DsaPointerIsValid(pstate->tbmiterator))
tbm_free_shared_area(dsa, pstate->tbmiterator);
- if (DsaPointerIsValid(pstate->prefetch_iterator))
- tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
pstate->tbmiterator = InvalidDsaPointer;
- pstate->prefetch_iterator = InvalidDsaPointer;
}
/* ----------------------------------------------------------------
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3dfb19ec7d5..1cad9c04f01 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
#include "storage/dsm.h"
#include "storage/lockdefs.h"
#include "storage/shm_toc.h"
+#include "storage/streaming_read.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -72,6 +73,9 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /* Streaming read control object for scans supporting it */
+ PgStreamingRead *rs_pgsr;
+
/*
* These fields are only used for bitmap scans for the "skip fetch"
* optimization. Bitmap scans needing no fields from the heap may skip
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ef1fcc02b1a..56683f9e4aa 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -795,17 +795,10 @@ typedef struct TableAmRoutine
* on the page have to be returned, otherwise the tuples at offsets in
* `tbmres->offsets` need to be returned.
*
- * XXX: Currently this may only be implemented if the AM uses md.c as its
- * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
- * blockids directly to the underlying storage. nodeBitmapHeapscan.c
- * performs prefetching directly using that interface. This probably
- * needs to be rectified at a later point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
- bool (*scan_bitmap_next_block) (TableScanDesc scan,
- bool *recheck, BlockNumber *blockno);
+ bool (*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1982,8 +1975,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* used after verifying the presence (at plan time or such).
*/
static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, BlockNumber *blockno)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1993,7 +1985,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, blockno);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a59df51dd69..d41a3e134d8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,8 @@ typedef enum
/* ----------------
* ParallelBitmapHeapState information
* tbmiterator iterator for scanning current pages
- * prefetch_iterator iterator for prefetching ahead of current page
* mutex mutual exclusion for the prefetching variable
* and state
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
* state current state of the TIDBitmap
* cv conditional wait variable
* phs_snapshot_data snapshot data shared to workers
@@ -1695,10 +1692,7 @@ typedef enum
typedef struct ParallelBitmapHeapState
{
dsa_pointer tbmiterator;
- dsa_pointer prefetch_iterator;
slock_t mutex;
- int prefetch_pages;
- int prefetch_target;
SharedBitmapState state;
ConditionVariable cv;
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,16 +1703,10 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
- * prefetch_iterator iterator for prefetching ahead of current page
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
- * prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
* recheck do current page's tuples need recheck
@@ -1729,20 +1717,13 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
- TBMIterator *prefetch_iterator;
- int prefetch_pages;
- int prefetch_target;
- int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
bool recheck;
- BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v2-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-patch; charset=US-ASCII; name=v2-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchDownload
From 8632e2b57bd4b532e5cbe94df89f2c1123fed62c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:17:47 -0500
Subject: [PATCH v2 08/13] Reduce scope of BitmapHeapScan tbmiterator local
variables
To simplify the diff of a future commit which will move the TBMIterators
into the scan descriptor, define them in a narrower scope now.
---
src/backend/executor/nodeBitmapHeapscan.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3b89e7e6c63..c62f978f5d7 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -76,8 +76,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterator *tbmiterator = NULL;
- TBMSharedIterator *shared_tbmiterator = NULL;
TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
@@ -90,10 +88,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- if (pstate == NULL)
- tbmiterator = node->tbmiterator;
- else
- shared_tbmiterator = node->shared_tbmiterator;
tbmres = node->tbmres;
/*
@@ -110,6 +104,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ TBMIterator *tbmiterator = NULL;
+ TBMSharedIterator *shared_tbmiterator = NULL;
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -118,7 +115,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
- node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
+ tbmiterator = tbm_begin_iterate(tbm);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -171,8 +168,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* Allocate a private iterator and attach the shared state to it */
- node->shared_tbmiterator = shared_tbmiterator =
- tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
+ shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -218,6 +214,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
+ node->tbmiterator = tbmiterator;
+ node->shared_tbmiterator = shared_tbmiterator;
node->initialized = true;
}
@@ -231,9 +229,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
if (tbmres == NULL)
{
if (!pstate)
- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
else
- node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
+ node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
if (tbmres == NULL)
{
/* no more entries in the bitmap */
--
2.37.2
v2-0009-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-patch; charset=US-ASCII; name=v2-0009-Make-table_scan_bitmap_next_block-async-friendly.patchDownload
From 59fdb0423ddc9032380247f987b682944a52d476 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v2 09/13] Make table_scan_bitmap_next_block() async friendly
table_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the known end of the table when the scan started.
This forced the caller to be responsible for determining if additional
blocks should be fetched and then for invoking
table_scan_bitmap_next_block() for these blocks.
It makes more sense for table_scan_bitmap_next_block() to be responsible
for finding a block that is not past the end of the table (as of the
time that the scan began) and for table_scan_bitmap_next_tuple() to
return false if there are no visible tuples on the page.
This also allows us to move responsibility for the iterator to table AM
specific code. This means handling invalid blocks is entirely up to
the table AM.
These changes will enable bitmapheapscan to use the future streaming
read API. The table AMs will implement a streaming read API callback
that returns the next block that needs to be fetched. In heap AM's case,
the callback will use the iterator to find the next block to be fetched.
Since choosing the next block will no longer the responsibility of
BitmapHeapNext(), the streaming read control flow requires these changes
to table_scan_bitmap_next_block().
---
src/backend/access/heap/heapam_handler.c | 58 +++++++--
src/backend/executor/nodeBitmapHeapscan.c | 148 ++++++++--------------
src/include/access/relscan.h | 5 +
src/include/access/tableam.h | 47 +++++--
src/include/nodes/execnodes.h | 11 +-
5 files changed, 145 insertions(+), 124 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3af9466b9ca..c8da3def645 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2114,17 +2114,51 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres)
+ bool *recheck, BlockNumber *blockno)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
- BlockNumber block = tbmres->blockno;
+ BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
+ TBMIterateResult *tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ *blockno = InvalidBlockNumber;
+ *recheck = true;
+
+ do
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (scan->shared_tbmiterator)
+ tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ else
+ tbmres = tbm_iterate(scan->tbmiterator);
+
+ if (tbmres == NULL)
+ {
+ /* no more entries in the bitmap */
+ Assert(hscan->rs_empty_tuples_pending == 0);
+ return false;
+ }
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+
+ /* Got a valid block */
+ *blockno = tbmres->blockno;
+ *recheck = tbmres->recheck;
+
/*
* We can skip fetching the heap page if we don't need any fields from the
* heap, and the bitmap entries don't need rechecking, and all tuples on
@@ -2143,16 +2177,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
return true;
}
- /*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE isolation
- * though, as we need to examine all invisible tuples reachable by the
- * index.
- */
- if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
- return false;
+ block = tbmres->blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2251,7 +2276,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}
- return ntup > 0;
+ /*
+ * Return true to indicate that a valid block was found and the bitmap is
+ * not exhausted. If there are no visible tuples on this page,
+ * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will
+ * return false returning control to this function to advance to the next
+ * block in the bitmap.
+ */
+ return true;
}
static bool
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c62f978f5d7..ae837785116 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -76,7 +76,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area *dsa = node->ss.ps.state->es_query_dsa;
@@ -88,7 +87,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- tbmres = node->tbmres;
/*
* If we haven't yet performed the underlying index scan, do it, and begin
@@ -116,7 +114,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -169,7 +166,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -214,46 +210,24 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
- node->tbmiterator = tbmiterator;
- node->shared_tbmiterator = shared_tbmiterator;
+ scan->tbmiterator = tbmiterator;
+ scan->shared_tbmiterator = shared_tbmiterator;
+
node->initialized = true;
+
+ /* Get the first block. if none, end of scan */
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ return ExecClearTuple(slot);
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ BitmapAdjustPrefetchTarget(node);
}
for (;;)
{
- CHECK_FOR_INTERRUPTS();
-
- /*
- * Get next page of results if needed
- */
- if (tbmres == NULL)
- {
- if (!pstate)
- node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
- else
- node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
- if (tbmres == NULL)
- {
- /* no more entries in the bitmap */
- break;
- }
-
- BitmapAdjustPrefetchIterator(node, tbmres->blockno);
-
- if (!table_scan_bitmap_next_block(scan, tbmres))
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
-
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
- else
+ while (table_scan_bitmap_next_tuple(scan, slot))
{
- /*
- * Continuing in previously obtained page.
- */
+ CHECK_FOR_INTERRUPTS();
#ifdef USE_PREFETCH
@@ -275,49 +249,41 @@ BitmapHeapNext(BitmapHeapScanState *node)
SpinLockRelease(&pstate->mutex);
}
#endif /* USE_PREFETCH */
- }
- /*
- * We issue prefetch requests *after* fetching the current page to try
- * to avoid having prefetching interfere with the main I/O. Also, this
- * should happen only when we have determined there is still something
- * to do on the current page, else we may uselessly prefetch the same
- * page we are just about to request for real.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- BitmapPrefetch(node, scan);
-
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
- {
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
- continue;
- }
+ /*
+ * We prefetch before fetching the current pages. We expect that a
+ * future streaming read API will do this, so do it now for
+ * consistency.
+ */
+ BitmapPrefetch(node, scan);
- /*
- * If we are using lossy info, we have to recheck the qual conditions
- * at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ /*
+ * If we are using lossy info, we have to recheck the qual
+ * conditions at every tuple.
+ */
+ if (node->recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
+ continue;
+ }
}
+
+ /* OK to return this tuple */
+ BitmapAccumCounters(node, scan);
+ return slot;
}
- /* OK to return this tuple */
- BitmapAccumCounters(node, scan);
- return slot;
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ break;
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ /* Adjust the prefetch target */
+ BitmapAdjustPrefetchTarget(node);
}
/*
@@ -608,12 +574,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
@@ -621,13 +583,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->tbmiterator = NULL;
- node->tbmres = NULL;
node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
node->pvmbuffer = InvalidBuffer;
+ node->recheck = true;
+ node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -658,28 +619,24 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
ExecEndNode(outerPlanState(node));
+
+ /*
+ * close heap scan
+ */
+ if (scanDesc)
+ table_endscan(scanDesc);
+
/*
* release bitmaps and buffers if any
*/
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
-
- /*
- * close heap scan
- */
- if (scanDesc)
- table_endscan(scanDesc);
-
}
/* ----------------------------------------------------------------
@@ -712,8 +669,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->tbmiterator = NULL;
- scanstate->tbmres = NULL;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -722,10 +677,11 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
+ scanstate->recheck = true;
+ scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b74e08dd745..5dea9c7a03d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -24,6 +24,9 @@
struct ParallelTableScanDescData;
+struct TBMIterator;
+struct TBMSharedIterator;
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -41,6 +44,8 @@ typedef struct TableScanDescData
ItemPointerData rs_maxtid;
/* Only used for Bitmap table scans */
+ struct TBMIterator *tbmiterator;
+ struct TBMSharedIterator *shared_tbmiterator;
long exact_pages;
long lossy_pages;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a3e30c4eda7..ef1fcc02b1a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "nodes/tidbitmap.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -804,7 +805,7 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres);
+ bool *recheck, BlockNumber *blockno);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -948,6 +949,8 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
result->lossy_pages = 0;
result->exact_pages = 0;
+ result->shared_tbmiterator = NULL;
+ result->tbmiterator = NULL;
return result;
}
@@ -1008,6 +1011,21 @@ table_beginscan_analyze(Relation rel)
static inline void
table_endscan(TableScanDesc scan)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_end(scan);
}
@@ -1018,6 +1036,21 @@ static inline void
table_rescan(TableScanDesc scan,
struct ScanKeyData *key)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_rescan(scan, key, false, false, false, false);
}
@@ -1941,17 +1974,16 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
- * a bitmap table scan. `scan` needs to have been started via
- * table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise.
+ * Prepare to fetch / check / return tuples as part of a bitmap table scan.
+ * `scan` needs to have been started via table_beginscan_bm(). Returns false if
+ * there are no more blocks in the bitmap, true otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres)
+ bool *recheck, BlockNumber *blockno)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1961,8 +1993,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, blockno);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9392923eb32..a59df51dd69 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1709,9 +1709,7 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * tbmiterator iterator for scanning current pages
- * tbmres current-page data
- * pvmbuffer ditto, for prefetched pages
+ * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
* prefetch_iterator iterator for prefetching ahead of current page
@@ -1720,10 +1718,10 @@ typedef struct ParallelBitmapHeapState
* prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
+ * recheck do current page's tuples need recheck
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1731,8 +1729,6 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- TBMIterator *tbmiterator;
- TBMIterateResult *tbmres;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
@@ -1742,10 +1738,11 @@ typedef struct BitmapHeapScanState
int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
+ bool recheck;
+ BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v2-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchtext/x-patch; charset=US-ASCII; name=v2-0003-BitmapHeapScan-begin-scan-after-bitmap-setup.patchDownload
From 6cfa8fee46a83936789062eedc37d0c06b59dc46 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v2 03/13] BitmapHeapScan begin scan after bitmap setup
There is no reason for table_beginscan_bm() to begin the actual scan of
the underlying table in ExecInitBitmapHeapScan(). We can begin the
underlying table scan after the index scan has been completed and the
bitmap built.
The one use of the scan descriptor during initialization was
ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
with one from an array in the parallel state. This overwrote the
snapshot set in table_beginscan_bm().
By saving that worker snapshot as a member in the BitmapHeapScanState
during initialization, it can be restored in table_beginscan_bm() after
returning from the table AM specific begin scan function.
---
src/backend/access/table/tableam.c | 11 ------
src/backend/executor/nodeBitmapHeapscan.c | 43 +++++++++++++++++------
src/include/access/tableam.h | 10 ++----
src/include/nodes/execnodes.h | 2 ++
4 files changed, 38 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a1..e78d793f69c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -120,17 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
NULL, flags);
}
-void
-table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
-{
- Assert(IsMVCCSnapshot(snapshot));
-
- RegisterSnapshot(snapshot);
- scan->rs_snapshot = snapshot;
- scan->rs_flags |= SO_TEMP_SNAPSHOT;
-}
-
-
/* ----------------------------------------------------------------------------
* Parallel table scan related functions.
* ----------------------------------------------------------------------------
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 76382c91fd7..be08bd785ae 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -191,6 +191,30 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
#endif /* USE_PREFETCH */
}
+
+ if (!scan)
+ {
+ Snapshot snapshot = node->ss.ps.state->es_snapshot;
+ uint32 extra_flags = 0;
+
+ /*
+ * Parallel workers must use the snapshot initialized by the
+ * parallel leader.
+ */
+ if (node->worker_snapshot)
+ {
+ snapshot = node->worker_snapshot;
+ extra_flags |= SO_TEMP_SNAPSHOT;
+ }
+
+ scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
+ node->ss.ss_currentRelation,
+ snapshot,
+ 0,
+ NULL,
+ extra_flags);
+ }
+
node->initialized = true;
}
@@ -614,7 +638,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
PlanState *outerPlan = outerPlanState(node);
/* rescan to release any page pin */
- table_rescan(node->ss.ss_currentScanDesc, NULL);
+ if (node->ss.ss_currentScanDesc)
+ table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
if (node->tbmiterator)
@@ -691,7 +716,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
/*
* close heap scan
*/
- table_endscan(scanDesc);
+ if (scanDesc)
+ table_endscan(scanDesc);
+
}
/* ----------------------------------------------------------------
@@ -740,6 +767,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->can_skip_fetch = false;
+ scanstate->worker_snapshot = NULL;
/*
* Miscellaneous initialization
@@ -788,11 +816,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ss_currentRelation = currentRelation;
- scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
- estate->es_snapshot,
- 0,
- NULL);
-
/*
* all done.
*/
@@ -931,13 +954,13 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelBitmapHeapState *pstate;
- Snapshot snapshot;
Assert(node->ss.ps.state->es_query_dsa != NULL);
pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->pstate = pstate;
- snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
- table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
+ node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
+ Assert(IsMVCCSnapshot(node->worker_snapshot));
+ RegisterSnapshot(node->worker_snapshot);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4d495216f07..8ef6b5ca25b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -939,9 +939,10 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key)
+ int nkeys, struct ScanKeyData *key,
+ uint32 extra_flags)
{
- uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+ uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1033,11 +1034,6 @@ table_rescan_set_params(TableScanDesc scan, struct ScanKeyData *key,
allow_pagemode);
}
-/*
- * Update snapshot used by the scan.
- */
-extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
-
/*
* Return next tuple from `scan`, store in slot.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..00c75fb10e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1726,6 +1726,7 @@ typedef struct ParallelBitmapHeapState
* shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
+ * worker_snapshot snapshot for parallel worker
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1750,6 +1751,7 @@ typedef struct BitmapHeapScanState
TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
+ Snapshot worker_snapshot;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v2-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchDownload
From 575fb1f93128ebfd8125c769de628f91e0d5c592 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:13:41 -0500
Subject: [PATCH v2 01/13] Remove table_scan_bitmap_next_tuple parameter tbmres
Future commits will remove the input TBMIterateResult from
table_scan_bitmap_next_block() as the streaming read API will be
responsible for iterating through the blocks in the bitmap and not
BitmapHeapNext(). Given that this parameter will not be set from
BitmapHeapNext(), it no longer makes sense to use it as a means of
communication between table_scan_bitmap_next_tuple() and
table_scan_bitmap_next_block().
---
src/backend/access/heap/heapam_handler.c | 1 -
src/backend/executor/nodeBitmapHeapscan.c | 2 +-
src/include/access/tableam.h | 7 -------
3 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..716d477e271 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2228,7 +2228,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
static bool
heapam_scan_bitmap_next_tuple(TableScanDesc scan,
- TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed63..d670939246b 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -304,7 +304,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* Attempt to fetch tuple from AM.
*/
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
/* nothing more to look at on this page */
node->tbmres = tbmres = NULL;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..4d495216f07 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -810,15 +810,10 @@ typedef struct TableAmRoutine
* Fetch the next tuple of a bitmap table scan into `slot` and return true
* if a visible tuple was found, false otherwise.
*
- * For some AMs it will make more sense to do all the work referencing
- * `tbmres` contents in scan_bitmap_next_block, for others it might be
- * better to defer more work to this callback.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_tuple) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot);
/*
@@ -1980,7 +1975,6 @@ table_scan_bitmap_next_block(TableScanDesc scan,
*/
static inline bool
table_scan_bitmap_next_tuple(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
/*
@@ -1992,7 +1986,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
- tbmres,
slot);
}
--
2.37.2
v2-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-patch; charset=US-ASCII; name=v2-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchDownload
From 2a11dc5ef62408bb455277eb73c83abc8f9d2bf3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v2 04/13] BitmapPrefetch use prefetch block recheck for skip
fetch
As of 7c70996ebf0949b142a9, BitmapPrefetch() used the recheck flag for
the current block to determine whether or not it could skip prefetching
the proposed prefetch block. It makes more sense for it to use the
recheck flag from the TBMIterateResult for the prefetch block instead.
See this [1] thread on hackers reporting the issue.
[1] https://www.postgresql.org/message-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ%2BopUrXDRXdcfwFZGA%40mail.gmail.com
---
src/backend/executor/nodeBitmapHeapscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index be08bd785ae..c6dfdf8cae9 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -532,7 +532,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* but is true in many cases.
*/
skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
@@ -583,7 +583,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
--
2.37.2
v2-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-patch; charset=US-ASCII; name=v2-0002-BitmapHeapScan-set-can_skip_fetch-later.patchDownload
From 5f915bc84eae56e52b5a61e9b7e691834fdb9680 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:38:41 -0500
Subject: [PATCH v2 02/13] BitmapHeapScan set can_skip_fetch later
There is no reason for BitmapHeapScan to calculate can_skip_fetch in
ExecInitBitmapHeapScan(). Moving it into BitmapHeapNext() is a
preliminary step toward moving can_skip_fetch into table AM specific
code, as we would need to set it after the scan has begun.
---
src/backend/executor/nodeBitmapHeapscan.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index d670939246b..76382c91fd7 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,6 +108,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ /*
+ * We can potentially skip fetching heap pages if we do not need any
+ * columns of the table, either for checking non-indexable quals or
+ * for returning data. This test is a bit simplistic, as it checks
+ * the stronger condition that there's no qual or return tlist at all.
+ * But in most cases it's probably not worth working harder than that.
+ */
+ node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
+ node->ss.ps.plan->targetlist == NIL);
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -729,16 +739,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
-
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or for
- * returning data. This test is a bit simplistic, as it checks the
- * stronger condition that there's no qual or return tlist at all. But in
- * most cases it's probably not worth working harder than that.
- */
- scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ scanstate->can_skip_fetch = false;
/*
* Miscellaneous initialization
--
2.37.2
v2-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-patch; charset=US-ASCII; name=v2-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchDownload
From d578ca2c47857794622c49319087f53918fe4c6c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v2 05/13] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumber
BitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c6dfdf8cae9..07a218ec03e 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -55,7 +55,7 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres);
+ BlockNumber blockno);
static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
static inline void BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan);
@@ -239,7 +239,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
break;
}
- BitmapAdjustPrefetchIterator(node, tbmres);
+ BitmapAdjustPrefetchIterator(node, tbmres->blockno);
/*
* We can skip fetching the heap page if we don't need any fields
@@ -392,7 +392,7 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
*/
static inline void
BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres)
+ BlockNumber blockno)
{
#ifdef USE_PREFETCH
ParallelBitmapHeapState *pstate = node->pstate;
@@ -411,7 +411,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
/* Do not let the prefetch iterator get behind the main one */
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
+ if (tbmpre == NULL || tbmpre->blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
--
2.37.2
v2-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchtext/x-patch; charset=US-ASCII; name=v2-0007-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchDownload
From 224a7e4e8eb7106c1e7159df8ca3d7ede6732be8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v2 07/13] BitmapHeapScan scan desc counts lossy and exact
pages
Future commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.
---
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 19 ++++++++++++++-----
src/include/access/relscan.h | 4 ++++
src/include/access/tableam.h | 6 +++++-
4 files changed, 32 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d775756fa53..3af9466b9ca 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
+ /* Only count exact and lossy pages with visible tuples */
+ if (ntup > 0)
+ {
+ if (tbmres->ntuples >= 0)
+ scan->exact_pages++;
+ else
+ scan->lossy_pages++;
+ }
+
return ntup > 0;
}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c884771e826..3b89e7e6c63 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -53,6 +53,8 @@
#include "utils/spccache.h"
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
+static inline void BitmapAccumCounters(BitmapHeapScanState *node,
+ TableScanDesc scan);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
BlockNumber blockno);
@@ -246,11 +248,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
continue;
}
- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
- node->lossy_pages++;
-
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
@@ -321,15 +318,27 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* OK to return this tuple */
+ BitmapAccumCounters(node, scan);
return slot;
}
/*
* if we get here it means we are at the end of the scan..
*/
+ BitmapAccumCounters(node, scan);
return ExecClearTuple(slot);
}
+static inline void
+BitmapAccumCounters(BitmapHeapScanState *node,
+ TableScanDesc scan)
+{
+ node->exact_pages += scan->exact_pages;
+ scan->exact_pages = 0;
+ node->lossy_pages += scan->lossy_pages;
+ scan->lossy_pages = 0;
+}
+
/*
* BitmapDoneInitializingSharedState - Shared state is initialized
*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..b74e08dd745 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -40,6 +40,10 @@ typedef struct TableScanDescData
ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;
+ /* Only used for Bitmap table scans */
+ long exact_pages;
+ long lossy_pages;
+
/*
* Information about type and behaviour of the scan, a bitmask of members
* of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b5d65a9528c..a3e30c4eda7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -942,9 +942,13 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
uint32 extra_flags)
{
+ TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
- return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result->lossy_pages = 0;
+ result->exact_pages = 0;
+ return result;
}
/*
--
2.37.2
v2-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-patch; charset=US-ASCII; name=v2-0006-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchDownload
From 9104d9c36462119a1875d7620b21d37b994216f9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v2 06/13] Push BitmapHeapScan skip fetch optimization into
table AM
This resolves the FIXME in BitmapHeapNext() which said that the
optmization to skip fetching blocks of the underlying table when none of
the column data was needed should be pushed into the table AM specific
code.
heapam_scan_bitmap_next_block() now does the visibility check and
accounting of empty tuples to be returned; while
heapam_scan_bitmap_next_tuple() prepares the slot to return empty
tuples.
The table AM agnostic functions for prefetching still need to know if
skipping fetching is permitted for this scan. However, this dependency
will be removed when that prefetching code is removed in favor of the
upcoming streaming read API.
---
src/backend/access/heap/heapam.c | 14 +++
src/backend/access/heap/heapam_handler.c | 29 ++++++
src/backend/executor/nodeBitmapHeapscan.c | 115 +++++++---------------
src/include/access/heapam.h | 10 ++
src/include/access/tableam.h | 18 ++--
src/include/nodes/execnodes.h | 6 --
6 files changed, 95 insertions(+), 97 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..b93f243c282 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_vmbuffer = InvalidBuffer;
+ scan->rs_empty_tuples_pending = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1043,6 +1045,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* reinitialize scan descriptor
*/
@@ -1062,6 +1070,12 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* decrement relation reference count and free scan descriptor storage
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 716d477e271..d775756fa53 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
#include "access/syncscan.h"
#include "access/tableam.h"
#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -2124,6 +2125,24 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ /*
+ * We can skip fetching the heap page if we don't need any fields from the
+ * heap, and the bitmap entries don't need rechecking, and all tuples on
+ * the page are visible to our transaction.
+ */
+ if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ {
+ /* can't be lossy in the skip_fetch case */
+ Assert(tbmres->ntuples >= 0);
+ Assert(hscan->rs_empty_tuples_pending >= 0);
+
+ hscan->rs_empty_tuples_pending += tbmres->ntuples;
+
+ return true;
+ }
+
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
@@ -2235,6 +2254,16 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
+ if (hscan->rs_empty_tuples_pending > 0)
+ {
+ /*
+ * If we don't have to fetch the tuple, just return nulls.
+ */
+ ExecStoreAllNullTuple(slot);
+ hscan->rs_empty_tuples_pending--;
+ return true;
+ }
+
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 07a218ec03e..c884771e826 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,16 +108,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or
- * for returning data. This test is a bit simplistic, as it checks
- * the stronger condition that there's no qual or return tlist at all.
- * But in most cases it's probably not worth working harder than that.
- */
- node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
- node->ss.ps.plan->targetlist == NIL);
-
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -207,6 +197,17 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags |= SO_TEMP_SNAPSHOT;
}
+ /*
+ * We can potentially skip fetching heap pages if we do not need
+ * any columns of the table, either for checking non-indexable
+ * quals or for returning data. This test is a bit simplistic, as
+ * it checks the stronger condition that there's no qual or return
+ * tlist at all. But in most cases it's probably not worth working
+ * harder than that.
+ */
+ if (node->ss.ps.plan->qual == NIL && node->ss.ps.plan->targetlist == NIL)
+ extra_flags |= SO_CAN_SKIP_FETCH;
+
scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
node->ss.ss_currentRelation,
snapshot,
@@ -220,8 +221,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
/*
@@ -241,32 +240,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres->blockno);
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
+ if (!table_scan_bitmap_next_block(scan, tbmres))
{
/* AM doesn't think this block is valid, skip */
continue;
@@ -320,46 +294,30 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
BitmapPrefetch(node, scan);
- if (node->return_empty_tuples > 0)
+ /*
+ * Attempt to fetch tuple from AM.
+ */
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
-
- if (--node->return_empty_tuples == 0)
- {
- /* no more tuples to return in the next round */
- node->tbmres = tbmres = NULL;
- }
+ /* nothing more to look at on this page */
+ node->tbmres = tbmres = NULL;
+ continue;
}
- else
+
+ /*
+ * If we are using lossy info, we have to recheck the qual conditions
+ * at every tuple.
+ */
+ if (tbmres->recheck)
{
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
{
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
continue;
}
-
- /*
- * If we are using lossy info, we have to recheck the qual
- * conditions at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
- {
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
- }
- }
}
/* OK to return this tuple */
@@ -531,7 +489,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* it did for the current heap page; which is not a certainty
* but is true in many cases.
*/
- skip_fetch = (node->can_skip_fetch &&
+
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
!tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -582,7 +541,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
/* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
!tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -652,8 +611,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
@@ -663,7 +620,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
node->initialized = false;
node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
- node->vmbuffer = InvalidBuffer;
node->pvmbuffer = InvalidBuffer;
ExecScanReScan(&node->ss);
@@ -708,8 +664,6 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
@@ -753,8 +707,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->tbm = NULL;
scanstate->tbmiterator = NULL;
scanstate->tbmres = NULL;
- scanstate->return_empty_tuples = 0;
- scanstate->vmbuffer = InvalidBuffer;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -766,7 +718,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
- scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..3dfb19ec7d5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,16 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /*
+ * These fields are only used for bitmap scans for the "skip fetch"
+ * optimization. Bitmap scans needing no fields from the heap may skip
+ * fetching an all visible block, instead using the number of tuples per
+ * block reported by the bitmap to determine how many NULL-filled tuples
+ * to return.
+ */
+ Buffer rs_vmbuffer;
+ int rs_empty_tuples_pending;
+
/* these fields only used in page-at-a-time mode and for bitmap scans */
int rs_cindex; /* current tuple's index in vistuples */
int rs_ntuples; /* number of visible tuples on page */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8ef6b5ca25b..b5d65a9528c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,6 +62,13 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+
+ /*
+ * At the discretion of the table AM, bitmap table scans may be able to
+ * skip fetching a block from the table if none of the table data is
+ * needed.
+ */
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
/*
@@ -780,10 +787,8 @@ typedef struct TableAmRoutine
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
- * make sense to perform tuple visibility checks at this time). For some
- * AMs it will make more sense to do all the work referencing `tbmres`
- * contents here, for others it might be better to defer more work to
- * scan_bitmap_next_tuple.
+ * make sense to perform tuple visibility checks at this time). All work
+ * referencing `tbmres` must be done here.
*
* If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
* on the page have to be returned, otherwise the tuples at offsets in
@@ -795,11 +800,6 @@ typedef struct TableAmRoutine
* performs prefetching directly using that interface. This probably
* needs to be rectified at a later point.
*
- * XXX: Currently this may only be implemented if the AM uses the
- * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
- * perform prefetching. This probably needs to be rectified at a later
- * point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 00c75fb10e2..9392923eb32 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1711,9 +1711,6 @@ typedef struct ParallelBitmapHeapState
* tbm bitmap obtained from child index scan(s)
* tbmiterator iterator for scanning current pages
* tbmres current-page data
- * can_skip_fetch can we potentially skip tuple fetches in this scan?
- * return_empty_tuples number of empty tuples to return
- * vmbuffer buffer for visibility-map lookups
* pvmbuffer ditto, for prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
@@ -1736,9 +1733,6 @@ typedef struct BitmapHeapScanState
TIDBitmap *tbm;
TBMIterator *tbmiterator;
TBMIterateResult *tbmres;
- bool can_skip_fetch;
- int return_empty_tuples;
- Buffer vmbuffer;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
--
2.37.2
In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.
I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).
The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previous
email
2) the TODO in the TIDBitmap API changes about being sure that setting
TBMIterateResult->blockno to InvalidBlockNumber is sufficient for
indicating an invalid TBMIterateResult (and an exhausted bitmap)
3) the streaming read API is not committed yet, so the last two patches
are not "done"
- Melanie
Attachments:
v3-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patchtext/x-diff; charset=us-asciiDownload
From e0cee301b81400209a0e727a3d7daa1f435ba999 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v3 01/13] BitmapHeapScan begin scan after bitmap creation
There is no reason for a BitmapHeapScan to begin the scan of the
underlying table in ExecInitBitmapHeapScan(). Instead, do so after
completing the index scan and building the bitmap.
ExecBitmapHeapInitializeWorker() overwrote the snapshot in the scan
descriptor with the correct one provided by the parallel leader. Since
ExecBitmapHeapInitializeWorker() is now called before the scan
descriptor has been created, save the worker's snapshot in the
BitmapHeapScanState and pass it to table_beginscan_bm().
---
src/backend/access/table/tableam.c | 11 ------
src/backend/executor/nodeBitmapHeapscan.c | 47 ++++++++++++++++++-----
src/include/access/tableam.h | 10 ++---
src/include/nodes/execnodes.h | 2 +
4 files changed, 42 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a1..e78d793f69c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -120,17 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
NULL, flags);
}
-void
-table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
-{
- Assert(IsMVCCSnapshot(snapshot));
-
- RegisterSnapshot(snapshot);
- scan->rs_snapshot = snapshot;
- scan->rs_flags |= SO_TEMP_SNAPSHOT;
-}
-
-
/* ----------------------------------------------------------------------------
* Parallel table scan related functions.
* ----------------------------------------------------------------------------
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed63..44bf38be3c9 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -181,6 +181,34 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
#endif /* USE_PREFETCH */
}
+
+ /*
+ * If this is the first scan of the underlying table, create the table
+ * scan descriptor and begin the scan.
+ */
+ if (!scan)
+ {
+ Snapshot snapshot = node->ss.ps.state->es_snapshot;
+ uint32 extra_flags = 0;
+
+ /*
+ * Parallel workers must use the snapshot initialized by the
+ * parallel leader.
+ */
+ if (node->worker_snapshot)
+ {
+ snapshot = node->worker_snapshot;
+ extra_flags |= SO_TEMP_SNAPSHOT;
+ }
+
+ scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
+ node->ss.ss_currentRelation,
+ snapshot,
+ 0,
+ NULL,
+ extra_flags);
+ }
+
node->initialized = true;
}
@@ -604,7 +632,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
PlanState *outerPlan = outerPlanState(node);
/* rescan to release any page pin */
- table_rescan(node->ss.ss_currentScanDesc, NULL);
+ if (node->ss.ss_currentScanDesc)
+ table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
if (node->tbmiterator)
@@ -681,7 +710,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
/*
* close heap scan
*/
- table_endscan(scanDesc);
+ if (scanDesc)
+ table_endscan(scanDesc);
+
}
/* ----------------------------------------------------------------
@@ -739,6 +770,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
*/
scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
node->scan.plan.targetlist == NIL);
+ scanstate->worker_snapshot = NULL;
/*
* Miscellaneous initialization
@@ -787,11 +819,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ss_currentRelation = currentRelation;
- scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
- estate->es_snapshot,
- 0,
- NULL);
-
/*
* all done.
*/
@@ -930,13 +957,13 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelBitmapHeapState *pstate;
- Snapshot snapshot;
Assert(node->ss.ps.state->es_query_dsa != NULL);
pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->pstate = pstate;
- snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
- table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
+ node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
+ Assert(IsMVCCSnapshot(node->worker_snapshot));
+ RegisterSnapshot(node->worker_snapshot);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..5375dd7150f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -944,9 +944,10 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key)
+ int nkeys, struct ScanKeyData *key,
+ uint32 extra_flags)
{
- uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+ uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1038,11 +1039,6 @@ table_rescan_set_params(TableScanDesc scan, struct ScanKeyData *key,
allow_pagemode);
}
-/*
- * Update snapshot used by the scan.
- */
-extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
-
/*
* Return next tuple from `scan`, store in slot.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..00c75fb10e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1726,6 +1726,7 @@ typedef struct ParallelBitmapHeapState
* shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
+ * worker_snapshot snapshot for parallel worker
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1750,6 +1751,7 @@ typedef struct BitmapHeapScanState
TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
+ Snapshot worker_snapshot;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v3-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-diff; charset=us-asciiDownload
From 69cd001bcdade976a51985e714d1b30b090bb388 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:38:41 -0500
Subject: [PATCH v3 02/13] BitmapHeapScan set can_skip_fetch later
Set BitmapHeapScanState->can_skip_fetch in BitmapHeapNext() when
!BitmapHeapScanState->initialized instead of in
ExecInitBitmapHeapScan(). This is a preliminary step to removing
can_skip_fetch from BitmapHeapScanState and setting it in table AM
specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 44bf38be3c9..a9ba2bdfb88 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,6 +108,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ /*
+ * We can potentially skip fetching heap pages if we do not need any
+ * columns of the table, either for checking non-indexable quals or
+ * for returning data. This test is a bit simplistic, as it checks
+ * the stronger condition that there's no qual or return tlist at all.
+ * But in most cases it's probably not worth working harder than that.
+ */
+ node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
+ node->ss.ps.plan->targetlist == NIL);
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -760,16 +770,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
-
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or for
- * returning data. This test is a bit simplistic, as it checks the
- * stronger condition that there's no qual or return tlist at all. But in
- * most cases it's probably not worth working harder than that.
- */
- scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
--
2.37.2
v3-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-diff; charset=us-asciiDownload
From b29df9592f8b3a3966cf6fab40f56a0c113f3d57 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v3 03/13] Push BitmapHeapScan skip fetch optimization into
table AM
7c70996ebf0949b142 introduced an optimization to allow bitmap table
scans to skip fetching a block from the heap if none of the underlying
data was needed and the block is marked all visible in the visibility
map. With the addition of table AMs, a FIXME was added to this code
indicating that it should be pushed into table AM specific code, as not
all table AMs may use a visibility map in the same way.
Resolve this FIXME for the current block and implement it for the heap
table AM by moving the vmbuffer and other fields needed for the
optimization from the BitmapHeapScanState into the HeapScanDescData.
heapam_scan_bitmap_next_block() now decides whether or not to skip
fetching the block before reading it in and
heapam_scan_bitmap_next_tuple() returns NULL-filled tuples for skipped
blocks.
The layering violation is still present in BitmapHeapScans's prefetching
code. However, this will be eliminated when prefetching is implemented
using the upcoming streaming read API discussed in [1].
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam.c | 14 +++
src/backend/access/heap/heapam_handler.c | 29 ++++++
src/backend/executor/nodeBitmapHeapscan.c | 118 ++++++----------------
src/include/access/heapam.h | 10 ++
src/include/access/tableam.h | 7 ++
src/include/nodes/execnodes.h | 8 +-
6 files changed, 94 insertions(+), 92 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..b93f243c282 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_vmbuffer = InvalidBuffer;
+ scan->rs_empty_tuples_pending = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1043,6 +1045,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* reinitialize scan descriptor
*/
@@ -1062,6 +1070,12 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* decrement relation reference count and free scan descriptor storage
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..7661acac3a8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
#include "access/syncscan.h"
#include "access/tableam.h"
#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -2124,6 +2125,24 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ /*
+ * We can skip fetching the heap page if we don't need any fields from the
+ * heap, and the bitmap entries don't need rechecking, and all tuples on
+ * the page are visible to our transaction.
+ */
+ if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ {
+ /* can't be lossy in the skip_fetch case */
+ Assert(tbmres->ntuples >= 0);
+ Assert(hscan->rs_empty_tuples_pending >= 0);
+
+ hscan->rs_empty_tuples_pending += tbmres->ntuples;
+
+ return true;
+ }
+
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
@@ -2236,6 +2255,16 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
+ if (hscan->rs_empty_tuples_pending > 0)
+ {
+ /*
+ * If we don't have to fetch the tuple, just return nulls.
+ */
+ ExecStoreAllNullTuple(slot);
+ hscan->rs_empty_tuples_pending--;
+ return true;
+ }
+
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index a9ba2bdfb88..2e4f87ea3a3 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,16 +108,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or
- * for returning data. This test is a bit simplistic, as it checks
- * the stronger condition that there's no qual or return tlist at all.
- * But in most cases it's probably not worth working harder than that.
- */
- node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
- node->ss.ps.plan->targetlist == NIL);
-
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -211,6 +201,17 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags |= SO_TEMP_SNAPSHOT;
}
+ /*
+ * We can potentially skip fetching heap pages if we do not need
+ * any columns of the table, either for checking non-indexable
+ * quals or for returning data. This test is a bit simplistic, as
+ * it checks the stronger condition that there's no qual or return
+ * tlist at all. But in most cases it's probably not worth working
+ * harder than that.
+ */
+ if (node->ss.ps.plan->qual == NIL && node->ss.ps.plan->targetlist == NIL)
+ extra_flags |= SO_CAN_SKIP_FETCH;
+
scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
node->ss.ss_currentRelation,
snapshot,
@@ -224,8 +225,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
/*
@@ -245,32 +244,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres);
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
+ if (!table_scan_bitmap_next_block(scan, tbmres))
{
/* AM doesn't think this block is valid, skip */
continue;
@@ -318,52 +292,33 @@ BitmapHeapNext(BitmapHeapScanState *node)
* should happen only when we have determined there is still something
* to do on the current page, else we may uselessly prefetch the same
* page we are just about to request for real.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
*/
BitmapPrefetch(node, scan);
- if (node->return_empty_tuples > 0)
+ /*
+ * Attempt to fetch tuple from AM.
+ */
+ if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
{
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
-
- if (--node->return_empty_tuples == 0)
- {
- /* no more tuples to return in the next round */
- node->tbmres = tbmres = NULL;
- }
+ /* nothing more to look at on this page */
+ node->tbmres = tbmres = NULL;
+ continue;
}
- else
+
+ /*
+ * If we are using lossy info, we have to recheck the qual conditions
+ * at every tuple.
+ */
+ if (tbmres->recheck)
{
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
{
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
continue;
}
-
- /*
- * If we are using lossy info, we have to recheck the qual
- * conditions at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
- {
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
- }
- }
}
/* OK to return this tuple */
@@ -535,7 +490,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* it did for the current heap page; which is not a certainty
* but is true in many cases.
*/
- skip_fetch = (node->can_skip_fetch &&
+
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
(node->tbmres ? !node->tbmres->recheck : false) &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -586,7 +542,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
/* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
(node->tbmres ? !node->tbmres->recheck : false) &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -656,8 +612,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
@@ -667,7 +621,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
node->initialized = false;
node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
- node->vmbuffer = InvalidBuffer;
node->pvmbuffer = InvalidBuffer;
ExecScanReScan(&node->ss);
@@ -712,8 +665,6 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
@@ -757,8 +708,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->tbm = NULL;
scanstate->tbmiterator = NULL;
scanstate->tbmres = NULL;
- scanstate->return_empty_tuples = 0;
- scanstate->vmbuffer = InvalidBuffer;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -770,7 +719,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
- scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..3dfb19ec7d5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,16 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /*
+ * These fields are only used for bitmap scans for the "skip fetch"
+ * optimization. Bitmap scans needing no fields from the heap may skip
+ * fetching an all visible block, instead using the number of tuples per
+ * block reported by the bitmap to determine how many NULL-filled tuples
+ * to return.
+ */
+ Buffer rs_vmbuffer;
+ int rs_empty_tuples_pending;
+
/* these fields only used in page-at-a-time mode and for bitmap scans */
int rs_cindex; /* current tuple's index in vistuples */
int rs_ntuples; /* number of visible tuples on page */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5375dd7150f..c193ea5db43 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,6 +62,13 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+
+ /*
+ * At the discretion of the table AM, bitmap table scans may be able to
+ * skip fetching a block from the table if none of the table data is
+ * needed.
+ */
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 00c75fb10e2..6fb4ec07c5f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1711,10 +1711,7 @@ typedef struct ParallelBitmapHeapState
* tbm bitmap obtained from child index scan(s)
* tbmiterator iterator for scanning current pages
* tbmres current-page data
- * can_skip_fetch can we potentially skip tuple fetches in this scan?
- * return_empty_tuples number of empty tuples to return
- * vmbuffer buffer for visibility-map lookups
- * pvmbuffer ditto, for prefetched pages
+ * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
* prefetch_iterator iterator for prefetching ahead of current page
@@ -1736,9 +1733,6 @@ typedef struct BitmapHeapScanState
TIDBitmap *tbm;
TBMIterator *tbmiterator;
TBMIterateResult *tbmres;
- bool can_skip_fetch;
- int return_empty_tuples;
- Buffer vmbuffer;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
--
2.37.2
v3-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-diff; charset=us-asciiDownload
From 17fc9d4c35e42b6e870b7e7f7c3495114e393e8a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v3 04/13] BitmapPrefetch use prefetch block recheck for skip
fetch
As of 7c70996ebf0949b142a9, BitmapPrefetch() used the recheck flag for
the current block to determine whether or not it could skip prefetching
the proposed prefetch block. It makes more sense for it to use the
recheck flag from the TBMIterateResult for the prefetch block instead.
See this [1] thread on hackers reporting the issue.
[1] https://www.postgresql.org/message-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ%2BopUrXDRXdcfwFZGA%40mail.gmail.com
---
src/backend/executor/nodeBitmapHeapscan.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2e4f87ea3a3..35ef26221ba 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,15 +484,9 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* skip this prefetch call, but continue to run the prefetch
* logic normally. (Would it be better not to increment
* prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
*/
-
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
@@ -543,7 +537,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
--
2.37.2
v3-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-diff; charset=us-asciiDownload
From 67a9fb1848718cabfcfd5c98368ab2aa79a6b213 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v3 05/13] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumber
BitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 35ef26221ba..3439c02e989 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -55,7 +55,7 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres);
+ BlockNumber blockno);
static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
static inline void BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan);
@@ -242,7 +242,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
break;
}
- BitmapAdjustPrefetchIterator(node, tbmres);
+ BitmapAdjustPrefetchIterator(node, tbmres->blockno);
if (!table_scan_bitmap_next_block(scan, tbmres))
{
@@ -351,7 +351,7 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
*/
static inline void
BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres)
+ BlockNumber blockno)
{
#ifdef USE_PREFETCH
ParallelBitmapHeapState *pstate = node->pstate;
@@ -370,7 +370,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
/* Do not let the prefetch iterator get behind the main one */
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
+ if (tbmpre == NULL || tbmpre->blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
--
2.37.2
v3-0006-BitmapHeapScan-scan-desc-counts-lossy-and-exact-p.patchtext/x-diff; charset=us-asciiDownload
From efbb311eddc765dd761154e1460e337fc2d29323 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:05:04 -0500
Subject: [PATCH v3 06/13] BitmapHeapScan scan desc counts lossy and exact
pages
Future commits will remove the TBMIterateResult from BitmapHeapNext(),
pushing it into the table AM-specific code. So we will have to keep
track of the number of lossy and exact pages in the scan descriptor.
Doing this change to lossy/exact page counting in a separate commit just
simplifies the diff.
---
src/backend/access/heap/heapam_handler.c | 9 +++++++++
src/backend/executor/nodeBitmapHeapscan.c | 19 ++++++++++++++-----
src/include/access/relscan.h | 4 ++++
src/include/access/tableam.h | 6 +++++-
4 files changed, 32 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7661acac3a8..9fc99a87fdf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
+ /* Only count exact and lossy pages with visible tuples */
+ if (ntup > 0)
+ {
+ if (tbmres->ntuples >= 0)
+ scan->exact_pages++;
+ else
+ scan->lossy_pages++;
+ }
+
return ntup > 0;
}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3439c02e989..eee90b8785b 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -53,6 +53,8 @@
#include "utils/spccache.h"
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
+static inline void BitmapAccumCounters(BitmapHeapScanState *node,
+ TableScanDesc scan);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
BlockNumber blockno);
@@ -250,11 +252,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
continue;
}
- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
- node->lossy_pages++;
-
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
@@ -322,15 +319,27 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* OK to return this tuple */
+ BitmapAccumCounters(node, scan);
return slot;
}
/*
* if we get here it means we are at the end of the scan..
*/
+ BitmapAccumCounters(node, scan);
return ExecClearTuple(slot);
}
+static inline void
+BitmapAccumCounters(BitmapHeapScanState *node,
+ TableScanDesc scan)
+{
+ node->exact_pages += scan->exact_pages;
+ scan->exact_pages = 0;
+ node->lossy_pages += scan->lossy_pages;
+ scan->lossy_pages = 0;
+}
+
/*
* BitmapDoneInitializingSharedState - Shared state is initialized
*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..b74e08dd745 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -40,6 +40,10 @@ typedef struct TableScanDescData
ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;
+ /* Only used for Bitmap table scans */
+ long exact_pages;
+ long lossy_pages;
+
/*
* Information about type and behaviour of the scan, a bitmask of members
* of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c193ea5db43..7dfb291800c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -954,9 +954,13 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
uint32 extra_flags)
{
+ TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
- return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result->lossy_pages = 0;
+ result->exact_pages = 0;
+ return result;
}
/*
--
2.37.2
v3-0007-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-diff; charset=us-asciiDownload
From e42eea35eb863303eb0a914b96fe33103e3afcd9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:17:47 -0500
Subject: [PATCH v3 07/13] Reduce scope of BitmapHeapScan tbmiterator local
variables
To simplify the diff of a future commit which will move the TBMIterators
into the scan descriptor, define them in a narrower scope now.
---
src/backend/executor/nodeBitmapHeapscan.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index eee90b8785b..a0fe65fde58 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -76,8 +76,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterator *tbmiterator = NULL;
- TBMSharedIterator *shared_tbmiterator = NULL;
TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
@@ -90,10 +88,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- if (pstate == NULL)
- tbmiterator = node->tbmiterator;
- else
- shared_tbmiterator = node->shared_tbmiterator;
tbmres = node->tbmres;
/*
@@ -110,6 +104,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ TBMIterator *tbmiterator = NULL;
+ TBMSharedIterator *shared_tbmiterator = NULL;
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -118,7 +115,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
- node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
+ tbmiterator = tbm_begin_iterate(tbm);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -171,8 +168,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* Allocate a private iterator and attach the shared state to it */
- node->shared_tbmiterator = shared_tbmiterator =
- tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
+ shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -222,6 +218,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
+ node->tbmiterator = tbmiterator;
+ node->shared_tbmiterator = shared_tbmiterator;
node->initialized = true;
}
@@ -235,9 +233,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
if (tbmres == NULL)
{
if (!pstate)
- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
else
- node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
+ node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
if (tbmres == NULL)
{
/* no more entries in the bitmap */
--
2.37.2
v3-0008-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-diff; charset=us-asciiDownload
From eee14d6a4cd7191201b158ed77e79abbefe6349f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:13:41 -0500
Subject: [PATCH v3 08/13] Remove table_scan_bitmap_next_tuple parameter tbmres
With the addition of the proposed streaming read API [1],
table_scan_bitmap_next_block() will no longer take a TBMIterateResult as
an input. Instead table AMs will be responsible for implementing a
callback for the streaming read API which specifies which blocks should
be prefetched and read.
Thus, it no longer makes sense to use the TBMIterateResult as a means of
communication between table_scan_bitmap_next_tuple() and
table_scan_bitmap_next_block().
Note that this parameter was unused by heap AM's implementation of
table_scan_bitmap_next_tuple().
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 1 -
src/backend/executor/nodeBitmapHeapscan.c | 2 +-
src/include/access/tableam.h | 12 +-----------
3 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9fc99a87fdf..3af9466b9ca 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2256,7 +2256,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
static bool
heapam_scan_bitmap_next_tuple(TableScanDesc scan,
- TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index a0fe65fde58..b4333184576 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -293,7 +293,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* Attempt to fetch tuple from AM.
*/
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
/* nothing more to look at on this page */
node->tbmres = tbmres = NULL;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7dfb291800c..2dc79583bcf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -787,10 +787,7 @@ typedef struct TableAmRoutine
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
- * make sense to perform tuple visibility checks at this time). For some
- * AMs it will make more sense to do all the work referencing `tbmres`
- * contents here, for others it might be better to defer more work to
- * scan_bitmap_next_tuple.
+ * make sense to perform tuple visibility checks at this time).
*
* If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
* on the page have to be returned, otherwise the tuples at offsets in
@@ -817,15 +814,10 @@ typedef struct TableAmRoutine
* Fetch the next tuple of a bitmap table scan into `slot` and return true
* if a visible tuple was found, false otherwise.
*
- * For some AMs it will make more sense to do all the work referencing
- * `tbmres` contents in scan_bitmap_next_block, for others it might be
- * better to defer more work to this callback.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_tuple) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot);
/*
@@ -1987,7 +1979,6 @@ table_scan_bitmap_next_block(TableScanDesc scan,
*/
static inline bool
table_scan_bitmap_next_tuple(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
/*
@@ -1999,7 +1990,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
- tbmres,
slot);
}
--
2.37.2
v3-0009-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-diff; charset=us-asciiDownload
From d579faa35292c4d3730a7fd112606fc419b7886a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v3 09/13] Make table_scan_bitmap_next_block() async friendly
table_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the end of the table recorded when the scan started.
This forced the caller to be responsible for determining if additional
blocks should be fetched and then for invoking
table_scan_bitmap_next_block() for these blocks.
It makes more sense for table_scan_bitmap_next_block() to be responsible
for finding a block that is not past the end of the table (as of the
time that the scan began) and for table_scan_bitmap_next_tuple() to
return false if there are no visible tuples on the page.
This also allows us to move responsibility for the iterator to table AM
specific code. This means handling invalid blocks is entirely up to
the table AM.
These changes will enable bitmapheapscan to use the future streaming
read API [1]. Table AMs will implement a streaming read API callback
returning the next block to fetch. In heap AM's case, the callback will
use the iterator to identify the next block to fetch. Since choosing the
next block will no longer the responsibility of BitmapHeapNext(), the
streaming read control flow requires these changes to
table_scan_bitmap_next_block().
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 58 +++++++--
src/backend/executor/nodeBitmapHeapscan.c | 148 ++++++++--------------
src/include/access/relscan.h | 5 +
src/include/access/tableam.h | 58 ++++++---
src/include/nodes/execnodes.h | 9 +-
5 files changed, 150 insertions(+), 128 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3af9466b9ca..c8da3def645 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2114,17 +2114,51 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres)
+ bool *recheck, BlockNumber *blockno)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
- BlockNumber block = tbmres->blockno;
+ BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
+ TBMIterateResult *tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ *blockno = InvalidBlockNumber;
+ *recheck = true;
+
+ do
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (scan->shared_tbmiterator)
+ tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ else
+ tbmres = tbm_iterate(scan->tbmiterator);
+
+ if (tbmres == NULL)
+ {
+ /* no more entries in the bitmap */
+ Assert(hscan->rs_empty_tuples_pending == 0);
+ return false;
+ }
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+
+ /* Got a valid block */
+ *blockno = tbmres->blockno;
+ *recheck = tbmres->recheck;
+
/*
* We can skip fetching the heap page if we don't need any fields from the
* heap, and the bitmap entries don't need rechecking, and all tuples on
@@ -2143,16 +2177,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
return true;
}
- /*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE isolation
- * though, as we need to examine all invisible tuples reachable by the
- * index.
- */
- if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
- return false;
+ block = tbmres->blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2251,7 +2276,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
scan->lossy_pages++;
}
- return ntup > 0;
+ /*
+ * Return true to indicate that a valid block was found and the bitmap is
+ * not exhausted. If there are no visible tuples on this page,
+ * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will
+ * return false returning control to this function to advance to the next
+ * block in the bitmap.
+ */
+ return true;
}
static bool
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index b4333184576..9109e8ddddf 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -76,7 +76,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area *dsa = node->ss.ps.state->es_query_dsa;
@@ -88,7 +87,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- tbmres = node->tbmres;
/*
* If we haven't yet performed the underlying index scan, do it, and begin
@@ -116,7 +114,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -169,7 +166,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -218,46 +214,24 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
- node->tbmiterator = tbmiterator;
- node->shared_tbmiterator = shared_tbmiterator;
+ scan->tbmiterator = tbmiterator;
+ scan->shared_tbmiterator = shared_tbmiterator;
+
node->initialized = true;
+
+ /* Get the first block. if none, end of scan */
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ return ExecClearTuple(slot);
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ BitmapAdjustPrefetchTarget(node);
}
for (;;)
{
- CHECK_FOR_INTERRUPTS();
-
- /*
- * Get next page of results if needed
- */
- if (tbmres == NULL)
+ while (table_scan_bitmap_next_tuple(scan, slot))
{
- if (!pstate)
- node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
- else
- node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
- if (tbmres == NULL)
- {
- /* no more entries in the bitmap */
- break;
- }
-
- BitmapAdjustPrefetchIterator(node, tbmres->blockno);
-
- if (!table_scan_bitmap_next_block(scan, tbmres))
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
-
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
- else
- {
- /*
- * Continuing in previously obtained page.
- */
+ CHECK_FOR_INTERRUPTS();
#ifdef USE_PREFETCH
@@ -279,46 +253,44 @@ BitmapHeapNext(BitmapHeapScanState *node)
SpinLockRelease(&pstate->mutex);
}
#endif /* USE_PREFETCH */
- }
- /*
- * We issue prefetch requests *after* fetching the current page to try
- * to avoid having prefetching interfere with the main I/O. Also, this
- * should happen only when we have determined there is still something
- * to do on the current page, else we may uselessly prefetch the same
- * page we are just about to request for real.
- */
- BitmapPrefetch(node, scan);
-
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
- {
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
- continue;
- }
+ /*
+ * We prefetch before fetching the current pages. We expect that a
+ * future streaming read API will do this, so do it this way now
+ * for consistency. Also, this should happen only when we have
+ * determined there is still something to do on the current page,
+ * else we may uselessly prefetch the same page we are just about
+ * to request for real.
+ */
+ BitmapPrefetch(node, scan);
- /*
- * If we are using lossy info, we have to recheck the qual conditions
- * at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ /*
+ * If we are using lossy info, we have to recheck the qual
+ * conditions at every tuple.
+ */
+ if (node->recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
+ continue;
+ }
}
+
+ /* OK to return this tuple */
+ BitmapAccumCounters(node, scan);
+ return slot;
}
- /* OK to return this tuple */
- BitmapAccumCounters(node, scan);
- return slot;
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ break;
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ /* Adjust the prefetch target */
+ BitmapAdjustPrefetchTarget(node);
}
/*
@@ -603,12 +575,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
@@ -616,13 +584,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->tbmiterator = NULL;
- node->tbmres = NULL;
node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
node->pvmbuffer = InvalidBuffer;
+ node->recheck = true;
+ node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -653,28 +620,24 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
ExecEndNode(outerPlanState(node));
+
+ /*
+ * close heap scan
+ */
+ if (scanDesc)
+ table_endscan(scanDesc);
+
/*
* release bitmaps and buffers if any
*/
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
-
- /*
- * close heap scan
- */
- if (scanDesc)
- table_endscan(scanDesc);
-
}
/* ----------------------------------------------------------------
@@ -707,8 +670,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->tbmiterator = NULL;
- scanstate->tbmres = NULL;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -717,10 +678,11 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
+ scanstate->recheck = true;
+ scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b74e08dd745..5dea9c7a03d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -24,6 +24,9 @@
struct ParallelTableScanDescData;
+struct TBMIterator;
+struct TBMSharedIterator;
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -41,6 +44,8 @@ typedef struct TableScanDescData
ItemPointerData rs_maxtid;
/* Only used for Bitmap table scans */
+ struct TBMIterator *tbmiterator;
+ struct TBMSharedIterator *shared_tbmiterator;
long exact_pages;
long lossy_pages;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2dc79583bcf..f1f5b7ab1d0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "nodes/tidbitmap.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -780,19 +781,14 @@ typedef struct TableAmRoutine
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
- * of a bitmap table scan. `scan` was started via table_beginscan_bm().
- * Return false if there are no tuples to be found on the page, true
- * otherwise.
+ * Prepare to fetch / check / return tuples from `blockno` as part of a
+ * bitmap table scan. `scan` was started via table_beginscan_bm(). Return
+ * false if the bitmap is exhausted and true otherwise.
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
* make sense to perform tuple visibility checks at this time).
*
- * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
- * on the page have to be returned, otherwise the tuples at offsets in
- * `tbmres->offsets` need to be returned.
- *
* XXX: Currently this may only be implemented if the AM uses md.c as its
* storage manager, and uses ItemPointer->ip_blkid in a manner that maps
* blockids directly to the underlying storage. nodeBitmapHeapscan.c
@@ -808,7 +804,7 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres);
+ bool *recheck, BlockNumber *blockno);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -952,6 +948,8 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
result->lossy_pages = 0;
result->exact_pages = 0;
+ result->shared_tbmiterator = NULL;
+ result->tbmiterator = NULL;
return result;
}
@@ -1012,6 +1010,21 @@ table_beginscan_analyze(Relation rel)
static inline void
table_endscan(TableScanDesc scan)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_end(scan);
}
@@ -1022,6 +1035,21 @@ static inline void
table_rescan(TableScanDesc scan,
struct ScanKeyData *key)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_rescan(scan, key, false, false, false, false);
}
@@ -1945,17 +1973,16 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
- * a bitmap table scan. `scan` needs to have been started via
- * table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise.
+ * Prepare to fetch / check / return tuples as part of a bitmap table scan.
+ * `scan` needs to have been started via table_beginscan_bm(). Returns false if
+ * there are no more blocks in the bitmap, true otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres)
+ bool *recheck, BlockNumber *blockno)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1965,8 +1992,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, blockno);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6fb4ec07c5f..a59df51dd69 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1709,8 +1709,6 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * tbmiterator iterator for scanning current pages
- * tbmres current-page data
* pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
@@ -1720,10 +1718,10 @@ typedef struct ParallelBitmapHeapState
* prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
+ * recheck do current page's tuples need recheck
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1731,8 +1729,6 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- TBMIterator *tbmiterator;
- TBMIterateResult *tbmres;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
@@ -1742,10 +1738,11 @@ typedef struct BitmapHeapScanState
int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
+ bool recheck;
+ BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v3-0010-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-diff; charset=us-asciiDownload
From c2ba82cb19d21f79090598b81aee3184cd45a654 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 20:13:43 -0500
Subject: [PATCH v3 10/13] Hard-code TBMIterateResult offsets array size
TIDBitmap's TBMIterateResult had a flexible sized array of tuple offsets
but the API always allocated MaxHeapTuplesPerPage OffsetNumbers.
Creating a fixed-size aray of size MaxHeapTuplesPerPage is more clear
for the API user.
---
src/backend/nodes/tidbitmap.c | 29 +++++++----------------------
src/include/nodes/tidbitmap.h | 12 ++++++++++--
2 files changed, 17 insertions(+), 24 deletions(-)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 0f4850065fb..689a959b467 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -40,21 +40,12 @@
#include <limits.h>
-#include "access/htup_details.h"
#include "common/hashfn.h"
#include "nodes/bitmapset.h"
#include "nodes/tidbitmap.h"
#include "storage/lwlock.h"
#include "utils/dsa.h"
-/*
- * The maximum number of tuples per page is not large (typically 256 with
- * 8K pages, or 1024 with 32K pages). So there's not much point in making
- * the per-page bitmaps variable size. We just legislate that the size
- * is this:
- */
-#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage
-
/*
* When we have to switch over to lossy storage, we use a data structure
* with one bit per page, where all pages having the same number DIV
@@ -66,7 +57,7 @@
* table, using identical data structures. (This is because the memory
* management for hashtables doesn't easily/efficiently allow space to be
* transferred easily from one hashtable to another.) Therefore it's best
- * if PAGES_PER_CHUNK is the same as MAX_TUPLES_PER_PAGE, or at least not
+ * if PAGES_PER_CHUNK is the same as MaxHeapTuplesPerPage, or at least not
* too different. But we also want PAGES_PER_CHUNK to be a power of 2 to
* avoid expensive integer remainder operations. So, define it like this:
*/
@@ -78,7 +69,7 @@
#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
/* number of active words for an exact page: */
-#define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
+#define WORDS_PER_PAGE ((MaxHeapTuplesPerPage - 1) / BITS_PER_BITMAPWORD + 1)
/* number of active words for a lossy chunk: */
#define WORDS_PER_CHUNK ((PAGES_PER_CHUNK - 1) / BITS_PER_BITMAPWORD + 1)
@@ -180,7 +171,7 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/*
@@ -221,7 +212,7 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/* Local function prototypes */
@@ -389,7 +380,7 @@ tbm_add_tuples(TIDBitmap *tbm, const ItemPointer tids, int ntids,
bitnum;
/* safety check to ensure we don't overrun bit array bounds */
- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxHeapTuplesPerPage)
elog(ERROR, "tuple offset out of range: %u", off);
/*
@@ -691,12 +682,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
Assert(tbm->iterating != TBM_ITERATING_SHARED);
- /*
- * Create the TBMIterator struct, with enough trailing space to serve the
- * needs of the TBMIterateResult sub-struct.
- */
- iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = palloc(sizeof(TBMIterator));
iterator->tbm = tbm;
/*
@@ -1470,8 +1456,7 @@ tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer dp)
* Create the TBMSharedIterator struct, with enough trailing space to
* serve the needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator));
istate = (TBMSharedIteratorState *) dsa_get_address(dsa, dp);
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639bf..432fae52962 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -22,6 +22,7 @@
#ifndef TIDBITMAP_H
#define TIDBITMAP_H
+#include "access/htup_details.h"
#include "storage/itemptr.h"
#include "utils/dsa.h"
@@ -41,9 +42,16 @@ typedef struct TBMIterateResult
{
BlockNumber blockno; /* page number containing tuples */
int ntuples; /* -1 indicates lossy result */
- bool recheck; /* should the tuples be rechecked? */
/* Note: recheck is always true if ntuples < 0 */
- OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+ bool recheck; /* should the tuples be rechecked? */
+
+ /*
+ * The maximum number of tuples per page is not large (typically 256 with
+ * 8K pages, or 1024 with 32K pages). So there's not much point in making
+ * the per-page bitmaps variable size. We just legislate that the size is
+ * this:
+ */
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
} TBMIterateResult;
/* function prototypes in nodes/tidbitmap.c */
--
2.37.2
v3-0011-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-diff; charset=us-asciiDownload
From 3f763a0fb8b16a84ef666cd9086402cb01171fab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:23:41 -0500
Subject: [PATCH v3 11/13] Separate TBM[Shared]Iterator and TBMIterateResult
Remove the TBMIterateResult from the TBMIterator and TBMSharedIterator
and have tbm_[shared_]iterate() take a TBMIterateResult as a parameter.
This will allow multiple TBMIterateResults to exist concurrently
allowing asynchronous use of the TIDBitmap for prefetching, for example.
tbm_[shared]_iterate() now sets blockno to InvalidBlockNumber when the
bitmap is exhausted instead of returning NULL.
BitmapHeapScan callers of tbm_iterate make a TBMIterateResult locally
and pass it in.
Because GIN only needs a single TBMIterateResult, inline the matchResult
in the GinScanEntry to avoid having to separately manage memory for the
TBMIterateResult.
---
src/backend/access/gin/ginget.c | 48 +++++++++------
src/backend/access/gin/ginscan.c | 2 +-
src/backend/access/heap/heapam_handler.c | 32 +++++-----
src/backend/executor/nodeBitmapHeapscan.c | 33 +++++-----
src/backend/nodes/tidbitmap.c | 73 ++++++++++++-----------
src/include/access/gin_private.h | 2 +-
src/include/nodes/tidbitmap.h | 4 +-
7 files changed, 107 insertions(+), 87 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb6..3aa457a29e1 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -332,10 +332,22 @@ restartScanEntry:
entry->list = NULL;
entry->nlist = 0;
entry->matchBitmap = NULL;
- entry->matchResult = NULL;
entry->reduceResult = false;
entry->predictNumberResult = 0;
+ /*
+ * MTODO: is it enough to set blockno to InvalidBlockNumber? In all the
+ * places were we previously set matchResult to NULL, I just set blockno
+ * to InvalidBlockNumber. It seems like this should be okay because that
+ * is usually what we check before using the matchResult members. But it
+ * might be safer to zero out the offsets array. But that is expensive.
+ */
+ entry->matchResult.blockno = InvalidBlockNumber;
+ entry->matchResult.ntuples = 0;
+ entry->matchResult.recheck = true;
+ memset(entry->matchResult.offsets, 0,
+ sizeof(OffsetNumber) * MaxHeapTuplesPerPage);
+
/*
* we should find entry, and begin scan of posting tree or just store
* posting list in memory
@@ -374,6 +386,7 @@ restartScanEntry:
{
if (entry->matchIterator)
tbm_end_iterate(entry->matchIterator);
+ entry->matchResult.blockno = InvalidBlockNumber;
entry->matchIterator = NULL;
tbm_free(entry->matchBitmap);
entry->matchBitmap = NULL;
@@ -823,18 +836,19 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
{
/*
* If we've exhausted all items on this block, move to next block
- * in the bitmap.
+ * in the bitmap. tbm_iterate() sets matchResult->blockno to
+ * InvalidBlockNumber when the bitmap is exhausted.
*/
- while (entry->matchResult == NULL ||
- (entry->matchResult->ntuples >= 0 &&
- entry->offset >= entry->matchResult->ntuples) ||
- entry->matchResult->blockno < advancePastBlk ||
+ while ((!BlockNumberIsValid(entry->matchResult.blockno)) ||
+ (entry->matchResult.ntuples >= 0 &&
+ entry->offset >= entry->matchResult.ntuples) ||
+ entry->matchResult.blockno < advancePastBlk ||
(ItemPointerIsLossyPage(&advancePast) &&
- entry->matchResult->blockno == advancePastBlk))
+ entry->matchResult.blockno == advancePastBlk))
{
- entry->matchResult = tbm_iterate(entry->matchIterator);
+ tbm_iterate(entry->matchIterator, &entry->matchResult);
- if (entry->matchResult == NULL)
+ if (!BlockNumberIsValid(entry->matchResult.blockno))
{
ItemPointerSetInvalid(&entry->curItem);
tbm_end_iterate(entry->matchIterator);
@@ -858,10 +872,10 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* We're now on the first page after advancePast which has any
* items on it. If it's a lossy result, return that.
*/
- if (entry->matchResult->ntuples < 0)
+ if (entry->matchResult.ntuples < 0)
{
ItemPointerSetLossyPage(&entry->curItem,
- entry->matchResult->blockno);
+ entry->matchResult.blockno);
/*
* We might as well fall out of the loop; we could not
@@ -875,27 +889,27 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* Not a lossy page. Skip over any offsets <= advancePast, and
* return that.
*/
- if (entry->matchResult->blockno == advancePastBlk)
+ if (entry->matchResult.blockno == advancePastBlk)
{
/*
* First, do a quick check against the last offset on the
* page. If that's > advancePast, so are all the other
* offsets, so just go back to the top to get the next page.
*/
- if (entry->matchResult->offsets[entry->matchResult->ntuples - 1] <= advancePastOff)
+ if (entry->matchResult.offsets[entry->matchResult.ntuples - 1] <= advancePastOff)
{
- entry->offset = entry->matchResult->ntuples;
+ entry->offset = entry->matchResult.ntuples;
continue;
}
/* Otherwise scan to find the first item > advancePast */
- while (entry->matchResult->offsets[entry->offset] <= advancePastOff)
+ while (entry->matchResult.offsets[entry->offset] <= advancePastOff)
entry->offset++;
}
ItemPointerSet(&entry->curItem,
- entry->matchResult->blockno,
- entry->matchResult->offsets[entry->offset]);
+ entry->matchResult.blockno,
+ entry->matchResult.offsets[entry->offset]);
entry->offset++;
/* Done unless we need to reduce the result */
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544e..033d5253394 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -106,7 +106,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
ItemPointerSetMin(&scanEntry->curItem);
scanEntry->matchBitmap = NULL;
scanEntry->matchIterator = NULL;
- scanEntry->matchResult = NULL;
+ scanEntry->matchResult.blockno = InvalidBlockNumber;
scanEntry->list = NULL;
scanEntry->nlist = 0;
scanEntry->offset = InvalidOffsetNumber;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c8da3def645..ba6793a749c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2121,7 +2121,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult *tbmres;
+ TBMIterateResult tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
@@ -2134,11 +2134,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
CHECK_FOR_INTERRUPTS();
if (scan->shared_tbmiterator)
- tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
else
- tbmres = tbm_iterate(scan->tbmiterator);
+ tbm_iterate(scan->tbmiterator, &tbmres);
- if (tbmres == NULL)
+ if (!BlockNumberIsValid(tbmres.blockno))
{
/* no more entries in the bitmap */
Assert(hscan->rs_empty_tuples_pending == 0);
@@ -2153,11 +2153,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* isolation though, as we need to examine all invisible tuples
* reachable by the index.
*/
- } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+ } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
/* Got a valid block */
- *blockno = tbmres->blockno;
- *recheck = tbmres->recheck;
+ *blockno = tbmres.blockno;
+ *recheck = tbmres.recheck;
/*
* We can skip fetching the heap page if we don't need any fields from the
@@ -2165,19 +2165,19 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* the page are visible to our transaction.
*/
if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ !tbmres.recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
{
/* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
+ Assert(tbmres.ntuples >= 0);
Assert(hscan->rs_empty_tuples_pending >= 0);
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
+ hscan->rs_empty_tuples_pending += tbmres.ntuples;
return true;
}
- block = tbmres->blockno;
+ block = tbmres.blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2206,7 +2206,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres->ntuples >= 0)
+ if (tbmres.ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2215,9 +2215,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres.ntuples; curslot++)
{
- OffsetNumber offnum = tbmres->offsets[curslot];
+ OffsetNumber offnum = tbmres.offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2270,7 +2270,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/* Only count exact and lossy pages with visible tuples */
if (ntup > 0)
{
- if (tbmres->ntuples >= 0)
+ if (tbmres.ntuples >= 0)
scan->exact_pages++;
else
scan->lossy_pages++;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9109e8ddddf..bcc60d3cf98 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -347,9 +347,10 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
else if (prefetch_iterator)
{
/* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
+ tbm_iterate(prefetch_iterator, &tbmpre);
- if (tbmpre == NULL || tbmpre->blockno != blockno)
+ if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
@@ -367,6 +368,8 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
}
else
{
+ TBMIterateResult tbmpre;
+
/* Release the mutex before iterating */
SpinLockRelease(&pstate->mutex);
@@ -379,7 +382,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
* case.
*/
if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator);
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
}
}
#endif /* USE_PREFETCH */
@@ -446,10 +449,12 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (node->prefetch_pages < node->prefetch_target)
{
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
bool skip_fetch;
- if (tbmpre == NULL)
+ tbm_iterate(prefetch_iterator, &tbmpre);
+
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_iterate(prefetch_iterator);
@@ -465,13 +470,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* prefetch_pages?)
*/
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
@@ -486,7 +491,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (1)
{
- TBMIterateResult *tbmpre;
+ TBMIterateResult tbmpre;
bool do_prefetch = false;
bool skip_fetch;
@@ -505,8 +510,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
if (!do_prefetch)
return;
- tbmpre = tbm_shared_iterate(prefetch_iterator);
- if (tbmpre == NULL)
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_shared_iterate(prefetch_iterator);
@@ -516,13 +521,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
}
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 689a959b467..b4dcb1cbb88 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -171,7 +171,6 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output;
};
/*
@@ -212,7 +211,6 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output;
};
/* Local function prototypes */
@@ -943,20 +941,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
/*
* tbm_iterate - scan through next page of a TIDBitmap
*
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan. Pages are guaranteed to be delivered in numerical
- * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition. If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway. (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition. If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway. (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
*/
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
{
TIDBitmap *tbm = iterator->tbm;
- TBMIterateResult *output = &(iterator->output);
Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
@@ -984,6 +983,7 @@ tbm_iterate(TBMIterator *iterator)
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
+ Assert(tbmres);
if (iterator->schunkptr < tbm->nchunks)
{
PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -994,11 +994,11 @@ tbm_iterate(TBMIterator *iterator)
chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
iterator->schunkbit++;
- return output;
+ return;
}
}
@@ -1014,16 +1014,17 @@ tbm_iterate(TBMIterator *iterator)
page = tbm->spages[iterator->spageptr];
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
iterator->spageptr++;
- return output;
+ return;
}
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
@@ -1033,10 +1034,9 @@ tbm_iterate(TBMIterator *iterator)
* across multiple processes. We need to acquire the iterator LWLock,
* before accessing the shared members.
*/
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
{
- TBMIterateResult *output = &iterator->output;
TBMSharedIteratorState *istate = iterator->state;
PagetableEntry *ptbase = NULL;
int *idxpages = NULL;
@@ -1087,13 +1087,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
}
@@ -1103,21 +1103,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
int ntuples;
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
istate->spageptr++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
LWLockRelease(&istate->lock);
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 51d0c74a6b0..e423d92b41c 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -352,7 +352,7 @@ typedef struct GinScanEntryData
/* for a partial-match or full-scan query, we accumulate all TIDs here */
TIDBitmap *matchBitmap;
TBMIterator *matchIterator;
- TBMIterateResult *matchResult;
+ TBMIterateResult matchResult;
/* used for Posting list and one page in Posting tree */
ItemPointerData *list;
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 432fae52962..f000c1af28f 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -72,8 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
extern void tbm_end_iterate(TBMIterator *iterator);
extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
--
2.37.2
v3-0012-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload
From 6b9989da160c8a96a8e70ae276796b460c205ff0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v3 12/13] Streaming Read API
---
contrib/pg_prewarm/pg_prewarm.c | 40 +-
src/backend/access/transam/xlogutils.c | 2 +-
src/backend/postmaster/bgwriter.c | 8 +-
src/backend/postmaster/checkpointer.c | 15 +-
src/backend/storage/Makefile | 2 +-
src/backend/storage/aio/Makefile | 14 +
src/backend/storage/aio/meson.build | 5 +
src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 560 +++++++++++++++--------
src/backend/storage/buffer/localbuf.c | 14 +-
src/backend/storage/meson.build | 1 +
src/backend/storage/smgr/smgr.c | 49 +-
src/include/storage/bufmgr.h | 22 +
src/include/storage/smgr.h | 4 +-
src/include/storage/streaming_read.h | 45 ++
src/include/utils/rel.h | 6 -
src/tools/pgindent/typedefs.list | 2 +
17 files changed, 986 insertions(+), 238 deletions(-)
create mode 100644 src/backend/storage/aio/Makefile
create mode 100644 src/backend/storage/aio/meson.build
create mode 100644 src/backend/storage/aio/streaming_read.c
create mode 100644 src/include/storage/streaming_read.h
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/smgr.h"
+#include "storage/streaming_read.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
static PGIOAlignedBlock blockbuffer;
+struct pg_prewarm_streaming_read_private
+{
+ BlockNumber blocknum;
+ int64 last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_data)
+{
+ struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+ if (p->blocknum <= p->last_block)
+ return p->blocknum++;
+
+ return InvalidBlockNumber;
+}
+
/*
* pg_prewarm(regclass, mode text, fork text,
* first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
}
else if (ptype == PREWARM_BUFFER)
{
+ struct pg_prewarm_streaming_read_private p;
+ PgStreamingRead *pgsr;
+
/*
* In buffer mode, we actually pull the data into shared_buffers.
*/
+
+ /* Set up the private state for our streaming buffer read callback. */
+ p.blocknum = first_block;
+ p.last_block = last_block;
+
+ pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ &p,
+ 0,
+ NULL,
+ BMR_REL(rel),
+ forkNumber,
+ pg_prewarm_streaming_read_next);
+
for (block = first_block; block <= last_block; ++block)
{
Buffer buf;
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+ buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
ReleaseBuffer(buf);
++blocks_done;
}
+ Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+ pg_streaming_read_free(pgsr);
}
/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
* This is unnecessarily heavy-handed, as it will close SMgrRelation
* objects for other databases as well. DROP DATABASE occurs seldom enough
* that it's not worth introducing a variant of smgrclose for just this
- * purpose. XXX: Or should we rather leave the smgr entries dangling?
+ * purpose.
*/
smgrcloseall();
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
if (FirstCallSinceLastCheckpoint())
{
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the bgwriter does
+ * not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
}
/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
ckpt_performed = CreateRestartPoint(flags);
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the checkpointer
+ * does not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
/*
* Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
*/
CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
- /*
- * After any checkpoint, close all smgr files. This is so we won't
- * hang onto smgr references to deleted files indefinitely.
- */
- smgrcloseall();
+ /* Free all smgr objects, as CheckpointerMain() normally would. */
+ smgrdestroyall();
return;
}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS = aio buffer file freespace ipc large_object lmgr page smgr sync
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+ bool advice_issued;
+ bool need_complete;
+ BlockNumber blocknum;
+ int nblocks;
+ int per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+ Buffer buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+ int max_ios;
+ int ios_in_progress;
+ int ios_in_progress_trigger;
+ int max_pinned_buffers;
+ int pinned_buffers;
+ int pinned_buffers_trigger;
+ int next_tail_buffer;
+ bool finished;
+ void *pgsr_private;
+ PgStreamingReadBufferCB callback;
+ BufferAccessStrategy strategy;
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+
+ bool advice_enabled;
+
+ /* Next expected block, for detecting sequential access. */
+ BlockNumber seq_blocknum;
+
+ /* Space for optional per-buffer private data. */
+ size_t per_buffer_data_size;
+ void *per_buffer_data;
+ int per_buffer_data_next;
+
+ /* Circular buffer of ranges. */
+ int size;
+ int head;
+ int tail;
+ PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy)
+{
+ PgStreamingRead *pgsr;
+ int size;
+ int max_ios;
+ uint32 max_pinned_buffers;
+
+
+ /*
+ * Decide how many assumed I/Os we will allow to run concurrently. That
+ * is, advice to the kernel to tell it that we will soon read. This
+ * number also affects how far we look ahead for opportunities to start
+ * more I/Os.
+ */
+ if (flags & PGSR_FLAG_MAINTENANCE)
+ max_ios = maintenance_io_concurrency;
+ else
+ max_ios = effective_io_concurrency;
+
+ /*
+ * The desired level of I/O concurrency controls how far ahead we are
+ * willing to look ahead. We also clamp it to at least
+ * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+ * sized read, even when max_ios is zero.
+ */
+ max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+ /*
+ * The *_io_concurrency GUCs, we might have 0. We want to allow at least
+ * one, to keep our gating logic simple.
+ */
+ max_ios = Max(max_ios, 1);
+
+ /*
+ * Don't allow this backend to pin too many buffers. For now we'll apply
+ * the limit for the shared buffer pool and the local buffer pool, without
+ * worrying which it is.
+ */
+ LimitAdditionalPins(&max_pinned_buffers);
+ LimitAdditionalLocalPins(&max_pinned_buffers);
+ Assert(max_pinned_buffers > 0);
+
+ /*
+ * pgsr->ranges is a circular buffer. When it is empty, head == tail.
+ * When it is full, there is an empty element between head and tail. Head
+ * can also be empty (nblocks == 0), therefore we need two extra elements
+ * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+ * maxmimum possible number of occupied ranges of the smallest possible
+ * size of one.
+ */
+ size = max_pinned_buffers + 2;
+
+ pgsr = (PgStreamingRead *)
+ palloc0(offsetof(PgStreamingRead, ranges) +
+ sizeof(pgsr->ranges[0]) * size);
+
+ pgsr->max_ios = max_ios;
+ pgsr->per_buffer_data_size = per_buffer_data_size;
+ pgsr->max_pinned_buffers = max_pinned_buffers;
+ pgsr->pgsr_private = pgsr_private;
+ pgsr->strategy = strategy;
+ pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+ /*
+ * This system supports prefetching advice. As long as direct I/O isn't
+ * enabled, and the caller hasn't promised sequential access, we can use
+ * it.
+ */
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ (flags & PGSR_FLAG_SEQUENTIAL) == 0)
+ pgsr->advice_enabled = true;
+#endif
+
+ /*
+ * We want to avoid creating ranges that are smaller than they could be
+ * just because we hit max_pinned_buffers. We only look ahead when the
+ * number of pinned buffers falls below this trigger number, or put
+ * another way, we stop looking ahead when we wouldn't be able to build a
+ * "full sized" range.
+ */
+ pgsr->pinned_buffers_trigger =
+ Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+ /* Space the callback to store extra data along with each block. */
+ if (per_buffer_data_size)
+ pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+ return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb)
+{
+ PgStreamingRead *result;
+
+ result = pg_streaming_read_buffer_alloc_internal(flags,
+ pgsr_private,
+ per_buffer_data_size,
+ strategy);
+ result->callback = next_block_cb;
+ result->bmr = bmr;
+ result->forknum = forknum;
+
+ return result;
+}
+
+/*
+ * Start building a new range. This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading. In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+ PgStreamingReadRange *head_range;
+
+ head_range = &pgsr->ranges[pgsr->head];
+ Assert(head_range->nblocks > 0);
+
+ /*
+ * If a call to CompleteReadBuffers() will be needed, and we can issue
+ * advice to the kernel to get the read started. We suppress it if the
+ * access pattern appears to be completely sequential, though, because on
+ * some systems that interfers with the kernel's own sequential read ahead
+ * heurstics and hurts performance.
+ */
+ if (pgsr->advice_enabled)
+ {
+ BlockNumber blocknum = head_range->blocknum;
+ int nblocks = head_range->nblocks;
+
+ if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+ {
+ SMgrRelation smgr =
+ pgsr->bmr.smgr ? pgsr->bmr.smgr :
+ RelationGetSmgr(pgsr->bmr.rel);
+
+ Assert(!head_range->advice_issued);
+
+ smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+ /*
+ * Count this as an I/O that is concurrently in progress, though
+ * we don't really know if the kernel generates a physical I/O.
+ */
+ head_range->advice_issued = true;
+ pgsr->ios_in_progress++;
+ }
+
+ /* Remember the block after this range, for sequence detection. */
+ pgsr->seq_blocknum = blocknum + nblocks;
+ }
+
+ /* Create a new head range. There must be space. */
+ Assert(pgsr->size > pgsr->max_pinned_buffers);
+ Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+ if (++pgsr->head == pgsr->size)
+ pgsr->head = 0;
+ head_range = &pgsr->ranges[pgsr->head];
+ head_range->nblocks = 0;
+
+ return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+ /*
+ * If we're finished or can't start more I/O, then don't look ahead.
+ */
+ if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+ return;
+
+ /*
+ * We'll also wait until the number of pinned buffers falls below our
+ * trigger level, so that we have the chance to create a full range.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+ return;
+
+ do
+ {
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+ Buffer buffer;
+ bool found;
+ bool need_complete;
+ PgStreamingReadRange *head_range;
+ void *per_buffer_data;
+
+ /* Do we have a full-sized range? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks == lengthof(head_range->buffers))
+ {
+ Assert(head_range->need_complete);
+ head_range = pg_streaming_read_new_range(pgsr);
+
+ /*
+ * Give up now if I/O is saturated, or we wouldn't be able form
+ * another full range after this due to the pin limit.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+ pgsr->ios_in_progress == pgsr->max_ios)
+ break;
+ }
+
+ per_buffer_data = (char *) pgsr->per_buffer_data +
+ pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+ /* Find out which block the callback wants to read next. */
+ blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+ if (blocknum == InvalidBlockNumber)
+ {
+ pgsr->finished = true;
+ break;
+ }
+ bmr = pgsr->bmr;
+ forknum = pgsr->forknum;
+
+ Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+ buffer = PrepareReadBuffer(bmr,
+ forknum,
+ blocknum,
+ pgsr->strategy,
+ &found);
+ pgsr->pinned_buffers++;
+
+ need_complete = !found;
+
+ /* Is there a head range that we can't extend? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks > 0 &&
+ (!need_complete ||
+ !head_range->need_complete ||
+ head_range->blocknum + head_range->nblocks != blocknum))
+ {
+ /* Yes, time to start building a new one. */
+ head_range = pg_streaming_read_new_range(pgsr);
+ Assert(head_range->nblocks == 0);
+ }
+
+ if (head_range->nblocks == 0)
+ {
+ /* Initialize a new range beginning at this block. */
+ head_range->blocknum = blocknum;
+ head_range->need_complete = need_complete;
+ head_range->advice_issued = false;
+ }
+ else
+ {
+ /* We can extend an existing range by one block. */
+ Assert(head_range->blocknum + head_range->nblocks == blocknum);
+ Assert(head_range->need_complete);
+ }
+
+ head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+ head_range->buffers[head_range->nblocks] = buffer;
+ head_range->nblocks++;
+
+ if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+ pgsr->per_buffer_data_next = 0;
+
+ } while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+ pgsr->ios_in_progress < pgsr->max_ios);
+
+ if (pgsr->ranges[pgsr->head].nblocks > 0)
+ pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+ pg_streaming_read_look_ahead(pgsr);
+
+ /* See if we have one buffer to return. */
+ while (pgsr->tail != pgsr->head)
+ {
+ PgStreamingReadRange *tail_range;
+
+ tail_range = &pgsr->ranges[pgsr->tail];
+
+ /*
+ * Do we need to perform an I/O before returning the buffers from this
+ * range?
+ */
+ if (tail_range->need_complete)
+ {
+ CompleteReadBuffers(pgsr->bmr,
+ tail_range->buffers,
+ pgsr->forknum,
+ tail_range->blocknum,
+ tail_range->nblocks,
+ false,
+ pgsr->strategy);
+ tail_range->need_complete = false;
+
+ /*
+ * We don't really know if the kernel generated an physical I/O
+ * when we issued advice, let alone when it finished, but it has
+ * certainly finished after a read call returns.
+ */
+ if (tail_range->advice_issued)
+ pgsr->ios_in_progress--;
+ }
+
+ /* Are there more buffers available in this range? */
+ if (pgsr->next_tail_buffer < tail_range->nblocks)
+ {
+ int buffer_index;
+ Buffer buffer;
+
+ buffer_index = pgsr->next_tail_buffer++;
+ buffer = tail_range->buffers[buffer_index];
+
+ Assert(BufferIsValid(buffer));
+
+ /* We are giving away ownership of this pinned buffer. */
+ Assert(pgsr->pinned_buffers > 0);
+ pgsr->pinned_buffers--;
+
+ if (per_buffer_data)
+ *per_buffer_data = (char *) pgsr->per_buffer_data +
+ tail_range->per_buffer_data_index[buffer_index] *
+ pgsr->per_buffer_data_size;
+
+ return buffer;
+ }
+
+ /* Advance tail to next range, if there is one. */
+ if (++pgsr->tail == pgsr->size)
+ pgsr->tail = 0;
+ pgsr->next_tail_buffer = 0;
+ }
+
+ Assert(pgsr->pinned_buffers == 0);
+
+ return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+ Buffer buffer;
+
+ /* Stop looking ahead, and unpin anything that wasn't consumed. */
+ pgsr->finished = true;
+ while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+ ReleaseBuffer(buffer);
+
+ if (pgsr->per_buffer_data)
+ pfree(pgsr->per_buffer_data);
+ pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
)
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static int SyncOneBuffer(int buf_id, bool skip_recently_used,
WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner);
static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot access temporary tables of other sessions")));
- /*
- * Read the buffer, and update pgstat counters to reflect a cache hit or
- * miss.
- */
- pgstat_count_buffer_read(reln);
- buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+ buf = ReadBuffer_common(BMR_REL(reln),
forkNum, blockNum, mode, strategy, &hit);
- if (hit)
- pgstat_count_buffer_hit(reln);
+
return buf;
}
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
- return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
- RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+ return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+ RELPERSISTENCE_UNLOGGED),
+ forkNum, blockNum,
mode, strategy, &hit);
}
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
bool hit;
Assert(extended_by == 0);
- buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+ buffer = ReadBuffer_common(bmr,
fork, extend_to - 1, mode, strategy,
&hit);
}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
* *hit is set to true if the request was satisfied from shared buffer cache.
*/
static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy, bool *hit)
{
- BufferDesc *bufHdr;
- Block bufBlock;
- bool found;
- IOContext io_context;
- IOObject io_object;
- bool isLocalBuf = SmgrIsTemp(smgr);
-
- *hit = false;
+ Buffer buffer;
/*
* Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;
- return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
- forkNum, strategy, flags);
+ *hit = false;
+
+ return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}
- TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend);
+ buffer = PrepareReadBuffer(bmr,
+ forkNum,
+ blockNum,
+ strategy,
+ hit);
+
+ /* At this point we do NOT hold any locks. */
+ if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+ {
+ /* if we just want zeroes and a lock, we're done */
+ ZeroBuffer(buffer, mode);
+ }
+ else if (!*hit)
+ {
+ /* we might need to perform I/O */
+ CompleteReadBuffers(bmr,
+ &buffer,
+ forkNum,
+ blockNum,
+ 1,
+ mode == RBM_ZERO_ON_ERROR,
+ strategy);
+ }
+
+ return buffer;
+}
+
+/*
+ * Prepare to read a block. The buffer is pinned. If this is a 'hit', then
+ * the returned buffer can be used immediately. Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer(). PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr)
+{
+ BufferDesc *bufHdr;
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
+
+ Assert(blockNum != P_NEW);
+
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
- /*
- * We do not use a BufferAccessStrategy for I/O of temporary tables.
- * However, in some cases, the "strategy" may not be NULL, so we can't
- * rely on IOContextForStrategy() to set the right IOContext for us.
- * This may happen in cases like CREATE TEMPORARY TABLE AS...
- */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
- bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
- if (found)
- pgBufferUsage.local_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.local_blks_read++;
}
else
{
- /*
- * lookup the buffer. IO_IN_PROGRESS is set if the requested block is
- * not currently in memory.
- */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
- bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
- strategy, &found, io_context);
- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;
}
- /* At this point we do NOT hold any locks. */
+ TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend);
- /* if it was already in the buffer pool, we're done */
- if (found)
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+ if (isLocalBuf)
+ {
+ bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+ if (*foundPtr)
+ pgBufferUsage.local_blks_hit++;
+ }
+ else
+ {
+ bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+ strategy, foundPtr, io_context);
+ if (*foundPtr)
+ pgBufferUsage.shared_blks_hit++;
+ }
+ if (bmr.rel)
+ {
+ /*
+ * While pgBufferUsage's "read" counter isn't bumped unless we reach
+ * CompleteReadBuffers() (so, not for hits, and not for buffers that
+ * are zeroed instead), the per-relation stats always count them.
+ */
+ pgstat_count_buffer_read(bmr.rel);
+ if (*foundPtr)
+ pgstat_count_buffer_hit(bmr.rel);
+ }
+ if (*foundPtr)
{
- /* Just need to update stats before we exit */
- *hit = true;
VacuumPageHit++;
pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageHit;
TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ }
- /*
- * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
- * on return.
- */
- if (!isLocalBuf)
- {
- if (mode == RBM_ZERO_AND_LOCK)
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
- LW_EXCLUSIVE);
- else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
- LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
- }
+ return BufferDescriptorGetBuffer(bufHdr);
+}
- return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+ if (BufferIsLocal(buffer))
+ {
+ BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+ return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
+ else
+ return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
- /*
- * if we have gotten to this point, we have allocated a buffer for the
- * page but its contents are not yet valid. IO_IN_PROGRESS is set for it,
- * if it's a shared buffer.
- */
- Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers(). The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy)
+{
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
- bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
+ if (isLocalBuf)
+ {
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
+ }
+ else
+ {
+ io_context = IOContextForStrategy(strategy);
+ io_object = IOOBJECT_RELATION;
+ }
/*
- * Read in the page, unless the caller intends to overwrite it and just
- * wants us to allocate a buffer.
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
*/
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ if (isLocalBuf)
+ pgBufferUsage.local_blks_read += nblocks;
else
+ pgBufferUsage.shared_blks_read += nblocks;
+
+ for (int i = 0; i < nblocks; ++i)
{
- instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+ int io_buffers_len;
+ Buffer io_buffers[MAX_BUFFERS_PER_TRANSFER];
+ void *io_pages[MAX_BUFFERS_PER_TRANSFER];
+ instr_time io_start;
+ BlockNumber io_first_block;
- smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
- pgstat_count_io_op_time(io_object, io_context,
- IOOP_READ, io_start, 1);
+ /*
+ * We could get all the information from buffer headers, but it can be
+ * expensive to access buffer header cache lines so we make the caller
+ * provide all the information we need, and assert that it is
+ * consistent.
+ */
+ {
+ RelFileLocator xlocator;
+ ForkNumber xforknum;
+ BlockNumber xblocknum;
+
+ BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+ Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+ Assert(xforknum == forknum);
+ Assert(xblocknum == blocknum + i);
+ }
+#endif
+
+ /*
+ * Skip this block if someone else has already completed it. If an
+ * I/O is already in progress in another backend, this will wait for
+ * the outcome: either done, or something went wrong and we will
+ * retry.
+ */
+ if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+ {
+ /*
+ * Report this as a 'hit' for this backend, even though it must
+ * have started out as a miss in PrepareReadBuffer().
+ */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ continue;
+ }
+
+ /* We found a buffer that we need to read in. */
+ io_buffers[0] = buffers[i];
+ io_pages[0] = BufferGetBlock(buffers[i]);
+ io_first_block = blocknum + i;
+ io_buffers_len = 1;
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
+ /*
+ * How many neighboring-on-disk blocks can we can scatter-read into
+ * other buffers at the same time? In this case we don't wait if we
+ * see an I/O already in progress. We already hold BM_IO_IN_PROGRESS
+ * for the head block, so we should get on with that I/O as soon as
+ * possible. We'll come back to this block again, above.
+ */
+ while ((i + 1) < nblocks &&
+ CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+ {
+ /* Must be consecutive block numbers. */
+ Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+ BufferGetBlockNumber(buffers[i]) + 1);
+
+ io_buffers[io_buffers_len] = buffers[++i];
+ io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+ }
+
+ io_start = pgstat_prepare_io_time(track_io_timing);
+ smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+ io_buffers_len);
+
+ /* Verify each block we read, and terminate the I/O. */
+ for (int j = 0; j < io_buffers_len; ++j)
{
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+ BufferDesc *bufHdr;
+ Block bufBlock;
+
+ if (isLocalBuf)
{
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+ bufBlock = LocalBufHdrGetBlock(bufHdr);
}
else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- }
- }
-
- /*
- * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
- * content lock before marking the page as valid, to make sure that no
- * other backend sees the zeroed page before the caller has had a chance
- * to initialize it.
- *
- * Since no-one else can be looking at the page contents yet, there is no
- * difference between an exclusive lock and a cleanup-strength lock. (Note
- * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
- * they assert that the buffer is already valid.)
- */
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
- {
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
- }
+ {
+ bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+ bufBlock = BufHdrGetBlock(bufHdr);
+ }
- if (isLocalBuf)
- {
- /* Only need to adjust flags */
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ /* check for garbage data */
+ if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ if (zero_on_error || zero_damaged_pages)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ memset(bufBlock, 0, BLCKSZ);
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ }
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
- }
+ /* Terminate I/O and set BM_VALID. */
+ if (isLocalBuf)
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
- VacuumPageMiss++;
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss;
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ /* Set BM_VALID, terminate IO, and wake up any waiters */
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ }
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ /* Report I/Os as completing individually. */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ false);
+ }
- return BufferDescriptorGetBuffer(bufHdr);
+ VacuumPageMiss += io_buffers_len;
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ }
}
/*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*
* The returned buffer is pinned and is already marked as holding the
* desired page. If it already did have the desired page, *foundPtr is
- * set true. Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true. Otherwise, *foundPtr is set false. A read should be
+ * performed with CompleteReadBuffers().
*
* io_context is passed as an output parameter to avoid calling
* IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(buf, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(existing_buf_hdr, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to obtain the right to
- * start I/O. If StartBufferIO returns false, then someone else managed
- * to read it before we did, so there's nothing left for BufferAlloc() to
- * do.
+ * Buffer contents are currently invalid.
*/
- if (StartBufferIO(victim_buf_hdr, true))
- *foundPtr = false;
- else
- *foundPtr = true;
+ *foundPtr = false;
return victim_buf_hdr;
}
@@ -1774,7 +1899,7 @@ again:
* pessimistic, but outside of toy-sized shared_buffers it should allow
* sufficient pins.
*/
-static void
+void
LimitAdditionalPins(uint32 *additional_pins)
{
uint32 max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
buf_state &= ~BM_VALID;
UnlockBufHdr(existing_hdr, buf_state);
- } while (!StartBufferIO(existing_hdr, true));
+ } while (!StartBufferIO(existing_hdr, true, false));
}
else
{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
LWLockRelease(partition_lock);
/* XXX: could combine the locked operations in it with the above */
- StartBufferIO(victim_buf_hdr, true);
+ StartBufferIO(victim_buf_hdr, true, false);
}
}
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
else
{
/*
- * If we previously pinned the buffer, it must surely be valid.
+ * If we previously pinned the buffer, it is likely to be valid, but
+ * it may not be if PrepareReadBuffer() was called and
+ * CompleteReadBuffers() hasn't been called yet. We'll check by
+ * loading the flags without locking. This is racy, but it's OK to
+ * return false spuriously: when CompleteReadBuffers() calls
+ * StartBufferIO(), it'll see that it's now valid.
*
* Note: We deliberately avoid a Valgrind client request here.
* Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
* that the buffer page is legitimately non-accessible here. We
* cannot meddle with that.
*/
- result = true;
+ result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
}
ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* someone else flushed the buffer before we could, so we need not do
* anything.
*/
- if (!StartBufferIO(buf, false))
+ if (!StartBufferIO(buf, false, false))
return;
/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
LW_EXCLUSIVE);
}
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would. The buffer must be already pinned. It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+ if (BufferIsLocal(buffer))
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ else
+ {
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ if (mode == RBM_ZERO_AND_LOCK)
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ else
+ LockBufferForCleanup(buffer);
+ }
+
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ if (BufferIsLocal(buffer))
+ {
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ buf_state = LockBufHdr(bufHdr);
+ buf_state |= BM_VALID;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
/*
* Verify that this backend is pinning the buffer exactly once.
*
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
*
* Returns true if we successfully marked the buffer as I/O busy,
* false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend. In that case, false indicates either that the I/O was already
+ * finished, or is still in progress. This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
*/
static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
UnlockBufHdr(buf, buf_state);
+ if (nowait)
+ return false;
WaitIO(buf);
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
* LocalBufferAlloc -
* Find or create a local buffer for the given page of the given relation.
*
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local. Also, IO_IN_PROGRESS
- * does not get set. Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local. We support only default access
+ * strategy (hence, usage_count is always advanced).
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
}
/* see LimitAdditionalPins() */
-static void
+void
LimitAdditionalLocalPins(uint32 *additional_pins)
{
uint32 max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
/*
* In contrast to LimitAdditionalPins() other backends don't play a role
- * here. We can allow up to NLocBuffer pins in total.
+ * here. We can allow up to NLocBuffer pins in total, but it might not be
+ * initialized yet so read num_temp_buffers.
*/
- max_pins = (NLocBuffer - NLocalPinnedBuffers);
+ max_pins = (num_temp_buffers - NLocalPinnedBuffers);
if (*additional_pins >= max_pins)
*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('aio')
subdir('buffer')
subdir('file')
subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
/*
* smgropen() -- Return an SMgrRelation object, creating it if need be.
*
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files. The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
*/
SMgrRelation
smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
}
/*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
*/
void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
{
SMgrRelation *owner;
ForkNumber forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
}
/*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
*
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr(). It may be re-owned if it is accessed by a
+ * relation before then.
*/
void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
{
for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
}
reln->smgr_targblock = InvalidBlockNumber;
+
+ if (reln->smgr_owner)
+ {
+ *reln->smgr_owner = NULL;
+ reln->smgr_owner = NULL;
+ dlist_push_tail(&unowned_relns, &reln->node);
+ }
}
/*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
*/
void
-smgrreleaseall(void)
+smgrcloseall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrrelease(reln);
+ smgrclose(reln);
}
/*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
*/
void
-smgrcloseall(void)
+smgrdestroyall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrclose(reln);
+ smgrdestroy(reln);
}
/*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
* AtEOXact_SMgr
*
* This routine is called during transaction commit or abort (it doesn't
- * particularly care which). All transient SMgrRelation objects are closed.
+ * particularly care which). All transient SMgrRelation objects are
+ * destroyed.
*
* We do this as a compromise between wanting transient SMgrRelations to
* live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
dlist_mutable_iter iter;
/*
- * Zap all unowned SMgrRelations. We rely on smgrclose() to remove each
+ * Zap all unowned SMgrRelations. We rely on smgrdestroy() to remove each
* one from the list.
*/
dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
Assert(rel->smgr_owner == NULL);
- smgrclose(rel);
+ smgrdestroy(rel);
}
}
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
bool
ProcessBarrierSmgrRelease(void)
{
- smgrreleaseall();
+ smgrcloseall();
return true;
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
#ifndef BUFMGR_H
#define BUFMGR_H
+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
#define BUFFER_LOCK_SHARE 1
#define BUFFER_LOCK_EXCLUSIVE 2
+/*
+ * Maximum number of buffers for multi-buffer I/O functions. This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
/*
* prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy);
extern void ReleaseBuffer(Buffer buffer);
extern void UnlockReleaseBuffer(Buffer buffer);
extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
/* in buf_init.c */
extern void InitBufferPool(void);
extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected. Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_private_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
*
* Very little code is authorized to touch rel->rd_smgr directly. Instead
* use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period. Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation. It's quite cheap in
- * comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b7..8007f17320a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2094,6 +2094,8 @@ PgStat_TableCounts
PgStat_TableStatus
PgStat_TableXactStatus
PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
PgXmlErrorContext
PgXmlStrictness
Pg_finfo_record
--
2.37.2
v3-0013-BitmapHeapScan-uses-streaming-read-API.patchtext/x-diff; charset=us-asciiDownload
From 6469df2a68926093e40f82df15d85ceacc6e0ca5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:04:18 -0500
Subject: [PATCH v3 13/13] BitmapHeapScan uses streaming read API
Remove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
ci-os-only:
---
src/backend/access/heap/heapam.c | 68 +++++
src/backend/access/heap/heapam_handler.c | 88 +++---
src/backend/executor/nodeBitmapHeapscan.c | 340 +---------------------
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 19 +-
src/include/nodes/execnodes.h | 19 --
6 files changed, 117 insertions(+), 421 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b93f243c282..c965048af60 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -115,6 +115,8 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static BlockNumber bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data);
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -335,6 +337,22 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
if (key != NULL && scan->rs_base.rs_nkeys > 0)
memcpy(scan->rs_base.rs_key, key, scan->rs_base.rs_nkeys * sizeof(ScanKeyData));
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
+ scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ scan,
+ sizeof(TBMIterateResult),
+ scan->rs_strategy,
+ BMR_REL(scan->rs_base.rs_rd),
+ MAIN_FORKNUM,
+ bitmapheap_pgsr_next);
+
+
+ }
+
/*
* Currently, we only have a stats counter for sequential heap scans (but
* e.g for bitmap scans the underlying bitmap index scans will be counted,
@@ -955,6 +973,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_pgsr = NULL;
scan->rs_vmbuffer = InvalidBuffer;
scan->rs_empty_tuples_pending = 0;
@@ -1093,6 +1112,9 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN && scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
pfree(scan);
}
@@ -10250,3 +10272,49 @@ HeapCheckForSerializableConflictOut(bool visible, Relation relation,
CheckForSerializableConflictOut(relation, xid, snapshot);
}
+
+static BlockNumber
+bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data)
+{
+ TBMIterateResult *tbmres = per_buffer_data;
+ HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (hdesc->rs_base.shared_tbmiterator)
+ tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+ else
+ tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+ /* no more entries in the bitmap */
+ if (!BlockNumberIsValid(tbmres->blockno))
+ return InvalidBlockNumber;
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+ continue;
+
+ if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->rs_vmbuffer))
+ {
+ hdesc->rs_empty_tuples_pending += tbmres->ntuples;
+ continue;
+ }
+
+ return tbmres->blockno;
+ }
+
+ /* not reachable */
+ Assert(false);
+}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ba6793a749c..0237cd52b61 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2113,79 +2113,65 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
*/
static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, BlockNumber *blockno)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
+ void *io_private;
BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult tbmres;
+ TBMIterateResult *tbmres;
+
+ Assert(hscan->rs_pgsr);
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- *blockno = InvalidBlockNumber;
*recheck = true;
- do
+ /* Release buffer containing previous block. */
+ if (BufferIsValid(hscan->rs_cbuf))
{
- CHECK_FOR_INTERRUPTS();
+ ReleaseBuffer(hscan->rs_cbuf);
+ hscan->rs_cbuf = InvalidBuffer;
+ }
- if (scan->shared_tbmiterator)
- tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
- else
- tbm_iterate(scan->tbmiterator, &tbmres);
+ hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->rs_pgsr, &io_private);
- if (!BlockNumberIsValid(tbmres.blockno))
+ if (BufferIsInvalid(hscan->rs_cbuf))
+ {
+ if (BufferIsValid(hscan->rs_vmbuffer))
{
- /* no more entries in the bitmap */
- Assert(hscan->rs_empty_tuples_pending == 0);
- return false;
+ ReleaseBuffer(hscan->rs_vmbuffer);
+ hscan->rs_vmbuffer = InvalidBuffer;
}
/*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE
- * isolation though, as we need to examine all invisible tuples
- * reachable by the index.
+ * Bitmap is exhausted. Time to emit empty tuples if relevant. We emit
+ * all empty tuples at the end instead of emitting them per block we
+ * skip fetching. This is necessary because the streaming read API
+ * will only return TBMIterateResults for blocks actually fetched.
+ * When we skip fetching a block, we keep track of how many empty
+ * tuples to emit at the end of the BitmapHeapScan. We do not recheck
+ * all NULL tuples.
*/
- } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
+ *recheck = false;
+ return hscan->rs_empty_tuples_pending > 0;
+ }
- /* Got a valid block */
- *blockno = tbmres.blockno;
- *recheck = tbmres.recheck;
+ Assert(io_private);
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, and the bitmap entries don't need rechecking, and all tuples on
- * the page are visible to our transaction.
- */
- if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres.recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres.ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
+ tbmres = io_private;
- hscan->rs_empty_tuples_pending += tbmres.ntuples;
+ Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
- return true;
- }
+ *recheck = tbmres->recheck;
- block = tbmres.blockno;
+ hscan->rs_cblock = tbmres->blockno;
+ hscan->rs_ntuples = tbmres->ntuples;
- /*
- * Acquire pin on the target heap page, trading in any pin we held before.
- */
- hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
- scan->rs_rd,
- block);
- hscan->rs_cblock = block;
+ block = tbmres->blockno;
buffer = hscan->rs_cbuf;
snapshot = scan->rs_snapshot;
@@ -2206,7 +2192,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres.ntuples >= 0)
+ if (tbmres->ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2215,9 +2201,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres.ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres->ntuples; curslot++)
{
- OffsetNumber offnum = tbmres.offsets[curslot];
+ OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2270,7 +2256,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/* Only count exact and lossy pages with visible tuples */
if (ntup > 0)
{
- if (tbmres.ntuples >= 0)
+ if (tbmres->ntuples >= 0)
scan->exact_pages++;
else
scan->lossy_pages++;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index bcc60d3cf98..5fd760a0f66 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -56,11 +56,6 @@ static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapAccumCounters(BitmapHeapScanState *node,
TableScanDesc scan);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
- TableScanDesc scan);
static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
@@ -91,14 +86,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* If we haven't yet performed the underlying index scan, do it, and begin
* the iteration over the bitmap.
- *
- * For prefetching, we use *two* iterators, one for the pages we are
- * actually scanning and another that runs ahead of the first for
- * prefetching. node->prefetch_pages tracks exactly how many pages ahead
- * the prefetch iterator is. Also, node->prefetch_target tracks the
- * desired prefetch distance, which starts small and increases up to the
- * node->prefetch_maximum. This is to avoid doing a lot of prefetching in
- * a scan that stops after a few tuples because of a LIMIT.
*/
if (!node->initialized)
{
@@ -114,15 +101,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->prefetch_iterator = tbm_begin_iterate(tbm);
- node->prefetch_pages = 0;
- node->prefetch_target = -1;
- }
-#endif /* USE_PREFETCH */
}
else
{
@@ -145,20 +123,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
* multiple processes to iterate jointly.
*/
pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- pstate->prefetch_iterator =
- tbm_prepare_shared_iterate(tbm);
-
- /*
- * We don't need the mutex here as we haven't yet woke up
- * others.
- */
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = -1;
- }
-#endif
/* We have initialized the shared state so wake up others. */
BitmapDoneInitializingSharedState(pstate);
@@ -166,14 +130,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->shared_prefetch_iterator =
- tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
- }
-#endif /* USE_PREFETCH */
}
/*
@@ -220,50 +176,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->initialized = true;
/* Get the first block. if none, end of scan */
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck))
return ExecClearTuple(slot);
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- BitmapAdjustPrefetchTarget(node);
}
- for (;;)
+ do
{
while (table_scan_bitmap_next_tuple(scan, slot))
{
CHECK_FOR_INTERRUPTS();
-#ifdef USE_PREFETCH
-
- /*
- * Try to prefetch at least a few pages even before we get to the
- * second page if we don't stop reading after the first tuple.
- */
- if (!pstate)
- {
- if (node->prefetch_target < node->prefetch_maximum)
- node->prefetch_target++;
- }
- else if (pstate->prefetch_target < node->prefetch_maximum)
- {
- /* take spinlock while updating shared state */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target < node->prefetch_maximum)
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-
- /*
- * We prefetch before fetching the current pages. We expect that a
- * future streaming read API will do this, so do it this way now
- * for consistency. Also, this should happen only when we have
- * determined there is still something to do on the current page,
- * else we may uselessly prefetch the same page we are just about
- * to request for real.
- */
- BitmapPrefetch(node, scan);
-
/*
* If we are using lossy info, we have to recheck the qual
* conditions at every tuple.
@@ -285,13 +207,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
return slot;
}
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &node->blockno))
- break;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
+ } while (table_scan_bitmap_next_block(scan, &node->recheck));
/*
* if we get here it means we are at the end of the scan..
@@ -325,215 +241,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
ConditionVariableBroadcast(&pstate->cv);
}
-/*
- * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (node->prefetch_pages > 0)
- {
- /* The main iterator has closed the distance by one page */
- node->prefetch_pages--;
- }
- else if (prefetch_iterator)
- {
- /* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult tbmpre;
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
- elog(ERROR, "prefetch and main iterators are out of sync");
- }
- return;
- }
-
- if (node->prefetch_maximum > 0)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages > 0)
- {
- pstate->prefetch_pages--;
- SpinLockRelease(&pstate->mutex);
- }
- else
- {
- TBMIterateResult tbmpre;
-
- /* Release the mutex before iterating */
- SpinLockRelease(&pstate->mutex);
-
- /*
- * In case of shared mode, we can not ensure that the current
- * blockno of the main iterator and that of the prefetch iterator
- * are same. It's possible that whatever blockno we are
- * prefetching will be processed by another process. Therefore,
- * we don't validate the blockno here as we do in non-parallel
- * case.
- */
- if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- }
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max. Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
- return;
- }
-
- /* Do an unlocked check first to save spinlock acquisitions. */
- if (pstate->prefetch_target < node->prefetch_maximum)
- {
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
- pstate->prefetch_target = node->prefetch_maximum;
- else if (pstate->prefetch_target > 0)
- pstate->prefetch_target *= 2;
- else
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (node->prefetch_pages < node->prefetch_target)
- {
- TBMIterateResult tbmpre;
- bool skip_fetch;
-
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_iterate(prefetch_iterator);
- node->prefetch_iterator = NULL;
- break;
- }
- node->prefetch_pages++;
-
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
-
- return;
- }
-
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (1)
- {
- TBMIterateResult tbmpre;
- bool do_prefetch = false;
- bool skip_fetch;
-
- /*
- * Recheck under the mutex. If some other process has already
- * done enough prefetching then we need not to do anything.
- */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- pstate->prefetch_pages++;
- do_prefetch = true;
- }
- SpinLockRelease(&pstate->mutex);
-
- if (!do_prefetch)
- return;
-
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_shared_iterate(prefetch_iterator);
- node->shared_prefetch_iterator = NULL;
- break;
- }
-
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
- }
-#endif /* USE_PREFETCH */
-}
-
/*
* BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
*/
@@ -579,22 +286,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->ss.ss_currentScanDesc)
table_rescan(node->ss.ss_currentScanDesc, NULL);
- /* release bitmaps and buffers if any */
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
+ /* release bitmaps if any */
if (node->tbm)
tbm_free(node->tbm);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_prefetch_iterator = NULL;
- node->pvmbuffer = InvalidBuffer;
node->recheck = true;
- node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -633,16 +330,10 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
table_endscan(scanDesc);
/*
- * release bitmaps and buffers if any
+ * release bitmaps if any
*/
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
}
/* ----------------------------------------------------------------
@@ -675,19 +366,13 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
- scanstate->prefetch_iterator = NULL;
- scanstate->prefetch_pages = 0;
- scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
scanstate->recheck = true;
- scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
@@ -727,13 +412,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->bitmapqualorig =
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
- /*
- * Maximum number of prefetches for the tablespace if configured,
- * otherwise the current value of the effective_io_concurrency GUC.
- */
- scanstate->prefetch_maximum =
- get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
scanstate->ss.ss_currentRelation = currentRelation;
/*
@@ -817,14 +495,10 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
return;
pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
-
pstate->tbmiterator = 0;
- pstate->prefetch_iterator = 0;
/* Initialize the mutex */
SpinLockInit(&pstate->mutex);
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = 0;
pstate->state = BM_INITIAL;
ConditionVariableInit(&pstate->cv);
@@ -856,11 +530,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
if (DsaPointerIsValid(pstate->tbmiterator))
tbm_free_shared_area(dsa, pstate->tbmiterator);
- if (DsaPointerIsValid(pstate->prefetch_iterator))
- tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
pstate->tbmiterator = InvalidDsaPointer;
- pstate->prefetch_iterator = InvalidDsaPointer;
}
/* ----------------------------------------------------------------
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3dfb19ec7d5..1cad9c04f01 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
#include "storage/dsm.h"
#include "storage/lockdefs.h"
#include "storage/shm_toc.h"
+#include "storage/streaming_read.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -72,6 +73,9 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /* Streaming read control object for scans supporting it */
+ PgStreamingRead *rs_pgsr;
+
/*
* These fields are only used for bitmap scans for the "skip fetch"
* optimization. Bitmap scans needing no fields from the heap may skip
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f1f5b7ab1d0..9fad92675f4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -789,22 +789,10 @@ typedef struct TableAmRoutine
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
* make sense to perform tuple visibility checks at this time).
*
- * XXX: Currently this may only be implemented if the AM uses md.c as its
- * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
- * blockids directly to the underlying storage. nodeBitmapHeapscan.c
- * performs prefetching directly using that interface. This probably
- * needs to be rectified at a later point.
- *
- * XXX: Currently this may only be implemented if the AM uses the
- * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
- * perform prefetching. This probably needs to be rectified at a later
- * point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
- bool (*scan_bitmap_next_block) (TableScanDesc scan,
- bool *recheck, BlockNumber *blockno);
+ bool (*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1981,8 +1969,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* used after verifying the presence (at plan time or such).
*/
static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, BlockNumber *blockno)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1992,7 +1979,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, blockno);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a59df51dd69..d41a3e134d8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,8 @@ typedef enum
/* ----------------
* ParallelBitmapHeapState information
* tbmiterator iterator for scanning current pages
- * prefetch_iterator iterator for prefetching ahead of current page
* mutex mutual exclusion for the prefetching variable
* and state
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
* state current state of the TIDBitmap
* cv conditional wait variable
* phs_snapshot_data snapshot data shared to workers
@@ -1695,10 +1692,7 @@ typedef enum
typedef struct ParallelBitmapHeapState
{
dsa_pointer tbmiterator;
- dsa_pointer prefetch_iterator;
slock_t mutex;
- int prefetch_pages;
- int prefetch_target;
SharedBitmapState state;
ConditionVariable cv;
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,16 +1703,10 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
- * prefetch_iterator iterator for prefetching ahead of current page
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
- * prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
* recheck do current page's tuples need recheck
@@ -1729,20 +1717,13 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
- TBMIterator *prefetch_iterator;
- int prefetch_pages;
- int prefetch_target;
- int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
bool recheck;
- BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
On Fri, Feb 16, 2024 at 12:35:59PM -0500, Melanie Plageman wrote:
In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previous
I've resolved this. I added a new patch to the set which starts counting
even pages with no visible tuples toward lossy and exact pages. After an
off-list conversation with Andres, it seems that this omission in master
may not have been intentional.
Once we have only two types of pages to differentiate between (lossy and
exact [no longer have to care about "has no visible tuples"]), it is
easy enough to pass a "lossy" boolean paramater to
table_scan_bitmap_next_block(). I've done this in the attached v4.
- Melanie
Attachments:
v4-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patchtext/x-diff; charset=us-asciiDownload
From e0cee301b81400209a0e727a3d7daa1f435ba999 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v4 01/14] BitmapHeapScan begin scan after bitmap creation
There is no reason for a BitmapHeapScan to begin the scan of the
underlying table in ExecInitBitmapHeapScan(). Instead, do so after
completing the index scan and building the bitmap.
ExecBitmapHeapInitializeWorker() overwrote the snapshot in the scan
descriptor with the correct one provided by the parallel leader. Since
ExecBitmapHeapInitializeWorker() is now called before the scan
descriptor has been created, save the worker's snapshot in the
BitmapHeapScanState and pass it to table_beginscan_bm().
---
src/backend/access/table/tableam.c | 11 ------
src/backend/executor/nodeBitmapHeapscan.c | 47 ++++++++++++++++++-----
src/include/access/tableam.h | 10 ++---
src/include/nodes/execnodes.h | 2 +
4 files changed, 42 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a1..e78d793f69c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -120,17 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
NULL, flags);
}
-void
-table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
-{
- Assert(IsMVCCSnapshot(snapshot));
-
- RegisterSnapshot(snapshot);
- scan->rs_snapshot = snapshot;
- scan->rs_flags |= SO_TEMP_SNAPSHOT;
-}
-
-
/* ----------------------------------------------------------------------------
* Parallel table scan related functions.
* ----------------------------------------------------------------------------
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed63..44bf38be3c9 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -181,6 +181,34 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
#endif /* USE_PREFETCH */
}
+
+ /*
+ * If this is the first scan of the underlying table, create the table
+ * scan descriptor and begin the scan.
+ */
+ if (!scan)
+ {
+ Snapshot snapshot = node->ss.ps.state->es_snapshot;
+ uint32 extra_flags = 0;
+
+ /*
+ * Parallel workers must use the snapshot initialized by the
+ * parallel leader.
+ */
+ if (node->worker_snapshot)
+ {
+ snapshot = node->worker_snapshot;
+ extra_flags |= SO_TEMP_SNAPSHOT;
+ }
+
+ scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
+ node->ss.ss_currentRelation,
+ snapshot,
+ 0,
+ NULL,
+ extra_flags);
+ }
+
node->initialized = true;
}
@@ -604,7 +632,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
PlanState *outerPlan = outerPlanState(node);
/* rescan to release any page pin */
- table_rescan(node->ss.ss_currentScanDesc, NULL);
+ if (node->ss.ss_currentScanDesc)
+ table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
if (node->tbmiterator)
@@ -681,7 +710,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
/*
* close heap scan
*/
- table_endscan(scanDesc);
+ if (scanDesc)
+ table_endscan(scanDesc);
+
}
/* ----------------------------------------------------------------
@@ -739,6 +770,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
*/
scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
node->scan.plan.targetlist == NIL);
+ scanstate->worker_snapshot = NULL;
/*
* Miscellaneous initialization
@@ -787,11 +819,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ss_currentRelation = currentRelation;
- scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
- estate->es_snapshot,
- 0,
- NULL);
-
/*
* all done.
*/
@@ -930,13 +957,13 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelBitmapHeapState *pstate;
- Snapshot snapshot;
Assert(node->ss.ps.state->es_query_dsa != NULL);
pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->pstate = pstate;
- snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
- table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
+ node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
+ Assert(IsMVCCSnapshot(node->worker_snapshot));
+ RegisterSnapshot(node->worker_snapshot);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..5375dd7150f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -944,9 +944,10 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key)
+ int nkeys, struct ScanKeyData *key,
+ uint32 extra_flags)
{
- uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+ uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1038,11 +1039,6 @@ table_rescan_set_params(TableScanDesc scan, struct ScanKeyData *key,
allow_pagemode);
}
-/*
- * Update snapshot used by the scan.
- */
-extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
-
/*
* Return next tuple from `scan`, store in slot.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..00c75fb10e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1726,6 +1726,7 @@ typedef struct ParallelBitmapHeapState
* shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
+ * worker_snapshot snapshot for parallel worker
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1750,6 +1751,7 @@ typedef struct BitmapHeapScanState
TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
+ Snapshot worker_snapshot;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v4-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-diff; charset=us-asciiDownload
From 69cd001bcdade976a51985e714d1b30b090bb388 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:38:41 -0500
Subject: [PATCH v4 02/14] BitmapHeapScan set can_skip_fetch later
Set BitmapHeapScanState->can_skip_fetch in BitmapHeapNext() when
!BitmapHeapScanState->initialized instead of in
ExecInitBitmapHeapScan(). This is a preliminary step to removing
can_skip_fetch from BitmapHeapScanState and setting it in table AM
specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 44bf38be3c9..a9ba2bdfb88 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,6 +108,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ /*
+ * We can potentially skip fetching heap pages if we do not need any
+ * columns of the table, either for checking non-indexable quals or
+ * for returning data. This test is a bit simplistic, as it checks
+ * the stronger condition that there's no qual or return tlist at all.
+ * But in most cases it's probably not worth working harder than that.
+ */
+ node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
+ node->ss.ps.plan->targetlist == NIL);
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -760,16 +770,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
-
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or for
- * returning data. This test is a bit simplistic, as it checks the
- * stronger condition that there's no qual or return tlist at all. But in
- * most cases it's probably not worth working harder than that.
- */
- scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
--
2.37.2
v4-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-diff; charset=us-asciiDownload
From b29df9592f8b3a3966cf6fab40f56a0c113f3d57 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v4 03/14] Push BitmapHeapScan skip fetch optimization into
table AM
7c70996ebf0949b142 introduced an optimization to allow bitmap table
scans to skip fetching a block from the heap if none of the underlying
data was needed and the block is marked all visible in the visibility
map. With the addition of table AMs, a FIXME was added to this code
indicating that it should be pushed into table AM specific code, as not
all table AMs may use a visibility map in the same way.
Resolve this FIXME for the current block and implement it for the heap
table AM by moving the vmbuffer and other fields needed for the
optimization from the BitmapHeapScanState into the HeapScanDescData.
heapam_scan_bitmap_next_block() now decides whether or not to skip
fetching the block before reading it in and
heapam_scan_bitmap_next_tuple() returns NULL-filled tuples for skipped
blocks.
The layering violation is still present in BitmapHeapScans's prefetching
code. However, this will be eliminated when prefetching is implemented
using the upcoming streaming read API discussed in [1].
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam.c | 14 +++
src/backend/access/heap/heapam_handler.c | 29 ++++++
src/backend/executor/nodeBitmapHeapscan.c | 118 ++++++----------------
src/include/access/heapam.h | 10 ++
src/include/access/tableam.h | 7 ++
src/include/nodes/execnodes.h | 8 +-
6 files changed, 94 insertions(+), 92 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..b93f243c282 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_vmbuffer = InvalidBuffer;
+ scan->rs_empty_tuples_pending = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1043,6 +1045,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* reinitialize scan descriptor
*/
@@ -1062,6 +1070,12 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* decrement relation reference count and free scan descriptor storage
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..7661acac3a8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
#include "access/syncscan.h"
#include "access/tableam.h"
#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -2124,6 +2125,24 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ /*
+ * We can skip fetching the heap page if we don't need any fields from the
+ * heap, and the bitmap entries don't need rechecking, and all tuples on
+ * the page are visible to our transaction.
+ */
+ if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ {
+ /* can't be lossy in the skip_fetch case */
+ Assert(tbmres->ntuples >= 0);
+ Assert(hscan->rs_empty_tuples_pending >= 0);
+
+ hscan->rs_empty_tuples_pending += tbmres->ntuples;
+
+ return true;
+ }
+
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
@@ -2236,6 +2255,16 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
+ if (hscan->rs_empty_tuples_pending > 0)
+ {
+ /*
+ * If we don't have to fetch the tuple, just return nulls.
+ */
+ ExecStoreAllNullTuple(slot);
+ hscan->rs_empty_tuples_pending--;
+ return true;
+ }
+
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index a9ba2bdfb88..2e4f87ea3a3 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,16 +108,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or
- * for returning data. This test is a bit simplistic, as it checks
- * the stronger condition that there's no qual or return tlist at all.
- * But in most cases it's probably not worth working harder than that.
- */
- node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
- node->ss.ps.plan->targetlist == NIL);
-
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -211,6 +201,17 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags |= SO_TEMP_SNAPSHOT;
}
+ /*
+ * We can potentially skip fetching heap pages if we do not need
+ * any columns of the table, either for checking non-indexable
+ * quals or for returning data. This test is a bit simplistic, as
+ * it checks the stronger condition that there's no qual or return
+ * tlist at all. But in most cases it's probably not worth working
+ * harder than that.
+ */
+ if (node->ss.ps.plan->qual == NIL && node->ss.ps.plan->targetlist == NIL)
+ extra_flags |= SO_CAN_SKIP_FETCH;
+
scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
node->ss.ss_currentRelation,
snapshot,
@@ -224,8 +225,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
/*
@@ -245,32 +244,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres);
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
+ if (!table_scan_bitmap_next_block(scan, tbmres))
{
/* AM doesn't think this block is valid, skip */
continue;
@@ -318,52 +292,33 @@ BitmapHeapNext(BitmapHeapScanState *node)
* should happen only when we have determined there is still something
* to do on the current page, else we may uselessly prefetch the same
* page we are just about to request for real.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
*/
BitmapPrefetch(node, scan);
- if (node->return_empty_tuples > 0)
+ /*
+ * Attempt to fetch tuple from AM.
+ */
+ if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
{
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
-
- if (--node->return_empty_tuples == 0)
- {
- /* no more tuples to return in the next round */
- node->tbmres = tbmres = NULL;
- }
+ /* nothing more to look at on this page */
+ node->tbmres = tbmres = NULL;
+ continue;
}
- else
+
+ /*
+ * If we are using lossy info, we have to recheck the qual conditions
+ * at every tuple.
+ */
+ if (tbmres->recheck)
{
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
{
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
continue;
}
-
- /*
- * If we are using lossy info, we have to recheck the qual
- * conditions at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
- {
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
- }
- }
}
/* OK to return this tuple */
@@ -535,7 +490,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* it did for the current heap page; which is not a certainty
* but is true in many cases.
*/
- skip_fetch = (node->can_skip_fetch &&
+
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
(node->tbmres ? !node->tbmres->recheck : false) &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -586,7 +542,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
/* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
(node->tbmres ? !node->tbmres->recheck : false) &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -656,8 +612,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
@@ -667,7 +621,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
node->initialized = false;
node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
- node->vmbuffer = InvalidBuffer;
node->pvmbuffer = InvalidBuffer;
ExecScanReScan(&node->ss);
@@ -712,8 +665,6 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
@@ -757,8 +708,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->tbm = NULL;
scanstate->tbmiterator = NULL;
scanstate->tbmres = NULL;
- scanstate->return_empty_tuples = 0;
- scanstate->vmbuffer = InvalidBuffer;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -770,7 +719,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
- scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..3dfb19ec7d5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,16 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /*
+ * These fields are only used for bitmap scans for the "skip fetch"
+ * optimization. Bitmap scans needing no fields from the heap may skip
+ * fetching an all visible block, instead using the number of tuples per
+ * block reported by the bitmap to determine how many NULL-filled tuples
+ * to return.
+ */
+ Buffer rs_vmbuffer;
+ int rs_empty_tuples_pending;
+
/* these fields only used in page-at-a-time mode and for bitmap scans */
int rs_cindex; /* current tuple's index in vistuples */
int rs_ntuples; /* number of visible tuples on page */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5375dd7150f..c193ea5db43 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,6 +62,13 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+
+ /*
+ * At the discretion of the table AM, bitmap table scans may be able to
+ * skip fetching a block from the table if none of the table data is
+ * needed.
+ */
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 00c75fb10e2..6fb4ec07c5f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1711,10 +1711,7 @@ typedef struct ParallelBitmapHeapState
* tbm bitmap obtained from child index scan(s)
* tbmiterator iterator for scanning current pages
* tbmres current-page data
- * can_skip_fetch can we potentially skip tuple fetches in this scan?
- * return_empty_tuples number of empty tuples to return
- * vmbuffer buffer for visibility-map lookups
- * pvmbuffer ditto, for prefetched pages
+ * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
* prefetch_iterator iterator for prefetching ahead of current page
@@ -1736,9 +1733,6 @@ typedef struct BitmapHeapScanState
TIDBitmap *tbm;
TBMIterator *tbmiterator;
TBMIterateResult *tbmres;
- bool can_skip_fetch;
- int return_empty_tuples;
- Buffer vmbuffer;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
--
2.37.2
v4-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-diff; charset=us-asciiDownload
From 17fc9d4c35e42b6e870b7e7f7c3495114e393e8a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v4 04/14] BitmapPrefetch use prefetch block recheck for skip
fetch
As of 7c70996ebf0949b142a9, BitmapPrefetch() used the recheck flag for
the current block to determine whether or not it could skip prefetching
the proposed prefetch block. It makes more sense for it to use the
recheck flag from the TBMIterateResult for the prefetch block instead.
See this [1] thread on hackers reporting the issue.
[1] https://www.postgresql.org/message-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ%2BopUrXDRXdcfwFZGA%40mail.gmail.com
---
src/backend/executor/nodeBitmapHeapscan.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2e4f87ea3a3..35ef26221ba 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,15 +484,9 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* skip this prefetch call, but continue to run the prefetch
* logic normally. (Would it be better not to increment
* prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
*/
-
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
@@ -543,7 +537,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
--
2.37.2
v4-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-diff; charset=us-asciiDownload
From 67a9fb1848718cabfcfd5c98368ab2aa79a6b213 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v4 05/14] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumber
BitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 35ef26221ba..3439c02e989 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -55,7 +55,7 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres);
+ BlockNumber blockno);
static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
static inline void BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan);
@@ -242,7 +242,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
break;
}
- BitmapAdjustPrefetchIterator(node, tbmres);
+ BitmapAdjustPrefetchIterator(node, tbmres->blockno);
if (!table_scan_bitmap_next_block(scan, tbmres))
{
@@ -351,7 +351,7 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
*/
static inline void
BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres)
+ BlockNumber blockno)
{
#ifdef USE_PREFETCH
ParallelBitmapHeapState *pstate = node->pstate;
@@ -370,7 +370,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
/* Do not let the prefetch iterator get behind the main one */
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
+ if (tbmpre == NULL || tbmpre->blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
--
2.37.2
v4-0006-EXPLAIN-Bitmap-table-scan-also-count-no-visible-t.patchtext/x-diff; charset=us-asciiDownload
From 4ad9d2798dff02537d0b5e7b807a5e80c7f0551d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 18:35:28 -0500
Subject: [PATCH v4 06/14] EXPLAIN Bitmap table scan also count no visible
tuple pages
Previously, bitmap heap scans only counted lossy and exact pages for
explain when there was at least one visible tuple on the page.
heapam_scan_bitmap_next_block() returned true only if there was a
"valid" page with tuples to be processed. However, the lossy and exact
page counters in EXPLAIN should count the number of pages represented in
a lossy or non-lossy way in the constructured bitmap, so it doesn't make
sense to omit pages without visible tuples.
---
src/backend/executor/nodeBitmapHeapscan.c | 15 ++++++++++-----
src/test/regress/expected/partition_prune.out | 4 +++-
2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3439c02e989..75e896074bf 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -225,6 +225,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
+ bool valid;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -244,17 +246,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres->blockno);
- if (!table_scan_bitmap_next_block(scan, tbmres))
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
+ valid = table_scan_bitmap_next_block(scan, tbmres);
if (tbmres->ntuples >= 0)
node->exact_pages++;
else
node->lossy_pages++;
+ if (!valid)
+ {
+ /* AM doesn't think this block is valid, skip */
+ continue;
+ }
+
+
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 9a4c48c0556..d9ec6492f96 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2709,6 +2709,7 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;
Index Cond: (a = 1)
-> Bitmap Heap Scan on ab_a1_b3 ab_a1_3 (actual rows=0 loops=1)
Recheck Cond: (a = 1)
+ Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
-> Materialize (actual rows=1 loops=1)
@@ -2724,9 +2725,10 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;
Index Cond: (a = 1)
-> Bitmap Heap Scan on ab_a1_b3 ab_3 (actual rows=0 loops=1)
Recheck Cond: (a = 1)
+ Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
-(34 rows)
+(36 rows)
table ab;
a | b
--
2.37.2
v4-0007-table_scan_bitmap_next_block-returns-lossy-or-exa.patchtext/x-diff; charset=us-asciiDownload
From 1d179a330c870ac6cf78cc4be56fb6e48298d093 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 20:34:07 -0500
Subject: [PATCH v4 07/14] table_scan_bitmap_next_block() returns lossy or
exact
Future commits will remove the TBMIterateResult from BitmapHeapNext() --
pushing it into the table AM-specific code. So, the table AM must inform
BitmapHeapNext() whether or not the current block is lossy or exact for
the purposes of the counters used in EXPLAIN.
---
src/backend/access/heap/heapam_handler.c | 5 ++++-
src/backend/executor/nodeBitmapHeapscan.c | 10 +++++-----
src/include/access/tableam.h | 14 ++++++++++----
3 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7661acac3a8..a6e52671d9b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2114,7 +2114,8 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres)
+ TBMIterateResult *tbmres,
+ bool *lossy)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
BlockNumber block = tbmres->blockno;
@@ -2242,6 +2243,8 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
+ *lossy = tbmres->ntuples < 0;
+
return ntup > 0;
}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 75e896074bf..054f745eeba 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -225,7 +225,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool valid;
+ bool valid, lossy;
CHECK_FOR_INTERRUPTS();
@@ -246,12 +246,12 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres->blockno);
- valid = table_scan_bitmap_next_block(scan, tbmres);
+ valid = table_scan_bitmap_next_block(scan, tbmres, &lossy);
- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
+ if (lossy)
node->lossy_pages++;
+ else
+ node->exact_pages++;
if (!valid)
{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c193ea5db43..8280035e39f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -796,6 +796,9 @@ typedef struct TableAmRoutine
* on the page have to be returned, otherwise the tuples at offsets in
* `tbmres->offsets` need to be returned.
*
+ * lossy indicates whether or not the block's representation in the bitmap
+ * is lossy or exact.
+ *
* XXX: Currently this may only be implemented if the AM uses md.c as its
* storage manager, and uses ItemPointer->ip_blkid in a manner that maps
* blockids directly to the underlying storage. nodeBitmapHeapscan.c
@@ -811,7 +814,8 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres);
+ struct TBMIterateResult *tbmres,
+ bool *lossy);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1952,14 +1956,16 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
* a bitmap table scan. `scan` needs to have been started via
* table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise.
+ * the page, true otherwise. lossy is set to true if bitmap is lossy for the
+ * selected block and false otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres)
+ struct TBMIterateResult *tbmres,
+ bool *lossy)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1970,7 +1976,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres);
+ tbmres, lossy);
}
/*
--
2.37.2
v4-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-diff; charset=us-asciiDownload
From b5c9f5aef18124f93886c25fceb56706dcdb813a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:17:47 -0500
Subject: [PATCH v4 08/14] Reduce scope of BitmapHeapScan tbmiterator local
variables
To simplify the diff of a future commit which will move the TBMIterators
into the scan descriptor, define them in a narrower scope now.
---
src/backend/executor/nodeBitmapHeapscan.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 054f745eeba..a639d6e7415 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -74,8 +74,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterator *tbmiterator = NULL;
- TBMSharedIterator *shared_tbmiterator = NULL;
TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
@@ -88,10 +86,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- if (pstate == NULL)
- tbmiterator = node->tbmiterator;
- else
- shared_tbmiterator = node->shared_tbmiterator;
tbmres = node->tbmres;
/*
@@ -108,6 +102,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ TBMIterator *tbmiterator = NULL;
+ TBMSharedIterator *shared_tbmiterator = NULL;
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -116,7 +113,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
- node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
+ tbmiterator = tbm_begin_iterate(tbm);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -169,8 +166,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* Allocate a private iterator and attach the shared state to it */
- node->shared_tbmiterator = shared_tbmiterator =
- tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
+ shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -220,6 +216,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
+ node->tbmiterator = tbmiterator;
+ node->shared_tbmiterator = shared_tbmiterator;
node->initialized = true;
}
@@ -235,9 +233,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
if (tbmres == NULL)
{
if (!pstate)
- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
else
- node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
+ node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
if (tbmres == NULL)
{
/* no more entries in the bitmap */
--
2.37.2
v4-0009-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-diff; charset=us-asciiDownload
From 46b086ad26c7c6a832892d91d3da2fd75f0a2039 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:13:41 -0500
Subject: [PATCH v4 09/14] Remove table_scan_bitmap_next_tuple parameter tbmres
With the addition of the proposed streaming read API [1],
table_scan_bitmap_next_block() will no longer take a TBMIterateResult as
an input. Instead table AMs will be responsible for implementing a
callback for the streaming read API which specifies which blocks should
be prefetched and read.
Thus, it no longer makes sense to use the TBMIterateResult as a means of
communication between table_scan_bitmap_next_tuple() and
table_scan_bitmap_next_block().
Note that this parameter was unused by heap AM's implementation of
table_scan_bitmap_next_tuple().
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 1 -
src/backend/executor/nodeBitmapHeapscan.c | 2 +-
src/include/access/tableam.h | 12 +-----------
3 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a6e52671d9b..5dc9c51ca95 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2250,7 +2250,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
static bool
heapam_scan_bitmap_next_tuple(TableScanDesc scan,
- TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index a639d6e7415..87991266931 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -301,7 +301,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* Attempt to fetch tuple from AM.
*/
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
/* nothing more to look at on this page */
node->tbmres = tbmres = NULL;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8280035e39f..8d7c800d157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -787,10 +787,7 @@ typedef struct TableAmRoutine
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
- * make sense to perform tuple visibility checks at this time). For some
- * AMs it will make more sense to do all the work referencing `tbmres`
- * contents here, for others it might be better to defer more work to
- * scan_bitmap_next_tuple.
+ * make sense to perform tuple visibility checks at this time).
*
* If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
* on the page have to be returned, otherwise the tuples at offsets in
@@ -821,15 +818,10 @@ typedef struct TableAmRoutine
* Fetch the next tuple of a bitmap table scan into `slot` and return true
* if a visible tuple was found, false otherwise.
*
- * For some AMs it will make more sense to do all the work referencing
- * `tbmres` contents in scan_bitmap_next_block, for others it might be
- * better to defer more work to this callback.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_tuple) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot);
/*
@@ -1989,7 +1981,6 @@ table_scan_bitmap_next_block(TableScanDesc scan,
*/
static inline bool
table_scan_bitmap_next_tuple(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
/*
@@ -2001,7 +1992,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
- tbmres,
slot);
}
--
2.37.2
v4-0010-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-diff; charset=us-asciiDownload
From f43d9b0913815b98b2a6216440a2b5e87ad95936 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v4 10/14] Make table_scan_bitmap_next_block() async friendly
table_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the end of the table recorded when the scan started.
This forced the caller to be responsible for determining if additional
blocks should be fetched and then for invoking
table_scan_bitmap_next_block() for these blocks.
It makes more sense for table_scan_bitmap_next_block() to be responsible
for finding a block that is not past the end of the table (as of the
time that the scan began) and for table_scan_bitmap_next_tuple() to
return false if there are no visible tuples on the page.
This also allows us to move responsibility for the iterator to table AM
specific code. This means handling invalid blocks is entirely up to
the table AM.
These changes will enable bitmapheapscan to use the future streaming
read API [1]. Table AMs will implement a streaming read API callback
returning the next block to fetch. In heap AM's case, the callback will
use the iterator to identify the next block to fetch. Since choosing the
next block will no longer the responsibility of BitmapHeapNext(), the
streaming read control flow requires these changes to
table_scan_bitmap_next_block().
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 59 ++++++--
src/backend/executor/nodeBitmapHeapscan.c | 167 +++++++++-------------
src/include/access/relscan.h | 7 +
src/include/access/tableam.h | 68 ++++++---
src/include/nodes/execnodes.h | 9 +-
5 files changed, 168 insertions(+), 142 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5dc9c51ca95..a439ddc87bf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2114,18 +2114,51 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres,
- bool *lossy)
+ bool *recheck, bool *lossy, BlockNumber *blockno)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
- BlockNumber block = tbmres->blockno;
+ BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
+ TBMIterateResult *tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ *blockno = InvalidBlockNumber;
+ *recheck = true;
+
+ do
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (scan->shared_tbmiterator)
+ tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ else
+ tbmres = tbm_iterate(scan->tbmiterator);
+
+ if (tbmres == NULL)
+ {
+ /* no more entries in the bitmap */
+ Assert(hscan->rs_empty_tuples_pending == 0);
+ return false;
+ }
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+
+ /* Got a valid block */
+ *blockno = tbmres->blockno;
+ *recheck = tbmres->recheck;
+
/*
* We can skip fetching the heap page if we don't need any fields from the
* heap, and the bitmap entries don't need rechecking, and all tuples on
@@ -2144,16 +2177,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
return true;
}
- /*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE isolation
- * though, as we need to examine all invisible tuples reachable by the
- * index.
- */
- if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
- return false;
+ block = tbmres->blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2245,7 +2269,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*lossy = tbmres->ntuples < 0;
- return ntup > 0;
+ /*
+ * Return true to indicate that a valid block was found and the bitmap is
+ * not exhausted. If there are no visible tuples on this page,
+ * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will
+ * return false returning control to this function to advance to the next
+ * block in the bitmap.
+ */
+ return true;
}
static bool
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 87991266931..3be433ea6e1 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -73,8 +73,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
{
ExprContext *econtext;
TableScanDesc scan;
+ bool lossy;
TIDBitmap *tbm;
- TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area *dsa = node->ss.ps.state->es_query_dsa;
@@ -86,7 +86,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- tbmres = node->tbmres;
/*
* If we haven't yet performed the underlying index scan, do it, and begin
@@ -114,7 +113,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -167,7 +165,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -216,56 +213,29 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
- node->tbmiterator = tbmiterator;
- node->shared_tbmiterator = shared_tbmiterator;
- node->initialized = true;
- }
-
- for (;;)
- {
- bool valid, lossy;
-
- CHECK_FOR_INTERRUPTS();
-
- /*
- * Get next page of results if needed
- */
- if (tbmres == NULL)
- {
- if (!pstate)
- node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
- else
- node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
- if (tbmres == NULL)
- {
- /* no more entries in the bitmap */
- break;
- }
-
- BitmapAdjustPrefetchIterator(node, tbmres->blockno);
+ scan->tbmiterator = tbmiterator;
+ scan->shared_tbmiterator = shared_tbmiterator;
- valid = table_scan_bitmap_next_block(scan, tbmres, &lossy);
+ node->initialized = true;
- if (lossy)
- node->lossy_pages++;
- else
- node->exact_pages++;
+ /* Get the first block. if none, end of scan */
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ return ExecClearTuple(slot);
- if (!valid)
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
+ if (lossy)
+ node->lossy_pages++;
+ else
+ node->exact_pages++;
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ BitmapAdjustPrefetchTarget(node);
+ }
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
- else
+ for (;;)
+ {
+ while (table_scan_bitmap_next_tuple(scan, slot))
{
- /*
- * Continuing in previously obtained page.
- */
+ CHECK_FOR_INTERRUPTS();
#ifdef USE_PREFETCH
@@ -287,45 +257,48 @@ BitmapHeapNext(BitmapHeapScanState *node)
SpinLockRelease(&pstate->mutex);
}
#endif /* USE_PREFETCH */
- }
- /*
- * We issue prefetch requests *after* fetching the current page to try
- * to avoid having prefetching interfere with the main I/O. Also, this
- * should happen only when we have determined there is still something
- * to do on the current page, else we may uselessly prefetch the same
- * page we are just about to request for real.
- */
- BitmapPrefetch(node, scan);
-
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
- {
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
- continue;
- }
+ /*
+ * We prefetch before fetching the current pages. We expect that a
+ * future streaming read API will do this, so do it this way now
+ * for consistency. Also, this should happen only when we have
+ * determined there is still something to do on the current page,
+ * else we may uselessly prefetch the same page we are just about
+ * to request for real.
+ */
+ BitmapPrefetch(node, scan);
- /*
- * If we are using lossy info, we have to recheck the qual conditions
- * at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ /*
+ * If we are using lossy info, we have to recheck the qual
+ * conditions at every tuple.
+ */
+ if (node->recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
+ continue;
+ }
}
+
+ /* OK to return this tuple */
+ return slot;
}
- /* OK to return this tuple */
- return slot;
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ break;
+
+ if (lossy)
+ node->lossy_pages++;
+ else
+ node->exact_pages++;
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ /* Adjust the prefetch target */
+ BitmapAdjustPrefetchTarget(node);
}
/*
@@ -599,12 +572,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
@@ -612,13 +581,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->tbmiterator = NULL;
- node->tbmres = NULL;
node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
node->pvmbuffer = InvalidBuffer;
+ node->recheck = true;
+ node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -649,28 +617,24 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
ExecEndNode(outerPlanState(node));
+
+ /*
+ * close heap scan
+ */
+ if (scanDesc)
+ table_endscan(scanDesc);
+
/*
* release bitmaps and buffers if any
*/
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
-
- /*
- * close heap scan
- */
- if (scanDesc)
- table_endscan(scanDesc);
-
}
/* ----------------------------------------------------------------
@@ -703,8 +667,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->tbmiterator = NULL;
- scanstate->tbmres = NULL;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -713,10 +675,11 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
+ scanstate->recheck = true;
+ scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..92b829cebc7 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -24,6 +24,9 @@
struct ParallelTableScanDescData;
+struct TBMIterator;
+struct TBMSharedIterator;
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -40,6 +43,10 @@ typedef struct TableScanDescData
ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;
+ /* Only used for Bitmap table scans */
+ struct TBMIterator *tbmiterator;
+ struct TBMSharedIterator *shared_tbmiterator;
+
/*
* Information about type and behaviour of the scan, a bitmask of members
* of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8d7c800d157..2adead958cb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "nodes/tidbitmap.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -780,19 +781,14 @@ typedef struct TableAmRoutine
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
- * of a bitmap table scan. `scan` was started via table_beginscan_bm().
- * Return false if there are no tuples to be found on the page, true
- * otherwise.
+ * Prepare to fetch / check / return tuples from `blockno` as part of a
+ * bitmap table scan. `scan` was started via table_beginscan_bm(). Return
+ * false if the bitmap is exhausted and true otherwise.
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
* make sense to perform tuple visibility checks at this time).
*
- * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
- * on the page have to be returned, otherwise the tuples at offsets in
- * `tbmres->offsets` need to be returned.
- *
* lossy indicates whether or not the block's representation in the bitmap
* is lossy or exact.
*
@@ -811,8 +807,8 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
- bool *lossy);
+ bool *recheck, bool *lossy,
+ BlockNumber *blockno);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -950,9 +946,13 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
uint32 extra_flags)
{
+ TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
- return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result->shared_tbmiterator = NULL;
+ result->tbmiterator = NULL;
+ return result;
}
/*
@@ -1012,6 +1012,21 @@ table_beginscan_analyze(Relation rel)
static inline void
table_endscan(TableScanDesc scan)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_end(scan);
}
@@ -1022,6 +1037,21 @@ static inline void
table_rescan(TableScanDesc scan,
struct ScanKeyData *key)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_rescan(scan, key, false, false, false, false);
}
@@ -1945,19 +1975,17 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
- * a bitmap table scan. `scan` needs to have been started via
- * table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise. lossy is set to true if bitmap is lossy for the
- * selected block and false otherwise.
+ * Prepare to fetch / check / return tuples as part of a bitmap table scan.
+ * `scan` needs to have been started via table_beginscan_bm(). Returns false if
+ * there are no more blocks in the bitmap, true otherwise. lossy is set to true
+ * if bitmap is lossy for the selected block and false otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
- bool *lossy)
+ bool *recheck, bool *lossy, BlockNumber *blockno)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1967,8 +1995,8 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres, lossy);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck,
+ lossy, blockno);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6fb4ec07c5f..a59df51dd69 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1709,8 +1709,6 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * tbmiterator iterator for scanning current pages
- * tbmres current-page data
* pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
@@ -1720,10 +1718,10 @@ typedef struct ParallelBitmapHeapState
* prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
+ * recheck do current page's tuples need recheck
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1731,8 +1729,6 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- TBMIterator *tbmiterator;
- TBMIterateResult *tbmres;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
@@ -1742,10 +1738,11 @@ typedef struct BitmapHeapScanState
int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
+ bool recheck;
+ BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v4-0011-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-diff; charset=us-asciiDownload
From 9c4b0c681205cdb4f48f544832fd0d4cd965f3c5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 20:13:43 -0500
Subject: [PATCH v4 11/14] Hard-code TBMIterateResult offsets array size
TIDBitmap's TBMIterateResult had a flexible sized array of tuple offsets
but the API always allocated MaxHeapTuplesPerPage OffsetNumbers.
Creating a fixed-size aray of size MaxHeapTuplesPerPage is more clear
for the API user.
---
src/backend/nodes/tidbitmap.c | 29 +++++++----------------------
src/include/nodes/tidbitmap.h | 12 ++++++++++--
2 files changed, 17 insertions(+), 24 deletions(-)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 0f4850065fb..689a959b467 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -40,21 +40,12 @@
#include <limits.h>
-#include "access/htup_details.h"
#include "common/hashfn.h"
#include "nodes/bitmapset.h"
#include "nodes/tidbitmap.h"
#include "storage/lwlock.h"
#include "utils/dsa.h"
-/*
- * The maximum number of tuples per page is not large (typically 256 with
- * 8K pages, or 1024 with 32K pages). So there's not much point in making
- * the per-page bitmaps variable size. We just legislate that the size
- * is this:
- */
-#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage
-
/*
* When we have to switch over to lossy storage, we use a data structure
* with one bit per page, where all pages having the same number DIV
@@ -66,7 +57,7 @@
* table, using identical data structures. (This is because the memory
* management for hashtables doesn't easily/efficiently allow space to be
* transferred easily from one hashtable to another.) Therefore it's best
- * if PAGES_PER_CHUNK is the same as MAX_TUPLES_PER_PAGE, or at least not
+ * if PAGES_PER_CHUNK is the same as MaxHeapTuplesPerPage, or at least not
* too different. But we also want PAGES_PER_CHUNK to be a power of 2 to
* avoid expensive integer remainder operations. So, define it like this:
*/
@@ -78,7 +69,7 @@
#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
/* number of active words for an exact page: */
-#define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
+#define WORDS_PER_PAGE ((MaxHeapTuplesPerPage - 1) / BITS_PER_BITMAPWORD + 1)
/* number of active words for a lossy chunk: */
#define WORDS_PER_CHUNK ((PAGES_PER_CHUNK - 1) / BITS_PER_BITMAPWORD + 1)
@@ -180,7 +171,7 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/*
@@ -221,7 +212,7 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/* Local function prototypes */
@@ -389,7 +380,7 @@ tbm_add_tuples(TIDBitmap *tbm, const ItemPointer tids, int ntids,
bitnum;
/* safety check to ensure we don't overrun bit array bounds */
- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxHeapTuplesPerPage)
elog(ERROR, "tuple offset out of range: %u", off);
/*
@@ -691,12 +682,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
Assert(tbm->iterating != TBM_ITERATING_SHARED);
- /*
- * Create the TBMIterator struct, with enough trailing space to serve the
- * needs of the TBMIterateResult sub-struct.
- */
- iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = palloc(sizeof(TBMIterator));
iterator->tbm = tbm;
/*
@@ -1470,8 +1456,7 @@ tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer dp)
* Create the TBMSharedIterator struct, with enough trailing space to
* serve the needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator));
istate = (TBMSharedIteratorState *) dsa_get_address(dsa, dp);
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639bf..432fae52962 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -22,6 +22,7 @@
#ifndef TIDBITMAP_H
#define TIDBITMAP_H
+#include "access/htup_details.h"
#include "storage/itemptr.h"
#include "utils/dsa.h"
@@ -41,9 +42,16 @@ typedef struct TBMIterateResult
{
BlockNumber blockno; /* page number containing tuples */
int ntuples; /* -1 indicates lossy result */
- bool recheck; /* should the tuples be rechecked? */
/* Note: recheck is always true if ntuples < 0 */
- OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+ bool recheck; /* should the tuples be rechecked? */
+
+ /*
+ * The maximum number of tuples per page is not large (typically 256 with
+ * 8K pages, or 1024 with 32K pages). So there's not much point in making
+ * the per-page bitmaps variable size. We just legislate that the size is
+ * this:
+ */
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
} TBMIterateResult;
/* function prototypes in nodes/tidbitmap.c */
--
2.37.2
v4-0012-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-diff; charset=us-asciiDownload
From e2bf17c5a936c3d536d9f25150b81d00969963b1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:23:41 -0500
Subject: [PATCH v4 12/14] Separate TBM[Shared]Iterator and TBMIterateResult
Remove the TBMIterateResult from the TBMIterator and TBMSharedIterator
and have tbm_[shared_]iterate() take a TBMIterateResult as a parameter.
This will allow multiple TBMIterateResults to exist concurrently
allowing asynchronous use of the TIDBitmap for prefetching, for example.
tbm_[shared]_iterate() now sets blockno to InvalidBlockNumber when the
bitmap is exhausted instead of returning NULL.
BitmapHeapScan callers of tbm_iterate make a TBMIterateResult locally
and pass it in.
Because GIN only needs a single TBMIterateResult, inline the matchResult
in the GinScanEntry to avoid having to separately manage memory for the
TBMIterateResult.
---
src/backend/access/gin/ginget.c | 48 +++++++++------
src/backend/access/gin/ginscan.c | 2 +-
src/backend/access/heap/heapam_handler.c | 32 +++++-----
src/backend/executor/nodeBitmapHeapscan.c | 33 +++++-----
src/backend/nodes/tidbitmap.c | 73 ++++++++++++-----------
src/include/access/gin_private.h | 2 +-
src/include/nodes/tidbitmap.h | 4 +-
7 files changed, 107 insertions(+), 87 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb6..3aa457a29e1 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -332,10 +332,22 @@ restartScanEntry:
entry->list = NULL;
entry->nlist = 0;
entry->matchBitmap = NULL;
- entry->matchResult = NULL;
entry->reduceResult = false;
entry->predictNumberResult = 0;
+ /*
+ * MTODO: is it enough to set blockno to InvalidBlockNumber? In all the
+ * places were we previously set matchResult to NULL, I just set blockno
+ * to InvalidBlockNumber. It seems like this should be okay because that
+ * is usually what we check before using the matchResult members. But it
+ * might be safer to zero out the offsets array. But that is expensive.
+ */
+ entry->matchResult.blockno = InvalidBlockNumber;
+ entry->matchResult.ntuples = 0;
+ entry->matchResult.recheck = true;
+ memset(entry->matchResult.offsets, 0,
+ sizeof(OffsetNumber) * MaxHeapTuplesPerPage);
+
/*
* we should find entry, and begin scan of posting tree or just store
* posting list in memory
@@ -374,6 +386,7 @@ restartScanEntry:
{
if (entry->matchIterator)
tbm_end_iterate(entry->matchIterator);
+ entry->matchResult.blockno = InvalidBlockNumber;
entry->matchIterator = NULL;
tbm_free(entry->matchBitmap);
entry->matchBitmap = NULL;
@@ -823,18 +836,19 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
{
/*
* If we've exhausted all items on this block, move to next block
- * in the bitmap.
+ * in the bitmap. tbm_iterate() sets matchResult->blockno to
+ * InvalidBlockNumber when the bitmap is exhausted.
*/
- while (entry->matchResult == NULL ||
- (entry->matchResult->ntuples >= 0 &&
- entry->offset >= entry->matchResult->ntuples) ||
- entry->matchResult->blockno < advancePastBlk ||
+ while ((!BlockNumberIsValid(entry->matchResult.blockno)) ||
+ (entry->matchResult.ntuples >= 0 &&
+ entry->offset >= entry->matchResult.ntuples) ||
+ entry->matchResult.blockno < advancePastBlk ||
(ItemPointerIsLossyPage(&advancePast) &&
- entry->matchResult->blockno == advancePastBlk))
+ entry->matchResult.blockno == advancePastBlk))
{
- entry->matchResult = tbm_iterate(entry->matchIterator);
+ tbm_iterate(entry->matchIterator, &entry->matchResult);
- if (entry->matchResult == NULL)
+ if (!BlockNumberIsValid(entry->matchResult.blockno))
{
ItemPointerSetInvalid(&entry->curItem);
tbm_end_iterate(entry->matchIterator);
@@ -858,10 +872,10 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* We're now on the first page after advancePast which has any
* items on it. If it's a lossy result, return that.
*/
- if (entry->matchResult->ntuples < 0)
+ if (entry->matchResult.ntuples < 0)
{
ItemPointerSetLossyPage(&entry->curItem,
- entry->matchResult->blockno);
+ entry->matchResult.blockno);
/*
* We might as well fall out of the loop; we could not
@@ -875,27 +889,27 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* Not a lossy page. Skip over any offsets <= advancePast, and
* return that.
*/
- if (entry->matchResult->blockno == advancePastBlk)
+ if (entry->matchResult.blockno == advancePastBlk)
{
/*
* First, do a quick check against the last offset on the
* page. If that's > advancePast, so are all the other
* offsets, so just go back to the top to get the next page.
*/
- if (entry->matchResult->offsets[entry->matchResult->ntuples - 1] <= advancePastOff)
+ if (entry->matchResult.offsets[entry->matchResult.ntuples - 1] <= advancePastOff)
{
- entry->offset = entry->matchResult->ntuples;
+ entry->offset = entry->matchResult.ntuples;
continue;
}
/* Otherwise scan to find the first item > advancePast */
- while (entry->matchResult->offsets[entry->offset] <= advancePastOff)
+ while (entry->matchResult.offsets[entry->offset] <= advancePastOff)
entry->offset++;
}
ItemPointerSet(&entry->curItem,
- entry->matchResult->blockno,
- entry->matchResult->offsets[entry->offset]);
+ entry->matchResult.blockno,
+ entry->matchResult.offsets[entry->offset]);
entry->offset++;
/* Done unless we need to reduce the result */
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544e..033d5253394 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -106,7 +106,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
ItemPointerSetMin(&scanEntry->curItem);
scanEntry->matchBitmap = NULL;
scanEntry->matchIterator = NULL;
- scanEntry->matchResult = NULL;
+ scanEntry->matchResult.blockno = InvalidBlockNumber;
scanEntry->list = NULL;
scanEntry->nlist = 0;
scanEntry->offset = InvalidOffsetNumber;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a439ddc87bf..daa5902e24d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2121,7 +2121,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult *tbmres;
+ TBMIterateResult tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
@@ -2134,11 +2134,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
CHECK_FOR_INTERRUPTS();
if (scan->shared_tbmiterator)
- tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
else
- tbmres = tbm_iterate(scan->tbmiterator);
+ tbm_iterate(scan->tbmiterator, &tbmres);
- if (tbmres == NULL)
+ if (!BlockNumberIsValid(tbmres.blockno))
{
/* no more entries in the bitmap */
Assert(hscan->rs_empty_tuples_pending == 0);
@@ -2153,11 +2153,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* isolation though, as we need to examine all invisible tuples
* reachable by the index.
*/
- } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+ } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
/* Got a valid block */
- *blockno = tbmres->blockno;
- *recheck = tbmres->recheck;
+ *blockno = tbmres.blockno;
+ *recheck = tbmres.recheck;
/*
* We can skip fetching the heap page if we don't need any fields from the
@@ -2165,19 +2165,19 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* the page are visible to our transaction.
*/
if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ !tbmres.recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
{
/* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
+ Assert(tbmres.ntuples >= 0);
Assert(hscan->rs_empty_tuples_pending >= 0);
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
+ hscan->rs_empty_tuples_pending += tbmres.ntuples;
return true;
}
- block = tbmres->blockno;
+ block = tbmres.blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2206,7 +2206,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres->ntuples >= 0)
+ if (tbmres.ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2215,9 +2215,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres.ntuples; curslot++)
{
- OffsetNumber offnum = tbmres->offsets[curslot];
+ OffsetNumber offnum = tbmres.offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2267,7 +2267,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
- *lossy = tbmres->ntuples < 0;
+ *lossy = tbmres.ntuples < 0;
/*
* Return true to indicate that a valid block was found and the bitmap is
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3be433ea6e1..74b92d4cbf4 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -344,9 +344,10 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
else if (prefetch_iterator)
{
/* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
+ tbm_iterate(prefetch_iterator, &tbmpre);
- if (tbmpre == NULL || tbmpre->blockno != blockno)
+ if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
@@ -364,6 +365,8 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
}
else
{
+ TBMIterateResult tbmpre;
+
/* Release the mutex before iterating */
SpinLockRelease(&pstate->mutex);
@@ -376,7 +379,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
* case.
*/
if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator);
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
}
}
#endif /* USE_PREFETCH */
@@ -443,10 +446,12 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (node->prefetch_pages < node->prefetch_target)
{
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
bool skip_fetch;
- if (tbmpre == NULL)
+ tbm_iterate(prefetch_iterator, &tbmpre);
+
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_iterate(prefetch_iterator);
@@ -462,13 +467,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* prefetch_pages?)
*/
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
@@ -483,7 +488,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (1)
{
- TBMIterateResult *tbmpre;
+ TBMIterateResult tbmpre;
bool do_prefetch = false;
bool skip_fetch;
@@ -502,8 +507,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
if (!do_prefetch)
return;
- tbmpre = tbm_shared_iterate(prefetch_iterator);
- if (tbmpre == NULL)
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_shared_iterate(prefetch_iterator);
@@ -513,13 +518,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
}
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 689a959b467..b4dcb1cbb88 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -171,7 +171,6 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output;
};
/*
@@ -212,7 +211,6 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output;
};
/* Local function prototypes */
@@ -943,20 +941,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
/*
* tbm_iterate - scan through next page of a TIDBitmap
*
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan. Pages are guaranteed to be delivered in numerical
- * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition. If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway. (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition. If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway. (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
*/
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
{
TIDBitmap *tbm = iterator->tbm;
- TBMIterateResult *output = &(iterator->output);
Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
@@ -984,6 +983,7 @@ tbm_iterate(TBMIterator *iterator)
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
+ Assert(tbmres);
if (iterator->schunkptr < tbm->nchunks)
{
PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -994,11 +994,11 @@ tbm_iterate(TBMIterator *iterator)
chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
iterator->schunkbit++;
- return output;
+ return;
}
}
@@ -1014,16 +1014,17 @@ tbm_iterate(TBMIterator *iterator)
page = tbm->spages[iterator->spageptr];
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
iterator->spageptr++;
- return output;
+ return;
}
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
@@ -1033,10 +1034,9 @@ tbm_iterate(TBMIterator *iterator)
* across multiple processes. We need to acquire the iterator LWLock,
* before accessing the shared members.
*/
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
{
- TBMIterateResult *output = &iterator->output;
TBMSharedIteratorState *istate = iterator->state;
PagetableEntry *ptbase = NULL;
int *idxpages = NULL;
@@ -1087,13 +1087,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
}
@@ -1103,21 +1103,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
int ntuples;
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
istate->spageptr++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
LWLockRelease(&istate->lock);
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 51d0c74a6b0..e423d92b41c 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -352,7 +352,7 @@ typedef struct GinScanEntryData
/* for a partial-match or full-scan query, we accumulate all TIDs here */
TIDBitmap *matchBitmap;
TBMIterator *matchIterator;
- TBMIterateResult *matchResult;
+ TBMIterateResult matchResult;
/* used for Posting list and one page in Posting tree */
ItemPointerData *list;
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 432fae52962..f000c1af28f 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -72,8 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
extern void tbm_end_iterate(TBMIterator *iterator);
extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
--
2.37.2
v4-0013-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload
From 0a6454968309ddaa85653ff9efacd54072f7fc33 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v4 13/14] Streaming Read API
---
contrib/pg_prewarm/pg_prewarm.c | 40 +-
src/backend/access/transam/xlogutils.c | 2 +-
src/backend/postmaster/bgwriter.c | 8 +-
src/backend/postmaster/checkpointer.c | 15 +-
src/backend/storage/Makefile | 2 +-
src/backend/storage/aio/Makefile | 14 +
src/backend/storage/aio/meson.build | 5 +
src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 560 +++++++++++++++--------
src/backend/storage/buffer/localbuf.c | 14 +-
src/backend/storage/meson.build | 1 +
src/backend/storage/smgr/smgr.c | 49 +-
src/include/storage/bufmgr.h | 22 +
src/include/storage/smgr.h | 4 +-
src/include/storage/streaming_read.h | 45 ++
src/include/utils/rel.h | 6 -
src/tools/pgindent/typedefs.list | 2 +
17 files changed, 986 insertions(+), 238 deletions(-)
create mode 100644 src/backend/storage/aio/Makefile
create mode 100644 src/backend/storage/aio/meson.build
create mode 100644 src/backend/storage/aio/streaming_read.c
create mode 100644 src/include/storage/streaming_read.h
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/smgr.h"
+#include "storage/streaming_read.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
static PGIOAlignedBlock blockbuffer;
+struct pg_prewarm_streaming_read_private
+{
+ BlockNumber blocknum;
+ int64 last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_data)
+{
+ struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+ if (p->blocknum <= p->last_block)
+ return p->blocknum++;
+
+ return InvalidBlockNumber;
+}
+
/*
* pg_prewarm(regclass, mode text, fork text,
* first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
}
else if (ptype == PREWARM_BUFFER)
{
+ struct pg_prewarm_streaming_read_private p;
+ PgStreamingRead *pgsr;
+
/*
* In buffer mode, we actually pull the data into shared_buffers.
*/
+
+ /* Set up the private state for our streaming buffer read callback. */
+ p.blocknum = first_block;
+ p.last_block = last_block;
+
+ pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ &p,
+ 0,
+ NULL,
+ BMR_REL(rel),
+ forkNumber,
+ pg_prewarm_streaming_read_next);
+
for (block = first_block; block <= last_block; ++block)
{
Buffer buf;
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+ buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
ReleaseBuffer(buf);
++blocks_done;
}
+ Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+ pg_streaming_read_free(pgsr);
}
/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
* This is unnecessarily heavy-handed, as it will close SMgrRelation
* objects for other databases as well. DROP DATABASE occurs seldom enough
* that it's not worth introducing a variant of smgrclose for just this
- * purpose. XXX: Or should we rather leave the smgr entries dangling?
+ * purpose.
*/
smgrcloseall();
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
if (FirstCallSinceLastCheckpoint())
{
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the bgwriter does
+ * not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
}
/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
ckpt_performed = CreateRestartPoint(flags);
/*
- * After any checkpoint, close all smgr files. This is so we
- * won't hang onto smgr references to deleted files indefinitely.
+ * After any checkpoint, free all smgr objects. Otherwise we
+ * would never do so for dropped relations, as the checkpointer
+ * does not process shared invalidation messages or call
+ * AtEOXact_SMgr().
*/
- smgrcloseall();
+ smgrdestroyall();
/*
* Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
*/
CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
- /*
- * After any checkpoint, close all smgr files. This is so we won't
- * hang onto smgr references to deleted files indefinitely.
- */
- smgrcloseall();
+ /* Free all smgr objects, as CheckpointerMain() normally would. */
+ smgrdestroyall();
return;
}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS = aio buffer file freespace ipc large_object lmgr page smgr sync
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+ bool advice_issued;
+ bool need_complete;
+ BlockNumber blocknum;
+ int nblocks;
+ int per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+ Buffer buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+ int max_ios;
+ int ios_in_progress;
+ int ios_in_progress_trigger;
+ int max_pinned_buffers;
+ int pinned_buffers;
+ int pinned_buffers_trigger;
+ int next_tail_buffer;
+ bool finished;
+ void *pgsr_private;
+ PgStreamingReadBufferCB callback;
+ BufferAccessStrategy strategy;
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+
+ bool advice_enabled;
+
+ /* Next expected block, for detecting sequential access. */
+ BlockNumber seq_blocknum;
+
+ /* Space for optional per-buffer private data. */
+ size_t per_buffer_data_size;
+ void *per_buffer_data;
+ int per_buffer_data_next;
+
+ /* Circular buffer of ranges. */
+ int size;
+ int head;
+ int tail;
+ PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy)
+{
+ PgStreamingRead *pgsr;
+ int size;
+ int max_ios;
+ uint32 max_pinned_buffers;
+
+
+ /*
+ * Decide how many assumed I/Os we will allow to run concurrently. That
+ * is, advice to the kernel to tell it that we will soon read. This
+ * number also affects how far we look ahead for opportunities to start
+ * more I/Os.
+ */
+ if (flags & PGSR_FLAG_MAINTENANCE)
+ max_ios = maintenance_io_concurrency;
+ else
+ max_ios = effective_io_concurrency;
+
+ /*
+ * The desired level of I/O concurrency controls how far ahead we are
+ * willing to look ahead. We also clamp it to at least
+ * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+ * sized read, even when max_ios is zero.
+ */
+ max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+ /*
+ * The *_io_concurrency GUCs, we might have 0. We want to allow at least
+ * one, to keep our gating logic simple.
+ */
+ max_ios = Max(max_ios, 1);
+
+ /*
+ * Don't allow this backend to pin too many buffers. For now we'll apply
+ * the limit for the shared buffer pool and the local buffer pool, without
+ * worrying which it is.
+ */
+ LimitAdditionalPins(&max_pinned_buffers);
+ LimitAdditionalLocalPins(&max_pinned_buffers);
+ Assert(max_pinned_buffers > 0);
+
+ /*
+ * pgsr->ranges is a circular buffer. When it is empty, head == tail.
+ * When it is full, there is an empty element between head and tail. Head
+ * can also be empty (nblocks == 0), therefore we need two extra elements
+ * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+ * maxmimum possible number of occupied ranges of the smallest possible
+ * size of one.
+ */
+ size = max_pinned_buffers + 2;
+
+ pgsr = (PgStreamingRead *)
+ palloc0(offsetof(PgStreamingRead, ranges) +
+ sizeof(pgsr->ranges[0]) * size);
+
+ pgsr->max_ios = max_ios;
+ pgsr->per_buffer_data_size = per_buffer_data_size;
+ pgsr->max_pinned_buffers = max_pinned_buffers;
+ pgsr->pgsr_private = pgsr_private;
+ pgsr->strategy = strategy;
+ pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+ /*
+ * This system supports prefetching advice. As long as direct I/O isn't
+ * enabled, and the caller hasn't promised sequential access, we can use
+ * it.
+ */
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ (flags & PGSR_FLAG_SEQUENTIAL) == 0)
+ pgsr->advice_enabled = true;
+#endif
+
+ /*
+ * We want to avoid creating ranges that are smaller than they could be
+ * just because we hit max_pinned_buffers. We only look ahead when the
+ * number of pinned buffers falls below this trigger number, or put
+ * another way, we stop looking ahead when we wouldn't be able to build a
+ * "full sized" range.
+ */
+ pgsr->pinned_buffers_trigger =
+ Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+ /* Space the callback to store extra data along with each block. */
+ if (per_buffer_data_size)
+ pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+ return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb)
+{
+ PgStreamingRead *result;
+
+ result = pg_streaming_read_buffer_alloc_internal(flags,
+ pgsr_private,
+ per_buffer_data_size,
+ strategy);
+ result->callback = next_block_cb;
+ result->bmr = bmr;
+ result->forknum = forknum;
+
+ return result;
+}
+
+/*
+ * Start building a new range. This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading. In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+ PgStreamingReadRange *head_range;
+
+ head_range = &pgsr->ranges[pgsr->head];
+ Assert(head_range->nblocks > 0);
+
+ /*
+ * If a call to CompleteReadBuffers() will be needed, and we can issue
+ * advice to the kernel to get the read started. We suppress it if the
+ * access pattern appears to be completely sequential, though, because on
+ * some systems that interfers with the kernel's own sequential read ahead
+ * heurstics and hurts performance.
+ */
+ if (pgsr->advice_enabled)
+ {
+ BlockNumber blocknum = head_range->blocknum;
+ int nblocks = head_range->nblocks;
+
+ if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+ {
+ SMgrRelation smgr =
+ pgsr->bmr.smgr ? pgsr->bmr.smgr :
+ RelationGetSmgr(pgsr->bmr.rel);
+
+ Assert(!head_range->advice_issued);
+
+ smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+ /*
+ * Count this as an I/O that is concurrently in progress, though
+ * we don't really know if the kernel generates a physical I/O.
+ */
+ head_range->advice_issued = true;
+ pgsr->ios_in_progress++;
+ }
+
+ /* Remember the block after this range, for sequence detection. */
+ pgsr->seq_blocknum = blocknum + nblocks;
+ }
+
+ /* Create a new head range. There must be space. */
+ Assert(pgsr->size > pgsr->max_pinned_buffers);
+ Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+ if (++pgsr->head == pgsr->size)
+ pgsr->head = 0;
+ head_range = &pgsr->ranges[pgsr->head];
+ head_range->nblocks = 0;
+
+ return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+ /*
+ * If we're finished or can't start more I/O, then don't look ahead.
+ */
+ if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+ return;
+
+ /*
+ * We'll also wait until the number of pinned buffers falls below our
+ * trigger level, so that we have the chance to create a full range.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+ return;
+
+ do
+ {
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+ Buffer buffer;
+ bool found;
+ bool need_complete;
+ PgStreamingReadRange *head_range;
+ void *per_buffer_data;
+
+ /* Do we have a full-sized range? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks == lengthof(head_range->buffers))
+ {
+ Assert(head_range->need_complete);
+ head_range = pg_streaming_read_new_range(pgsr);
+
+ /*
+ * Give up now if I/O is saturated, or we wouldn't be able form
+ * another full range after this due to the pin limit.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+ pgsr->ios_in_progress == pgsr->max_ios)
+ break;
+ }
+
+ per_buffer_data = (char *) pgsr->per_buffer_data +
+ pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+ /* Find out which block the callback wants to read next. */
+ blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+ if (blocknum == InvalidBlockNumber)
+ {
+ pgsr->finished = true;
+ break;
+ }
+ bmr = pgsr->bmr;
+ forknum = pgsr->forknum;
+
+ Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+ buffer = PrepareReadBuffer(bmr,
+ forknum,
+ blocknum,
+ pgsr->strategy,
+ &found);
+ pgsr->pinned_buffers++;
+
+ need_complete = !found;
+
+ /* Is there a head range that we can't extend? */
+ head_range = &pgsr->ranges[pgsr->head];
+ if (head_range->nblocks > 0 &&
+ (!need_complete ||
+ !head_range->need_complete ||
+ head_range->blocknum + head_range->nblocks != blocknum))
+ {
+ /* Yes, time to start building a new one. */
+ head_range = pg_streaming_read_new_range(pgsr);
+ Assert(head_range->nblocks == 0);
+ }
+
+ if (head_range->nblocks == 0)
+ {
+ /* Initialize a new range beginning at this block. */
+ head_range->blocknum = blocknum;
+ head_range->need_complete = need_complete;
+ head_range->advice_issued = false;
+ }
+ else
+ {
+ /* We can extend an existing range by one block. */
+ Assert(head_range->blocknum + head_range->nblocks == blocknum);
+ Assert(head_range->need_complete);
+ }
+
+ head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+ head_range->buffers[head_range->nblocks] = buffer;
+ head_range->nblocks++;
+
+ if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+ pgsr->per_buffer_data_next = 0;
+
+ } while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+ pgsr->ios_in_progress < pgsr->max_ios);
+
+ if (pgsr->ranges[pgsr->head].nblocks > 0)
+ pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+ pg_streaming_read_look_ahead(pgsr);
+
+ /* See if we have one buffer to return. */
+ while (pgsr->tail != pgsr->head)
+ {
+ PgStreamingReadRange *tail_range;
+
+ tail_range = &pgsr->ranges[pgsr->tail];
+
+ /*
+ * Do we need to perform an I/O before returning the buffers from this
+ * range?
+ */
+ if (tail_range->need_complete)
+ {
+ CompleteReadBuffers(pgsr->bmr,
+ tail_range->buffers,
+ pgsr->forknum,
+ tail_range->blocknum,
+ tail_range->nblocks,
+ false,
+ pgsr->strategy);
+ tail_range->need_complete = false;
+
+ /*
+ * We don't really know if the kernel generated an physical I/O
+ * when we issued advice, let alone when it finished, but it has
+ * certainly finished after a read call returns.
+ */
+ if (tail_range->advice_issued)
+ pgsr->ios_in_progress--;
+ }
+
+ /* Are there more buffers available in this range? */
+ if (pgsr->next_tail_buffer < tail_range->nblocks)
+ {
+ int buffer_index;
+ Buffer buffer;
+
+ buffer_index = pgsr->next_tail_buffer++;
+ buffer = tail_range->buffers[buffer_index];
+
+ Assert(BufferIsValid(buffer));
+
+ /* We are giving away ownership of this pinned buffer. */
+ Assert(pgsr->pinned_buffers > 0);
+ pgsr->pinned_buffers--;
+
+ if (per_buffer_data)
+ *per_buffer_data = (char *) pgsr->per_buffer_data +
+ tail_range->per_buffer_data_index[buffer_index] *
+ pgsr->per_buffer_data_size;
+
+ return buffer;
+ }
+
+ /* Advance tail to next range, if there is one. */
+ if (++pgsr->tail == pgsr->size)
+ pgsr->tail = 0;
+ pgsr->next_tail_buffer = 0;
+ }
+
+ Assert(pgsr->pinned_buffers == 0);
+
+ return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+ Buffer buffer;
+
+ /* Stop looking ahead, and unpin anything that wasn't consumed. */
+ pgsr->finished = true;
+ while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+ ReleaseBuffer(buffer);
+
+ if (pgsr->per_buffer_data)
+ pfree(pgsr->per_buffer_data);
+ pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
)
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static int SyncOneBuffer(int buf_id, bool skip_recently_used,
WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner);
static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot access temporary tables of other sessions")));
- /*
- * Read the buffer, and update pgstat counters to reflect a cache hit or
- * miss.
- */
- pgstat_count_buffer_read(reln);
- buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+ buf = ReadBuffer_common(BMR_REL(reln),
forkNum, blockNum, mode, strategy, &hit);
- if (hit)
- pgstat_count_buffer_hit(reln);
+
return buf;
}
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
- return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
- RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+ return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+ RELPERSISTENCE_UNLOGGED),
+ forkNum, blockNum,
mode, strategy, &hit);
}
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
bool hit;
Assert(extended_by == 0);
- buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+ buffer = ReadBuffer_common(bmr,
fork, extend_to - 1, mode, strategy,
&hit);
}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
* *hit is set to true if the request was satisfied from shared buffer cache.
*/
static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy, bool *hit)
{
- BufferDesc *bufHdr;
- Block bufBlock;
- bool found;
- IOContext io_context;
- IOObject io_object;
- bool isLocalBuf = SmgrIsTemp(smgr);
-
- *hit = false;
+ Buffer buffer;
/*
* Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;
- return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
- forkNum, strategy, flags);
+ *hit = false;
+
+ return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}
- TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend);
+ buffer = PrepareReadBuffer(bmr,
+ forkNum,
+ blockNum,
+ strategy,
+ hit);
+
+ /* At this point we do NOT hold any locks. */
+ if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+ {
+ /* if we just want zeroes and a lock, we're done */
+ ZeroBuffer(buffer, mode);
+ }
+ else if (!*hit)
+ {
+ /* we might need to perform I/O */
+ CompleteReadBuffers(bmr,
+ &buffer,
+ forkNum,
+ blockNum,
+ 1,
+ mode == RBM_ZERO_ON_ERROR,
+ strategy);
+ }
+
+ return buffer;
+}
+
+/*
+ * Prepare to read a block. The buffer is pinned. If this is a 'hit', then
+ * the returned buffer can be used immediately. Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer(). PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr)
+{
+ BufferDesc *bufHdr;
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
+
+ Assert(blockNum != P_NEW);
+
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
- /*
- * We do not use a BufferAccessStrategy for I/O of temporary tables.
- * However, in some cases, the "strategy" may not be NULL, so we can't
- * rely on IOContextForStrategy() to set the right IOContext for us.
- * This may happen in cases like CREATE TEMPORARY TABLE AS...
- */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
- bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
- if (found)
- pgBufferUsage.local_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.local_blks_read++;
}
else
{
- /*
- * lookup the buffer. IO_IN_PROGRESS is set if the requested block is
- * not currently in memory.
- */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
- bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
- strategy, &found, io_context);
- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;
}
- /* At this point we do NOT hold any locks. */
+ TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend);
- /* if it was already in the buffer pool, we're done */
- if (found)
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+ if (isLocalBuf)
+ {
+ bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+ if (*foundPtr)
+ pgBufferUsage.local_blks_hit++;
+ }
+ else
+ {
+ bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+ strategy, foundPtr, io_context);
+ if (*foundPtr)
+ pgBufferUsage.shared_blks_hit++;
+ }
+ if (bmr.rel)
+ {
+ /*
+ * While pgBufferUsage's "read" counter isn't bumped unless we reach
+ * CompleteReadBuffers() (so, not for hits, and not for buffers that
+ * are zeroed instead), the per-relation stats always count them.
+ */
+ pgstat_count_buffer_read(bmr.rel);
+ if (*foundPtr)
+ pgstat_count_buffer_hit(bmr.rel);
+ }
+ if (*foundPtr)
{
- /* Just need to update stats before we exit */
- *hit = true;
VacuumPageHit++;
pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageHit;
TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ }
- /*
- * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
- * on return.
- */
- if (!isLocalBuf)
- {
- if (mode == RBM_ZERO_AND_LOCK)
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
- LW_EXCLUSIVE);
- else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
- LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
- }
+ return BufferDescriptorGetBuffer(bufHdr);
+}
- return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+ if (BufferIsLocal(buffer))
+ {
+ BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+ return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
+ else
+ return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
- /*
- * if we have gotten to this point, we have allocated a buffer for the
- * page but its contents are not yet valid. IO_IN_PROGRESS is set for it,
- * if it's a shared buffer.
- */
- Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers(). The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy)
+{
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
- bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+ }
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
+ if (isLocalBuf)
+ {
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
+ }
+ else
+ {
+ io_context = IOContextForStrategy(strategy);
+ io_object = IOOBJECT_RELATION;
+ }
/*
- * Read in the page, unless the caller intends to overwrite it and just
- * wants us to allocate a buffer.
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
*/
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ if (isLocalBuf)
+ pgBufferUsage.local_blks_read += nblocks;
else
+ pgBufferUsage.shared_blks_read += nblocks;
+
+ for (int i = 0; i < nblocks; ++i)
{
- instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+ int io_buffers_len;
+ Buffer io_buffers[MAX_BUFFERS_PER_TRANSFER];
+ void *io_pages[MAX_BUFFERS_PER_TRANSFER];
+ instr_time io_start;
+ BlockNumber io_first_block;
- smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
- pgstat_count_io_op_time(io_object, io_context,
- IOOP_READ, io_start, 1);
+ /*
+ * We could get all the information from buffer headers, but it can be
+ * expensive to access buffer header cache lines so we make the caller
+ * provide all the information we need, and assert that it is
+ * consistent.
+ */
+ {
+ RelFileLocator xlocator;
+ ForkNumber xforknum;
+ BlockNumber xblocknum;
+
+ BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+ Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+ Assert(xforknum == forknum);
+ Assert(xblocknum == blocknum + i);
+ }
+#endif
+
+ /*
+ * Skip this block if someone else has already completed it. If an
+ * I/O is already in progress in another backend, this will wait for
+ * the outcome: either done, or something went wrong and we will
+ * retry.
+ */
+ if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+ {
+ /*
+ * Report this as a 'hit' for this backend, even though it must
+ * have started out as a miss in PrepareReadBuffer().
+ */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ continue;
+ }
+
+ /* We found a buffer that we need to read in. */
+ io_buffers[0] = buffers[i];
+ io_pages[0] = BufferGetBlock(buffers[i]);
+ io_first_block = blocknum + i;
+ io_buffers_len = 1;
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
+ /*
+ * How many neighboring-on-disk blocks can we can scatter-read into
+ * other buffers at the same time? In this case we don't wait if we
+ * see an I/O already in progress. We already hold BM_IO_IN_PROGRESS
+ * for the head block, so we should get on with that I/O as soon as
+ * possible. We'll come back to this block again, above.
+ */
+ while ((i + 1) < nblocks &&
+ CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+ {
+ /* Must be consecutive block numbers. */
+ Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+ BufferGetBlockNumber(buffers[i]) + 1);
+
+ io_buffers[io_buffers_len] = buffers[++i];
+ io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+ }
+
+ io_start = pgstat_prepare_io_time(track_io_timing);
+ smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+ io_buffers_len);
+
+ /* Verify each block we read, and terminate the I/O. */
+ for (int j = 0; j < io_buffers_len; ++j)
{
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+ BufferDesc *bufHdr;
+ Block bufBlock;
+
+ if (isLocalBuf)
{
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
+ bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+ bufBlock = LocalBufHdrGetBlock(bufHdr);
}
else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- }
- }
-
- /*
- * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
- * content lock before marking the page as valid, to make sure that no
- * other backend sees the zeroed page before the caller has had a chance
- * to initialize it.
- *
- * Since no-one else can be looking at the page contents yet, there is no
- * difference between an exclusive lock and a cleanup-strength lock. (Note
- * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
- * they assert that the buffer is already valid.)
- */
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
- {
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
- }
+ {
+ bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+ bufBlock = BufHdrGetBlock(bufHdr);
+ }
- if (isLocalBuf)
- {
- /* Only need to adjust flags */
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ /* check for garbage data */
+ if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ if (zero_on_error || zero_damaged_pages)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ memset(bufBlock, 0, BLCKSZ);
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ }
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
- }
+ /* Terminate I/O and set BM_VALID. */
+ if (isLocalBuf)
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
- VacuumPageMiss++;
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss;
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ /* Set BM_VALID, terminate IO, and wake up any waiters */
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ }
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ /* Report I/Os as completing individually. */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ false);
+ }
- return BufferDescriptorGetBuffer(bufHdr);
+ VacuumPageMiss += io_buffers_len;
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ }
}
/*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*
* The returned buffer is pinned and is already marked as holding the
* desired page. If it already did have the desired page, *foundPtr is
- * set true. Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true. Otherwise, *foundPtr is set false. A read should be
+ * performed with CompleteReadBuffers().
*
* io_context is passed as an output parameter to avoid calling
* IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(buf, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called PrepareReadBuffer() but not yet CompleteReadBuffers().
*/
- if (StartBufferIO(existing_buf_hdr, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to obtain the right to
- * start I/O. If StartBufferIO returns false, then someone else managed
- * to read it before we did, so there's nothing left for BufferAlloc() to
- * do.
+ * Buffer contents are currently invalid.
*/
- if (StartBufferIO(victim_buf_hdr, true))
- *foundPtr = false;
- else
- *foundPtr = true;
+ *foundPtr = false;
return victim_buf_hdr;
}
@@ -1774,7 +1899,7 @@ again:
* pessimistic, but outside of toy-sized shared_buffers it should allow
* sufficient pins.
*/
-static void
+void
LimitAdditionalPins(uint32 *additional_pins)
{
uint32 max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
buf_state &= ~BM_VALID;
UnlockBufHdr(existing_hdr, buf_state);
- } while (!StartBufferIO(existing_hdr, true));
+ } while (!StartBufferIO(existing_hdr, true, false));
}
else
{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
LWLockRelease(partition_lock);
/* XXX: could combine the locked operations in it with the above */
- StartBufferIO(victim_buf_hdr, true);
+ StartBufferIO(victim_buf_hdr, true, false);
}
}
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
else
{
/*
- * If we previously pinned the buffer, it must surely be valid.
+ * If we previously pinned the buffer, it is likely to be valid, but
+ * it may not be if PrepareReadBuffer() was called and
+ * CompleteReadBuffers() hasn't been called yet. We'll check by
+ * loading the flags without locking. This is racy, but it's OK to
+ * return false spuriously: when CompleteReadBuffers() calls
+ * StartBufferIO(), it'll see that it's now valid.
*
* Note: We deliberately avoid a Valgrind client request here.
* Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
* that the buffer page is legitimately non-accessible here. We
* cannot meddle with that.
*/
- result = true;
+ result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
}
ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* someone else flushed the buffer before we could, so we need not do
* anything.
*/
- if (!StartBufferIO(buf, false))
+ if (!StartBufferIO(buf, false, false))
return;
/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
LW_EXCLUSIVE);
}
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would. The buffer must be already pinned. It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+ if (BufferIsLocal(buffer))
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ else
+ {
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ if (mode == RBM_ZERO_AND_LOCK)
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ else
+ LockBufferForCleanup(buffer);
+ }
+
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ if (BufferIsLocal(buffer))
+ {
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ buf_state = LockBufHdr(bufHdr);
+ buf_state |= BM_VALID;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
/*
* Verify that this backend is pinning the buffer exactly once.
*
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
*
* Returns true if we successfully marked the buffer as I/O busy,
* false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend. In that case, false indicates either that the I/O was already
+ * finished, or is still in progress. This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
*/
static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
UnlockBufHdr(buf, buf_state);
+ if (nowait)
+ return false;
WaitIO(buf);
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
* LocalBufferAlloc -
* Find or create a local buffer for the given page of the given relation.
*
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local. Also, IO_IN_PROGRESS
- * does not get set. Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local. We support only default access
+ * strategy (hence, usage_count is always advanced).
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
}
/* see LimitAdditionalPins() */
-static void
+void
LimitAdditionalLocalPins(uint32 *additional_pins)
{
uint32 max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
/*
* In contrast to LimitAdditionalPins() other backends don't play a role
- * here. We can allow up to NLocBuffer pins in total.
+ * here. We can allow up to NLocBuffer pins in total, but it might not be
+ * initialized yet so read num_temp_buffers.
*/
- max_pins = (NLocBuffer - NLocalPinnedBuffers);
+ max_pins = (num_temp_buffers - NLocalPinnedBuffers);
if (*additional_pins >= max_pins)
*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('aio')
subdir('buffer')
subdir('file')
subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
/*
* smgropen() -- Return an SMgrRelation object, creating it if need be.
*
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files. The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
*/
SMgrRelation
smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
}
/*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
*/
void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
{
SMgrRelation *owner;
ForkNumber forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
}
/*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
*
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr(). It may be re-owned if it is accessed by a
+ * relation before then.
*/
void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
{
for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
}
reln->smgr_targblock = InvalidBlockNumber;
+
+ if (reln->smgr_owner)
+ {
+ *reln->smgr_owner = NULL;
+ reln->smgr_owner = NULL;
+ dlist_push_tail(&unowned_relns, &reln->node);
+ }
}
/*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
*/
void
-smgrreleaseall(void)
+smgrcloseall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrrelease(reln);
+ smgrclose(reln);
}
/*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
*/
void
-smgrcloseall(void)
+smgrdestroyall(void)
{
HASH_SEQ_STATUS status;
SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
hash_seq_init(&status, SMgrRelationHash);
while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
- smgrclose(reln);
+ smgrdestroy(reln);
}
/*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
* AtEOXact_SMgr
*
* This routine is called during transaction commit or abort (it doesn't
- * particularly care which). All transient SMgrRelation objects are closed.
+ * particularly care which). All transient SMgrRelation objects are
+ * destroyed.
*
* We do this as a compromise between wanting transient SMgrRelations to
* live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
dlist_mutable_iter iter;
/*
- * Zap all unowned SMgrRelations. We rely on smgrclose() to remove each
+ * Zap all unowned SMgrRelations. We rely on smgrdestroy() to remove each
* one from the list.
*/
dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
Assert(rel->smgr_owner == NULL);
- smgrclose(rel);
+ smgrdestroy(rel);
}
}
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
bool
ProcessBarrierSmgrRelease(void)
{
- smgrreleaseall();
+ smgrcloseall();
return true;
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
#ifndef BUFMGR_H
#define BUFMGR_H
+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
#define BUFFER_LOCK_SHARE 1
#define BUFFER_LOCK_EXCLUSIVE 2
+/*
+ * Maximum number of buffers for multi-buffer I/O functions. This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
/*
* prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool zero_on_error,
+ BufferAccessStrategy strategy);
extern void ReleaseBuffer(Buffer buffer);
extern void UnlockReleaseBuffer(Buffer buffer);
extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
/* in buf_init.c */
extern void InitBufferPool(void);
extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected. Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_private_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
*
* Very little code is authorized to touch rel->rd_smgr directly. Instead
* use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period. Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation. It's quite cheap in
- * comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b7..8007f17320a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2094,6 +2094,8 @@ PgStat_TableCounts
PgStat_TableStatus
PgStat_TableXactStatus
PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
PgXmlErrorContext
PgXmlStrictness
Pg_finfo_record
--
2.37.2
v4-0014-BitmapHeapScan-uses-streaming-read-API.patchtext/x-diff; charset=us-asciiDownload
From 4c9c90df25b4e421c34913b5da3da071fd4b15e1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:04:18 -0500
Subject: [PATCH v4 14/14] BitmapHeapScan uses streaming read API
Remove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/heap/heapam.c | 68 +++++
src/backend/access/heap/heapam_handler.c | 88 +++---
src/backend/executor/nodeBitmapHeapscan.c | 336 +---------------------
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 22 +-
src/include/nodes/execnodes.h | 19 --
6 files changed, 117 insertions(+), 420 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b93f243c282..c965048af60 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -115,6 +115,8 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static BlockNumber bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data);
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -335,6 +337,22 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
if (key != NULL && scan->rs_base.rs_nkeys > 0)
memcpy(scan->rs_base.rs_key, key, scan->rs_base.rs_nkeys * sizeof(ScanKeyData));
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
+ scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ scan,
+ sizeof(TBMIterateResult),
+ scan->rs_strategy,
+ BMR_REL(scan->rs_base.rs_rd),
+ MAIN_FORKNUM,
+ bitmapheap_pgsr_next);
+
+
+ }
+
/*
* Currently, we only have a stats counter for sequential heap scans (but
* e.g for bitmap scans the underlying bitmap index scans will be counted,
@@ -955,6 +973,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_pgsr = NULL;
scan->rs_vmbuffer = InvalidBuffer;
scan->rs_empty_tuples_pending = 0;
@@ -1093,6 +1112,9 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN && scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
pfree(scan);
}
@@ -10250,3 +10272,49 @@ HeapCheckForSerializableConflictOut(bool visible, Relation relation,
CheckForSerializableConflictOut(relation, xid, snapshot);
}
+
+static BlockNumber
+bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data)
+{
+ TBMIterateResult *tbmres = per_buffer_data;
+ HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (hdesc->rs_base.shared_tbmiterator)
+ tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+ else
+ tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+ /* no more entries in the bitmap */
+ if (!BlockNumberIsValid(tbmres->blockno))
+ return InvalidBlockNumber;
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+ continue;
+
+ if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->rs_vmbuffer))
+ {
+ hdesc->rs_empty_tuples_pending += tbmres->ntuples;
+ continue;
+ }
+
+ return tbmres->blockno;
+ }
+
+ /* not reachable */
+ Assert(false);
+}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index daa5902e24d..cade7edd900 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2113,79 +2113,65 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
*/
static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, bool *lossy, BlockNumber *blockno)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck, bool *lossy)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
+ void *io_private;
BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult tbmres;
+ TBMIterateResult *tbmres;
+
+ Assert(hscan->rs_pgsr);
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- *blockno = InvalidBlockNumber;
*recheck = true;
- do
+ /* Release buffer containing previous block. */
+ if (BufferIsValid(hscan->rs_cbuf))
{
- CHECK_FOR_INTERRUPTS();
+ ReleaseBuffer(hscan->rs_cbuf);
+ hscan->rs_cbuf = InvalidBuffer;
+ }
- if (scan->shared_tbmiterator)
- tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
- else
- tbm_iterate(scan->tbmiterator, &tbmres);
+ hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->rs_pgsr, &io_private);
- if (!BlockNumberIsValid(tbmres.blockno))
+ if (BufferIsInvalid(hscan->rs_cbuf))
+ {
+ if (BufferIsValid(hscan->rs_vmbuffer))
{
- /* no more entries in the bitmap */
- Assert(hscan->rs_empty_tuples_pending == 0);
- return false;
+ ReleaseBuffer(hscan->rs_vmbuffer);
+ hscan->rs_vmbuffer = InvalidBuffer;
}
/*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE
- * isolation though, as we need to examine all invisible tuples
- * reachable by the index.
+ * Bitmap is exhausted. Time to emit empty tuples if relevant. We emit
+ * all empty tuples at the end instead of emitting them per block we
+ * skip fetching. This is necessary because the streaming read API
+ * will only return TBMIterateResults for blocks actually fetched.
+ * When we skip fetching a block, we keep track of how many empty
+ * tuples to emit at the end of the BitmapHeapScan. We do not recheck
+ * all NULL tuples.
*/
- } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
+ *recheck = false;
+ return hscan->rs_empty_tuples_pending > 0;
+ }
- /* Got a valid block */
- *blockno = tbmres.blockno;
- *recheck = tbmres.recheck;
+ Assert(io_private);
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, and the bitmap entries don't need rechecking, and all tuples on
- * the page are visible to our transaction.
- */
- if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres.recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres.ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
+ tbmres = io_private;
- hscan->rs_empty_tuples_pending += tbmres.ntuples;
+ Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
- return true;
- }
+ *recheck = tbmres->recheck;
- block = tbmres.blockno;
+ hscan->rs_cblock = tbmres->blockno;
+ hscan->rs_ntuples = tbmres->ntuples;
- /*
- * Acquire pin on the target heap page, trading in any pin we held before.
- */
- hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
- scan->rs_rd,
- block);
- hscan->rs_cblock = block;
+ block = tbmres->blockno;
buffer = hscan->rs_cbuf;
snapshot = scan->rs_snapshot;
@@ -2206,7 +2192,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres.ntuples >= 0)
+ if (tbmres->ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2215,9 +2201,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres.ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres->ntuples; curslot++)
{
- OffsetNumber offnum = tbmres.offsets[curslot];
+ OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2267,7 +2253,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
- *lossy = tbmres.ntuples < 0;
+ *lossy = tbmres->ntuples < 0;
/*
* Return true to indicate that a valid block was found and the bitmap is
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 74b92d4cbf4..c5a482cc175 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -54,11 +54,6 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
- TableScanDesc scan);
static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
@@ -90,14 +85,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* If we haven't yet performed the underlying index scan, do it, and begin
* the iteration over the bitmap.
- *
- * For prefetching, we use *two* iterators, one for the pages we are
- * actually scanning and another that runs ahead of the first for
- * prefetching. node->prefetch_pages tracks exactly how many pages ahead
- * the prefetch iterator is. Also, node->prefetch_target tracks the
- * desired prefetch distance, which starts small and increases up to the
- * node->prefetch_maximum. This is to avoid doing a lot of prefetching in
- * a scan that stops after a few tuples because of a LIMIT.
*/
if (!node->initialized)
{
@@ -113,15 +100,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->prefetch_iterator = tbm_begin_iterate(tbm);
- node->prefetch_pages = 0;
- node->prefetch_target = -1;
- }
-#endif /* USE_PREFETCH */
}
else
{
@@ -144,20 +122,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
* multiple processes to iterate jointly.
*/
pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- pstate->prefetch_iterator =
- tbm_prepare_shared_iterate(tbm);
-
- /*
- * We don't need the mutex here as we haven't yet woke up
- * others.
- */
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = -1;
- }
-#endif
/* We have initialized the shared state so wake up others. */
BitmapDoneInitializingSharedState(pstate);
@@ -165,14 +129,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->shared_prefetch_iterator =
- tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
- }
-#endif /* USE_PREFETCH */
}
/*
@@ -219,16 +175,13 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->initialized = true;
/* Get the first block. if none, end of scan */
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy))
return ExecClearTuple(slot);
if (lossy)
node->lossy_pages++;
else
node->exact_pages++;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- BitmapAdjustPrefetchTarget(node);
}
for (;;)
@@ -237,37 +190,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
{
CHECK_FOR_INTERRUPTS();
-#ifdef USE_PREFETCH
-
- /*
- * Try to prefetch at least a few pages even before we get to the
- * second page if we don't stop reading after the first tuple.
- */
- if (!pstate)
- {
- if (node->prefetch_target < node->prefetch_maximum)
- node->prefetch_target++;
- }
- else if (pstate->prefetch_target < node->prefetch_maximum)
- {
- /* take spinlock while updating shared state */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target < node->prefetch_maximum)
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-
- /*
- * We prefetch before fetching the current pages. We expect that a
- * future streaming read API will do this, so do it this way now
- * for consistency. Also, this should happen only when we have
- * determined there is still something to do on the current page,
- * else we may uselessly prefetch the same page we are just about
- * to request for real.
- */
- BitmapPrefetch(node, scan);
-
/*
* If we are using lossy info, we have to recheck the qual
* conditions at every tuple.
@@ -288,17 +210,13 @@ BitmapHeapNext(BitmapHeapScanState *node)
return slot;
}
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy))
break;
if (lossy)
node->lossy_pages++;
else
node->exact_pages++;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
}
/*
@@ -322,215 +240,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
ConditionVariableBroadcast(&pstate->cv);
}
-/*
- * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (node->prefetch_pages > 0)
- {
- /* The main iterator has closed the distance by one page */
- node->prefetch_pages--;
- }
- else if (prefetch_iterator)
- {
- /* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult tbmpre;
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
- elog(ERROR, "prefetch and main iterators are out of sync");
- }
- return;
- }
-
- if (node->prefetch_maximum > 0)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages > 0)
- {
- pstate->prefetch_pages--;
- SpinLockRelease(&pstate->mutex);
- }
- else
- {
- TBMIterateResult tbmpre;
-
- /* Release the mutex before iterating */
- SpinLockRelease(&pstate->mutex);
-
- /*
- * In case of shared mode, we can not ensure that the current
- * blockno of the main iterator and that of the prefetch iterator
- * are same. It's possible that whatever blockno we are
- * prefetching will be processed by another process. Therefore,
- * we don't validate the blockno here as we do in non-parallel
- * case.
- */
- if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- }
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max. Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
- return;
- }
-
- /* Do an unlocked check first to save spinlock acquisitions. */
- if (pstate->prefetch_target < node->prefetch_maximum)
- {
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
- pstate->prefetch_target = node->prefetch_maximum;
- else if (pstate->prefetch_target > 0)
- pstate->prefetch_target *= 2;
- else
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (node->prefetch_pages < node->prefetch_target)
- {
- TBMIterateResult tbmpre;
- bool skip_fetch;
-
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_iterate(prefetch_iterator);
- node->prefetch_iterator = NULL;
- break;
- }
- node->prefetch_pages++;
-
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
-
- return;
- }
-
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (1)
- {
- TBMIterateResult tbmpre;
- bool do_prefetch = false;
- bool skip_fetch;
-
- /*
- * Recheck under the mutex. If some other process has already
- * done enough prefetching then we need not to do anything.
- */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- pstate->prefetch_pages++;
- do_prefetch = true;
- }
- SpinLockRelease(&pstate->mutex);
-
- if (!do_prefetch)
- return;
-
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_shared_iterate(prefetch_iterator);
- node->shared_prefetch_iterator = NULL;
- break;
- }
-
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
- }
-#endif /* USE_PREFETCH */
-}
-
/*
* BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
*/
@@ -576,22 +285,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->ss.ss_currentScanDesc)
table_rescan(node->ss.ss_currentScanDesc, NULL);
- /* release bitmaps and buffers if any */
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
+ /* release bitmaps if any */
if (node->tbm)
tbm_free(node->tbm);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_prefetch_iterator = NULL;
- node->pvmbuffer = InvalidBuffer;
node->recheck = true;
- node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -630,16 +329,10 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
table_endscan(scanDesc);
/*
- * release bitmaps and buffers if any
+ * release bitmaps if any
*/
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
}
/* ----------------------------------------------------------------
@@ -672,19 +365,13 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
- scanstate->prefetch_iterator = NULL;
- scanstate->prefetch_pages = 0;
- scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
scanstate->recheck = true;
- scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
@@ -724,13 +411,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->bitmapqualorig =
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
- /*
- * Maximum number of prefetches for the tablespace if configured,
- * otherwise the current value of the effective_io_concurrency GUC.
- */
- scanstate->prefetch_maximum =
- get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
scanstate->ss.ss_currentRelation = currentRelation;
/*
@@ -814,14 +494,10 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
return;
pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
-
pstate->tbmiterator = 0;
- pstate->prefetch_iterator = 0;
/* Initialize the mutex */
SpinLockInit(&pstate->mutex);
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = 0;
pstate->state = BM_INITIAL;
ConditionVariableInit(&pstate->cv);
@@ -853,11 +529,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
if (DsaPointerIsValid(pstate->tbmiterator))
tbm_free_shared_area(dsa, pstate->tbmiterator);
- if (DsaPointerIsValid(pstate->prefetch_iterator))
- tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
pstate->tbmiterator = InvalidDsaPointer;
- pstate->prefetch_iterator = InvalidDsaPointer;
}
/* ----------------------------------------------------------------
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3dfb19ec7d5..1cad9c04f01 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
#include "storage/dsm.h"
#include "storage/lockdefs.h"
#include "storage/shm_toc.h"
+#include "storage/streaming_read.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -72,6 +73,9 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /* Streaming read control object for scans supporting it */
+ PgStreamingRead *rs_pgsr;
+
/*
* These fields are only used for bitmap scans for the "skip fetch"
* optimization. Bitmap scans needing no fields from the heap may skip
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2adead958cb..1a7b9db8b40 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -792,23 +792,11 @@ typedef struct TableAmRoutine
* lossy indicates whether or not the block's representation in the bitmap
* is lossy or exact.
*
- * XXX: Currently this may only be implemented if the AM uses md.c as its
- * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
- * blockids directly to the underlying storage. nodeBitmapHeapscan.c
- * performs prefetching directly using that interface. This probably
- * needs to be rectified at a later point.
- *
- * XXX: Currently this may only be implemented if the AM uses the
- * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
- * perform prefetching. This probably needs to be rectified at a later
- * point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
- bool (*scan_bitmap_next_block) (TableScanDesc scan,
- bool *recheck, bool *lossy,
- BlockNumber *blockno);
+ bool (*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck,
+ bool *lossy);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1984,8 +1972,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* used after verifying the presence (at plan time or such).
*/
static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, bool *lossy, BlockNumber *blockno)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck, bool *lossy)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1995,8 +1982,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck,
- lossy, blockno);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, lossy);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a59df51dd69..d41a3e134d8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,8 @@ typedef enum
/* ----------------
* ParallelBitmapHeapState information
* tbmiterator iterator for scanning current pages
- * prefetch_iterator iterator for prefetching ahead of current page
* mutex mutual exclusion for the prefetching variable
* and state
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
* state current state of the TIDBitmap
* cv conditional wait variable
* phs_snapshot_data snapshot data shared to workers
@@ -1695,10 +1692,7 @@ typedef enum
typedef struct ParallelBitmapHeapState
{
dsa_pointer tbmiterator;
- dsa_pointer prefetch_iterator;
slock_t mutex;
- int prefetch_pages;
- int prefetch_target;
SharedBitmapState state;
ConditionVariable cv;
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,16 +1703,10 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
- * prefetch_iterator iterator for prefetching ahead of current page
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
- * prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
* recheck do current page's tuples need recheck
@@ -1729,20 +1717,13 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
- TBMIterator *prefetch_iterator;
- int prefetch_pages;
- int prefetch_target;
- int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
bool recheck;
- BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
On Mon, Feb 26, 2024 at 08:50:28PM -0500, Melanie Plageman wrote:
On Fri, Feb 16, 2024 at 12:35:59PM -0500, Melanie Plageman wrote:
In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previousI've resolved this. I added a new patch to the set which starts counting
even pages with no visible tuples toward lossy and exact pages. After an
off-list conversation with Andres, it seems that this omission in master
may not have been intentional.Once we have only two types of pages to differentiate between (lossy and
exact [no longer have to care about "has no visible tuples"]), it is
easy enough to pass a "lossy" boolean paramater to
table_scan_bitmap_next_block(). I've done this in the attached v4.
Thomas posted a new version of the Streaming Read API [1]/messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com, so here is a
rebased v5. This should make it easier to review as it can be applied on
top of master.
- Melanie
[1]: /messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com
Attachments:
v5-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patchtext/x-diff; charset=us-asciiDownload
From 5f523e4839c935f3b126b0c388129eb919c82b81 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v5 01/14] BitmapHeapScan begin scan after bitmap creation
There is no reason for a BitmapHeapScan to begin the scan of the
underlying table in ExecInitBitmapHeapScan(). Instead, do so after
completing the index scan and building the bitmap.
ExecBitmapHeapInitializeWorker() overwrote the snapshot in the scan
descriptor with the correct one provided by the parallel leader. Since
ExecBitmapHeapInitializeWorker() is now called before the scan
descriptor has been created, save the worker's snapshot in the
BitmapHeapScanState and pass it to table_beginscan_bm().
---
src/backend/access/table/tableam.c | 11 ------
src/backend/executor/nodeBitmapHeapscan.c | 47 ++++++++++++++++++-----
src/include/access/tableam.h | 10 ++---
src/include/nodes/execnodes.h | 2 +
4 files changed, 42 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a1..e78d793f69c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -120,17 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
NULL, flags);
}
-void
-table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
-{
- Assert(IsMVCCSnapshot(snapshot));
-
- RegisterSnapshot(snapshot);
- scan->rs_snapshot = snapshot;
- scan->rs_flags |= SO_TEMP_SNAPSHOT;
-}
-
-
/* ----------------------------------------------------------------------------
* Parallel table scan related functions.
* ----------------------------------------------------------------------------
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed63..44bf38be3c9 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -181,6 +181,34 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
#endif /* USE_PREFETCH */
}
+
+ /*
+ * If this is the first scan of the underlying table, create the table
+ * scan descriptor and begin the scan.
+ */
+ if (!scan)
+ {
+ Snapshot snapshot = node->ss.ps.state->es_snapshot;
+ uint32 extra_flags = 0;
+
+ /*
+ * Parallel workers must use the snapshot initialized by the
+ * parallel leader.
+ */
+ if (node->worker_snapshot)
+ {
+ snapshot = node->worker_snapshot;
+ extra_flags |= SO_TEMP_SNAPSHOT;
+ }
+
+ scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
+ node->ss.ss_currentRelation,
+ snapshot,
+ 0,
+ NULL,
+ extra_flags);
+ }
+
node->initialized = true;
}
@@ -604,7 +632,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
PlanState *outerPlan = outerPlanState(node);
/* rescan to release any page pin */
- table_rescan(node->ss.ss_currentScanDesc, NULL);
+ if (node->ss.ss_currentScanDesc)
+ table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
if (node->tbmiterator)
@@ -681,7 +710,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
/*
* close heap scan
*/
- table_endscan(scanDesc);
+ if (scanDesc)
+ table_endscan(scanDesc);
+
}
/* ----------------------------------------------------------------
@@ -739,6 +770,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
*/
scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
node->scan.plan.targetlist == NIL);
+ scanstate->worker_snapshot = NULL;
/*
* Miscellaneous initialization
@@ -787,11 +819,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ss_currentRelation = currentRelation;
- scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
- estate->es_snapshot,
- 0,
- NULL);
-
/*
* all done.
*/
@@ -930,13 +957,13 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelBitmapHeapState *pstate;
- Snapshot snapshot;
Assert(node->ss.ps.state->es_query_dsa != NULL);
pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->pstate = pstate;
- snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
- table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
+ node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
+ Assert(IsMVCCSnapshot(node->worker_snapshot));
+ RegisterSnapshot(node->worker_snapshot);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..5375dd7150f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -944,9 +944,10 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, struct ScanKeyData *key)
+ int nkeys, struct ScanKeyData *key,
+ uint32 extra_flags)
{
- uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+ uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1038,11 +1039,6 @@ table_rescan_set_params(TableScanDesc scan, struct ScanKeyData *key,
allow_pagemode);
}
-/*
- * Update snapshot used by the scan.
- */
-extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
-
/*
* Return next tuple from `scan`, store in slot.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..00c75fb10e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1726,6 +1726,7 @@ typedef struct ParallelBitmapHeapState
* shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
+ * worker_snapshot snapshot for parallel worker
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1750,6 +1751,7 @@ typedef struct BitmapHeapScanState
TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
+ Snapshot worker_snapshot;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v5-0002-BitmapHeapScan-set-can_skip_fetch-later.patchtext/x-diff; charset=us-asciiDownload
From c76a0dc384143a23ac58d421df4d4956fee58961 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 14:38:41 -0500
Subject: [PATCH v5 02/14] BitmapHeapScan set can_skip_fetch later
Set BitmapHeapScanState->can_skip_fetch in BitmapHeapNext() when
!BitmapHeapScanState->initialized instead of in
ExecInitBitmapHeapScan(). This is a preliminary step to removing
can_skip_fetch from BitmapHeapScanState and setting it in table AM
specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 44bf38be3c9..a9ba2bdfb88 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,6 +108,16 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ /*
+ * We can potentially skip fetching heap pages if we do not need any
+ * columns of the table, either for checking non-indexable quals or
+ * for returning data. This test is a bit simplistic, as it checks
+ * the stronger condition that there's no qual or return tlist at all.
+ * But in most cases it's probably not worth working harder than that.
+ */
+ node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
+ node->ss.ps.plan->targetlist == NIL);
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -760,16 +770,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
-
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or for
- * returning data. This test is a bit simplistic, as it checks the
- * stronger condition that there's no qual or return tlist at all. But in
- * most cases it's probably not worth working harder than that.
- */
- scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
--
2.37.2
v5-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patchtext/x-diff; charset=us-asciiDownload
From b4148fc01e789700309cb144ca5e68bbcd9c2aa6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 20:15:05 -0500
Subject: [PATCH v5 03/14] Push BitmapHeapScan skip fetch optimization into
table AM
7c70996ebf0949b142 introduced an optimization to allow bitmap table
scans to skip fetching a block from the heap if none of the underlying
data was needed and the block is marked all visible in the visibility
map. With the addition of table AMs, a FIXME was added to this code
indicating that it should be pushed into table AM specific code, as not
all table AMs may use a visibility map in the same way.
Resolve this FIXME for the current block and implement it for the heap
table AM by moving the vmbuffer and other fields needed for the
optimization from the BitmapHeapScanState into the HeapScanDescData.
heapam_scan_bitmap_next_block() now decides whether or not to skip
fetching the block before reading it in and
heapam_scan_bitmap_next_tuple() returns NULL-filled tuples for skipped
blocks.
The layering violation is still present in BitmapHeapScans's prefetching
code. However, this will be eliminated when prefetching is implemented
using the upcoming streaming read API discussed in [1].
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam.c | 14 +++
src/backend/access/heap/heapam_handler.c | 29 ++++++
src/backend/executor/nodeBitmapHeapscan.c | 118 ++++++----------------
src/include/access/heapam.h | 10 ++
src/include/access/tableam.h | 7 ++
src/include/nodes/execnodes.h | 8 +-
6 files changed, 94 insertions(+), 92 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..b93f243c282 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_vmbuffer = InvalidBuffer;
+ scan->rs_empty_tuples_pending = 0;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1043,6 +1045,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* reinitialize scan descriptor
*/
@@ -1062,6 +1070,12 @@ heap_endscan(TableScanDesc sscan)
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_vmbuffer))
+ {
+ ReleaseBuffer(scan->rs_vmbuffer);
+ scan->rs_vmbuffer = InvalidBuffer;
+ }
+
/*
* decrement relation reference count and free scan descriptor storage
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..c9b9b4c00f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
#include "access/syncscan.h"
#include "access/tableam.h"
#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -2122,6 +2123,24 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ /*
+ * We can skip fetching the heap page if we don't need any fields from the
+ * heap, and the bitmap entries don't need rechecking, and all tuples on
+ * the page are visible to our transaction.
+ */
+ if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ {
+ /* can't be lossy in the skip_fetch case */
+ Assert(tbmres->ntuples >= 0);
+ Assert(hscan->rs_empty_tuples_pending >= 0);
+
+ hscan->rs_empty_tuples_pending += tbmres->ntuples;
+
+ return true;
+ }
+
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
@@ -2234,6 +2253,16 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page page;
ItemId lp;
+ if (hscan->rs_empty_tuples_pending > 0)
+ {
+ /*
+ * If we don't have to fetch the tuple, just return nulls.
+ */
+ ExecStoreAllNullTuple(slot);
+ hscan->rs_empty_tuples_pending--;
+ return true;
+ }
+
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index a9ba2bdfb88..2e4f87ea3a3 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -108,16 +108,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
- /*
- * We can potentially skip fetching heap pages if we do not need any
- * columns of the table, either for checking non-indexable quals or
- * for returning data. This test is a bit simplistic, as it checks
- * the stronger condition that there's no qual or return tlist at all.
- * But in most cases it's probably not worth working harder than that.
- */
- node->can_skip_fetch = (node->ss.ps.plan->qual == NIL &&
- node->ss.ps.plan->targetlist == NIL);
-
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -211,6 +201,17 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags |= SO_TEMP_SNAPSHOT;
}
+ /*
+ * We can potentially skip fetching heap pages if we do not need
+ * any columns of the table, either for checking non-indexable
+ * quals or for returning data. This test is a bit simplistic, as
+ * it checks the stronger condition that there's no qual or return
+ * tlist at all. But in most cases it's probably not worth working
+ * harder than that.
+ */
+ if (node->ss.ps.plan->qual == NIL && node->ss.ps.plan->targetlist == NIL)
+ extra_flags |= SO_CAN_SKIP_FETCH;
+
scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
node->ss.ss_currentRelation,
snapshot,
@@ -224,8 +225,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
/*
@@ -245,32 +244,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres);
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
+ if (!table_scan_bitmap_next_block(scan, tbmres))
{
/* AM doesn't think this block is valid, skip */
continue;
@@ -318,52 +292,33 @@ BitmapHeapNext(BitmapHeapScanState *node)
* should happen only when we have determined there is still something
* to do on the current page, else we may uselessly prefetch the same
* page we are just about to request for real.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
*/
BitmapPrefetch(node, scan);
- if (node->return_empty_tuples > 0)
+ /*
+ * Attempt to fetch tuple from AM.
+ */
+ if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
{
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
-
- if (--node->return_empty_tuples == 0)
- {
- /* no more tuples to return in the next round */
- node->tbmres = tbmres = NULL;
- }
+ /* nothing more to look at on this page */
+ node->tbmres = tbmres = NULL;
+ continue;
}
- else
+
+ /*
+ * If we are using lossy info, we have to recheck the qual conditions
+ * at every tuple.
+ */
+ if (tbmres->recheck)
{
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
{
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
continue;
}
-
- /*
- * If we are using lossy info, we have to recheck the qual
- * conditions at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
- {
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
- }
- }
}
/* OK to return this tuple */
@@ -535,7 +490,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* it did for the current heap page; which is not a certainty
* but is true in many cases.
*/
- skip_fetch = (node->can_skip_fetch &&
+
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
(node->tbmres ? !node->tbmres->recheck : false) &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -586,7 +542,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
}
/* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
+ skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
(node->tbmres ? !node->tbmres->recheck : false) &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
@@ -656,8 +612,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
@@ -667,7 +621,6 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
node->initialized = false;
node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
- node->vmbuffer = InvalidBuffer;
node->pvmbuffer = InvalidBuffer;
ExecScanReScan(&node->ss);
@@ -712,8 +665,6 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->vmbuffer != InvalidBuffer)
- ReleaseBuffer(node->vmbuffer);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
@@ -757,8 +708,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->tbm = NULL;
scanstate->tbmiterator = NULL;
scanstate->tbmres = NULL;
- scanstate->return_empty_tuples = 0;
- scanstate->vmbuffer = InvalidBuffer;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -770,7 +719,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
- scanstate->can_skip_fetch = false;
scanstate->worker_snapshot = NULL;
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..3dfb19ec7d5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,16 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /*
+ * These fields are only used for bitmap scans for the "skip fetch"
+ * optimization. Bitmap scans needing no fields from the heap may skip
+ * fetching an all visible block, instead using the number of tuples per
+ * block reported by the bitmap to determine how many NULL-filled tuples
+ * to return.
+ */
+ Buffer rs_vmbuffer;
+ int rs_empty_tuples_pending;
+
/* these fields only used in page-at-a-time mode and for bitmap scans */
int rs_cindex; /* current tuple's index in vistuples */
int rs_ntuples; /* number of visible tuples on page */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5375dd7150f..c193ea5db43 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -62,6 +62,13 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+
+ /*
+ * At the discretion of the table AM, bitmap table scans may be able to
+ * skip fetching a block from the table if none of the table data is
+ * needed.
+ */
+ SO_CAN_SKIP_FETCH = 1 << 10,
} ScanOptions;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 00c75fb10e2..6fb4ec07c5f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1711,10 +1711,7 @@ typedef struct ParallelBitmapHeapState
* tbm bitmap obtained from child index scan(s)
* tbmiterator iterator for scanning current pages
* tbmres current-page data
- * can_skip_fetch can we potentially skip tuple fetches in this scan?
- * return_empty_tuples number of empty tuples to return
- * vmbuffer buffer for visibility-map lookups
- * pvmbuffer ditto, for prefetched pages
+ * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
* prefetch_iterator iterator for prefetching ahead of current page
@@ -1736,9 +1733,6 @@ typedef struct BitmapHeapScanState
TIDBitmap *tbm;
TBMIterator *tbmiterator;
TBMIterateResult *tbmres;
- bool can_skip_fetch;
- int return_empty_tuples;
- Buffer vmbuffer;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
--
2.37.2
v5-0004-BitmapPrefetch-use-prefetch-block-recheck-for-ski.patchtext/x-diff; charset=us-asciiDownload
From 3df428a0428b821ed1c19bb19a18e0d3b3d60a4a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:03:24 -0500
Subject: [PATCH v5 04/14] BitmapPrefetch use prefetch block recheck for skip
fetch
As of 7c70996ebf0949b142a9, BitmapPrefetch() used the recheck flag for
the current block to determine whether or not it could skip prefetching
the proposed prefetch block. It makes more sense for it to use the
recheck flag from the TBMIterateResult for the prefetch block instead.
See this [1] thread on hackers reporting the issue.
[1] https://www.postgresql.org/message-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ%2BopUrXDRXdcfwFZGA%40mail.gmail.com
---
src/backend/executor/nodeBitmapHeapscan.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2e4f87ea3a3..35ef26221ba 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,15 +484,9 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* skip this prefetch call, but continue to run the prefetch
* logic normally. (Would it be better not to increment
* prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
*/
-
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
@@ -543,7 +537,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
+ !tbmpre->recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
tbmpre->blockno,
&node->pvmbuffer));
--
2.37.2
v5-0005-Update-BitmapAdjustPrefetchIterator-parameter-typ.patchtext/x-diff; charset=us-asciiDownload
From 43d0dd9b5661617bf61562f8c4bc189d66fc62a3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 19:04:48 -0500
Subject: [PATCH v5 05/14] Update BitmapAdjustPrefetchIterator parameter type
to BlockNumber
BitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only. This will
allow us to move away from using the TBMIterateResult outside of table
AM specific code.
---
src/backend/executor/nodeBitmapHeapscan.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 35ef26221ba..3439c02e989 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -55,7 +55,7 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres);
+ BlockNumber blockno);
static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
static inline void BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan);
@@ -242,7 +242,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
break;
}
- BitmapAdjustPrefetchIterator(node, tbmres);
+ BitmapAdjustPrefetchIterator(node, tbmres->blockno);
if (!table_scan_bitmap_next_block(scan, tbmres))
{
@@ -351,7 +351,7 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
*/
static inline void
BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres)
+ BlockNumber blockno)
{
#ifdef USE_PREFETCH
ParallelBitmapHeapState *pstate = node->pstate;
@@ -370,7 +370,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
/* Do not let the prefetch iterator get behind the main one */
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
+ if (tbmpre == NULL || tbmpre->blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
--
2.37.2
v5-0006-EXPLAIN-Bitmap-table-scan-also-count-no-visible-t.patchtext/x-diff; charset=us-asciiDownload
From c0f80d78a1d2a940a825beeb394f8cd025d260c0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 18:35:28 -0500
Subject: [PATCH v5 06/14] EXPLAIN Bitmap table scan also count no visible
tuple pages
Previously, bitmap heap scans only counted lossy and exact pages for
explain when there was at least one visible tuple on the page.
heapam_scan_bitmap_next_block() returned true only if there was a
"valid" page with tuples to be processed. However, the lossy and exact
page counters in EXPLAIN should count the number of pages represented in
a lossy or non-lossy way in the constructured bitmap, so it doesn't make
sense to omit pages without visible tuples.
---
src/backend/executor/nodeBitmapHeapscan.c | 15 ++++++++++-----
src/test/regress/expected/partition_prune.out | 4 +++-
2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3439c02e989..75e896074bf 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -225,6 +225,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
+ bool valid;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -244,17 +246,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres->blockno);
- if (!table_scan_bitmap_next_block(scan, tbmres))
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
+ valid = table_scan_bitmap_next_block(scan, tbmres);
if (tbmres->ntuples >= 0)
node->exact_pages++;
else
node->lossy_pages++;
+ if (!valid)
+ {
+ /* AM doesn't think this block is valid, skip */
+ continue;
+ }
+
+
/* Adjust the prefetch target */
BitmapAdjustPrefetchTarget(node);
}
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index b41950d923b..7b1b1e97033 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2812,6 +2812,7 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;
Index Cond: (a = 1)
-> Bitmap Heap Scan on ab_a1_b3 ab_a1_3 (actual rows=0 loops=1)
Recheck Cond: (a = 1)
+ Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
-> Materialize (actual rows=1 loops=1)
@@ -2827,9 +2828,10 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;
Index Cond: (a = 1)
-> Bitmap Heap Scan on ab_a1_b3 ab_3 (actual rows=0 loops=1)
Recheck Cond: (a = 1)
+ Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
-(34 rows)
+(36 rows)
table ab;
a | b
--
2.37.2
v5-0007-table_scan_bitmap_next_block-returns-lossy-or-exa.patchtext/x-diff; charset=us-asciiDownload
From 0f9b10773db6b406f2d9a481293a2fd1dfe0669d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 20:34:07 -0500
Subject: [PATCH v5 07/14] table_scan_bitmap_next_block() returns lossy or
exact
Future commits will remove the TBMIterateResult from BitmapHeapNext() --
pushing it into the table AM-specific code. So, the table AM must inform
BitmapHeapNext() whether or not the current block is lossy or exact for
the purposes of the counters used in EXPLAIN.
---
src/backend/access/heap/heapam_handler.c | 5 ++++-
src/backend/executor/nodeBitmapHeapscan.c | 10 +++++-----
src/include/access/tableam.h | 14 ++++++++++----
3 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c9b9b4c00f1..10c1c3b616b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2112,7 +2112,8 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres)
+ TBMIterateResult *tbmres,
+ bool *lossy)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
BlockNumber block = tbmres->blockno;
@@ -2240,6 +2241,8 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
+ *lossy = tbmres->ntuples < 0;
+
return ntup > 0;
}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 75e896074bf..054f745eeba 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -225,7 +225,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
for (;;)
{
- bool valid;
+ bool valid, lossy;
CHECK_FOR_INTERRUPTS();
@@ -246,12 +246,12 @@ BitmapHeapNext(BitmapHeapScanState *node)
BitmapAdjustPrefetchIterator(node, tbmres->blockno);
- valid = table_scan_bitmap_next_block(scan, tbmres);
+ valid = table_scan_bitmap_next_block(scan, tbmres, &lossy);
- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
+ if (lossy)
node->lossy_pages++;
+ else
+ node->exact_pages++;
if (!valid)
{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c193ea5db43..8280035e39f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -796,6 +796,9 @@ typedef struct TableAmRoutine
* on the page have to be returned, otherwise the tuples at offsets in
* `tbmres->offsets` need to be returned.
*
+ * lossy indicates whether or not the block's representation in the bitmap
+ * is lossy or exact.
+ *
* XXX: Currently this may only be implemented if the AM uses md.c as its
* storage manager, and uses ItemPointer->ip_blkid in a manner that maps
* blockids directly to the underlying storage. nodeBitmapHeapscan.c
@@ -811,7 +814,8 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres);
+ struct TBMIterateResult *tbmres,
+ bool *lossy);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1952,14 +1956,16 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
* a bitmap table scan. `scan` needs to have been started via
* table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise.
+ * the page, true otherwise. lossy is set to true if bitmap is lossy for the
+ * selected block and false otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres)
+ struct TBMIterateResult *tbmres,
+ bool *lossy)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1970,7 +1976,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres);
+ tbmres, lossy);
}
/*
--
2.37.2
v5-0008-Reduce-scope-of-BitmapHeapScan-tbmiterator-local-.patchtext/x-diff; charset=us-asciiDownload
From 9df8ee50ea111a49a04397db0ff2b77d89eab7a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:17:47 -0500
Subject: [PATCH v5 08/14] Reduce scope of BitmapHeapScan tbmiterator local
variables
To simplify the diff of a future commit which will move the TBMIterators
into the scan descriptor, define them in a narrower scope now.
---
src/backend/executor/nodeBitmapHeapscan.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 054f745eeba..a639d6e7415 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -74,8 +74,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterator *tbmiterator = NULL;
- TBMSharedIterator *shared_tbmiterator = NULL;
TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
@@ -88,10 +86,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- if (pstate == NULL)
- tbmiterator = node->tbmiterator;
- else
- shared_tbmiterator = node->shared_tbmiterator;
tbmres = node->tbmres;
/*
@@ -108,6 +102,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
*/
if (!node->initialized)
{
+ TBMIterator *tbmiterator = NULL;
+ TBMSharedIterator *shared_tbmiterator = NULL;
+
if (!pstate)
{
tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
@@ -116,7 +113,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
- node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
+ tbmiterator = tbm_begin_iterate(tbm);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -169,8 +166,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
}
/* Allocate a private iterator and attach the shared state to it */
- node->shared_tbmiterator = shared_tbmiterator =
- tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
+ shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
@@ -220,6 +216,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
+ node->tbmiterator = tbmiterator;
+ node->shared_tbmiterator = shared_tbmiterator;
node->initialized = true;
}
@@ -235,9 +233,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
if (tbmres == NULL)
{
if (!pstate)
- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
else
- node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
+ node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
if (tbmres == NULL)
{
/* no more entries in the bitmap */
--
2.37.2
v5-0009-Remove-table_scan_bitmap_next_tuple-parameter-tbm.patchtext/x-diff; charset=us-asciiDownload
From 8a31b11113194b26526c3931d98e022f7f1d6603 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 12 Feb 2024 18:13:41 -0500
Subject: [PATCH v5 09/14] Remove table_scan_bitmap_next_tuple parameter tbmres
With the addition of the proposed streaming read API [1],
table_scan_bitmap_next_block() will no longer take a TBMIterateResult as
an input. Instead table AMs will be responsible for implementing a
callback for the streaming read API which specifies which blocks should
be prefetched and read.
Thus, it no longer makes sense to use the TBMIterateResult as a means of
communication between table_scan_bitmap_next_tuple() and
table_scan_bitmap_next_block().
Note that this parameter was unused by heap AM's implementation of
table_scan_bitmap_next_tuple().
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 1 -
src/backend/executor/nodeBitmapHeapscan.c | 2 +-
src/include/access/tableam.h | 12 +-----------
3 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 10c1c3b616b..a1ec50ab7a8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2248,7 +2248,6 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
static bool
heapam_scan_bitmap_next_tuple(TableScanDesc scan,
- TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index a639d6e7415..87991266931 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -301,7 +301,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* Attempt to fetch tuple from AM.
*/
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
+ if (!table_scan_bitmap_next_tuple(scan, slot))
{
/* nothing more to look at on this page */
node->tbmres = tbmres = NULL;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8280035e39f..8d7c800d157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -787,10 +787,7 @@ typedef struct TableAmRoutine
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
- * make sense to perform tuple visibility checks at this time). For some
- * AMs it will make more sense to do all the work referencing `tbmres`
- * contents here, for others it might be better to defer more work to
- * scan_bitmap_next_tuple.
+ * make sense to perform tuple visibility checks at this time).
*
* If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
* on the page have to be returned, otherwise the tuples at offsets in
@@ -821,15 +818,10 @@ typedef struct TableAmRoutine
* Fetch the next tuple of a bitmap table scan into `slot` and return true
* if a visible tuple was found, false otherwise.
*
- * For some AMs it will make more sense to do all the work referencing
- * `tbmres` contents in scan_bitmap_next_block, for others it might be
- * better to defer more work to this callback.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_tuple) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot);
/*
@@ -1989,7 +1981,6 @@ table_scan_bitmap_next_block(TableScanDesc scan,
*/
static inline bool
table_scan_bitmap_next_tuple(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
TupleTableSlot *slot)
{
/*
@@ -2001,7 +1992,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
- tbmres,
slot);
}
--
2.37.2
v5-0010-Make-table_scan_bitmap_next_block-async-friendly.patchtext/x-diff; charset=us-asciiDownload
From b89d1e2133e1959750c9081e27dfa21f4fa7e46b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 13 Feb 2024 10:57:07 -0500
Subject: [PATCH v5 10/14] Make table_scan_bitmap_next_block() async friendly
table_scan_bitmap_next_block() previously returned false if we did not
wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
This could happen when there were no visible tuples on the page or, due
to concurrent activity on the table, the block returned by the iterator
is past the end of the table recorded when the scan started.
This forced the caller to be responsible for determining if additional
blocks should be fetched and then for invoking
table_scan_bitmap_next_block() for these blocks.
It makes more sense for table_scan_bitmap_next_block() to be responsible
for finding a block that is not past the end of the table (as of the
time that the scan began) and for table_scan_bitmap_next_tuple() to
return false if there are no visible tuples on the page.
This also allows us to move responsibility for the iterator to table AM
specific code. This means handling invalid blocks is entirely up to
the table AM.
These changes will enable bitmapheapscan to use the future streaming
read API [1]. Table AMs will implement a streaming read API callback
returning the next block to fetch. In heap AM's case, the callback will
use the iterator to identify the next block to fetch. Since choosing the
next block will no longer the responsibility of BitmapHeapNext(), the
streaming read control flow requires these changes to
table_scan_bitmap_next_block().
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 59 ++++++--
src/backend/executor/nodeBitmapHeapscan.c | 167 +++++++++-------------
src/include/access/relscan.h | 7 +
src/include/access/tableam.h | 68 ++++++---
src/include/nodes/execnodes.h | 9 +-
5 files changed, 168 insertions(+), 142 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a1ec50ab7a8..e038e60cd8f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2112,18 +2112,51 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres,
- bool *lossy)
+ bool *recheck, bool *lossy, BlockNumber *blockno)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
- BlockNumber block = tbmres->blockno;
+ BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
+ TBMIterateResult *tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
+ *blockno = InvalidBlockNumber;
+ *recheck = true;
+
+ do
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (scan->shared_tbmiterator)
+ tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ else
+ tbmres = tbm_iterate(scan->tbmiterator);
+
+ if (tbmres == NULL)
+ {
+ /* no more entries in the bitmap */
+ Assert(hscan->rs_empty_tuples_pending == 0);
+ return false;
+ }
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+
+ /* Got a valid block */
+ *blockno = tbmres->blockno;
+ *recheck = tbmres->recheck;
+
/*
* We can skip fetching the heap page if we don't need any fields from the
* heap, and the bitmap entries don't need rechecking, and all tuples on
@@ -2142,16 +2175,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
return true;
}
- /*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE isolation
- * though, as we need to examine all invisible tuples reachable by the
- * index.
- */
- if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
- return false;
+ block = tbmres->blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2243,7 +2267,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*lossy = tbmres->ntuples < 0;
- return ntup > 0;
+ /*
+ * Return true to indicate that a valid block was found and the bitmap is
+ * not exhausted. If there are no visible tuples on this page,
+ * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will
+ * return false returning control to this function to advance to the next
+ * block in the bitmap.
+ */
+ return true;
}
static bool
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 87991266931..3be433ea6e1 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -73,8 +73,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
{
ExprContext *econtext;
TableScanDesc scan;
+ bool lossy;
TIDBitmap *tbm;
- TBMIterateResult *tbmres;
TupleTableSlot *slot;
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area *dsa = node->ss.ps.state->es_query_dsa;
@@ -86,7 +86,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- tbmres = node->tbmres;
/*
* If we haven't yet performed the underlying index scan, do it, and begin
@@ -114,7 +113,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -167,7 +165,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
- node->tbmres = tbmres = NULL;
#ifdef USE_PREFETCH
if (node->prefetch_maximum > 0)
@@ -216,56 +213,29 @@ BitmapHeapNext(BitmapHeapScanState *node)
extra_flags);
}
- node->tbmiterator = tbmiterator;
- node->shared_tbmiterator = shared_tbmiterator;
- node->initialized = true;
- }
-
- for (;;)
- {
- bool valid, lossy;
-
- CHECK_FOR_INTERRUPTS();
-
- /*
- * Get next page of results if needed
- */
- if (tbmres == NULL)
- {
- if (!pstate)
- node->tbmres = tbmres = tbm_iterate(node->tbmiterator);
- else
- node->tbmres = tbmres = tbm_shared_iterate(node->shared_tbmiterator);
- if (tbmres == NULL)
- {
- /* no more entries in the bitmap */
- break;
- }
-
- BitmapAdjustPrefetchIterator(node, tbmres->blockno);
+ scan->tbmiterator = tbmiterator;
+ scan->shared_tbmiterator = shared_tbmiterator;
- valid = table_scan_bitmap_next_block(scan, tbmres, &lossy);
+ node->initialized = true;
- if (lossy)
- node->lossy_pages++;
- else
- node->exact_pages++;
+ /* Get the first block. if none, end of scan */
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ return ExecClearTuple(slot);
- if (!valid)
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
+ if (lossy)
+ node->lossy_pages++;
+ else
+ node->exact_pages++;
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ BitmapAdjustPrefetchTarget(node);
+ }
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
- else
+ for (;;)
+ {
+ while (table_scan_bitmap_next_tuple(scan, slot))
{
- /*
- * Continuing in previously obtained page.
- */
+ CHECK_FOR_INTERRUPTS();
#ifdef USE_PREFETCH
@@ -287,45 +257,48 @@ BitmapHeapNext(BitmapHeapScanState *node)
SpinLockRelease(&pstate->mutex);
}
#endif /* USE_PREFETCH */
- }
- /*
- * We issue prefetch requests *after* fetching the current page to try
- * to avoid having prefetching interfere with the main I/O. Also, this
- * should happen only when we have determined there is still something
- * to do on the current page, else we may uselessly prefetch the same
- * page we are just about to request for real.
- */
- BitmapPrefetch(node, scan);
-
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, slot))
- {
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
- continue;
- }
+ /*
+ * We prefetch before fetching the current pages. We expect that a
+ * future streaming read API will do this, so do it this way now
+ * for consistency. Also, this should happen only when we have
+ * determined there is still something to do on the current page,
+ * else we may uselessly prefetch the same page we are just about
+ * to request for real.
+ */
+ BitmapPrefetch(node, scan);
- /*
- * If we are using lossy info, we have to recheck the qual conditions
- * at every tuple.
- */
- if (tbmres->recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ /*
+ * If we are using lossy info, we have to recheck the qual
+ * conditions at every tuple.
+ */
+ if (node->recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- ExecClearTuple(slot);
- continue;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->bitmapqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ ExecClearTuple(slot);
+ continue;
+ }
}
+
+ /* OK to return this tuple */
+ return slot;
}
- /* OK to return this tuple */
- return slot;
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ break;
+
+ if (lossy)
+ node->lossy_pages++;
+ else
+ node->exact_pages++;
+
+ BitmapAdjustPrefetchIterator(node, node->blockno);
+ /* Adjust the prefetch target */
+ BitmapAdjustPrefetchTarget(node);
}
/*
@@ -599,12 +572,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
table_rescan(node->ss.ss_currentScanDesc, NULL);
/* release bitmaps and buffers if any */
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
@@ -612,13 +581,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->tbmiterator = NULL;
- node->tbmres = NULL;
node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_tbmiterator = NULL;
node->shared_prefetch_iterator = NULL;
node->pvmbuffer = InvalidBuffer;
+ node->recheck = true;
+ node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -649,28 +617,24 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
ExecEndNode(outerPlanState(node));
+
+ /*
+ * close heap scan
+ */
+ if (scanDesc)
+ table_endscan(scanDesc);
+
/*
* release bitmaps and buffers if any
*/
- if (node->tbmiterator)
- tbm_end_iterate(node->tbmiterator);
if (node->prefetch_iterator)
tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_tbmiterator)
- tbm_end_shared_iterate(node->shared_tbmiterator);
if (node->shared_prefetch_iterator)
tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->pvmbuffer != InvalidBuffer)
ReleaseBuffer(node->pvmbuffer);
-
- /*
- * close heap scan
- */
- if (scanDesc)
- table_endscan(scanDesc);
-
}
/* ----------------------------------------------------------------
@@ -703,8 +667,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->tbmiterator = NULL;
- scanstate->tbmres = NULL;
scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
@@ -713,10 +675,11 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_tbmiterator = NULL;
scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
+ scanstate->recheck = true;
+ scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..92b829cebc7 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -24,6 +24,9 @@
struct ParallelTableScanDescData;
+struct TBMIterator;
+struct TBMSharedIterator;
+
/*
* Generic descriptor for table scans. This is the base-class for table scans,
* which needs to be embedded in the scans of individual AMs.
@@ -40,6 +43,10 @@ typedef struct TableScanDescData
ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;
+ /* Only used for Bitmap table scans */
+ struct TBMIterator *tbmiterator;
+ struct TBMSharedIterator *shared_tbmiterator;
+
/*
* Information about type and behaviour of the scan, a bitmask of members
* of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8d7c800d157..2adead958cb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "nodes/tidbitmap.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -780,19 +781,14 @@ typedef struct TableAmRoutine
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
- * of a bitmap table scan. `scan` was started via table_beginscan_bm().
- * Return false if there are no tuples to be found on the page, true
- * otherwise.
+ * Prepare to fetch / check / return tuples from `blockno` as part of a
+ * bitmap table scan. `scan` was started via table_beginscan_bm(). Return
+ * false if the bitmap is exhausted and true otherwise.
*
* This will typically read and pin the target block, and do the necessary
* work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
* make sense to perform tuple visibility checks at this time).
*
- * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
- * on the page have to be returned, otherwise the tuples at offsets in
- * `tbmres->offsets` need to be returned.
- *
* lossy indicates whether or not the block's representation in the bitmap
* is lossy or exact.
*
@@ -811,8 +807,8 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres,
- bool *lossy);
+ bool *recheck, bool *lossy,
+ BlockNumber *blockno);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -950,9 +946,13 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
uint32 extra_flags)
{
+ TableScanDesc result;
uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE | extra_flags;
- return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result = rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+ result->shared_tbmiterator = NULL;
+ result->tbmiterator = NULL;
+ return result;
}
/*
@@ -1012,6 +1012,21 @@ table_beginscan_analyze(Relation rel)
static inline void
table_endscan(TableScanDesc scan)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_end(scan);
}
@@ -1022,6 +1037,21 @@ static inline void
table_rescan(TableScanDesc scan,
struct ScanKeyData *key)
{
+ if (scan->rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->shared_tbmiterator)
+ {
+ tbm_end_shared_iterate(scan->shared_tbmiterator);
+ scan->shared_tbmiterator = NULL;
+ }
+
+ if (scan->tbmiterator)
+ {
+ tbm_end_iterate(scan->tbmiterator);
+ scan->tbmiterator = NULL;
+ }
+ }
+
scan->rs_rd->rd_tableam->scan_rescan(scan, key, false, false, false, false);
}
@@ -1945,19 +1975,17 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
*/
/*
- * Prepare to fetch / check / return tuples from `tbmres->blockno` as part of
- * a bitmap table scan. `scan` needs to have been started via
- * table_beginscan_bm(). Returns false if there are no tuples to be found on
- * the page, true otherwise. lossy is set to true if bitmap is lossy for the
- * selected block and false otherwise.
+ * Prepare to fetch / check / return tuples as part of a bitmap table scan.
+ * `scan` needs to have been started via table_beginscan_bm(). Returns false if
+ * there are no more blocks in the bitmap, true otherwise. lossy is set to true
+ * if bitmap is lossy for the selected block and false otherwise.
*
* Note, this is an optionally implemented function, therefore should only be
* used after verifying the presence (at plan time or such).
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres,
- bool *lossy)
+ bool *recheck, bool *lossy, BlockNumber *blockno)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1967,8 +1995,8 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
- tbmres, lossy);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck,
+ lossy, blockno);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6fb4ec07c5f..a59df51dd69 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1709,8 +1709,6 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * tbmiterator iterator for scanning current pages
- * tbmres current-page data
* pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
@@ -1720,10 +1718,10 @@ typedef struct ParallelBitmapHeapState
* prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_tbmiterator shared iterator
* shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
+ * recheck do current page's tuples need recheck
* ----------------
*/
typedef struct BitmapHeapScanState
@@ -1731,8 +1729,6 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- TBMIterator *tbmiterator;
- TBMIterateResult *tbmres;
Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
@@ -1742,10 +1738,11 @@ typedef struct BitmapHeapScanState
int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_tbmiterator;
TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
+ bool recheck;
+ BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
v5-0011-Hard-code-TBMIterateResult-offsets-array-size.patchtext/x-diff; charset=us-asciiDownload
From c6518284a8c20aa4d9e3e2267ad2dfa0acb2aefa Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 20:13:43 -0500
Subject: [PATCH v5 11/14] Hard-code TBMIterateResult offsets array size
TIDBitmap's TBMIterateResult had a flexible sized array of tuple offsets
but the API always allocated MaxHeapTuplesPerPage OffsetNumbers.
Creating a fixed-size aray of size MaxHeapTuplesPerPage is more clear
for the API user.
---
src/backend/nodes/tidbitmap.c | 29 +++++++----------------------
src/include/nodes/tidbitmap.h | 12 ++++++++++--
2 files changed, 17 insertions(+), 24 deletions(-)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index e8ab5d78fcc..d2bf8f44d50 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -40,7 +40,6 @@
#include <limits.h>
-#include "access/htup_details.h"
#include "common/hashfn.h"
#include "common/int.h"
#include "nodes/bitmapset.h"
@@ -48,14 +47,6 @@
#include "storage/lwlock.h"
#include "utils/dsa.h"
-/*
- * The maximum number of tuples per page is not large (typically 256 with
- * 8K pages, or 1024 with 32K pages). So there's not much point in making
- * the per-page bitmaps variable size. We just legislate that the size
- * is this:
- */
-#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage
-
/*
* When we have to switch over to lossy storage, we use a data structure
* with one bit per page, where all pages having the same number DIV
@@ -67,7 +58,7 @@
* table, using identical data structures. (This is because the memory
* management for hashtables doesn't easily/efficiently allow space to be
* transferred easily from one hashtable to another.) Therefore it's best
- * if PAGES_PER_CHUNK is the same as MAX_TUPLES_PER_PAGE, or at least not
+ * if PAGES_PER_CHUNK is the same as MaxHeapTuplesPerPage, or at least not
* too different. But we also want PAGES_PER_CHUNK to be a power of 2 to
* avoid expensive integer remainder operations. So, define it like this:
*/
@@ -79,7 +70,7 @@
#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
/* number of active words for an exact page: */
-#define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
+#define WORDS_PER_PAGE ((MaxHeapTuplesPerPage - 1) / BITS_PER_BITMAPWORD + 1)
/* number of active words for a lossy chunk: */
#define WORDS_PER_CHUNK ((PAGES_PER_CHUNK - 1) / BITS_PER_BITMAPWORD + 1)
@@ -181,7 +172,7 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/*
@@ -222,7 +213,7 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+ TBMIterateResult output;
};
/* Local function prototypes */
@@ -390,7 +381,7 @@ tbm_add_tuples(TIDBitmap *tbm, const ItemPointer tids, int ntids,
bitnum;
/* safety check to ensure we don't overrun bit array bounds */
- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxHeapTuplesPerPage)
elog(ERROR, "tuple offset out of range: %u", off);
/*
@@ -692,12 +683,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
Assert(tbm->iterating != TBM_ITERATING_SHARED);
- /*
- * Create the TBMIterator struct, with enough trailing space to serve the
- * needs of the TBMIterateResult sub-struct.
- */
- iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = palloc(sizeof(TBMIterator));
iterator->tbm = tbm;
/*
@@ -1467,8 +1453,7 @@ tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer dp)
* Create the TBMSharedIterator struct, with enough trailing space to
* serve the needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMSharedIterator *) palloc0(sizeof(TBMSharedIterator));
istate = (TBMSharedIteratorState *) dsa_get_address(dsa, dp);
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639bf..432fae52962 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -22,6 +22,7 @@
#ifndef TIDBITMAP_H
#define TIDBITMAP_H
+#include "access/htup_details.h"
#include "storage/itemptr.h"
#include "utils/dsa.h"
@@ -41,9 +42,16 @@ typedef struct TBMIterateResult
{
BlockNumber blockno; /* page number containing tuples */
int ntuples; /* -1 indicates lossy result */
- bool recheck; /* should the tuples be rechecked? */
/* Note: recheck is always true if ntuples < 0 */
- OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+ bool recheck; /* should the tuples be rechecked? */
+
+ /*
+ * The maximum number of tuples per page is not large (typically 256 with
+ * 8K pages, or 1024 with 32K pages). So there's not much point in making
+ * the per-page bitmaps variable size. We just legislate that the size is
+ * this:
+ */
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
} TBMIterateResult;
/* function prototypes in nodes/tidbitmap.c */
--
2.37.2
v5-0012-Separate-TBM-Shared-Iterator-and-TBMIterateResult.patchtext/x-diff; charset=us-asciiDownload
From 602ad80e9045384c19387145bff41893945423ab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:23:41 -0500
Subject: [PATCH v5 12/14] Separate TBM[Shared]Iterator and TBMIterateResult
Remove the TBMIterateResult from the TBMIterator and TBMSharedIterator
and have tbm_[shared_]iterate() take a TBMIterateResult as a parameter.
This will allow multiple TBMIterateResults to exist concurrently
allowing asynchronous use of the TIDBitmap for prefetching, for example.
tbm_[shared]_iterate() now sets blockno to InvalidBlockNumber when the
bitmap is exhausted instead of returning NULL.
BitmapHeapScan callers of tbm_iterate make a TBMIterateResult locally
and pass it in.
Because GIN only needs a single TBMIterateResult, inline the matchResult
in the GinScanEntry to avoid having to separately manage memory for the
TBMIterateResult.
---
src/backend/access/gin/ginget.c | 48 +++++++++------
src/backend/access/gin/ginscan.c | 2 +-
src/backend/access/heap/heapam_handler.c | 32 +++++-----
src/backend/executor/nodeBitmapHeapscan.c | 33 +++++-----
src/backend/nodes/tidbitmap.c | 73 ++++++++++++-----------
src/include/access/gin_private.h | 2 +-
src/include/nodes/tidbitmap.h | 4 +-
7 files changed, 107 insertions(+), 87 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb6..3aa457a29e1 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -332,10 +332,22 @@ restartScanEntry:
entry->list = NULL;
entry->nlist = 0;
entry->matchBitmap = NULL;
- entry->matchResult = NULL;
entry->reduceResult = false;
entry->predictNumberResult = 0;
+ /*
+ * MTODO: is it enough to set blockno to InvalidBlockNumber? In all the
+ * places were we previously set matchResult to NULL, I just set blockno
+ * to InvalidBlockNumber. It seems like this should be okay because that
+ * is usually what we check before using the matchResult members. But it
+ * might be safer to zero out the offsets array. But that is expensive.
+ */
+ entry->matchResult.blockno = InvalidBlockNumber;
+ entry->matchResult.ntuples = 0;
+ entry->matchResult.recheck = true;
+ memset(entry->matchResult.offsets, 0,
+ sizeof(OffsetNumber) * MaxHeapTuplesPerPage);
+
/*
* we should find entry, and begin scan of posting tree or just store
* posting list in memory
@@ -374,6 +386,7 @@ restartScanEntry:
{
if (entry->matchIterator)
tbm_end_iterate(entry->matchIterator);
+ entry->matchResult.blockno = InvalidBlockNumber;
entry->matchIterator = NULL;
tbm_free(entry->matchBitmap);
entry->matchBitmap = NULL;
@@ -823,18 +836,19 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
{
/*
* If we've exhausted all items on this block, move to next block
- * in the bitmap.
+ * in the bitmap. tbm_iterate() sets matchResult->blockno to
+ * InvalidBlockNumber when the bitmap is exhausted.
*/
- while (entry->matchResult == NULL ||
- (entry->matchResult->ntuples >= 0 &&
- entry->offset >= entry->matchResult->ntuples) ||
- entry->matchResult->blockno < advancePastBlk ||
+ while ((!BlockNumberIsValid(entry->matchResult.blockno)) ||
+ (entry->matchResult.ntuples >= 0 &&
+ entry->offset >= entry->matchResult.ntuples) ||
+ entry->matchResult.blockno < advancePastBlk ||
(ItemPointerIsLossyPage(&advancePast) &&
- entry->matchResult->blockno == advancePastBlk))
+ entry->matchResult.blockno == advancePastBlk))
{
- entry->matchResult = tbm_iterate(entry->matchIterator);
+ tbm_iterate(entry->matchIterator, &entry->matchResult);
- if (entry->matchResult == NULL)
+ if (!BlockNumberIsValid(entry->matchResult.blockno))
{
ItemPointerSetInvalid(&entry->curItem);
tbm_end_iterate(entry->matchIterator);
@@ -858,10 +872,10 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* We're now on the first page after advancePast which has any
* items on it. If it's a lossy result, return that.
*/
- if (entry->matchResult->ntuples < 0)
+ if (entry->matchResult.ntuples < 0)
{
ItemPointerSetLossyPage(&entry->curItem,
- entry->matchResult->blockno);
+ entry->matchResult.blockno);
/*
* We might as well fall out of the loop; we could not
@@ -875,27 +889,27 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
* Not a lossy page. Skip over any offsets <= advancePast, and
* return that.
*/
- if (entry->matchResult->blockno == advancePastBlk)
+ if (entry->matchResult.blockno == advancePastBlk)
{
/*
* First, do a quick check against the last offset on the
* page. If that's > advancePast, so are all the other
* offsets, so just go back to the top to get the next page.
*/
- if (entry->matchResult->offsets[entry->matchResult->ntuples - 1] <= advancePastOff)
+ if (entry->matchResult.offsets[entry->matchResult.ntuples - 1] <= advancePastOff)
{
- entry->offset = entry->matchResult->ntuples;
+ entry->offset = entry->matchResult.ntuples;
continue;
}
/* Otherwise scan to find the first item > advancePast */
- while (entry->matchResult->offsets[entry->offset] <= advancePastOff)
+ while (entry->matchResult.offsets[entry->offset] <= advancePastOff)
entry->offset++;
}
ItemPointerSet(&entry->curItem,
- entry->matchResult->blockno,
- entry->matchResult->offsets[entry->offset]);
+ entry->matchResult.blockno,
+ entry->matchResult.offsets[entry->offset]);
entry->offset++;
/* Done unless we need to reduce the result */
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544e..033d5253394 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -106,7 +106,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
ItemPointerSetMin(&scanEntry->curItem);
scanEntry->matchBitmap = NULL;
scanEntry->matchIterator = NULL;
- scanEntry->matchResult = NULL;
+ scanEntry->matchResult.blockno = InvalidBlockNumber;
scanEntry->list = NULL;
scanEntry->nlist = 0;
scanEntry->offset = InvalidOffsetNumber;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e038e60cd8f..022753e203a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2119,7 +2119,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult *tbmres;
+ TBMIterateResult tbmres;
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
@@ -2132,11 +2132,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
CHECK_FOR_INTERRUPTS();
if (scan->shared_tbmiterator)
- tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
+ tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
else
- tbmres = tbm_iterate(scan->tbmiterator);
+ tbm_iterate(scan->tbmiterator, &tbmres);
- if (tbmres == NULL)
+ if (!BlockNumberIsValid(tbmres.blockno))
{
/* no more entries in the bitmap */
Assert(hscan->rs_empty_tuples_pending == 0);
@@ -2151,11 +2151,11 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* isolation though, as we need to examine all invisible tuples
* reachable by the index.
*/
- } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
+ } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
/* Got a valid block */
- *blockno = tbmres->blockno;
- *recheck = tbmres->recheck;
+ *blockno = tbmres.blockno;
+ *recheck = tbmres.recheck;
/*
* We can skip fetching the heap page if we don't need any fields from the
@@ -2163,19 +2163,19 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
* the page are visible to our transaction.
*/
if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer))
+ !tbmres.recheck &&
+ VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
{
/* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
+ Assert(tbmres.ntuples >= 0);
Assert(hscan->rs_empty_tuples_pending >= 0);
- hscan->rs_empty_tuples_pending += tbmres->ntuples;
+ hscan->rs_empty_tuples_pending += tbmres.ntuples;
return true;
}
- block = tbmres->blockno;
+ block = tbmres.blockno;
/*
* Acquire pin on the target heap page, trading in any pin we held before.
@@ -2204,7 +2204,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres->ntuples >= 0)
+ if (tbmres.ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2213,9 +2213,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres.ntuples; curslot++)
{
- OffsetNumber offnum = tbmres->offsets[curslot];
+ OffsetNumber offnum = tbmres.offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2265,7 +2265,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
- *lossy = tbmres->ntuples < 0;
+ *lossy = tbmres.ntuples < 0;
/*
* Return true to indicate that a valid block was found and the bitmap is
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3be433ea6e1..74b92d4cbf4 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -344,9 +344,10 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
else if (prefetch_iterator)
{
/* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
+ tbm_iterate(prefetch_iterator, &tbmpre);
- if (tbmpre == NULL || tbmpre->blockno != blockno)
+ if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
elog(ERROR, "prefetch and main iterators are out of sync");
}
return;
@@ -364,6 +365,8 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
}
else
{
+ TBMIterateResult tbmpre;
+
/* Release the mutex before iterating */
SpinLockRelease(&pstate->mutex);
@@ -376,7 +379,7 @@ BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
* case.
*/
if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator);
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
}
}
#endif /* USE_PREFETCH */
@@ -443,10 +446,12 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (node->prefetch_pages < node->prefetch_target)
{
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+ TBMIterateResult tbmpre;
bool skip_fetch;
- if (tbmpre == NULL)
+ tbm_iterate(prefetch_iterator, &tbmpre);
+
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_iterate(prefetch_iterator);
@@ -462,13 +467,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
* prefetch_pages?)
*/
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
@@ -483,7 +488,7 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
{
while (1)
{
- TBMIterateResult *tbmpre;
+ TBMIterateResult tbmpre;
bool do_prefetch = false;
bool skip_fetch;
@@ -502,8 +507,8 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
if (!do_prefetch)
return;
- tbmpre = tbm_shared_iterate(prefetch_iterator);
- if (tbmpre == NULL)
+ tbm_shared_iterate(prefetch_iterator, &tbmpre);
+ if (!BlockNumberIsValid(tbmpre.blockno))
{
/* No more pages to prefetch */
tbm_end_shared_iterate(prefetch_iterator);
@@ -513,13 +518,13 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
/* As above, skip prefetch if we expect not to need page */
skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre->recheck &&
+ !tbmpre.recheck &&
VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
+ tbmpre.blockno,
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
}
}
}
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index d2bf8f44d50..7d038c2018d 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -172,7 +172,6 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output;
};
/*
@@ -213,7 +212,6 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output;
};
/* Local function prototypes */
@@ -944,20 +942,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
/*
* tbm_iterate - scan through next page of a TIDBitmap
*
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan. Pages are guaranteed to be delivered in numerical
- * order. If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition. If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway. (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order. tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition. If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway. (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
*/
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
{
TIDBitmap *tbm = iterator->tbm;
- TBMIterateResult *output = &(iterator->output);
Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
@@ -985,6 +984,7 @@ tbm_iterate(TBMIterator *iterator)
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
+ Assert(tbmres);
if (iterator->schunkptr < tbm->nchunks)
{
PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -995,11 +995,11 @@ tbm_iterate(TBMIterator *iterator)
chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
iterator->schunkbit++;
- return output;
+ return;
}
}
@@ -1015,16 +1015,17 @@ tbm_iterate(TBMIterator *iterator)
page = tbm->spages[iterator->spageptr];
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
iterator->spageptr++;
- return output;
+ return;
}
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
@@ -1034,10 +1035,9 @@ tbm_iterate(TBMIterator *iterator)
* across multiple processes. We need to acquire the iterator LWLock,
* before accessing the shared members.
*/
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
{
- TBMIterateResult *output = &iterator->output;
TBMSharedIteratorState *istate = iterator->state;
PagetableEntry *ptbase = NULL;
int *idxpages = NULL;
@@ -1088,13 +1088,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
}
@@ -1104,21 +1104,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
int ntuples;
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
istate->spageptr++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
LWLockRelease(&istate->lock);
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
+ return;
}
/*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 3013a44bae1..3b432263bb0 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -353,7 +353,7 @@ typedef struct GinScanEntryData
/* for a partial-match or full-scan query, we accumulate all TIDs here */
TIDBitmap *matchBitmap;
TBMIterator *matchIterator;
- TBMIterateResult *matchResult;
+ TBMIterateResult matchResult;
/* used for Posting list and one page in Posting tree */
ItemPointerData *list;
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 432fae52962..f000c1af28f 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -72,8 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
extern void tbm_end_iterate(TBMIterator *iterator);
extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
--
2.37.2
v5-0013-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload
From 1b50526e266f2413e04572f8ea5007805e2f20c2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v5 13/14] Streaming Read API
---
contrib/pg_prewarm/pg_prewarm.c | 40 +-
src/backend/storage/Makefile | 2 +-
src/backend/storage/aio/Makefile | 14 +
src/backend/storage/aio/meson.build | 5 +
src/backend/storage/aio/streaming_read.c | 612 ++++++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 641 ++++++++++++++++-------
src/backend/storage/buffer/localbuf.c | 14 +-
src/backend/storage/meson.build | 1 +
src/include/storage/bufmgr.h | 45 ++
src/include/storage/streaming_read.h | 52 ++
src/tools/pgindent/typedefs.list | 3 +
11 files changed, 1218 insertions(+), 211 deletions(-)
create mode 100644 src/backend/storage/aio/Makefile
create mode 100644 src/backend/storage/aio/meson.build
create mode 100644 src/backend/storage/aio/streaming_read.c
create mode 100644 src/include/storage/streaming_read.h
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..1cc84bcb0c2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/smgr.h"
+#include "storage/streaming_read.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
static PGIOAlignedBlock blockbuffer;
+struct pg_prewarm_streaming_read_private
+{
+ BlockNumber blocknum;
+ int64 last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_data)
+{
+ struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+ if (p->blocknum <= p->last_block)
+ return p->blocknum++;
+
+ return InvalidBlockNumber;
+}
+
/*
* pg_prewarm(regclass, mode text, fork text,
* first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
}
else if (ptype == PREWARM_BUFFER)
{
+ struct pg_prewarm_streaming_read_private p;
+ PgStreamingRead *pgsr;
+
/*
* In buffer mode, we actually pull the data into shared_buffers.
*/
+
+ /* Set up the private state for our streaming buffer read callback. */
+ p.blocknum = first_block;
+ p.last_block = last_block;
+
+ pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+ &p,
+ 0,
+ NULL,
+ BMR_REL(rel),
+ forkNumber,
+ pg_prewarm_streaming_read_next);
+
for (block = first_block; block <= last_block; ++block)
{
Buffer buf;
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+ buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
ReleaseBuffer(buf);
++blocks_done;
}
+ Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+ pg_streaming_read_free(pgsr);
}
/* Close relation, release lock. */
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS = aio buffer file freespace ipc large_object lmgr page smgr sync
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..71f2c4a70b6
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,612 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+ bool need_wait;
+ bool advice_issued;
+ BlockNumber blocknum;
+ int nblocks;
+ int per_buffer_data_index;
+ Buffer buffers[MAX_BUFFERS_PER_TRANSFER];
+ ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+ int max_ios;
+ int ios_in_progress;
+ int max_pinned_buffers;
+ int pinned_buffers;
+ int pinned_buffers_trigger;
+ int next_tail_buffer;
+ int ramp_up_pin_limit;
+ int ramp_up_pin_stall;
+ bool finished;
+ bool advice_enabled;
+ void *pgsr_private;
+ PgStreamingReadBufferCB callback;
+
+ BufferAccessStrategy strategy;
+ BufferManagerRelation bmr;
+ ForkNumber forknum;
+
+ /* Sometimes we need to buffer one block for flow control. */
+ BlockNumber unget_blocknum;
+ void *unget_per_buffer_data;
+
+ /* Next expected block, for detecting sequential access. */
+ BlockNumber seq_blocknum;
+
+ /* Space for optional per-buffer private data. */
+ size_t per_buffer_data_size;
+ void *per_buffer_data;
+
+ /* Circular buffer of ranges. */
+ int size;
+ int head;
+ int tail;
+ PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy)
+{
+ PgStreamingRead *pgsr;
+ int size;
+ int max_ios;
+ uint32 max_pinned_buffers;
+
+
+ /*
+ * Decide how many assumed I/Os we will allow to run concurrently. That
+ * is, advice to the kernel to tell it that we will soon read. This
+ * number also affects how far we look ahead for opportunities to start
+ * more I/Os.
+ */
+ if (flags & PGSR_FLAG_MAINTENANCE)
+ max_ios = maintenance_io_concurrency;
+ else
+ max_ios = effective_io_concurrency;
+
+ /*
+ * The desired level of I/O concurrency controls how far ahead we are
+ * willing to look ahead. We also clamp it to at least
+ * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+ * sized read, even when max_ios is zero.
+ */
+ max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+ /*
+ * The *_io_concurrency GUCs might be set to 0, but we want to allow at
+ * least one, to keep our gating logic simple.
+ */
+ max_ios = Max(max_ios, 1);
+
+ /*
+ * Don't allow this backend to pin too many buffers. For now we'll apply
+ * the limit for the shared buffer pool and the local buffer pool, without
+ * worrying which it is.
+ */
+ LimitAdditionalPins(&max_pinned_buffers);
+ LimitAdditionalLocalPins(&max_pinned_buffers);
+ Assert(max_pinned_buffers > 0);
+
+ /*
+ * pgsr->ranges is a circular buffer. When it is empty, head == tail.
+ * When it is full, there is an empty element between head and tail. Head
+ * can also be empty (nblocks == 0), therefore we need two extra elements
+ * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+ * maxmimum possible number of occupied ranges of the smallest possible
+ * size of one.
+ */
+ size = max_pinned_buffers + 2;
+
+ pgsr = (PgStreamingRead *)
+ palloc0(offsetof(PgStreamingRead, ranges) +
+ sizeof(pgsr->ranges[0]) * size);
+
+ pgsr->max_ios = max_ios;
+ pgsr->per_buffer_data_size = per_buffer_data_size;
+ pgsr->max_pinned_buffers = max_pinned_buffers;
+ pgsr->pgsr_private = pgsr_private;
+ pgsr->strategy = strategy;
+ pgsr->size = size;
+
+ pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+ /*
+ * This system supports prefetching advice. As long as direct I/O isn't
+ * enabled, and the caller hasn't promised sequential access, we can use
+ * it.
+ */
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ (flags & PGSR_FLAG_SEQUENTIAL) == 0)
+ pgsr->advice_enabled = true;
+#endif
+
+ /*
+ * We start off building small ranges, but double that quickly, for the
+ * benefit of users that don't know how far ahead they'll read. This can
+ * be disabled by users that already know they'll read all the way.
+ */
+ if (flags & PGSR_FLAG_FULL)
+ pgsr->ramp_up_pin_limit = INT_MAX;
+ else
+ pgsr->ramp_up_pin_limit = 1;
+
+ /*
+ * We want to avoid creating ranges that are smaller than they could be
+ * just because we hit max_pinned_buffers. We only look ahead when the
+ * number of pinned buffers falls below this trigger number, or put
+ * another way, we stop looking ahead when we wouldn't be able to build a
+ * "full sized" range.
+ */
+ pgsr->pinned_buffers_trigger =
+ Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+ /* Space for the callback to store extra data along with each block. */
+ if (per_buffer_data_size)
+ pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+ return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_data_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb)
+{
+ PgStreamingRead *result;
+
+ result = pg_streaming_read_buffer_alloc_internal(flags,
+ pgsr_private,
+ per_buffer_data_size,
+ strategy);
+ result->callback = next_block_cb;
+ result->bmr = bmr;
+ result->forknum = forknum;
+
+ return result;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+ int result;
+
+ /*
+ * Find slot in the circular buffer of per-buffer data, without using the
+ * expensive % operator.
+ */
+ result = range->per_buffer_data_index + n;
+ if (result >= pgsr->max_pinned_buffers)
+ result -= pgsr->max_pinned_buffers;
+ Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+ return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+ return (char *) pgsr->per_buffer_data +
+ pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+ return get_per_buffer_data_by_index(pgsr,
+ get_per_buffer_data_index(pgsr,
+ range,
+ n));
+}
+
+/*
+ * Start reading the head range, and create a new head range. The new head
+ * range is returned. It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+ PgStreamingReadRange *head_range;
+ PgStreamingReadRange *new_head_range;
+ int nblocks_pinned;
+ int flags;
+
+ /* Caller should make sure we never exceed max_ios. */
+ Assert(pgsr->ios_in_progress < pgsr->max_ios);
+
+ /* Should only call if the head range has some blocks to read. */
+ head_range = &pgsr->ranges[pgsr->head];
+ Assert(head_range->nblocks > 0);
+
+ /*
+ * If advice hasn't been suppressed, and this system supports it, this
+ * isn't a strictly sequential pattern, then we'll issue advice.
+ */
+ if (pgsr->advice_enabled && head_range->blocknum != pgsr->seq_blocknum)
+ flags = READ_BUFFERS_ISSUE_ADVICE;
+ else
+ flags = 0;
+
+
+ /* Start reading as many blocks as we can from the head range. */
+ nblocks_pinned = head_range->nblocks;
+ head_range->need_wait =
+ StartReadBuffers(pgsr->bmr,
+ head_range->buffers,
+ pgsr->forknum,
+ head_range->blocknum,
+ &nblocks_pinned,
+ pgsr->strategy,
+ flags,
+ &head_range->operation);
+
+ /* Did that start an I/O? */
+ if (head_range->need_wait && (flags & READ_BUFFERS_ISSUE_ADVICE))
+ {
+ head_range->advice_issued = true;
+ pgsr->ios_in_progress++;
+ Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+ }
+
+ /*
+ * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+ * but always at least one.
+ */
+ Assert(nblocks_pinned <= head_range->nblocks);
+ Assert(nblocks_pinned >= 1);
+ pgsr->pinned_buffers += nblocks_pinned;
+
+ /*
+ * Remember where the next block would be after that, so we can detect
+ * sequential access next time.
+ */
+ pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+ /*
+ * Create a new head range. There must be space, because we have enough
+ * elements for every range to hold just one block, up to the pin limit.
+ */
+ Assert(pgsr->size > pgsr->max_pinned_buffers);
+ Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+ if (++pgsr->head == pgsr->size)
+ pgsr->head = 0;
+ new_head_range = &pgsr->ranges[pgsr->head];
+ new_head_range->nblocks = 0;
+ new_head_range->advice_issued = false;
+
+ /*
+ * If we didn't manage to start the whole read above, we split the range,
+ * moving the remainder into the new head range.
+ */
+ if (nblocks_pinned < head_range->nblocks)
+ {
+ int nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+ head_range->nblocks = nblocks_pinned;
+
+ new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+ new_head_range->nblocks = nblocks_remaining;
+ }
+
+ /* The new range has per-buffer data starting after the previous range. */
+ new_head_range->per_buffer_data_index =
+ get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+ return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+ BlockNumber result;
+
+ if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+ {
+ /*
+ * If we had to unget a block, now it is time to return that one
+ * again.
+ */
+ result = pgsr->unget_blocknum;
+ pgsr->unget_blocknum = InvalidBlockNumber;
+
+ /*
+ * The same per_buffer_data element must have been used, and still
+ * contains whatever data the callback wrote into it. So we just
+ * sanity-check that we were called with the value that
+ * pg_streaming_unget_block() pushed back.
+ */
+ Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+ }
+ else
+ {
+ /* Use the installed callback directly. */
+ result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+ }
+
+ return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later. This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+ Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+ pgsr->unget_blocknum = blocknum;
+ pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+ PgStreamingReadRange *range;
+
+ /*
+ * If we're still ramping up, we may have to stall to wait for buffers to
+ * be consumed first before we do any more prefetching.
+ */
+ if (pgsr->ramp_up_pin_stall > 0)
+ {
+ Assert(pgsr->pinned_buffers > 0);
+ return;
+ }
+
+ /*
+ * If we're finished or can't start more I/O, then don't look ahead.
+ */
+ if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+ return;
+
+ /*
+ * We'll also wait until the number of pinned buffers falls below our
+ * trigger level, so that we have the chance to create a full range.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+ return;
+
+ do
+ {
+ BlockNumber blocknum;
+ void *per_buffer_data;
+
+ /* Do we have a full-sized range? */
+ range = &pgsr->ranges[pgsr->head];
+ if (range->nblocks == lengthof(range->buffers))
+ {
+ /* Start as much of it as we can. */
+ range = pg_streaming_read_start_head_range(pgsr);
+
+ /* If we're now at the I/O limit, stop here. */
+ if (pgsr->ios_in_progress == pgsr->max_ios)
+ return;
+
+ /*
+ * If we couldn't form a full range, then stop here to avoid
+ * creating small I/O.
+ */
+ if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+ return;
+
+ /*
+ * That might have only been partially started, but always
+ * processes at least one so that'll do for now.
+ */
+ Assert(range->nblocks < lengthof(range->buffers));
+ }
+
+ /* Find per-buffer data slot for the next block. */
+ per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+ /* Find out which block the callback wants to read next. */
+ blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+ if (blocknum == InvalidBlockNumber)
+ {
+ /* End of stream. */
+ pgsr->finished = true;
+ break;
+ }
+
+ /*
+ * Is there a head range that we cannot extend, because the requested
+ * block is not consecutive?
+ */
+ if (range->nblocks > 0 &&
+ range->blocknum + range->nblocks != blocknum)
+ {
+ /* Yes. Start it, so we can begin building a new one. */
+ range = pg_streaming_read_start_head_range(pgsr);
+
+ /*
+ * It's possible that it was only partially started, and we have a
+ * new range with the remainder. Keep starting I/Os until we get
+ * it all out of the way, or we hit the I/O limit.
+ */
+ while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+ range = pg_streaming_read_start_head_range(pgsr);
+
+ /*
+ * We have to 'unget' the block returned by the callback if we
+ * don't have enough I/O capacity left to start something.
+ */
+ if (pgsr->ios_in_progress == pgsr->max_ios)
+ {
+ pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+ return;
+ }
+ }
+
+ /* If we have a new, empty range, initialize the start block. */
+ if (range->nblocks == 0)
+ {
+ range->blocknum = blocknum;
+ }
+
+ /* This block extends the range by one. */
+ Assert(range->blocknum + range->nblocks == blocknum);
+ range->nblocks++;
+
+ } while (pgsr->pinned_buffers + range->nblocks < pgsr->max_pinned_buffers &&
+ pgsr->pinned_buffers + range->nblocks < pgsr->ramp_up_pin_limit);
+
+ /* If we've hit the ramp-up limit, insert a stall. */
+ if (pgsr->pinned_buffers + range->nblocks >= pgsr->ramp_up_pin_limit)
+ {
+ /* Can't get here if an earlier stall hasn't finished. */
+ Assert(pgsr->ramp_up_pin_stall == 0);
+ /* Don't do any more prefetching until these buffers are consumed. */
+ pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+ /* Double it. It will soon be out of the way. */
+ pgsr->ramp_up_pin_limit *= 2;
+ }
+
+ /* Start as much as we can. */
+ while (range->nblocks > 0)
+ {
+ range = pg_streaming_read_start_head_range(pgsr);
+ if (pgsr->ios_in_progress == pgsr->max_ios)
+ break;
+ }
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+ pg_streaming_read_look_ahead(pgsr);
+
+ /* See if we have one buffer to return. */
+ while (pgsr->tail != pgsr->head)
+ {
+ PgStreamingReadRange *tail_range;
+
+ tail_range = &pgsr->ranges[pgsr->tail];
+
+ /*
+ * Do we need to perform an I/O before returning the buffers from this
+ * range?
+ */
+ if (tail_range->need_wait)
+ {
+ WaitReadBuffers(&tail_range->operation);
+ tail_range->need_wait = false;
+
+ /*
+ * We don't really know if the kernel generated a physical I/O
+ * when we issued advice, let alone when it finished, but it has
+ * certainly finished now because we've performed the read.
+ */
+ if (tail_range->advice_issued)
+ {
+ Assert(pgsr->ios_in_progress > 0);
+ pgsr->ios_in_progress--;
+ }
+ }
+
+ /* Are there more buffers available in this range? */
+ if (pgsr->next_tail_buffer < tail_range->nblocks)
+ {
+ int buffer_index;
+ Buffer buffer;
+
+ buffer_index = pgsr->next_tail_buffer++;
+ buffer = tail_range->buffers[buffer_index];
+
+ Assert(BufferIsValid(buffer));
+
+ /* We are giving away ownership of this pinned buffer. */
+ Assert(pgsr->pinned_buffers > 0);
+ pgsr->pinned_buffers--;
+
+ if (pgsr->ramp_up_pin_stall > 0)
+ pgsr->ramp_up_pin_stall--;
+
+ if (per_buffer_data)
+ *per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+ return buffer;
+ }
+
+ /* Advance tail to next range, if there is one. */
+ if (++pgsr->tail == pgsr->size)
+ pgsr->tail = 0;
+ pgsr->next_tail_buffer = 0;
+
+ /*
+ * If tail crashed into head, and head is not empty, then it is time
+ * to start that range.
+ */
+ if (pgsr->tail == pgsr->head &&
+ pgsr->ranges[pgsr->head].nblocks > 0)
+ pg_streaming_read_start_head_range(pgsr);
+ }
+
+ Assert(pgsr->pinned_buffers == 0);
+
+ return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+ Buffer buffer;
+
+ /* Stop looking ahead. */
+ pgsr->finished = true;
+
+ /* Unpin anything that wasn't consumed. */
+ while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+ ReleaseBuffer(buffer);
+
+ Assert(pgsr->pinned_buffers == 0);
+ Assert(pgsr->ios_in_progress == 0);
+
+ /* Release memory. */
+ if (pgsr->per_buffer_data)
+ pfree(pgsr->per_buffer_data);
+
+ pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4dc..3b1b0ad99df 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
* and pin it so that no one can destroy it while this process
* is using it.
*
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ * two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
* ReleaseBuffer() -- unpin a buffer
*
* MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -472,10 +477,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
)
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
ForkNumber forkNum, BlockNumber blockNum,
- ReadBufferMode mode, BufferAccessStrategy strategy,
- bool *hit);
+ ReadBufferMode mode, BufferAccessStrategy strategy);
static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
ForkNumber fork,
BufferAccessStrategy strategy,
@@ -501,7 +505,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static int SyncOneBuffer(int buf_id, bool skip_recently_used,
WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner);
static void AbortBufferIO(Buffer buffer);
@@ -782,7 +786,6 @@ Buffer
ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy)
{
- bool hit;
Buffer buf;
/*
@@ -795,15 +798,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot access temporary tables of other sessions")));
- /*
- * Read the buffer, and update pgstat counters to reflect a cache hit or
- * miss.
- */
- pgstat_count_buffer_read(reln);
- buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
- forkNum, blockNum, mode, strategy, &hit);
- if (hit)
- pgstat_count_buffer_hit(reln);
+ buf = ReadBuffer_common(BMR_REL(reln),
+ forkNum, blockNum, mode, strategy);
+
return buf;
}
@@ -823,13 +820,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy, bool permanent)
{
- bool hit;
-
SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
- return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
- RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
- mode, strategy, &hit);
+ return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+ RELPERSISTENCE_UNLOGGED),
+ forkNum, blockNum,
+ mode, strategy);
}
/*
@@ -995,35 +991,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
*/
if (buffer == InvalidBuffer)
{
- bool hit;
-
Assert(extended_by == 0);
- buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
- fork, extend_to - 1, mode, strategy,
- &hit);
+ buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
}
return buffer;
}
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK. The buffer must be already
+ * pinned. It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+ if (BufferIsLocal(buffer))
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ else
+ {
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ if (mode == RBM_ZERO_AND_LOCK)
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ else
+ LockBufferForCleanup(buffer);
+ }
+
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ if (BufferIsLocal(buffer))
+ {
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ buf_state = LockBufHdr(bufHdr);
+ buf_state |= BM_VALID;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
/*
* ReadBuffer_common -- common logic for all ReadBuffer variants
*
* *hit is set to true if the request was satisfied from shared buffer cache.
*/
static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
- BufferAccessStrategy strategy, bool *hit)
+ BufferAccessStrategy strategy)
{
- BufferDesc *bufHdr;
- Block bufBlock;
- bool found;
- IOContext io_context;
- IOObject io_object;
- bool isLocalBuf = SmgrIsTemp(smgr);
-
- *hit = false;
+ ReadBuffersOperation operation;
+ Buffer buffer;
+ int nblocks;
+ int flags;
/*
* Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1042,181 +1071,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;
- return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
- forkNum, strategy, flags);
+ return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}
- TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend);
+ nblocks = 1;
+ if (mode == RBM_ZERO_ON_ERROR)
+ flags = READ_BUFFERS_ZERO_ON_ERROR;
+ else
+ flags = 0;
+ if (StartReadBuffers(bmr,
+ &buffer,
+ forkNum,
+ blockNum,
+ &nblocks,
+ strategy,
+ flags,
+ &operation))
+ WaitReadBuffers(&operation);
+ Assert(nblocks == 1); /* single block can't be short */
+
+ if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+ ZeroBuffer(buffer, mode);
+
+ return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ BufferAccessStrategy strategy,
+ bool *foundPtr)
+{
+ BufferDesc *bufHdr;
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
+
+ Assert(blockNum != P_NEW);
+ Assert(bmr.smgr);
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
- /*
- * We do not use a BufferAccessStrategy for I/O of temporary tables.
- * However, in some cases, the "strategy" may not be NULL, so we can't
- * rely on IOContextForStrategy() to set the right IOContext for us.
- * This may happen in cases like CREATE TEMPORARY TABLE AS...
- */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
- bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
- if (found)
- pgBufferUsage.local_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.local_blks_read++;
}
else
{
- /*
- * lookup the buffer. IO_IN_PROGRESS is set if the requested block is
- * not currently in memory.
- */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
- bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
- strategy, &found, io_context);
- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;
}
- /* At this point we do NOT hold any locks. */
+ TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend);
- /* if it was already in the buffer pool, we're done */
- if (found)
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+ if (isLocalBuf)
+ {
+ bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+ if (*foundPtr)
+ pgBufferUsage.local_blks_hit++;
+ }
+ else
+ {
+ bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+ strategy, foundPtr, io_context);
+ if (*foundPtr)
+ pgBufferUsage.shared_blks_hit++;
+ }
+ if (bmr.rel)
+ {
+ /*
+ * While pgBufferUsage's "read" counter isn't bumped unless we reach
+ * WaitReadBuffers() (so, not for hits, and not for buffers that are
+ * zeroed instead), the per-relation stats always count them.
+ */
+ pgstat_count_buffer_read(bmr.rel);
+ if (*foundPtr)
+ pgstat_count_buffer_hit(bmr.rel);
+ }
+ if (*foundPtr)
{
- /* Just need to update stats before we exit */
- *hit = true;
VacuumPageHit++;
pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageHit;
TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ }
- /*
- * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
- * on return.
- */
- if (!isLocalBuf)
- {
- if (mode == RBM_ZERO_AND_LOCK)
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
- LW_EXCLUSIVE);
- else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
- LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
- }
+ return BufferDescriptorGetBuffer(bufHdr);
+}
- return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks. On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary. If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed. Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers(). In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forkNum,
+ BlockNumber blockNum,
+ int *nblocks,
+ BufferAccessStrategy strategy,
+ int flags,
+ ReadBuffersOperation *operation)
+{
+ int actual_nblocks = *nblocks;
+
+ if (bmr.rel)
+ {
+ bmr.smgr = RelationGetSmgr(bmr.rel);
+ bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
}
- /*
- * if we have gotten to this point, we have allocated a buffer for the
- * page but its contents are not yet valid. IO_IN_PROGRESS is set for it,
- * if it's a shared buffer.
- */
- Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
+ operation->bmr = bmr;
+ operation->forknum = forkNum;
+ operation->blocknum = blockNum;
+ operation->buffers = buffers;
+ operation->nblocks = actual_nblocks;
+ operation->strategy = strategy;
+ operation->flags = flags;
- bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+ operation->io_buffers_len = 0;
- /*
- * Read in the page, unless the caller intends to overwrite it and just
- * wants us to allocate a buffer.
- */
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
- MemSet((char *) bufBlock, 0, BLCKSZ);
- else
+ for (int i = 0; i < actual_nblocks; ++i)
{
- instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+ bool found;
- smgrread(smgr, forkNum, blockNum, bufBlock);
+ buffers[i] = PrepareReadBuffer(bmr,
+ forkNum,
+ blockNum + i,
+ strategy,
+ &found);
- pgstat_count_io_op_time(io_object, io_context,
- IOOP_READ, io_start, 1);
+ if (found)
+ {
+ /*
+ * Terminate the read as soon as we get a hit. It could be a
+ * single buffer hit, or it could be a hit that follows a readable
+ * range. We don't want to create more than one readable range,
+ * so we stop here.
+ */
+ actual_nblocks = operation->nblocks = *nblocks = i + 1;
+ }
+ else
+ {
+ /* Extend the readable range to cover this block. */
+ operation->io_buffers_len++;
+ }
+ }
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
{
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above. That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity. Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
}
+
+ /* Indicate that WaitReadBuffers() should be called. */
+ return true;
}
+ else
+ {
+ return false;
+ }
+}
- /*
- * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
- * content lock before marking the page as valid, to make sure that no
- * other backend sees the zeroed page before the caller has had a chance
- * to initialize it.
- *
- * Since no-one else can be looking at the page contents yet, there is no
- * difference between an exclusive lock and a cleanup-strength lock. (Note
- * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
- * they assert that the buffer is already valid.)
- */
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+ if (BufferIsLocal(buffer))
{
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+ BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+ return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
+ else
+ return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+ BufferManagerRelation bmr;
+ Buffer *buffers;
+ int nblocks;
+ BlockNumber blocknum;
+ ForkNumber forknum;
+ bool isLocalBuf;
+ IOContext io_context;
+ IOObject io_object;
+
+ /*
+ * Currently operations are only allowed to include a read of some range,
+ * with an optional extra buffer that is already pinned at the end. So
+ * nblocks can be at most one more than io_buffers_len.
+ */
+ Assert((operation->nblocks == operation->io_buffers_len) ||
+ (operation->nblocks == operation->io_buffers_len + 1));
+ /* Find the range of the physical read we need to perform. */
+ nblocks = operation->io_buffers_len;
+ if (nblocks == 0)
+ return; /* nothing to do */
+
+ buffers = &operation->buffers[0];
+ blocknum = operation->blocknum;
+ forknum = operation->forknum;
+ bmr = operation->bmr;
+
+ isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
- /* Only need to adjust flags */
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
-
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
}
else
{
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ io_context = IOContextForStrategy(operation->strategy);
+ io_object = IOOBJECT_RELATION;
}
- VacuumPageMiss++;
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss;
+ /*
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
+ */
+ if (isLocalBuf)
+ pgBufferUsage.local_blks_read += nblocks;
+ else
+ pgBufferUsage.shared_blks_read += nblocks;
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
- smgr->smgr_rlocator.locator.spcOid,
- smgr->smgr_rlocator.locator.dbOid,
- smgr->smgr_rlocator.locator.relNumber,
- smgr->smgr_rlocator.backend,
- found);
+ for (int i = 0; i < nblocks; ++i)
+ {
+ int io_buffers_len;
+ Buffer io_buffers[MAX_BUFFERS_PER_TRANSFER];
+ void *io_pages[MAX_BUFFERS_PER_TRANSFER];
+ instr_time io_start;
+ BlockNumber io_first_block;
- return BufferDescriptorGetBuffer(bufHdr);
+ /*
+ * Skip this block if someone else has already completed it. If an
+ * I/O is already in progress in another backend, this will wait for
+ * the outcome: either done, or something went wrong and we will
+ * retry.
+ */
+ if (!WaitReadBuffersCanStartIO(buffers[i], false))
+ {
+ /*
+ * Report this as a 'hit' for this backend, even though it must
+ * have started out as a miss in PrepareReadBuffer().
+ */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ true);
+ continue;
+ }
+
+ /* We found a buffer that we need to read in. */
+ io_buffers[0] = buffers[i];
+ io_pages[0] = BufferGetBlock(buffers[i]);
+ io_first_block = blocknum + i;
+ io_buffers_len = 1;
+
+ /*
+ * How many neighboring-on-disk blocks can we can scatter-read into
+ * other buffers at the same time? In this case we don't wait if we
+ * see an I/O already in progress. We already hold BM_IO_IN_PROGRESS
+ * for the head block, so we should get on with that I/O as soon as
+ * possible. We'll come back to this block again, above.
+ */
+ while ((i + 1) < nblocks &&
+ WaitReadBuffersCanStartIO(buffers[i + 1], true))
+ {
+ /* Must be consecutive block numbers. */
+ Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+ BufferGetBlockNumber(buffers[i]) + 1);
+
+ io_buffers[io_buffers_len] = buffers[++i];
+ io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+ }
+
+ io_start = pgstat_prepare_io_time(track_io_timing);
+ smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+ io_buffers_len);
+
+ /* Verify each block we read, and terminate the I/O. */
+ for (int j = 0; j < io_buffers_len; ++j)
+ {
+ BufferDesc *bufHdr;
+ Block bufBlock;
+
+ if (isLocalBuf)
+ {
+ bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+ bufBlock = LocalBufHdrGetBlock(bufHdr);
+ }
+ else
+ {
+ bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+ bufBlock = BufHdrGetBlock(bufHdr);
+ }
+
+ /* check for garbage data */
+ if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ memset(bufBlock, 0, BLCKSZ);
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ io_first_block + j,
+ relpath(bmr.smgr->smgr_rlocator, forknum))));
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ if (isLocalBuf)
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ }
+ else
+ {
+ /* Set BM_VALID, terminate IO, and wake up any waiters */
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ }
+
+ /* Report I/Os as completing individually. */
+ TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+ bmr.smgr->smgr_rlocator.locator.spcOid,
+ bmr.smgr->smgr_rlocator.locator.dbOid,
+ bmr.smgr->smgr_rlocator.locator.relNumber,
+ bmr.smgr->smgr_rlocator.backend,
+ false);
+ }
+
+ VacuumPageMiss += io_buffers_len;
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ }
}
/*
- * BufferAlloc -- subroutine for ReadBuffer. Handles lookup of a shared
- * buffer. If no buffer exists already, selects a replacement
- * victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers. Handles lookup of a shared
+ * buffer. If no buffer exists already, selects a replacement victim and
+ * evicts the old page, but does NOT read in new page.
*
* "strategy" can be a buffer replacement strategy object, or NULL for
* the default strategy. The selected buffer's usage_count is advanced when
@@ -1224,11 +1476,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*
* The returned buffer is pinned and is already marked as holding the
* desired page. If it already did have the desired page, *foundPtr is
- * set true. Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true. Otherwise, *foundPtr is set false.
*
* io_context is passed as an output parameter to avoid calling
* IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1287,19 +1535,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called StartReadBuffers() but not yet WaitReadBuffers().
*/
- if (StartBufferIO(buf, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return buf;
@@ -1364,19 +1603,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/*
* We can only get here if (a) someone else is still reading in
- * the page, or (b) a previous read attempt failed. We have to
- * wait for any active read attempt to finish, and then set up our
- * own read attempt if the page is still not BM_VALID.
- * StartBufferIO does it all.
+ * the page, (b) a previous read attempt failed, or (c) someone
+ * called StartReadBuffers() but not yet WaitReadBuffers().
*/
- if (StartBufferIO(existing_buf_hdr, true))
- {
- /*
- * If we get here, previous attempts to read the buffer must
- * have failed ... but we shall bravely try again.
- */
- *foundPtr = false;
- }
+ *foundPtr = false;
}
return existing_buf_hdr;
@@ -1408,15 +1638,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to obtain the right to
- * start I/O. If StartBufferIO returns false, then someone else managed
- * to read it before we did, so there's nothing left for BufferAlloc() to
- * do.
+ * Buffer contents are currently invalid.
*/
- if (StartBufferIO(victim_buf_hdr, true))
- *foundPtr = false;
- else
- *foundPtr = true;
+ *foundPtr = false;
return victim_buf_hdr;
}
@@ -1770,7 +1994,7 @@ again:
* pessimistic, but outside of toy-sized shared_buffers it should allow
* sufficient pins.
*/
-static void
+void
LimitAdditionalPins(uint32 *additional_pins)
{
uint32 max_backends;
@@ -2035,7 +2259,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
buf_state &= ~BM_VALID;
UnlockBufHdr(existing_hdr, buf_state);
- } while (!StartBufferIO(existing_hdr, true));
+ } while (!StartBufferIO(existing_hdr, true, false));
}
else
{
@@ -2058,7 +2282,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
LWLockRelease(partition_lock);
/* XXX: could combine the locked operations in it with the above */
- StartBufferIO(victim_buf_hdr, true);
+ StartBufferIO(victim_buf_hdr, true, false);
}
}
@@ -2373,7 +2597,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
else
{
/*
- * If we previously pinned the buffer, it must surely be valid.
+ * If we previously pinned the buffer, it is likely to be valid, but
+ * it may not be if StartReadBuffers() was called and
+ * WaitReadBuffers() hasn't been called yet. We'll check by loading
+ * the flags without locking. This is racy, but it's OK to return
+ * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+ * it'll see that it's now valid.
*
* Note: We deliberately avoid a Valgrind client request here.
* Individual access methods can optionally superimpose buffer page
@@ -2382,7 +2611,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
* that the buffer page is legitimately non-accessible here. We
* cannot meddle with that.
*/
- result = true;
+ result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
}
ref->refcount++;
@@ -3450,7 +3679,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* someone else flushed the buffer before we could, so we need not do
* anything.
*/
- if (!StartBufferIO(buf, false))
+ if (!StartBufferIO(buf, false, false))
return;
/* Setup error traceback support for ereport() */
@@ -5185,9 +5414,15 @@ WaitIO(BufferDesc *buf)
*
* Returns true if we successfully marked the buffer as I/O busy,
* false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend. In that case, false indicates either that the I/O was already
+ * finished, or is still in progress. This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
*/
static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
@@ -5200,6 +5435,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
UnlockBufHdr(buf, buf_state);
+ if (nowait)
+ return false;
WaitIO(buf);
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1f02fed250e..6956d4e5b49 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
* LocalBufferAlloc -
* Find or create a local buffer for the given page of the given relation.
*
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local. Also, IO_IN_PROGRESS
- * does not get set. Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local. We support only default access
+ * strategy (hence, usage_count is always advanced).
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
}
/* see LimitAdditionalPins() */
-static void
+void
LimitAdditionalLocalPins(uint32 *additional_pins)
{
uint32 max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
/*
* In contrast to LimitAdditionalPins() other backends don't play a role
- * here. We can allow up to NLocBuffer pins in total.
+ * here. We can allow up to NLocBuffer pins in total, but it might not be
+ * initialized yet so read num_temp_buffers.
*/
- max_pins = (NLocBuffer - NLocalPinnedBuffers);
+ max_pins = (num_temp_buffers - NLocalPinnedBuffers);
if (*additional_pins >= max_pins)
*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('aio')
subdir('buffer')
subdir('file')
subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
#ifndef BUFMGR_H
#define BUFMGR_H
+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
#define BUFFER_LOCK_SHARE 1
#define BUFFER_LOCK_EXCLUSIVE 2
+/*
+ * Maximum number of buffers for multi-buffer I/O functions. This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
/*
* prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy,
bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers(). Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+ /* Parameters passed in to StartReadBuffers(). */
+ BufferManagerRelation bmr;
+ Buffer *buffers;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+ int nblocks;
+ BufferAccessStrategy strategy;
+ int flags;
+
+ /* Range of buffers, if we need to perform a read. */
+ int io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+ Buffer *buffers,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int *nblocks,
+ BufferAccessStrategy strategy,
+ int flags,
+ ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
extern void ReleaseBuffer(Buffer buffer);
extern void UnlockReleaseBuffer(Buffer buffer);
extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
/* in buf_init.c */
extern void InitBufferPool(void);
extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected. Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet. This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+ void *pgsr_private,
+ void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+ void *pgsr_private,
+ size_t per_buffer_private_size,
+ BufferAccessStrategy strategy,
+ BufferManagerRelation bmr,
+ ForkNumber forknum,
+ PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf2..cfb58cf4836 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
PgStat_TableStatus
PgStat_TableXactStatus
PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
PgXmlErrorContext
PgXmlStrictness
Pg_finfo_record
@@ -2267,6 +2269,7 @@ ReInitializeDSMForeignScan_function
ReScanForeignScan_function
ReadBufPtrType
ReadBufferMode
+ReadBuffersOperation
ReadBytePtrType
ReadExtraTocPtrType
ReadFunc
--
2.37.2
v5-0014-BitmapHeapScan-uses-streaming-read-API.patchtext/x-diff; charset=us-asciiDownload
From e2267faf0fed9006fb0b437406737f1644d582c2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Feb 2024 21:04:18 -0500
Subject: [PATCH v5 14/14] BitmapHeapScan uses streaming read API
Remove all of the code to do prefetching from BitmapHeapScan code and
rely on the streaming read API prefetching. Heap table AM implements a
streaming read callback which uses the iterator to get the next valid
block that needs to be fetched for the streaming read API.
---
src/backend/access/heap/heapam.c | 68 +++++
src/backend/access/heap/heapam_handler.c | 88 +++---
src/backend/executor/nodeBitmapHeapscan.c | 336 +---------------------
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 22 +-
src/include/nodes/execnodes.h | 19 --
6 files changed, 117 insertions(+), 420 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b93f243c282..c965048af60 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -115,6 +115,8 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static BlockNumber bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data);
/*
* Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -335,6 +337,22 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
if (key != NULL && scan->rs_base.rs_nkeys > 0)
memcpy(scan->rs_base.rs_key, key, scan->rs_base.rs_nkeys * sizeof(ScanKeyData));
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+ {
+ if (scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
+ scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+ scan,
+ sizeof(TBMIterateResult),
+ scan->rs_strategy,
+ BMR_REL(scan->rs_base.rs_rd),
+ MAIN_FORKNUM,
+ bitmapheap_pgsr_next);
+
+
+ }
+
/*
* Currently, we only have a stats counter for sequential heap scans (but
* e.g for bitmap scans the underlying bitmap index scans will be counted,
@@ -955,6 +973,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
+ scan->rs_pgsr = NULL;
scan->rs_vmbuffer = InvalidBuffer;
scan->rs_empty_tuples_pending = 0;
@@ -1093,6 +1112,9 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
+ if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN && scan->rs_pgsr)
+ pg_streaming_read_free(scan->rs_pgsr);
+
pfree(scan);
}
@@ -10250,3 +10272,49 @@ HeapCheckForSerializableConflictOut(bool visible, Relation relation,
CheckForSerializableConflictOut(relation, xid, snapshot);
}
+
+static BlockNumber
+bitmapheap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+ void *per_buffer_data)
+{
+ TBMIterateResult *tbmres = per_buffer_data;
+ HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (hdesc->rs_base.shared_tbmiterator)
+ tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+ else
+ tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+ /* no more entries in the bitmap */
+ if (!BlockNumberIsValid(tbmres->blockno))
+ return InvalidBlockNumber;
+
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend). We don't take this optimization in SERIALIZABLE
+ * isolation though, as we need to examine all invisible tuples
+ * reachable by the index.
+ */
+ if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+ continue;
+
+ if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+ !tbmres->recheck &&
+ VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->rs_vmbuffer))
+ {
+ hdesc->rs_empty_tuples_pending += tbmres->ntuples;
+ continue;
+ }
+
+ return tbmres->blockno;
+ }
+
+ /* not reachable */
+ Assert(false);
+}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 022753e203a..9727613e87f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2111,79 +2111,65 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
*/
static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, bool *lossy, BlockNumber *blockno)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck, bool *lossy)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
+ void *io_private;
BlockNumber block;
Buffer buffer;
Snapshot snapshot;
int ntup;
- TBMIterateResult tbmres;
+ TBMIterateResult *tbmres;
+
+ Assert(hscan->rs_pgsr);
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- *blockno = InvalidBlockNumber;
*recheck = true;
- do
+ /* Release buffer containing previous block. */
+ if (BufferIsValid(hscan->rs_cbuf))
{
- CHECK_FOR_INTERRUPTS();
+ ReleaseBuffer(hscan->rs_cbuf);
+ hscan->rs_cbuf = InvalidBuffer;
+ }
- if (scan->shared_tbmiterator)
- tbm_shared_iterate(scan->shared_tbmiterator, &tbmres);
- else
- tbm_iterate(scan->tbmiterator, &tbmres);
+ hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->rs_pgsr, &io_private);
- if (!BlockNumberIsValid(tbmres.blockno))
+ if (BufferIsInvalid(hscan->rs_cbuf))
+ {
+ if (BufferIsValid(hscan->rs_vmbuffer))
{
- /* no more entries in the bitmap */
- Assert(hscan->rs_empty_tuples_pending == 0);
- return false;
+ ReleaseBuffer(hscan->rs_vmbuffer);
+ hscan->rs_vmbuffer = InvalidBuffer;
}
/*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend). We don't take this optimization in SERIALIZABLE
- * isolation though, as we need to examine all invisible tuples
- * reachable by the index.
+ * Bitmap is exhausted. Time to emit empty tuples if relevant. We emit
+ * all empty tuples at the end instead of emitting them per block we
+ * skip fetching. This is necessary because the streaming read API
+ * will only return TBMIterateResults for blocks actually fetched.
+ * When we skip fetching a block, we keep track of how many empty
+ * tuples to emit at the end of the BitmapHeapScan. We do not recheck
+ * all NULL tuples.
*/
- } while (!IsolationIsSerializable() && tbmres.blockno >= hscan->rs_nblocks);
+ *recheck = false;
+ return hscan->rs_empty_tuples_pending > 0;
+ }
- /* Got a valid block */
- *blockno = tbmres.blockno;
- *recheck = tbmres.recheck;
+ Assert(io_private);
- /*
- * We can skip fetching the heap page if we don't need any fields from the
- * heap, and the bitmap entries don't need rechecking, and all tuples on
- * the page are visible to our transaction.
- */
- if (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmres.recheck &&
- VM_ALL_VISIBLE(scan->rs_rd, tbmres.blockno, &hscan->rs_vmbuffer))
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres.ntuples >= 0);
- Assert(hscan->rs_empty_tuples_pending >= 0);
+ tbmres = io_private;
- hscan->rs_empty_tuples_pending += tbmres.ntuples;
+ Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
- return true;
- }
+ *recheck = tbmres->recheck;
- block = tbmres.blockno;
+ hscan->rs_cblock = tbmres->blockno;
+ hscan->rs_ntuples = tbmres->ntuples;
- /*
- * Acquire pin on the target heap page, trading in any pin we held before.
- */
- hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
- scan->rs_rd,
- block);
- hscan->rs_cblock = block;
+ block = tbmres->blockno;
buffer = hscan->rs_cbuf;
snapshot = scan->rs_snapshot;
@@ -2204,7 +2190,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres.ntuples >= 0)
+ if (tbmres->ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2213,9 +2199,9 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres.ntuples; curslot++)
+ for (curslot = 0; curslot < tbmres->ntuples; curslot++)
{
- OffsetNumber offnum = tbmres.offsets[curslot];
+ OffsetNumber offnum = tbmres->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
@@ -2265,7 +2251,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
Assert(ntup <= MaxHeapTuplesPerPage);
hscan->rs_ntuples = ntup;
- *lossy = tbmres.ntuples < 0;
+ *lossy = tbmres->ntuples < 0;
/*
* Return true to indicate that a valid block was found and the bitmap is
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 74b92d4cbf4..c5a482cc175 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -54,11 +54,6 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
- TableScanDesc scan);
static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
@@ -90,14 +85,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/*
* If we haven't yet performed the underlying index scan, do it, and begin
* the iteration over the bitmap.
- *
- * For prefetching, we use *two* iterators, one for the pages we are
- * actually scanning and another that runs ahead of the first for
- * prefetching. node->prefetch_pages tracks exactly how many pages ahead
- * the prefetch iterator is. Also, node->prefetch_target tracks the
- * desired prefetch distance, which starts small and increases up to the
- * node->prefetch_maximum. This is to avoid doing a lot of prefetching in
- * a scan that stops after a few tuples because of a LIMIT.
*/
if (!node->initialized)
{
@@ -113,15 +100,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->tbm = tbm;
tbmiterator = tbm_begin_iterate(tbm);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->prefetch_iterator = tbm_begin_iterate(tbm);
- node->prefetch_pages = 0;
- node->prefetch_target = -1;
- }
-#endif /* USE_PREFETCH */
}
else
{
@@ -144,20 +122,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
* multiple processes to iterate jointly.
*/
pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- pstate->prefetch_iterator =
- tbm_prepare_shared_iterate(tbm);
-
- /*
- * We don't need the mutex here as we haven't yet woke up
- * others.
- */
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = -1;
- }
-#endif
/* We have initialized the shared state so wake up others. */
BitmapDoneInitializingSharedState(pstate);
@@ -165,14 +129,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
/* Allocate a private iterator and attach the shared state to it */
shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->shared_prefetch_iterator =
- tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
- }
-#endif /* USE_PREFETCH */
}
/*
@@ -219,16 +175,13 @@ BitmapHeapNext(BitmapHeapScanState *node)
node->initialized = true;
/* Get the first block. if none, end of scan */
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy))
return ExecClearTuple(slot);
if (lossy)
node->lossy_pages++;
else
node->exact_pages++;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- BitmapAdjustPrefetchTarget(node);
}
for (;;)
@@ -237,37 +190,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
{
CHECK_FOR_INTERRUPTS();
-#ifdef USE_PREFETCH
-
- /*
- * Try to prefetch at least a few pages even before we get to the
- * second page if we don't stop reading after the first tuple.
- */
- if (!pstate)
- {
- if (node->prefetch_target < node->prefetch_maximum)
- node->prefetch_target++;
- }
- else if (pstate->prefetch_target < node->prefetch_maximum)
- {
- /* take spinlock while updating shared state */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target < node->prefetch_maximum)
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-
- /*
- * We prefetch before fetching the current pages. We expect that a
- * future streaming read API will do this, so do it this way now
- * for consistency. Also, this should happen only when we have
- * determined there is still something to do on the current page,
- * else we may uselessly prefetch the same page we are just about
- * to request for real.
- */
- BitmapPrefetch(node, scan);
-
/*
* If we are using lossy info, we have to recheck the qual
* conditions at every tuple.
@@ -288,17 +210,13 @@ BitmapHeapNext(BitmapHeapScanState *node)
return slot;
}
- if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy, &node->blockno))
+ if (!table_scan_bitmap_next_block(scan, &node->recheck, &lossy))
break;
if (lossy)
node->lossy_pages++;
else
node->exact_pages++;
-
- BitmapAdjustPrefetchIterator(node, node->blockno);
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
}
/*
@@ -322,215 +240,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
ConditionVariableBroadcast(&pstate->cv);
}
-/*
- * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- BlockNumber blockno)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (node->prefetch_pages > 0)
- {
- /* The main iterator has closed the distance by one page */
- node->prefetch_pages--;
- }
- else if (prefetch_iterator)
- {
- /* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult tbmpre;
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno) || tbmpre.blockno != blockno)
- elog(ERROR, "prefetch and main iterators are out of sync");
- }
- return;
- }
-
- if (node->prefetch_maximum > 0)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages > 0)
- {
- pstate->prefetch_pages--;
- SpinLockRelease(&pstate->mutex);
- }
- else
- {
- TBMIterateResult tbmpre;
-
- /* Release the mutex before iterating */
- SpinLockRelease(&pstate->mutex);
-
- /*
- * In case of shared mode, we can not ensure that the current
- * blockno of the main iterator and that of the prefetch iterator
- * are same. It's possible that whatever blockno we are
- * prefetching will be processed by another process. Therefore,
- * we don't validate the blockno here as we do in non-parallel
- * case.
- */
- if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- }
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max. Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
- return;
- }
-
- /* Do an unlocked check first to save spinlock acquisitions. */
- if (pstate->prefetch_target < node->prefetch_maximum)
- {
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
- pstate->prefetch_target = node->prefetch_maximum;
- else if (pstate->prefetch_target > 0)
- pstate->prefetch_target *= 2;
- else
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (node->prefetch_pages < node->prefetch_target)
- {
- TBMIterateResult tbmpre;
- bool skip_fetch;
-
- tbm_iterate(prefetch_iterator, &tbmpre);
-
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_iterate(prefetch_iterator);
- node->prefetch_iterator = NULL;
- break;
- }
- node->prefetch_pages++;
-
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
-
- return;
- }
-
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (1)
- {
- TBMIterateResult tbmpre;
- bool do_prefetch = false;
- bool skip_fetch;
-
- /*
- * Recheck under the mutex. If some other process has already
- * done enough prefetching then we need not to do anything.
- */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- pstate->prefetch_pages++;
- do_prefetch = true;
- }
- SpinLockRelease(&pstate->mutex);
-
- if (!do_prefetch)
- return;
-
- tbm_shared_iterate(prefetch_iterator, &tbmpre);
- if (!BlockNumberIsValid(tbmpre.blockno))
- {
- /* No more pages to prefetch */
- tbm_end_shared_iterate(prefetch_iterator);
- node->shared_prefetch_iterator = NULL;
- break;
- }
-
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (scan->rs_flags & SO_CAN_SKIP_FETCH &&
- !tbmpre.recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre.blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre.blockno);
- }
- }
- }
-#endif /* USE_PREFETCH */
-}
-
/*
* BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
*/
@@ -576,22 +285,12 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (node->ss.ss_currentScanDesc)
table_rescan(node->ss.ss_currentScanDesc, NULL);
- /* release bitmaps and buffers if any */
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
+ /* release bitmaps if any */
if (node->tbm)
tbm_free(node->tbm);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
- node->prefetch_iterator = NULL;
node->initialized = false;
- node->shared_prefetch_iterator = NULL;
- node->pvmbuffer = InvalidBuffer;
node->recheck = true;
- node->blockno = InvalidBlockNumber;
ExecScanReScan(&node->ss);
@@ -630,16 +329,10 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
table_endscan(scanDesc);
/*
- * release bitmaps and buffers if any
+ * release bitmaps if any
*/
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
}
/* ----------------------------------------------------------------
@@ -672,19 +365,13 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
scanstate->tbm = NULL;
- scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
- scanstate->prefetch_iterator = NULL;
- scanstate->prefetch_pages = 0;
- scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
- scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
scanstate->worker_snapshot = NULL;
scanstate->recheck = true;
- scanstate->blockno = InvalidBlockNumber;
/*
* Miscellaneous initialization
@@ -724,13 +411,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->bitmapqualorig =
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
- /*
- * Maximum number of prefetches for the tablespace if configured,
- * otherwise the current value of the effective_io_concurrency GUC.
- */
- scanstate->prefetch_maximum =
- get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
scanstate->ss.ss_currentRelation = currentRelation;
/*
@@ -814,14 +494,10 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
return;
pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
-
pstate->tbmiterator = 0;
- pstate->prefetch_iterator = 0;
/* Initialize the mutex */
SpinLockInit(&pstate->mutex);
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = 0;
pstate->state = BM_INITIAL;
ConditionVariableInit(&pstate->cv);
@@ -853,11 +529,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
if (DsaPointerIsValid(pstate->tbmiterator))
tbm_free_shared_area(dsa, pstate->tbmiterator);
- if (DsaPointerIsValid(pstate->prefetch_iterator))
- tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
pstate->tbmiterator = InvalidDsaPointer;
- pstate->prefetch_iterator = InvalidDsaPointer;
}
/* ----------------------------------------------------------------
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3dfb19ec7d5..1cad9c04f01 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
#include "storage/dsm.h"
#include "storage/lockdefs.h"
#include "storage/shm_toc.h"
+#include "storage/streaming_read.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -72,6 +73,9 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
+ /* Streaming read control object for scans supporting it */
+ PgStreamingRead *rs_pgsr;
+
/*
* These fields are only used for bitmap scans for the "skip fetch"
* optimization. Bitmap scans needing no fields from the heap may skip
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2adead958cb..1a7b9db8b40 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -792,23 +792,11 @@ typedef struct TableAmRoutine
* lossy indicates whether or not the block's representation in the bitmap
* is lossy or exact.
*
- * XXX: Currently this may only be implemented if the AM uses md.c as its
- * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
- * blockids directly to the underlying storage. nodeBitmapHeapscan.c
- * performs prefetching directly using that interface. This probably
- * needs to be rectified at a later point.
- *
- * XXX: Currently this may only be implemented if the AM uses the
- * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
- * perform prefetching. This probably needs to be rectified at a later
- * point.
- *
* Optional callback, but either both scan_bitmap_next_block and
* scan_bitmap_next_tuple need to exist, or neither.
*/
- bool (*scan_bitmap_next_block) (TableScanDesc scan,
- bool *recheck, bool *lossy,
- BlockNumber *blockno);
+ bool (*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck,
+ bool *lossy);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1984,8 +1972,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
* used after verifying the presence (at plan time or such).
*/
static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
- bool *recheck, bool *lossy, BlockNumber *blockno)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck, bool *lossy)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1995,8 +1982,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
- return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck,
- lossy, blockno);
+ return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck, lossy);
}
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a59df51dd69..d41a3e134d8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,8 @@ typedef enum
/* ----------------
* ParallelBitmapHeapState information
* tbmiterator iterator for scanning current pages
- * prefetch_iterator iterator for prefetching ahead of current page
* mutex mutual exclusion for the prefetching variable
* and state
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
* state current state of the TIDBitmap
* cv conditional wait variable
* phs_snapshot_data snapshot data shared to workers
@@ -1695,10 +1692,7 @@ typedef enum
typedef struct ParallelBitmapHeapState
{
dsa_pointer tbmiterator;
- dsa_pointer prefetch_iterator;
slock_t mutex;
- int prefetch_pages;
- int prefetch_target;
SharedBitmapState state;
ConditionVariable cv;
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,16 +1703,10 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * pvmbuffer buffer for visibility-map lookups of prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
- * prefetch_iterator iterator for prefetching ahead of current page
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
- * prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
- * shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* worker_snapshot snapshot for parallel worker
* recheck do current page's tuples need recheck
@@ -1729,20 +1717,13 @@ typedef struct BitmapHeapScanState
ScanState ss; /* its first field is NodeTag */
ExprState *bitmapqualorig;
TIDBitmap *tbm;
- Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
- TBMIterator *prefetch_iterator;
- int prefetch_pages;
- int prefetch_target;
- int prefetch_maximum;
Size pscan_len;
bool initialized;
- TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
Snapshot worker_snapshot;
bool recheck;
- BlockNumber blockno;
} BitmapHeapScanState;
/* ----------------
--
2.37.2
Hi,
I haven't looked at the code very closely yet, but I decided to do some
basic benchmarks to see if/how this refactoring affects behavior.
Attached is a simple .sh script that
1) creates a table with one of a couple basic data distributions
(uniform, linear, ...), with an index on top
2) runs a simple query with a where condition matching a known fraction
of the table (0 - 100%), and measures duration
3) the query is forced to use bitmapscan by disabling other options
4) there's a couple parameters the script varies (work_mem, parallel
workers, ...), the script drops caches etc.
5) I only have results for table with 1M rows, which is ~320MB, so not
huge. I'm running this for larger data set, but that will take time.
I did this on my two "usual" machines - i5 and xeon. Both have flash
storage, although i5 is SATA and xeon has NVMe. I won't share the raw
results, because the CSV is like 5MB - ping me off-list if you need the
file, ofc.
Attached is PDF summarizing the results as a pivot table, with results
for "master" and "patched" builds. The interesting bit is the last
column, which shows whether the patch makes it faster (green) or slower
(red).
The results seem pretty mixed, on both machines. If you focus on the
uncached results (pages 4 and 8-9), there's both runs that are much
faster (by a factor of 2-5x) and slower (similar factor).
Of course, these results are with forced bitmap scans, so the question
is if those regressions even matter - maybe we'd use a different scan
type, making these changes less severe. So I logged "optimal plan" for
each run, tracking the scan type the optimizer would really pick without
all the enable_* GUCs. And the -optimal.pdf shows only results for the
runs where the optimal plan uses the bitmap scan. And yes, while the
impact of the changes (in either direction) is reduced, it's still very
much there.
What's a bit surprising to me is that these regressions affect runs with
effective_io_concurrency=0 in particular, which traditionally meant to
not do any prefetching / async stuff. I've perceived the patch mostly as
refactoring, so have not really expected such massive impact on these cases.
So I wonder if the refactoring means that we're actually doing some sort
amount of prefetching even with e_i_c=0. I'm not sure that'd be great, I
assume people have valid reasons to disable prefetching ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
bitmapscan-results.pdfapplication/pdf; name=bitmapscan-results.pdfDownload
%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(�� G o o g l e S h e e t s)
/Title
(�� b i t m a p s c a n s t r e a m i n g r e a d A P I)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x������rF /�"���&��)�a��3��jD+J,KG�aKO1��r`#@�������WS�s��I8��o�X_���VOu�����>����?L���������}�~-��������5�!,���)����w�����������boE�o�ax�����q�����0t���5"��������R�������a
����wy��?�?��/�����*�{�>�B���AD����E<��e9��$���x]fU�m$���$�����M_������l���[IHt%Q+���<�
Mp����p������!�#a+���<��M0,�XI��v�n�������ys��&8Gh��$\i;i7�e��H�����9�mc�+4�@������2du$l`E{�����/8�]�XY��v�n+�c�7�����9�mc{h��$\n;k7�e��H�����9�mc�h��c�[��y��2du$l`E{����� .#4��i�k�4@�����H�����9�mc\�yZ��9S�n�pjaE{�����/8��D
d�J�W�����$l`M{����� ��H-���*k7�e��H�����9�mc���Z8��U�n�pGjaE{����� .�H-���*k7�e��H�����9�mc\��Z���U�n�pGjaE{�����/p��p��+�U�E]6��=o�c��8R�=���Mc�:6��=o�c��gp��=���Mc��H-�h����66����y���v�X����ys������#5��+m��W�/uI�����9�mc��Z���U�n�p GjaE{����� p�{~����2du$l`E{����� ��H-{~����2���ZX��7��ml�����_e���Y X��7��ml:[�������G������Gj`M{�����t.���n���v�X9GjaE{����� N�H-��*k7�e��H�����9�mc\��Z8��U�n�pGjaE{����� ��H-\��*k7�e��#���=o�c��\zp��p��+����WE]6��=o�c��Gp��{~����2��ZX��7��ml�8R�=���Mc�:6��=o�c��p�N{~����2\��ZX��7��ml�+8R�=���Mc�:6��=o�c����Gj W���_���Azp�����ylK_o��Z���U�n+p��ZX��7��ml�38R��_e���Y X��7��ml�����_e���#8R+���<��}��Gj W����_ (����5�ys��&8�#�������4�� ������9�mc�H-��*k7�e��ZX��7��ml�38R��_e���Y X��7��ml�����_e���#8R+���<�����xR�Y�������|�f}�-qM?��������U7��s�Ba-xX�W(�U���n�_X7x�������s��Y7�B�z8�&���j��=�+���Y%�[\��V=�V^l�����
�j������B�z8�N����n0��z���\��$y�k����j�c�^X���d���<���
h��i5� ��6x�s�Ba[-x���W(�U����[m�������Y%�[\��V=�V^l�����
�j������B�z8������,_���`/Vt����-�Q@�N� ���7����La[-xD_��
h��i5������q�w3�m�������
�����pD_�����f
�j����-�P@�N�;���|�}7�����-�Q@�N� ��7����La[-x@_���e������@^��-*�5�G���W(�U�m�����Pq�y��V3���np�Z����q���������m5cV��W(�U�m� �n�V[���^0�m5aQ����(�U�m�{aqWq�y��V3��np�Z����qX�������m5���v�+�������[mq�8��U�m�/�h7�B�zl[�x���/�����m5cV��W(�U�m� O� ����k�`
�j����-�Q@��V3G���}�Gq������G����
h�c�j���
w}�P�V3��Wop�Z�������z��]�-�����\��V=��f�����e�w�m5cV��W(�U�m� �=�j�Y�����&,*I�����m5�}����[(l����7�B�zl[�xB_�����
�j���
�P@��V3���7x���Ba[��U���
h�c�j�K���b����)l� �J���F�zl[�x@_�����
�j���
�P@��V3��7x���Ba[�8����
���m5�}��]�-���g��\��V=��f�Wo��������qD_��
h�c�j��C_m1��z�waQI��(�U�m���7����Ba[�x@_��
h�c�j�}���[(l���\��V=��f<��������������z�+������W��<��n���f�*Y��
���m5��G_m1��z����E%�[\��V=��f<����~�w�m5�}�W(�U�m�O��7x���Ba[�xB_��
h�c�j���
�v}�P�V3^�Wop�Z�������z��]�-���W��\��V=��Nx�z���|�+�n�HXU&y���������{aq���������`/,�P@��V3�"���q�w+�m5�)b�-�P@��V3^Vl�����V
�j������B�zl[M���Wo0��z����|��$y�k���������[���n���f�*Y��
���m5�0`�-���[����q��W(�U�m��#����������1����
���m5����Jg��zl[��imq�Z�����������VJ+o�c�j��������0uk|��5��NM�\�)�����*T�Z����hC��������Y�`����RI�h�_R�?gu�_S��s��1-sX������'�����m�����$ x�i����z��3�o���uF�BE��C����@�r ��!o<�1���&j1�L����������s��0��������
������/�$���:�q1�1�D\��q���=���h_�
2\I�������T.!�CN��$ID�����r�����ay��`��D�K��J���5��6fLi��*�[�$���&C�,��
�|��aB���u��L��������7��!��~��I��l��.x��AK�kI���,L�f�0�����s�AZ�@]�LFk|�W�j_�-4do;k��2-df�C��/I�@d����1_����*�hz&�B;����za{yPaK�"���D
D�1uC�c�^ VR���W�\�`�iO���0���YBZB3�[h�^��mx�����c>i �����k_W�:�sFVl.6���q�B$����XCn�}'.�>�e���R��u�W`{
�mA� J�"�LX�m�����d6����� �m�M\
C6�����2�5��1 x�����������# X����L!���2�B6�l������HP8���gz���g�XOJ�����Q�8s�� ��k8�R�$l`Ve��
Y�x�!' a
_�lV�� �- A6��B>�+&$��cgf aO
(� c�0%H��� ��0��xrXh��:�_�0�E��;� �v�m
���(���G"�r���@43A.E�6�� Pf���Y������(�����N\�v6w�x���0�� @��
D��9d�b>2:�h�$l���3z�+� �c�;eH��� ���0��xXxr��e������P���C*���p�9��20<+%�8|�1��c�������@S��r@D&�����dPy60��y%��I!5Y(+X����?�~=�*c�zIPj�'a������b5�0���D���7ZVM���>�����>+/x`_��@���0Aex h ��
m������$l��;p"t������~
�R��^I��0Rv~�e�$l����.�.;X0�=���-4t�h���e�b�&0A}� �[xn��U���f0A%X�-<i���(�?��BCx�Xd���d��l�/��'���Y]���������
A6����`��%6��3S��3%A�H��N�4�[7��R� f=0 hw|Y���`��$l�9B�f!\a �#au��Y�8`�$s���#�" h(w}>��P.0��)�qt`�J.�/�$������ ���=b{l�X;3�$l 2��[0S�(#���
4��Y�l��(n5 C6�l����������
4��+�m�}���r$ak5�L��x6C6�l��z0�KN��=��@���o$�%N� K����R X)��v��������$l`�����2�a�3%A� a
���e�������9 x����e���������
4�Vq6���k�I$(�)%aku�W�/^��9<6�� i�B����>���� ���>� yC��P�$l�)
nh7�"����6���s����t�I�@3�Gwv�����d6)ws����e\a@I�����5
$�W��w R�"��<� ���]I��!+�~�,E� �D^�Z���X^��L�|z��
4|��*y�CN�P&�/a�],fR $z��$l�):^����t+h6����:�|�LP9�~
���<
����r>���q�B��o�lZ��1�� ^���#a���o�d}(A6����� �iA��F� X$,���eA��0�$l�a��l\���� 2 Xd��n!�:�}(C6�:�B6=H��g��@$�n�� c�'H�%�qt��A�5����8���I���i�C��d`��!��" h���� �C #���
<�E60� F�B��
��w����=AVG�b���B�����
<9�lO����+L x���L��
�>�c�hfm��C��C����8:�W�c�pg��I|��K�$l`��� +gF6k�0Q�$l 28�'�����+6�d'�t!�D
E6��6�.����Y h�����SJ�����
�=v`�x�C~��wVb�v2�y!C����/y�A/
[8f�u//r���7qI�@�^9����I����0�d
6��p'���+~��! ���+��l�����=��6����i�3��:X3�! h�z3�x�C���0�$l����bBrk�b�3H��\����q�9$(OM�������� ,#?K����K��1{8����
���)�� �E)#�,I�@��oaf'z!SL�X�1A�x h:y����X� h�v��-�X�/A%x h�7�y��t��Gf����&����u��(���_�J��[x��lZ�����a6���G�7������^��*'a�W���Vd��H����" ���2�#t��
D�Ci��G����@��
<�C2faL�.� 2 hz�s!��Y�f�`��$l�9B2f!\a 2 ��{�W�C2uy�r�9$a�s���d����d6���^/ c�������^���!Y�|y![:��(C6 w�H��!�E�"t��
D��eS�|a$�7�{"��Fu��C.��%�w�G"]���|{W���!�E ���GTw56c&\a ����:p<�6�|l��K�k�@��cQ�2.0����8:�y{~����i���O�fo���h�s~
�8���D��@��
���!�G#���
D��5�f�<BHPK�D�C6/td����Z���}����S�,�Z/`r ���]�C6��0�$l�� �vm6d��a j����gf���>��t��@3�;��1��(��
Nr��%l ���+|l��`�/a
�N��VIF��CI�@��-�M������c
@�6�����Cb�������-4|�N��������%l�%t��h�7�o� LP����/�Q���0�l��Va������'1dP)���
D�����T�H��0dPM������$l`��ww�(O���!J�'a�n?�&#�*e��#6�0f��:���^��!�J)��������3�8�9' a�|��c��F�Z�`r`yL�7�4��A�p���H�ol#�Q)��J���F6.��������2 �=[_�D��(|l ���I�@�w�����@>���z"���rZ����� j���H�3�2B�� M���
D��d^����$�+�I��"c!�c�yd��*�X{�y<x�D)�G�ZE`t R�n�l^�C����\�O�H������ �J je���H�{�N�M� ��D�"�!�g:�wH�R�I�@���hC�)U�� ����w�����CB�2$a
�A��f���N����lol��N%��XC��sc��9UJ��$���H���-��T��J
I���Y~�F6tH�R�$l 2�U�"C��*���%6�H��-!dC��*eH�"���-C��&��!F�=R:K���!�J �����`��W�/�R�!�<7|l���5�|�R�
�N8�-��)��E�"���)B
Y�b��fH������L�
d�?UXFh�v���h!�L #t��,#4�;��l��?&|T�Y�@�[h�vO��f!�L _p�"��,d��2a���Zb�B�x���P�0�TZ�2�uO�*�2$� a����[hw\-[2��)����
�c�wh��s�OP
_�4$a;H���:��ikBHP�B,D�����r��/B�H�@��="as��5a$�5!z"���e$k���Y�B�4�;�)�3�� !A- 19�d��!mMW�F6���ac��5�#(k��#C�9YS�&��V�hzXwQ�H��%��/e"������5a�7 mM)����x� ����"t��
D�#?���Z��5���
D��w�1�0&��1F���$�x���x��5�H��C��p�9$(�*�����VB}!$KgB�Z�w����
��HP�e���*���d���������H�N�B�Sm�
r������Xh�P�8 _���������%C�����
_����
r�������|M� � �KH�@C����*��.e\`HI�@d��|��r�T���I���?�C��!�l���I�@d�g��2��.���#6��ViJr�x� I�@�<������AV���:F"�1+d>6;��R�$l����2lW���|+� xr��3cF�sH�R��w�a<��p�)$�BF�� �cH�JP�~�b/aY���W6d�����%l��<XT� �z��
4���!���A����
4|G� ��Y-Z�c��StlW�f�t+�f6���=��p��A��2�����Y)����
���H����4���FG6��d��?A��M2��" ��;�L�IA���E� h���Y#��4+!$�5Az"���J�#�4�E+��4|G����M��� h�P6��0��^GXxj@��!�K�j}���'m�
R��q�)$a��`1�e������jG6p<�d�v�9]d��v�"!���������"���
4��yZ����������|'#K��.�$�BF��U��9]�������g;��VBIP8���Hy�d�2A�U�j|���
�0��H����-@����D�3g3J�KP�z$a���e��� �UJ��
4�;�D�&�'� j���e�J��0��A��R$a�u�m��p�9$a
���w���$����
,�y�����o�$�eDF�%ZwlT�{�����x6p�L���X^��{$���G�����Ce$+��#� a���7���� ���! h���|+�!A)���
<�A^���$���G�@$<T<U��A�Z�ct`���=��$\!)Aq�Qjk�G��p�T��-\lk������s�#�/�� #=T�U5�-4�GK�+%Y$E-�1n��<����D;=O�E4�-<>]l8���\+L=@������@>�P���ZA�������!!'A5 )1��
�!
���>z��jI����w0�^� ��D�"����B�F�8�? h���JH&�8�@��Q6,H�F�Z�br��e��\�[aI����Vr��������������^y��������{?�tG��IP�^�L$a�3�1d��Gv�|����@�;V�X � �zD��� �.������$a�u�LW��a�Z�bt�!���C���d��HP[LD��O�*�
�HPJM����F��� ��������x��d�B��D���
�Q�?CRN�b��\? ��G��hH�B�Z�bq ���m�r�2A�9��
<�G6j�%%���E�@C���UC*�2&�e.F�U6k�J�Z�br��<��
IPjO�����E�6�?���V��x�C��!'A15����,�u��2�������
���;��(���!�F9^��@Cyt�#-uH�Y�������GW>S�R��x x��3��e����
���E�D��W��0n�q.�(�%��.Q��<��L��U��"�&e39 h(��%�H6�,%�HQ�F��
<;�iJ^��2�k������i#J���B������!q$A���0A6p�O�G> y�#�u"�������6���o&�l��7�I�@3��5�.9�KP� �<�CrL���#���
4�;����!sD W�E6�l��!yD( j-����*3��(��H�"����5@�H�j�R� ����_JB������
D�C����-�9���
<�A�.&$��,�9g a
��sad����H��cEA�q�I$(�����x�3%�$�(��;�,%/wHIPmC��'ag�D�wQ��;��#A�"�8�P�8 _�.����������H�! h���RF�?� Y��������^'��� [D�%a�?G��r�9#(w����J��r =D��
�M����B�����R���`�Q���;���������=�d[fO���K�bCy���1��SI�b�x�T�lrS�4���%mqNZ����)���jO��.���MAv���8�I���]d��aK0���>=�le�����b������@#�L��Q��z�������2=lf���f�.����_\��{UP�� Q&c��5��YwJ���A���2������u5y���7 ��Q��Y�+���C"��2�������Vl��[Fc�r�>�.6{����v�Af���G�d1A��!���\�'y����_b��Rl�����_\���eh�
!�&���;�����oR�B���2������I�R6RH�Q��Iw��� ��(-c-0����<�pV�}������T�Hy]�SWV� 'cY{��z���.8�����bd��WAV �p�KHy ��8@��V�F���)0+c-J��Y�����i#v��-.�������7� N4 �[|��l�BK���������}E%�M�Ik����Hz��b����2��au�����3)o
L�X��.>�U6ra]pVk�����\�y*����G�r�9�[������C���2�*����h�/�q@V�RV��-F����l��������A�bd�y�FlRx2R�R���-6]=�3�l������������C&OF��������y!��Uv�'�����O�T���2����-F�Cw+�J ����n
I�b�?���m��2��;�[lH��*��mB�Q���-6��?q�� �(�M�����
���[�UlR�2�$o��!�c�'��������P1%�����d�=|��b}����.8�$o�e=���S
�������) y�{�e�I�b���+�������3���n�6H�_,o1����e���/6Mq`3����Hz�7���mB������H{�5��t�f,���bCz�V���mB�W���.>9�b|����.8�,o��oa}*�0���J���x������)+c-���Y>����i#v��-F�������Gq�IH�b���������7�S������/t!�-2�2��I�se]qRK]��������_q^eW���g�]|�������3�X�:�.>�]1|�#�m|���-��d8���J+;{3�BC�#���;�=��e����#���CJ*�$�e$$o�!=VR"��������-6�������kJ�Xk[L.6��
�*�$�)�[lV���cS�t2�j:�\MO���A��� eJ�XB,.6����5�JXJC�����w�XLR������]��{��eIB
��2�
��
��%���d!kJYX�bdu�/j�,�a"��IWl����O~SQZ^=��$4$o��=t�YIy�]e�H��-6�G��3N^��=5h�����:*�r�����$y�mG<�8W�Q��A���������V�����I�HK8�T��_=l�f/@���T g�F*"���=�������`6�r�����}a�~n��c2��F��[\$u����c6��2�z����X^���~��TB��#���)�+[&S)�[��������8���VBX]lH���X'�R)�����c�#fR%�Y�T0 y��)M��PX�>1�JYH�bd�9���4*%��5��H�.\�E�;�Wf��<*e!y��]����a���
������F������VQ�\���SZ1�-��i�q����-z�D�� s�����O��p��1'[ s2��9w�~^�B$��1���t��/*ae+���d!y��u�%Z�L��Ik����Hz�ny���3��au1�|�^i9eZ�B��.>5�d
)Y���R�������s�^#���|���-fy�����aU��-�)����7���Y#v��-6������0�Gh +M����4P!%7&� �Z|b��!�zXy HYe��eI��Y���X1GL�������az��.8�$o��5������:��S�B������1@��V�Z�bq1�;fR1L����t�~~&���5JJX�D�.F����l�^#���X��b�z�p�VS]���->Dk��hc�Z�at1��q(~���f�D��3���
$o���.&�%�fbg�L��~�����+������] o�7b�X�b���L�:Nra�O���-6�u+�;_���
�kF�"1��tw7�pv����X+;,..����e�i�K��`��0��Y��=U,H1wH1C�D��s:v,���`�
�iu��e+[�Sm��! m�e+�#m�W�S���qF�8#G]l8kG�O��p�c�����s/�}��b\L����-F��g�q�q)�=.%�B2�.>3�c��h����c,o�1L��-������p���
�[��c���,0�e�uWg��9��s��/%��/���y�4�����0�!y�
�vR-g{KKY�wa���������F��*���� '��->�*�)����'��"��
k�H�U����7sXM"o1r�����`��4�Ry�
i��EuLg$�3����HZ��T�A���)c!���O
�l�D*���+�)�[��;��t�f�iA� W�[����g#��E�mS���}��C���
�^F��T�W>��1�O������n�.��|J�[,oq�VV/����|��X���.F�lK��D��Ahy��t>�ay�������cs�����X��,.�P���(�W�����,o1^��Mb~&��[�sX'"o1��\��
�YC@� �D�-6���=��g��3r������3%�����a��!G�\����?�C�[��1����f)�� �;%f|&�;k�GI�b�w>���������9�J��-����#�j�A�d��9�������n����J��#�[|���Yb���&���������E�NN1�pB -n�x�8�l��/������pI�b���i$A����C���L����n��*[�D��L��-���!�
i�Z������_����qF�8#I=lH�m����A�IX9��a�Y��#�;B��R",$$o�R��x����IX9������}�����_$oq���o�m���Q6��0 ��`$oq�)�������z����%�h�*����-FVUM������)k�5b�X��kv2F��2�&���0�\YH�bd=��C��^���V]��V�|�R�>��2��_��
���)�oq���jey��>��S����'����d�~LX�o���Y������S��6�Jy��v��O��Y��,at�a�;�LD*N�,8�,o���}rR�"D��IX��s�+~�Y�Y'\�,o��f�TqxO!���H�8��x��k����<X�TV�Z�nq1�����q@��}cy�3����g�;
B�Xk��.6��"9/v��(-;�� 4,oqqlw� ����_A�)�[��R�d��P7w�y0#!�[l���+F"f�`L*E���c)A��.�5�����N���lany��Xx���g�+��N�N�?����B�A�F��e4=l~�}$�m~R��-�0��"����}(
�[��{yh����e��9��8c����`x�C(k�����7X��KF�;���,o1��Nn�l[|�c��c����"��+��@:H����-6+f/��pXh�ig���[��;����D��3�2�iu������G��4��y����PHY���qF�0�����C!��Q��#N.>5����t����IX�bCZ=��D���$P�u�������N�N�3���d������=���Bq�0�E��3O�%�X��y-�2���W� ��i�s1-c�!y��6q5y}{Ki���{��g��P��#cB�Z��w����������r����l��*��l�+cZ�Zrq��ae�D���f�{\�,o�Y>��&�%��;����"o����ObB�aL�X���E;)��[��7��y��f�#�1 o��7��Fp�6�zb_����&N.6�;����u����YX������y��1/c��W��-;(a�d��w,oq�f��qXxs1aup�7M�����J/X��U���U�^fT�$o�eu>A��l�� k�5b�H�����m�Y�%F�#�����������k�=F�MXxH�b��;�� ���O� g��-�������O� I�bK��P1�x�ak��->�'uLX-,o�]-�>[\��y��<a�pm,��l���`�Pe�<|�k�1�l���`�u����#� �_���p��J Z�����m��*(�������k!�7Hf��F��[\d-���FI���J����-F���E�'���Q�%��oGg���$�����`�Iey��j~�>�����,o�!t�U��}T"cX2"oq}�����0$�XD��J�dKd�8#E�Qe]|t�t�$B���������>Q��.����->��u�%B��������+}�Y)�)&��[l��c�|�1�vS �l�Rj��-����B�xg':��9�$���W������3_�/bwX�����S�Nhe�� 6
�[lh��)/o��KX��.F�c�F�T6~�'�S����if����L:"��"Xk O.�.��Y����'����'�D!�EB�UN��
?xW��)�jy�+���qF�8��0��>A��aBq* ���J��i0��k�1�dy�O������x�����g������E��3O ��I���tQ�PA�x��,���b���"�{AW6z����{��#��a�l�xYhk�������9���d����JKX�v���H{�/0����,$,o�!�����/#��Z�{r1��}����Y�Yg���[lX}2RqxZI�����~����3��a!�����W6o��<��2�8�����v�w��t�9ey����4��~-����Y�������O!�$��k,oqqW*�o�� �p�n�b7�[�n�'Q�T��������-F���qxZi#���-6��;�
�U���V����Y?��������V�]\��Wr^q0x9z�"�����������{X�����������D��C�z+���Wm�7�W��gZ]|rv���em�
��E�b�=����qF�8#K]\��v~�0�x8�m�&N.>�����=qe]pI�����������x�+�������uV�^NW� ;��q@��vct�X����y�Y^�H���������8�9�]���$y�
m�Y-BL��r�'�q@���#y�s��7 q`|I���� H��-����0���PV�o���������)�����/� g��-���^V�)���e,$o�%�����;^[f�,o�������]:�]`,Y�����9�a^���������O�W�1�������-�;�&�����Z���s���'/o�����z[�+�����7�U=�[�������%�^]V���by����I�axuYX�S��OvV61�����������T�0��,t����������������Y��\���I��
o-+�=y��=�P[((4
��rZ]l��~�N�#o��7��y��#Y"�8u�,$0�H'W�R�}���N�[|�O����`!d,��b������q�[�(�`����[L�� 5;��=�xe�����>2�����~��-�Y���{��g��Hj�u)|�9K(P?�[\�e���������������w7�S�a��m�#� g��-F�c� �T7n q ���HZf@�g&�;$���"o1�����
�HeC_x#�dZ]���u����]3k�`�����������/8�,o1����=�H9(gR�L����qr�� ��K��-.o����y�K� g���D�O�K������uh!d��-.��������/�q@���by��V,&�s����
i!e���.�n�"�����7�g}�ht1���C
B+{;��������E���������g}�gr1��n��
��/I+��y��u4���g|������2�iu1������7��V� �W�-6&sl��8#i��$�.���u�>�e���`Z������&N.������B��Zb�$,o��W���������
���������N�7���lz��;qtr��%]_o����^���$o1�����wbB�fLHX ���9��_3�d��-����f5�I��1!a%H��n�D�����7a�!y�
���H����=������<����H��-���.�[lH����C�z����4vOL���I +��z�����&�9�`���8�~���IX���qF�0z���=5(��i��1-a����
���OTvN���J",$o����.f����U����������S�T��4�"?Y��H��oz����R�*��<�w�Y���M����Y#���-.�z�p��������-�X��y�������x���{�D�0�U
�D�p.Y�b���j����*.E~���ey��v��L� ��!�}y��t��M�l���o�I���Hz(gV9C@� �V�-6�y�(�Hg$�3������T���?�*I��bC�������0�����-F�����c+�\V��2�.>��yp�������v��-.m�BS����6������:�����?g�H���#���������`�)���F��[\�k�I
�?&�Wr���g����+&dw�����������G�B+�� M8�,o1���u-�*��;�+�zY�b�z�������U���TW�_��3��2 ���Y����G�8#i��$�#i�?Yy�NH������$qr�>i��8��[�t�Iey���=�����w����7u�T���tUS6w���N�N�;�������.@�w�_, �[y$oq(]��/��� Z�/����>���8��~��
���`�8 ]�^���[��a��S��, H�����^�SZv-LKX�]l�����
S���E��[\�Y����#��(�PY�b����=F<�����[��{���@��#�)�L���t���8
&
I�X��I�B{��#�qF�8#K]�������'"�%�R"�����t'�@��.���C����S�*��H�}�o\J,o����sb�iA� ���#�n�+� o�'��2y���-��je���O���B�XC\\��~��=CvMx�^Y#v��-F��2R��W����>��������N������[������x4�K/$,o�)RqWx���f��I���aV�,x��YD�bd��0-��iC@� �I�-6��S:��S�4�Hg$ �����l��;�f��2�W
'#��� ������2�G���E���S�G�h����Q�K����[���=sa%�,I�����}�z���x�\�"v��-6���\![*�/V����-�m������������-.���^1�l�x�\xH�b;��/�B�29�����-��e��@���J��B;wH;C�X�b3�����l�x�\I�������J�X�,o�!�
[vn�G��qF�0z�dg#�&�����$N>�+��a]p1����yn/�/��`^��3�>�[�] �������-�����m����=��3�&/����]�g!���^#V�����bd��Wg���Y�5re��7���������*���+�[l��W:R�^&WZ��������~,(^
/�!cy��[\l��
qVx����`�����=.S��C��'��������V���^�fZ]��{�H]��! i�5+�#i^�K������gd�3����%��7^qVX0@XK�\lX��[����&��V�����I|��Z�B3�..uV�D='3����3'#/'�����+��5����2��c�o(k�5b�X��"���.^o�+c}��wqi��4:�xw_I ����3��'�/���E%�[l(�nL���g����-��������;�J:C/E������ qx�_X���Y��2�+�����V��[|`�����5��gd ����`�Ia�]/�+K�\l�z��q!�=/� �[l�����[���� {��� �[L�G��x�&V��9v&�...���C����q@���cy����K��i�l���!y���NJ�����{:s��G���d�+=�\�pjY����pH���K��k��-��>$S��C�z&���+��{����H3�.��<��]IC@� +V�-6�{&���qF�8#K]�����#.'����������U3�
�.'�)L��rby�����O!�31&e,$��bCz����N�N�7���l
�J���8),f�p���A��x.����^qcX�@yH�b���J+>
+(i�����HZ�5�
'c}��wq�Y��%� (a}�pt��q�������#�-.F��Oj����k�,,o1���g���w��������F��W������L�����k��4��y��t���x*,g��qF�0��\O#�%qdX�@I��b$��3�J������->���a`Y%����-.kV�r{�3�t�T�J��)/����I�b�?������[aZ�J��-�i����-���E��[l�y0�@X��0+aeI����-D��Bx��o��C�;�[��&��X�������UT�^f^q-Q~���%y����~���;����,o��=VZ@H��0)a%�V���B�X�,o�%�;�
#��Y��,a�����"��,����V�8y�l����.'��$�.'�����$������V�y��1��VoV�iA� {G�[��H8./����U��~n���S�k��%ay4��-�;����*n�D��*#�n�v^�d��7������)�����K@��8���<��t������V9�xKW�pB(�*M8�,o�CX�������
.X��'�~�_�s�r��I��9fK�82��2 ���Y�~�t=�������[l��K3p�)m��6�HF#��:!?&?W"����f��}e��$n�i\M,o����p�~�h����������K;���]�E���(?X��(I�b��d78O� ���AX�\\��{���_0k�5b�X�������}��/��+����[���+�� ��0!a}�rt�CX�.L(��M8�,o�3���_������\�,o1.
�^2�B����xf�;d���"oq����0�ZUg"?P$�V#��jJ�X�"o1�����qF�8#K]����U���8��'����&�]p51�6�
���&a�I��XX���f
z�RI�I'��[\�j��%��mJ�M���O����K��x*\6U�����"Vq`�3��>���Y�}~R�8 i�����H�[U@�����(iH�����[BvcLHX��]�����E�&�[l�?�+#
u]���)�[|t�+1��!�}y�KvQ��`�����#�L��
�^�r�#k�`������Sk.��g��3r���f8��I�{���g��-6�����U��2��+���g�W�sNrN�^Y�b�����%��$��yD��-^�g�k7�����2��+#���-�����F��[��;Y������2��+{��z��#������+G�a�XJ@H�c���b$������J=,`����������d�;d���"o��i�p����E~��1�..rz����`��4��y��t�wZ�4�Hg$ ���t����"�HL�XH��bC�s��?w�u���X�*]\lXw��a�84�e�K��->�WqY�:-�:a�X���\}y]�\B��k�N}�[,�g��7��'�[���^<�.��3������j��.���7a�I�[�����V7z�g�4I�����[94P��K������X]@h�h����vo���o�i�J��-��;G����%%]p��������������e^lY��p!9R�8����v.�[�Y��y^�������g��4'�[��n�W�PV�z9�%��..6��J�e�q@��}cy����c$�M���2��@{g�G��2B�����>:��0�l�D(=��N'�[l��HB������NY�b�O~���L8wH8C�D���b)y� )���Xx�����w�<E�k�`������w�F$�3��I��b$�9����������#N.6�qd3����gqNY�b����+��:-�:��ey��9�����p'#��'��y���-�~���.�.�++c}htqq�5;���uvp�?c��7�����ylqp�_i�k�����-D��%�B�KX]��Vw`�p�_ �����b����%����M{%`y��=t�z&��.���[�Sy��A��X@"o1��l��w���4�HF#��o���?S�����9����S��g��.���{����������R��.%���<�^UqPN c��k,o���������$���7I�b��]��R]�3�T|
G�������Hz��k�5b�X��#�� IY�����,$oq�Z�W�|�J���>�9������9�8� 2������.�8"�(��>�E��;�����p��p�����AT�C���W��*���f5��H��! o��(�W�&��(d��|��cy��vM�a��}Z�pR���h���";1�N�N�:X�b��9[��.������,���,����V
W�3�$o�e���&�1���x#���-6���4*�l�po>cI�[��GdO���m�BC���w�e����JJXHH�bCz�d���N�����-6�;�!d�����=ey����MZ�����3�i���#�3��4�fy��Z:t�1c�3��Y���#+xs�����9 +G�<l9�G`��B8\J����->e�������V�y���u$Nn�g����-.�7fg'�����nV���������f��g�����y#���18�����`B�4L��F_��=\��_��=����Z�4����a�O��S�_a
�+��$��k��B6��c�)��l�������3/o�����z��5�~�"��.d�6�X�6�g�qF�8#O=lxw��"N$o���[��T9y��D�6,�. ����������M�)x�f
��E��z��N()���N�N���
v62��A6�t!Q~��',oC���C�Bb�<IH�+pNX�\\��r'�e��3]�.b�X�b���W���W|���yH�b��� �
s�%��y-���=��W)G[���N����p*Y�����U�V��1H�����-�v�~���s��3tS�-6���P�3��~@�����}���! a��)���:u<�gd�3������d�%#�I"��<qrqe��0������9�����TfQ^��O�g^]l&2{��k��c���/��Cy��#�[L��D���r�_��5a} rqqF[5�L$��/����{��}����@> �[\f�<&����`Z�������V�f�P����p:Y�b��|�\��Qp����ey��t�WF7wH7C�D���,�?�����1 �������o`�8#a�u#�������+���y��b�QoW������8,.>�5����p0/c��W�C�)!����v��-6���q8r���������X?ayN��-�KW�*�G*�x
� �#���3��>'^� ������;��OvD'�B����I�b��\@8�c7�t��������S� ������[|p8�
$:u��"ey��t�\%�;$��"oq���L��?2�iu�1��
<���
[bbY���t����b��d��6..>�5���W6����y���������v��-.vM?�eS����4a��y*1�[��'wlf�nKg,I�b�z�[����;����fd�g��U�k�0��$y�-���J��6\�V�����0���w��i%!y����
7�E c!!y�-����2��C�z���C� +)��LJXI�����W�y!
I�_���HO�=��3��g���-��+�)�
�Kgq�����3���[|�������rV�y���U$�
�Jg�v��-�I���������������2������O�����������V�?f�'u?�6���7����C�?������� �����g dv6��% �Q�,����g ������?K@���kC����9x��W?i�_����3?����b��5Z���0 �?%��+�?��_�����������������~���������~�5���<?�����?�������~��qzS������������������������>����}���|��?��O�����������y����?�������/>�o�������W����o����������vG_k�8�8�����>�C�s�C����n�e}�
���g�C}eu=}at�4��������BQ�_d�����dN��x����������U������o+���������������^�����Z������2/�b���������������Z���O����*������e)�|iP����k���/**������n
]
�?��A���dj�k���.5����?��gLP��{�t��:�uok����������a��<C��j����_c_���i��C���<�K�Z��QsY��������<6��=hZ���yU�<���X���v���S|����~��~�3����|m��9���<`�<9����u~�����w�6����/�n���Wv�F��Vg�����u�v�y�g�^�6�7t��si����
�����s���
�Xq���o����ws}���e���o_�����^n�����J��]�0=7�X�+���_�E���6�\�9���;�s���s:���������ri'����3;�5�w��������o���}e�~^��t���I�ke�wuJ���� 8��z����:���+��qN'5O���I�oUq:�s���J`]�I�o����9�c?���y�wK|�@��|�vIa:�vU����o!�����4��Mp.�����_��9.qIwX�NH��,|��<���et�\������~%�����W�C,t���?��7>w����O����
��y}{��0��=U_3����*��\��|�yI�:V[��irqI�1�T�x�����7����������?��z $����Y�x><�_a�ciR�>��'�sc����(_����vy��K;t�,O�o�f@�]S���:�U�sWQ�t������*��?����~3������k9tc}]?R]^~C��Y���[�0���5����/������������e���B���g���]������e�Ku�.��su|/�]���~��7T���^������������]�+�����u�����j��_������\��m^�;W���}X���H���$�sT����TctW����W�Fw��^m��#F����\U��������0�]��$�+����Fw�q�#�+�����Z���XjXz�'�����E6(������9���]A�o��]A�������zF�]A�����n���������O�����������#o�|�7�w������?�4Z"<_w
�Z�|�)��v��1^i��^>.� ���Ey������~��k�W���%��8�nC�@�������Dz����n �J��{C��^i�6Dd��F���(�^i�?��_�*.�^y6�_?p�x�l���t �|�7���_>�����������s�w���>y�s�w�b��J�/������1����Z=��]��5�s�����\�*��_a$�^�B�������k��������kH�F���I�?G����g��������B�n��Z��|��]z�F����`I
�D5�+X��;���%]O�x��qK��V�%K�B������i(X��O�g��;B������
���w�XB��[,����%,(n��{��P����Pp���k�����;BAW�-'����-��
��o: tu{9�kc���������P��m��
,�3���PC���oMaU�N�������?N�P�0"�]??�P�`Iw|�-Y�-�`��>]�I
X��t��
-�����B��%}�!t��
f��b-�cr(XP|��JB����L����P��K(x��I(x�PL\������i��'\����]���DJ�m�(�+��z�_x�������_����,�t;�������5rux�X�0������y_W������_������V��<��#�A��4�������h^�3
�3�u5�"��}5�������V�����kU���L����)���./K>l^W��g(��Z���MT/������m��t�K�[q�������;�^�m%,uU7�X�����:��HS�8t�S�?�Jo8���~�xB(�����z@�����}1�����p@���h7|�������6���n�k���+r<Y^5
�q(���5P,�Qk�xd��K6��� K�~G�X�/������
I��@q�5�����_M1�x��:�����.K����ttYR|�c3]�_>V���������f�xty�Pp�x{����������9>&oOi�]�w����[�CWs�[����S|���X�;���F����,���Xj|�����/��aq�\�����7�Q���E��}�%�+6u1z���0����v���0��.��U�?W� �����v��l��oG�C��!���9��~�
�=
_/���'��b��V,���c���%#i8/���!����.��;�BW�-Aaa$n94tui ��z�����X���%
]�?.�dA���
?���U�p�V����(
�vt��S��c�r��Aa��Z�.���amjPX���%s����ZYPx����TdAa���]���O�%�-g9,,�~{XXP�\��������-���n�q
;���7j���KNB��[������n�����#4t_?��BCW�M���;��k!V��������^�1
]�?������w:/,���u����f3�B��]
cm^[�(+v�z��fGW��J8X���Y
vt=�!���@?����i8XP����2����ErHXX3�#7�/|�6KPxw�%(�
%(�_������)=U:D���������]~�U7^~��^����fa .�e��_<��q���@Kr~w���6\������7�h���N�_.wj�y`^?�\I��/G�~���iZ��r��u�;�O���U�4����>h<Z�����5�j�i,�����Y<z������d���u�HBSw��9�tU�(r`Zl����t���CB�wP��Mb����'QB���,����)z�wx��Wz�Q�����U����F_�=_�KC}�W��h7�H������l�����;�X����j�!ab��Z������G����0
K6u��������4�-_����
��j�xd^?�q��Q��0�r,�@����_��B��K��W��������/��k.�}�b�r�_q����+V���|������U|Gz�������}����P�q��4�w�7��S`�ZH��}g*VWM����lFY`X���;/G���;/%��|=����bSW�H`X�;���l����������~�.���b�rhXR�T1�����7�7������K<XP���XR�|��v���8��kb�8���.��;�BW�-�����#���}SPX������0������T �����N���U��z?4(,�Q�ia�����c��kPX���B95�j��6�Aa���'PgA�A��\�P���M]/��G��!j.�SR�\���sHxw��rio��\V��
����o���f��|�Hg���.���j�
]�����o
]�7��������0���'��Shx��MCCG�;���Us�u��u^�M#�a��Z�C�;�C�l�94��T�u�����v�U�-uI�����6�jhX���E�%4,��o�����+���n�&������zhHw_�����j oW���9��s;Y�7���������������m����U�x��Wz�:�������cM�c�{�d���������������T�D�XW������XG����HG��#L?[ 5�-������f��f��]������8�`S�������k���qla�P���������1��Z�W�HJ�ZX'�~Yi�Zhx�w����7X8r����_/�;wa��}�M���s��b��}w��;w�a6�=������������A��������/
�����6<B�������sc��J3�p�U�?O�����Der�YZ5-u�4�+���k+����P��<"�1_���h�����[�;*����!��������!N��J6���\�c��:��V�1�
��7�|%��w]J��!�����KI���n8�+)��q�w{�9�������b���wx��W�{��k_�-1���s������[�+�b>W���o�Y�W��7|�.����=��[
g~�V���o�����8�+X}KZc�����Xc��������?�����aL4�+���;cC7���O
kP.;1���V F���si��F�����5 ��� ���rK���V% �[����K4��__���wy��W����J�@W�_~K#q��E�����S[��NQ`kci���]��;2=������n5
,��/-7�5
t�������[�����%;�|=z�(����koi��X���w��f�zVI���k��|�GQ�#��&�1
,�P��~�-�5
,(n.ys�PHXhqs�����Y�sm��B���[,Q��c�E��������#
t��~��Q��������/-��XowK�}��z>Qf��
��n�,�r�f���K�E��jF�n�H�jn:�{,G���{,%�jX?v�[�6����aaV�G�YDXX���N9"<�h�������0B������z��a���?ZJDXP�w����n���X"���X"��[<=Z�����d���g[|���L|�w�����Geq���sC�,�tui)�=����!�+�3]��7$dq���g��C�3�^�a�}���i)����E3j�3,�Qk��#cG}��M�Qx����������6�qh�Y��O
wABuL�=��55��`E��<5�,n4��eAqs=����O�w���w�XB���n�gRR�|��~��6��S�7��l7�^[���l�w�������
��Ahx<Cc?WuKJ��~��{N3�����U��[�>G�w��=�+���S.��\�w|b.���/���;B��)5��'������X��t�q��0�����c��
�����������%�suo��\���/����NA����C��84Gzw�zw��8����0�f�������y:o8���zy���<O���P,��<�7Ey��{����y>�U���tF���j�p��#ra�|i�3�Q^�z�����\�6�y������'��o>�o�S0��Vkd����;";�Dv���|i���Ey������Dy���|����=��[/Gy���K���WH
z���6��r�8p���m4<�,����{����ZoHt�^�c�{�0��y���T������Ha���y��<u���V���.�����y�o �J���������{��4$�i����������C
�|���rkx�5�������)����g����o~�0��z�_K�������z�K��6�-�m?��y����q���8x�1���Z��<�������z�[.�f������������S���k�mU���\���y���,R4�������t�����*r�|5�����.�e���V�����@�������l>]��"����������pAo���w�W">_����Ke�"(~����Z^ ��#��a�^��<��S����?���\MWC\��7C\�7|4v����h(�*n)��������������-Q�����v��W�=�����?s�������rO������#��j������a�%�,���#4
+]���)LgC�o +z=?US+�SC��!��oxq���44.�m�h\������6�C��C�=|jxw{9��]o��oc��y��5Fy��]��Dy� ����U}O�W�=���l�54��T�|�����j����}��z��R2[�<O�-Q���{C����8���0��y�k�O�������q����p*���i��p�Ev%����N�����K���r�$�]il���qh��qw{%��y�$�������7T��� �9L������y����������������:��<��|
vU��)�U����m�mU{�w��#;�|i����]�zZ������]�L�����Ev������������|��z}uhd����PW/}0���'��J��S{%���\��8���6.�m�X����W��|����9��n����7T��=�o�g���|�Z�8����q��0wT�sU�t�����S�?���{X
�<�?[���z[�����oQe���^n
�<����:�P�d>�W>����@p�|Z�H �|�i��P�7��/md��o>
A��z�����z���8����L5������U��|���3��|��/���^�����T����oAo{�w��P����x�/�z�Z�8�����y�?7�OY���������k���F}����<�?[^���o��;�k��[X/�Tu)Z��s8��J��z��8��(K�W6��z}��t�]���\o(���I�>�r�c�z ��rP&�����lH������k����tsq�l
�|��o��^�%��y|%���^���U���hsA���o���X��[������_����<�����>������.����[��|�k�;S�~�}����7�s���>v��T��]�_���z���X�0��/�j,����:Kl����������pa��y��9��j������gdL��
�����F���p�&�p�~����h�h@Ws4\�������x����8_�]�aW��W��~���u�����2������~�.���$�uu������z�$E����z��;o��T����/G�w������r�������n���W���'��yz�8����]6Gw��/
o�d����[C��,��G�[��m��������K�Ht�O���Gm5���k h4�+����=��J�����z�i��|j�������u����S�����,����SC��Ew���v�,�+���k[�;OwK������9����zAD��n�+������JAo��w�����q����u���t��$��;�;��wDw���- ��y��4\t��;O�������5��G�!t��n��Fw����Ht�O�����4������Ew��O�����Jq8���'it���'{���S��?��Ew�X���m�������,�+Y���-������B&������LAo����+��������:o�gw���Q�Dw7�W��{�j������$��;B<O�-!^a��<�_R���S���7I
#�����x[���
!�?�_VL����5��u�c?Y�W2��+*��hOV,����,��
���k
���n�X�[�[F���@F����c}�mx�9��jFz9��^���JK��i�#��[/�}��������^��|���S��3����7T����o}��H\�w�}��[�>O���[5�y��4�:��<�7��#��z���}[����
n�IM W�UK�WZ1
�b��k?Y�W2��K*j>��
����7��7�5��G��*������"V����-�,�+������[R�8���6�H�]/�{7��p�n�Ku�.�O���W�����w��������@�}t���@�;$������6_�p�^Oy�aN�(8z��(|��'k8�6�K�_
'����f��da�^�����M�yC:`���[�f}��T����B�kD_m�51�h\�����C�u������j�R���28�j}��`]?[R�+�uu���V�������������~�P�|��I���"PWs��uC���!�pg�.k���������u|�q��q��MW�u�����zM|=u���y�,L�Ti(�����0�Q�U�{K�������-EC�0����,����cq��
v�c=YL�O�
1�������PRIcB��8,��������S��!YYK#���LI��c&�P�� �bo��\����"��dk ��m�8�]/E�=|Y��|[����G�7�/�����:o�������)�<�R��-��Tix-/��a�#��U}S�����rD���V���$�J�����%�i�J��z��b>����i)�-!�?���e!�o>��8���z��,��7��������\�F ���|�������\vY�#�k�T�������� �B�6���!V�2��iI��YH���Y�N��~������_o������<���
n��Uy�8�*/{����|z��'��(��������a��z�2�T�R�������C�VJ=�^�P���Y���^,��zG*�z����Z>������R�/��������K=�|l���O��J����q��~a���okWz��Pz<���l�������_q+������]+����I������}�U���h(��G�{B����q#E��>�}��aNL�����{�1E���*���E���@����H��]=P���c}z�T���O/��;���,(����}z��E������z����z������7���s��c�(��W�J>��K�x$Z;_��,_8���SW\-���_������O2S�����nTkk %�[K(3��DD��ZB���ZB/e�3��J������������E���9��)��Z�E{�)��Y�*���e����;xz������b3��Z�9XV�������;�/��n#�� ����i���?{J@�)A@�)A)�)A���+�����s�'�Eh:�s���P�� ���r���r�A0��:d���P�!�� �W��Q��v�Z�o���Q"�c_L����+B~S�r�#!_J�%"��c�E����"d��D�^.������"�o�K�BR$����
�=�F&aSG��07��w��,�� ����
���a���cJ(3Q2nbW1;���d�P
��\��5 C�� @
��kc3���h6&���,����F�5�Z=v
���{�`���J��Il+q6�h���/��L|�������v�c����v��K�7�Wy<>���8w�������\{lu�z[���V����Uy�[w}�a|�m%8[t��JM�z��}<���Cgv�0���z��VB�k���^���Y�W����z^������c�zj��;���Z-������RO/�����q;�����A���|�]YP�������z<>v��������\�����zi��i�?�-%�'b�R����i7M���f���J86P�qp����L����}%*�FE��>�~3���>��l-alV�m�?��~j���-�^��u��u�V�u���u�
��jC�4d_�Q�#���u��>���YG��%*Dv��v��v����n`��O�����8����y����_�K�7_�d����6��7�.{���]K~?_�9���������re�u��������
�+c;}lX��������+��o�
*W���w+���-�m`����+W�(�1�P�2�k�r���S����H}�r�\�:2�)���������r�����+W�~�f+��~��s�1���9y2&xA����]����w.�Z�_��uZ�����+��O���N
��/�y�~c��s��p���������j>G�B��������9�X���hX�����]�'���e�Y���l%j�VS��=S�1�Q�@��"Rnb����#c1�|���Xg���}��|�}����������
�������q7�|���|ip��88P�qp��K�O��W���k�0��|l����V����ph��j>�x>FN�|��r�����b|�����Fc��|�N�"5�-��|�6��8�X�1�q4�|t�Gj>��j>�������c�g�;C����1j>~'��e��-<�OFI�t|�"��S��:V������<���\�,[�CQ���
����_q��t��4x?�q�}��Gx��o�'P2p���H�O[�*��S���3�S�m���)�;iL�*���vS&�jV�Q}P2z���uT�� U ����U _G�;��:z��?���"U c}�X�{�d��\������;���� T�<R���:�_q@r�=[gu�_q��� ���8��]+��G�����j��#� %��A�^�@5H�����mL�w����wz5H�� ����S�rN��� �G�A����4*"�w�^
�u��K�2�G��O��U`B
g|�����[x��,�8;�;���H{("�W�b��������o1^�U`:�
W�W��+>\~�6��x+�����Q8��WB��C
���cm���
c��%c;6
kK�`DvD���L����,wc��2��D<5�{��������Qy��yt_3��:��F�~�w���%���>�-���]^GX[�{�dT��e(���PP��S�g��bb�������"����������}�\~��%e���%e6�����:���Sd�z/�}4F��}���������u��������I�7��[���>����:�>F�8EF��A��OE$�g(���������SN���u�NB���d�D@����|2l(�}�^�9�}<,���~s� D"����/�^������W����W}�+nU�)� �F�>��'�������>��.���
U��o���d}'4V}�l�WXmU���?5���8=��D�#�K��H�2�U_G�W�^��ud\7T}|�&q���a��S�x�:z6�s�TOM���
��������*@~���Cq���Z�����W���
0�`���
�����pl�d�LC!�D��/c���X2vh{�����e ����O�P��������2�?5+��>(=#���6e��H����#gPa/�:r��z�����ue���e��:Z9=P��&R��=Prp�Zke`��[���z����4�4�y�20�e`��8Pn?r
�&�@
���@H���r
�@�vJ()��)1{(n�q�W/��������
�2O������Q��d�{g/eD���e�XG�2P����T(�:z��^�;����������C���cr�?d��*B-�"�����U�\�b=W��W�\7����3��U��bW�q��11>�_{L�;z������.�vw����w�Pp��d��F5��&c��Xi2v��P���D������N�4o��j(�����w0Sz��ddcZ%T�*"���b'�@����NL�����8-*M~'#�L��|r4`���n���KJ���[vI����+���{I�c�w
�/�W�K����2nU����Ph�/2p���q�E�^����c�P�����9^P��;�eM��K*0�F<5+��A�������?��2"�=&b=%�S�����>~��r�?�*�>~/C{�%����>��HL�[���~�� ?�U_��{������=&q�~cw�����L(�~T}�cB��=&��]M���B��V}�� �>~'�3w���e?��:��8����S������7v���OE$���:Z��T}|E���uT���U�^G����W��u��|��
P=5�^����~��'c�`�y���/�@
��8�1��5`|�R���\-�9Pf��
��C�1��� d��6I,y$��zX
2�� �_��3�JA~�����w2S
���v��>��8{��d�L)�]GX
���%]/�:���R��#cX��|(�d���{Y\���l��KA��<���t�C�Wm��T��.���@��ap��ap/���cb����!'�Pl������q���P!H��i��:�X����;�� e��A�fN����xjV���^Rz�����1[!(#R�
�
A���],������k��XG��o��l�<\O�[^F�r�\`��|x�G��zPF��2���`<W�����z�y����c���8x?�q5���+��vd�����_����_���%�����N��V�k����k����+a�r���<x�PSh��s/�����A�i1E/�_������f�\�+��GdCnm�����/���l�@���s=��aL[�o#��sj�����S�����vj����?%�������~L��W'�o�Q��(M�I�|����%{zPV���c����H��jm�l/�xLV�nw�g��w�����Y�x���0c?;
7���F��v?���v����?����O��|A9\j�a�}18�f��i�h��&�k�C��1���B0��E_��!������w����Q������d�>?�OF�]�y8������������\� �.��\mf�}����������xQh�!���!���b(l����C��D�
�"�����[����N�G�^��C��X�M8�~3v�C`��wL�~�;���2��_��R���yPFE��&��d�!d�*:�pV����CPY�w*��Ae�*�;���O�!�x��<t��j��F���F�;8�[4kb)u��o�3D�?�� ���*�U�~�p��/���?F���N�>��������t��+���JA�(e�0����l(@(pp����O��W�6](�����+���&�&�8!�B#B!�B!�Ca��#g��"�B���q� ��~� x42B�����B����C1A(0���B���*�WB��_�����p~B��&���������F
�\���B6��P�fU
*�?e�PPY��u��Z��x�P�����P�Y��Q�/eU����[4;�!���o���c�=?����������;�a|�?��?�����gU�wz��
U�����P��q�<
=
]�n��D���).���T�h+�0�+.�������`�!��<�Hb�Mx�4�y�4xZ��$����xI��%G(�������Z|�7��D<�}��(���Pdw�@�W�Pu�;�x�����;����t��O������xI���$��!�U�;Hr�E�<�s�;��Zl���A��r<�w�0��-eU�i���j�h����U���]6��q@H� �Ro1�����'c���� ����s��D�'�A��[�?�b]n�h!��A�}� �������^��!�j�uq�y|����8��9��b#!�!�Ca;�#g�"��?���J�8�H����s}s?8��:�b��!0�z�l:��s�8Nu�o�C��'ti�xMf�"�!�U�!d�*:�lVE�����TV�������a�~v��wKY��������h�����AeU�H��a���_��RP!26D���
w�;��*�����.���!}� ��1�l�Y! 88 8�oXH��lH�����]_��t����xO�~����y��3!94�0���0��i1��P`_a(�=���9��092��
�
�G�iU�
��W�y}�0x.�H�s��w���<��#���l~�Nv!���f*�i�xuF&;��+�E�"��e�������W����2#�_Qf��jo����u�����f�������as�n��#����7y�s�3c�u��`��x5lCk�PB�Q"�>����Xc����B���
������K�KY�|n��:z~zZ�����G=|�4x�-�J#n���������ap�a��
�;(���#i�Pd�u��3����� �a����2&��w`�W�Cz�~w���� �'�8^h����3lC6?�mH��k2c�ImC8��mPd�6��j���Y��
*��gQ�u� �j}(
��N��U�-:`{���nn��6���A=3��"Z��������O����@����fc��y]�,�G�p�OD�G��}pd|?��H�����4x��@SD|�<�C8�� \
��;8�B#!�!�Ca�
#g�"���3�8��F� 0v� 0�������C�/x���s8� p�5�B6?�AH��k2��"M�!�U� d�*�lVE���j�=7|6�^
����a`tV�O�Y�pL���Z?;h�/TV5�C���U�k��@��G94q������)<ox���X�b��8�c�B� ��� m������'(t}G�O�_u�
<���3�UH��U��U�>t`�W��B��a��W'O�;�bj!
nj!
�C�� ���x(jA�m�@���J�4'�h��`P��r������{Se��D��j��
)���W_�7����������8���tV�j!�UA-��*��U����
ja�U�Qn��������Zl2���U���ja�Um�0����������������l�xqv@��w�� CU���y����kS��0x2]� ��_x z����_u�������n�������n�%"B�k��X1������k��k���P��A�}�������k�r�50r�5�hd\c;U1��~��$�����3�_�x?
��8��9��l~B��&��d�5(�=CR�������!�U�5�����lVE���j����0c�\����~)����TV��+F�0��E����p�uL�-�1�a��q�`� �w�;�����A>K���m����q@7����<z����C<�]<���C��C:�_����5O�p��������gs�$E�%9��BC�m�O������BC������(I����b��UI<�z<@h0vh�$e�&JR��55�M����DI�f�������L�j#M�����pzE����6����������WT*��%����0���DI�_����ku~�qx�kK=�3�L�������K��ml��/�����kB��;��Ph� ��y�So���5��P�@�D��E����"~�]^�������H��_8N���4L���/���b(l����_��D�aB��� F��}��L�����~��W�����elg�2����P.�&x�S�38}�<�`�� �B�<^����4�[E�&��
*���#eV�������H��� ���j���U`���H�U��`{mlI�p�[���(��)2Ja3������I�����>���8�]Fw�����,)W�;YR��� �d�4�������6"~���n^����6"�6"�f��-��,I��w�O���
�����3�A�}���������"����mM9�������n�������j�N�[�<�0d��4y�&3�!M��A�}����h�Y
����d�pVE����=YRdU����RV�L�Y����`K���ON��q)��$�?�K7����[vtS�~�O��������W��i����btF����F��.��r!{� ��.��&8�?�"}�].���.��.�`�p����,�����aH��aH���P��A�m� ��a�d�0Pr�0Pr�0�hDe_�v��VSv��K�~w���a����s�
������'0q�xMFC���$��!�U�0��*�pV���`Y��0��U�=Y���/e�W�?��aV-��ndV5��p��� ��� �����zoZ��Ox�����D������U������,���YB���@����D�<E;�����LC���4����p�KRl�4���4���b(l����i`��iPd�4�90^��#�%D4?���C�%(;��@��S2E�C�%}�\;��l~B��&��d�4(�=^R�}����h�YM����i�fU4
����AgU�4��j�2�j�bt�����=�Ae�w��M�-�v;F�4���%t���'8�#�iPB�����_�����jk]Y��A�������A��!��v����Uw���6�!��_��O����7���e�~��uxy�6GA��������ugVrh���������b(lK���`I��v������72F~�_3XF���W�%�+�8p, c_kK���e�$���&��J�^���� G���S��G��N�H��������Tt$��
�$KFG���#Q�����D���1G��a��O�S�X��9����1��o�8���(Q0$�^>�z1n��s%�!Q����_@�7�4)����2����W�8fcz�m�`f� �����C���s7 �/���O���Q���v�4�<�y�7[ppy[�l.��6W��A#!�!�Ca���A��u4E�����W9c�>�
]z�WmO��������zUW�������0�N~�{f� �{����}�g8;�AH��+2cx���n��B��A�}�������7Z0jN�9�x��s�� �']P��3����|���K� ��T��f���B�_��4Ph�x�G��gP�=��tm���@�6��j�h�7M4m��x���6���6PY���ap��7mp���XW0hD��]���b(lm����E����������|�E6h��6_�A�X/��
��Q)�=�����h
����&���l~Bq�&��dFd�*�����
�A6_�8Pd�0J��ZvD�g�u������}7��/�[(��,(��;��a����Yj{����Mg�� ����L������%���L��:�������8�|F7 \?[b;��@����I����+�it�>s%e��MB�MB�N�G�.����� ip� i��
W'PpD'H��$�� �� ����"v� _��v� �}� ��V��X�������wg�g� "�O�tx� ��2�$�N�'� q�xMFt�$���Y���pV� �v%����@��W���:4��w� �j=c�N��gl� ��fHq� �������I��c �PU#�����c�/P�T���e�}8������N�\��*������.Jn�A���$f� �e��|� ��d�C�|C�|�L��o��eC�|C���R��R��.I��w�O���}�u�]���F��M���o`�u��|#�v=Pv�}A��>�}c;��7�x�N�t���
����!M���tII�}C8��o�fU�
�l�/�|m�/P�G�S=�����7��Zo8C��c]?}c�����4������/���2���+4��~|�=�������8��&��R?B�Y�X^���#�8�b|�����w���u�,C<�o��E|�<c�H�*��������7_�����}Y������������papwa��
�](�}2�$W;��`������.��(����`���5����x��Cp�:��_������3�^ �'��;t��rN~w��O�.������Jd�*��pVw����.�Y�#]�.�:O���lp<�>�qV}q�0����f��y^������l��:c
^�~�D�B����}]>��; ����hNB�H��M�.o�i^���%�� x .�kt/�7/���O��K���{ ��l/���%�W���i"���ap�a�����
���}$���.�>V��#�jR�G=� x4������P0v�XM�~w6+�q����nb��Y�{�r���:�f'Ti�xEf��"�� �SA(���9����������a|�u��a�PM�S_"�,��h�Cu��WuPlu�3����'�h��q~�P?�8PW]�o�@��#*�� ��������������5�V�,�A��Z�s��!�8���2�����������Y�7�5��&�A��A<-���7�`[PrDH��(�^��2�$�D4���2���^ ����C�)������"����)���2���4�'�q�xMF�J��MHr���� �UA��*HJ~��~wi �j���K
�$��0c�F=��Z�Ei �j}-H�����^��RN5���
��A�)����;9(av�O����!��Z�Ms���!��Qt��kD",1hA�>��s�d�A���f����f���[����-B��a��y����;��b#*!�*!�Ca��U(*A�}���������Q��*���*�>l��_��-�;�uA���Y5�8}�����O�������H�=vA�}������YU#�{�P%��j7)�a��T������qV}�~%/nA�/eU�N�J`���/�}������m�#�������v�����'���y+������1�]3�?�l0��.����A��c�h�i��� W�-��;��'S
����t��h��Q4\�[I%��w���h��h���P��A�}�������h`��h`��h��0^�A40��9�Dc�:SA40vh����uFB����W�+��l~B��&��d�gA�}����h�fU
����Ae��5�hPY�
<F���U_�/��qV}16a�h�Yu�Y`��3���'A��������9,�K�����<;��p���Lz��y�=��+���qnE�(�G,H�?�Q.:[@pp����G?
p@@pp����4�<�y���0��$?�F?���[����K��?R�;��Cb#�b#�Ca�
���"��)92���#�)��
<��
��2�f��`�kEQP��`�w�����Ul_�Alp���� 6�� �F�<^���lRE��� 6�Y�F6���`����(6�:�o��bC���|�3����1��/���qV}s:?@l���L������&8Px}�{�� ���(����.)�<�7F(t�����������t��J������j��V���e-nc�d�����H��[ �Z�&�{�@\q��qwqO��������>��E�;"9����/E42S;tje_?�
������?�;iH ��^���bAd�j�4y�"3� �RQ�s*hE��A6��6`���������a�U����Y����bL�Am���3#��|�?��
��|26��@h�@��o��������7W��
\~��� �.
��h�e�`+��P��VipS<����W���]�y|�#������9��(4������i1�2�`[PrDH��(9���_sW"e@�+���e�:��+�~wb�{
D�����.
�*
��"�� �A�<^�iN�
$��D��
�Y�%Gz
dV�{
D4�� ���}����{�ju��RV5�d����X�2������ B� hA������[$�~V���mu�����mC����"���f��!�����'���0[�8x�<C��h!}��"���1L��,>w�#�FTB�UB<-��V
\,zA%��"��W ��Q ��Q <������@�)����������(����p~B��&��dF%d�*��pV������YU#gT���v����1�T��o����� �j� �^[GCv��rj}��� ����.vt�����Z��4�_�r��YP����n�0�g'1jA���uw�M)��F��8���w���������`t�+�Y�qapwa��
�-(��9��w�A�Od�n��#�E4^��&�-0v�|J�N�������mge������W�P}K�<(��YH��+2c������`�9�B6��Y`d���������`x4��BhxN5v��9���A�fA���q��u���
?� �Mz|�'=��/9��qX�w����y��g$C�
��M���dP��dH_�n�_���a�4�Z��f�`�� ���t����D������ �9 8��\6��*�x���������_���_�����������_�,�+�������l��T������l����k�e� ��X0/Z�
`^�V�yQ�Zm�E�k�
��_���(r���#�3�yad� E0/��1/<��������� ���o��^4/���~�)��:T5|�Nu��E����������5i���(r1��^�Y�K6��{�fUt/����EeU��
���B�^xV�mt�,eU��_���!��i)��6�up�uHgyeS.��d�EW.����|1O���8�s.�.'�v��������������������������p�����.\88�����7����n������D�cW,0h�����XP`[,(p��@o\D,(r�� �#�r�a��vF�l�qv�Ct��������f�����>�������������b��_�O�(��B6��X��T���b!�UA,d�*��U}����/TV-^3����� �U���[�����j}-�X���n�����b���b���?_���'�a��w���l}�K�<�~��&aw���h�+���S'8���c;��@����?�OE|?�^�i|�����4�<�y�Pt�������|n�O�g84`4�4l
6
S��A�d� p��hz��V8���f8ym4=�A�9a8��f1����1*�FL�AP�~v��f��y�l!�Q�A�d�5!�S�AH'�n�Y��tV�!�U�AHg�nD�Q�ma�U�m!*���[pvKY������Z?����s2i3��jl/����4/���G�;?:A�������u���sk��sk�F�w��RA�)fc���z��i�t�;A���������Y+�C1k���Ww���������[`��[P`�-(���v���� ���F��F��F�W��x�����[`�gc�8����w�g��n��{�Mw�d*5�/�[�fTt����lNE��M��T���B6��[�fUt*�V7'���F�-����TY�g��U
�~)��7��[PY�����[�tg3Yg��������?�$4�����B��4���� ���A��/u�"����'�>_�y�I�
l�T��GH��G��������m6�./���������[��@"���V���N`_ p�pFpbo�����<��]�� �^��UT���E.A_����@c}��X� (��������H����s���$��@$��
g�M���i��H�U�y��/i�z�F�2��������6(hu���h}�/���3�L�H�F�G"�W���P��.�4� �C$��6�$�r3@w�lO@�d{�����g�����=z�r���P�=�@p�c,��^oWQ��t��������_��Vk�-�kk���n\�gApm� ���F��4,�;���S����4R�{��D�@cl� �@��8w���%�;�QD�~��'c��
�6A�[ �)u� ��!�DQ7����M���iu�H�~B4��n��Q�
2�VO���2��2i�9��4�d�i��R�A�Q����q��)�}{��}����!M��A>-�{��{��{_3��4��B����}��ct�����I�[��i�(_���+n�`��r]S@��}7Yn7�k����=���8��@���6���)�1v�"vS@��ac�%.A�~�q�LA8�p<~���������g��5�7��5� �"��� �O�D*Z�hFEk�M�`
��"��3)������H����AJ��%�(��+Z�R����aJ}w��k@c��t@�5�����r8��2 ���ZD@�6��B�W\�!1�F\{�b�z����[�X(��0�����h|APn}k�a�<�?����,�7%����}�d��Y|~n�qS0f�u@q]!���P\W0nD(�-8"8"��E�� ��3�� �z5B��_�[U��Z�J3��cW�����-� �_����mL�]�t
B@��6�lB!��� T��w-dS*�lJ!�M� �)��N�� ���� R������Z�P8��R���������Q
5�:
8�� ?��<yn"Pd��"���o"P����D��Fo"H_s7�������Q�0������s�.�+_������n� {�0����Rw{���DF�(�������@pme@�Z��
�� ����G7RpF�1� ���s2BW4���AP}�!�w0�������)�4+p�����gg�(�h:Ee ��2�&TT����@d>_DS**�R}eM���)���kb���q�����j��Sjy��������2��OW�m���M��IlBP@y*`���t@���?�\"���9bq�p��E���o���V���Q���r������W��pz��k[�dt��9t$2j�!���v�k;�����8�v@��@�����C�Q6�:�C �k�Aq�8��0� ��j=�MgF�����.����T�A4��J`_%D�*��hbE� �����T�����D�Y������*��8�h��vU�����Y���Jf�P
"��/�R���3F�)��8Y�76�@� �q�+A�oZhJA��m])�qt��AI�5P��� ��~��d�<yZ���d�<���;��B8�_��������"����m��� ���NX����
��}�������NXd���`{�"G&,2pd�"G&,����F��5}A�&,�H;0��Ekg4�
wF������s��P����hE9�M� '�i�D4�����Q�"���t�g%��(� �F�~iG�����Q��t!��:a"����C����>�`��A�����!}�]>��]>r]l��?�A��/���g# �.LZO!Gz����!�������4�r]=@��v�������Q.�v�"z@��v��3�q����3�/��h�3���|�=@��)
7���_T��>'��*��ST��1D����)T� ��@�T_�����R��tJ��h����-��z���XE�'�TaT�r����@\xy<_W�\W�`EFv+6Q �\n�h��W�F+rn���oz�r�hE���l<
�������C���;��w�7�b�{�����a K�����/�Z��
endstream
endobj
8
0
obj
44607
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x���[�������|C����?�z�E���_� Q ��@*�%(�N/T�����ZdG:�m/����m���e����y�O��=�}��_��������������<~����??���a��%�� �}�w�����c��u����x��7� w�#3�e����������L�3��/�/�e���������?~���������i5� ]z$��l��V
{������li|b���P�H�{���Z����f����~�@�ZC�H�{���Z���������l��
|��Q$���AH�^�ZC�Ek�u��k���(�^� �V�m���������k��Y��l��Bj����j+���;� ��E��������5��i����{� ��E��������5��k����{� ��E��������5�2h����f+����F���z!�zmk�����^�~�@�Zg�"���{��k[k�^c�F?���_kh {�wR�������������_k�Qc�F� �V�m��g��=6c7��������AH�^�ZC/�7zn�n"|�������Z����z�4V{
��V,����Ef{��R�������������_kh {�wR������������_k�Ic�F� �V�m����������_k���z�wR����YO��^�~�Ic7Y��l��Bj�����X���f�&����X��;��k[k�Ac�Fw��M��54����;��k[k�Ic�F��M��5���z�wR�����Ic�FO��M��5t�X��;��k[�����j�a��������Yd��z!�zmk
�k���G3v�k
�"a��Bj���z�X��}3v�k
�"a��Bj���z�X��c3v�k
=k��������5���z��f�&��z�X��;��k[���Sc����k������Yd��z!�zmk
�k���g3v�k
�k��������5���z��f�&��E��������5���z��f�&��z�X��;��k[k�Ec�F���M��5���z�wR����Y/O��^�~���n"�f����=�����5���z����M��54����;��k[k�Ic�F��M��54����;��k[k���z��f�&��:i������������Cb����kE��M�������]�����5t7h+�~�b�!|���A[��Bj���z��^w��m_k�a�Z{���Z�����&���C+v��E��������5t���^O��m_k�4k���AH�^�Zg�|H��h���"���$�����^�!�V�m���E[����
�k
�-�
�wR������Ck�u�������F���z!�zmk
=?��^���m_kh {�wR�����K���zn�nC�ZC/���������u��Sc����k��i+��,2�{��(k����7�Y�R��5^s�O�{�����Z��Zg=j���X�V�M��G�����Z��Zg=k���X�V�M��Q$���AH�^A��^4
n�\�R������F���AH�^A�W�?5
z
��Vd��54���^�!�V���Y�7�Y�R������F���AH�^A���4
nt_�R������F���AH�^A��N7z�V��kSkj {�wR�WP�U��^�~��k
�"���{��+�u���`7�Q�R�������`7z!�z��z�X��]3v�kM=h�������:�Ic�F��
��5���z�wR�WP���������_kj {�wR�WP�U���^�~�Ic7� 4���^�!�V���Yw�7����@�ZS�H�{���Z��Zg=h���������4Vo�Bj�
j����z��f����z�X��;��+�u��������_kj {�wR�WP�UOO��^�~���n�h����CH�^A���5Vo��������5Vo�Bj�
j����z��f����z�X��;��+�u�������_k�Yc�F� �V���Y/�7zn�n |��Q$���AH�^A�W=?5V{
��V,��O�Yd��z!�z����X���f����E�������:�Qc�F���
��5���z�wR�WP��������_k���z�wR�WP�U���j�a�������Ef{��R�WP��;���h�n |��;�����Z��Zg=h���������4Vo�Bj�
j����z��f����z�X��;��+�u�Ic�FO��
��5u�X��;��+������j�a�������Ef{��R�WP��;���h�n |��;�����Z��Zg=h���������F���z!�z��z�X��C3v�kM=i�������:�Ec�FO��
��5���z�wR�WP�/=<�����^+���'Y[�������+�u�}�������D�ZS���������:���Z{��b7���c���z!�z��z��^���M��5�<h���AH�^A��^F���s+v�kM��Zk�wR�WP�U?��7�{�XFm�"�����^�!�V���Y�����g+v�kM���
�wR�WP�������o�n"|��Q$���AH�^A������c+v�kM=/Zk�wR�WP�Uw��
��V���b��mEf{��R�WP��;���h�n"|��Q$���AH�^A��4Vot���@�ZS�7z!�z��z�X��C3v�kM=i�������:���z��f����:i�������z���P����kE����� e���Z��Zg�'�m���gR�����zB�V��V���:k=�l�w=�Z�6�����z����V���Y� e[�s������5��P��{G���z��ZO(���C��V�M������.���qJS�y\�_�4��q������� �:d���Q���Y�s�|���i���+��?�}�/�fy���R�G�S������g�����g���G8�2�����1�`��R����?�8��%=���_�e���7����d{�����������M_�
\���Y���5sz�����I�K7m�1.49�����k�>���?���:��W�4)/k�����;l��on�����0���c*�r���������Jm�b������Y��C�����^jo�_���t����.KM�����*�����aP� =��^W�E�K��+6M�M�b�>����rse��I;U�i��&&��>��;�(V�T7k���^����I�� �M���Z�kl|�����g��Y"��aD�_���������>���2<?"�g�N��{%� ������d{�[<�D�Oe�!5uJM�8�{��q8]�]�`V��5kb�����$
3M5�CT�c/���l�u����AjD�&�����9�[\#�d���<S��o�1����,J 1=1I�h�u�{���(���?:(g\B��4��Q��(�^}]��0�|��� �^W]��?��g����`�u}���-=������_���$�^��8�c{|�Yq���}�C5k��`��R���#�������N�Y{��ij��g:���������g{�]3�qSQ6�,*�p��f������Ul�ER6��R�%���������@z��=�'��r ���_�NN�B�{ �Y:�I,���K�����PzQp���$]��^~����� B��P�:�g.t��k�"a��su�������_��� �}H������P������o|9�
��O4u����NE[-��{���x�e�dd{��IP�1�(�o4k�����V���p�g�Kr�F����j���*P�T��Y����8����Z�����,�z��;qy�N�N�T�{�P���[Dj�c�������K��������h�� v;H����5���4)5MJ�P+������������qA�&%��Vj���O0��4�
��� �{�����_DE�����#\hr�%���$x�]�e�OzR�Y����^;����7z������z����$�{�������JXF6|���0������Y�6u�M��l��o��d<�o�����{�����@�. � \�]���{���e��_��uq�����d����1��#Z�o������#����������q#�dP e`�w��S��D�A^�a�aP� ]�^+�@tz��I�iR��G��u���n�_|�[����O}�Y&���?bF�������t�����~�gW�����q�74nv������G+p����ML���a������:e&m��v���������Y����F�Wj�'4;ywn��Y���V�Q�O,��;����l�b�P|�L��,���K_��lB�/��V.������C��4��^��X�rp����&g\B�\s%��A:sR�A��tc�{�nk���M�r�����r�a��<�!��4��=e��,��y�b�Y�4(��Z��9]�����*4)����B����Q�{�G 2h���j�H����������q2������i�����l����|��"���j����m���N
B`��X�6�%�d{����I.F/`����w[f{��o6^�b�KV��n��C��������3�qB���^+��D���C�����^��2��Mtep���S���� t����i��B�^liZ�\
�2���4)c�C���}�f��L����&#��n3��%C��v#h@`�����A+s�8u��{�7��X�-`�Y��6�^�g�[-t=��P���q;,��.nr��pQ BW�2�a&�5)��kG��N����n����u��Z�6H-U�dE�@el�jPJ���Q�)V�C�m�f�����U�~���V��Zh��x�b�O�2�b�P��a��r����<C��C��4��^+��������m�BuJ�P���W����i��k�~\g�4M
M�B�>��C��2�M���8}JJIc�]wj� �0`g�N�����#X��`�� � �v'�{���@��8+w�������*b_��-�����q�H�w�����>9}����*cf4Y�����S�c��Sn�����:7�V"�e*c'R�R������Q\D9p��^���[m�i��&;[�x�{���d-�V��������\�{}�]n�uz(u�F��k�>?����"�����i�9u
�P+����aP� ���^���J�R���4)e�C���'���V�2�q�����P�\8}kk����3A{����I��-�=��3-�n���bG�4��������^��{��~>���XL)Vm���k�-�����;�M�b��.�{���q�@�$��Z�j��kY�#�v�QF@]5)��kGm.�a��#�51��k�����~�O�2�d�NY��f{�][�DF��P�$-�����`b��P��W:��K����O��R�����Q�3UF
`���4)f�#�nc�v&FNX�R�i������?a
n��d{�w�at����qV���l��w�������p)3t���7����%��m|�)�Xhb�����
�AC7�6i�`����Zy�^|�O4���^��x�h���w�8����� ����x$��d������o��{]�4�!��n�G9����k�����n�7�$����Jm�G��QtW�a�%��m��������:H����o�Z�+��iR����w���\�o�4�Z���,b��x����`�����<�X2��j����5_<�8�;��:j�`������M���Q[���W]�q�K��}u���Bsb�t3�a���v7� @�.|�%m��p��F�(�O 4�����mgO�C<�
�,�^����������I��^�V��CeH�u>�%5J����>�c��g��-���y��Z��V�����I��q �B��{<�;�����k��O�z��I�iR���Z��;-:�@�N{4!iu���"K���Y;��v��Y�2���3"�Q���=������8+t������C�����(�U�l�6�l��X�\�N��4D'=�Xhb�����):�-M��������4BS���m���
m�JF���H�&%�{]���*4,����@��.�{�1P�1�A�x�{]m���)p-��i���k����*����`�IK{������^k� u
�P+���d���:H����
m.tz��I�iR����=��5iWbL�����P�\a%�t���@��/����:0N�xj_�&dZB���:*�!K�0��m���k����[,z+C�<�%!n��h���e{���Q����i��dML������`�0u
L�.�{]mg��p��g�'
J��^o��wn��������� ����8�1��F}�����Kd|Z��8C��
{��%�I
�:��F�H�J�����Z��_&�2�!�3�N1�j�6~�c�t:H��������lQ��I�iR������w7��� � RRNC}�����;k���^��\��e �$I{��Vn+��8�8+u����k�66�2L�����6v� ��{
���F)=�������{���/�Uz��a�6.�{]b+�8�`���-�����c-�4b���b����r����b�&&�{���SVz�
�&$�{��x�Hn�H�J������?��?��`��\f\"�������A:1���m=�@�w�ta�'����?���t�z��Y�����w>L��'\t������~H�v�_)����6c:�'' ��N�{�����
��NThR�����oM�K���6
�^;\����W�Ef{��-�|��_
4 ��Z_\�_��
�23�N��J��{���<j������I�J{�{�$ r�P��&g\B}���Y��)�Y {��;m����$5MJM�R����4~�������
MJC��ci)B����^������&v�;j������������8?�W=a�O��zxx`���{�8�o|�f������%�vw�7u�M�,�{�����8���4���^W�YY��/rrK���{�m��9E2���|�� {����
y�7C������k}��~V1��P�$����
m-���Z��q��q �B�Gd�aP� }��^������&��I!Cj�6�+CC�������1�o���@2��} �{��A{�����!� �N�����2 :�
�q������>��K���_�&�V��X�����������2x�K=kR���J=t��1S���-���UfeQ9T�2�e,41���vS���FE�5kP`�uA���18e�&B��O�{]m��-�ZD��D{/���r�]�@�J�����Z������yI��q �B�����;�����k��.��&��I)Cj���9`�05��b�j�r�����������V�������v'�{�VS� g���4�{��ks;b��C�b�F_z��E�H�~��;u�T�4�TVm����:�r���Sl��e{����E[���������k�
�bHg�Q
�U����C�Mq^�������/6�{]R�f��Y+�b'L�{������f�Q��R'i$���o��#F��c4�T����%�;��|Z�s�9H����A���4`��&}��>SF��c���(i�i�t��/�6��]'����#���i1�S�2~qv�(�{����%2<�9���a����������3zqn�j�+���~�Q;�N&�����J�v���(������ ��)6i�`����a~� 1 `qF4h���������g<�*����K�P�Sm�s�2��
S�Q�&���}����E+������k�yZ�T���������z�m�����yP��q �kg���S����A:-���v��K���&��I9C��\�E��e�T()&��Vlk2����} ���vs�5�1�0>e(��B��`��B[�?2B�:�J�i������d#��2��"��f�l�uW?�1IO� �.��C��|c��$a��-�����\#��Yz��a���U��$��\z�!��z��1��D@��������c3���$�O��a��>���%��( �N�L�{��� T��$ R�I�P��X�"=J����U�{���^�Az������u�O}tL�}���+9��~�=2���F�����k��.V���gF� M���Z���]u�����2�{]us{g!MC�����i��q�*���g�+��2�
���Y|����UCHp��qS��������m� �H���T��J��^�>����8�(�����P�1e��T��K�{����#����J��a����B�C����^����-1������Njr�%��=x��Q�A���N�{��@m�l"��>�>�n��~�b��;�{����=��#�QBOf0���������z2�AGm������ �z4�����n���SmK��1��2Z�����^+�y :���`����������;D.���`�l�u�
r�Er�Af0`�v�n�}��G'2�h`�u���3sF=��P�{��f�������������m�;`E�T��K�����y<7����������d�4)0M
�P���� ����0 ������_�;k���^+�9����0|�9F��P+5<;�g���RGm��~��3>�I<K�v|������^�>X��wp��_��_�aV{��>�L��� �'PS������^{�}3���b��E��e���cP�j�u���("��&$�{]��^���0���e��g�����?b��+�0��J�8�����l�u�j����#F+P��R'i+��~s�X��XF3�N:��K�=���
c�2�����dng�Ar��4)/MZ��Gz��Y(H��EI�Oc�\%�3Re c
g'������S �����rG�d{�������z �����s�������5��BOD �.��C]`wa`������Y��:�U�B������y�����?����"4k�����V2:��B�v��j,�b(���O�Q$��v���a,��;I�h���6�0���B�2.�V��E����Yi��;��B�iRh�2��v���8t���mW���V��
F��3A�%�s�4u;O"�1�3�L��������8+v����k�6C'#Q�nHf��8%Y5/���^w�z���[`B�%�s����hDl���i������3YIE���}�����Df�`dfm7���>�RB� E�7�>�P��_��2���4D;.��~�m2@�:=�:I#i�u�Z��E�!
\hr�%��m.PD�"v;H������E�z��I�iR����a��=�v)�5`����1�����i�BX#u�.{�����=L,����LK����T���rGm�����F����8)Z5����^��JC����Y�����J�v���#����FM�R��
�^;js7bV�2�q�4(&�{]ol�J��1��k�fm�����b���\��������^��e �;��8w�~{��� �[�tz(t������Y�1J�5�����E�K��]GWiK7���Y[{����
�����3^(���y�t�@��"F}E�����0kl^��^v�7��y�k��:���;�WI�W��s�v��j�?��@���8zMPS���m���������O���B�m��P���~K��^����>��7��^��^Xh���C]}����.��.v���vb�{]�;��uz(u��������+h�a� 4��PW[���q�A���_�{�m����� ��<���Kg��o"f}w�v���w�%�B����+yt�i ������9/�VRs��h������^��M'*�:�Wm�������������S`�fe{������������bV{�=�������U����~���[g*�9��G}�����G.����{������8�{���5�
��P�$����ulB�qg,;>a�b�%�;*X�gn�aP� ��^���L������t�9��8F|�wI�Oc�\��w�{0G 0k�����gE�Zd����#�������#��4�{��
��K�� B��r�9�
m�Mel�����i������gq@� ��S}��5�y�5#��
`���Z�>��m���P�G�{Ns�5w{��+�� r`����n~R7��P�$������v�@*C�`�q �R�� �� ���i��k��� ������P����E�f}���Ra��{/5��J�����2-�����wWzo�8������l�u������ �[w��m����.x�H����������:�&m��V������u7=��v��3�Un�����z�fmw����rw~c&��^��{��@7�wP���F{R`�u�Z���+����������q��)��u�=����K�q����;��+���7���t�so�4)g�C���s<�e��
��Ic�a7�7v�^mw���v�c'y��I���LK�������;���u����1v��I���_u�x/h��z�m���2��O.��G:��q��i��)6i�`�u����Z���B�}��P��o��c�n�7l�v!i���
�V�b�1�Xh���C�}�{������Y<���>�0m���{��:I#i���>dQK7�M���Z������T�0(v��K{��vV�C��4)0M
�PW_���Jr��� ��1�����.3�� ��%�C����3i�a�]�V�����u�w5�Yw����*��3�{=�������K�Xh�
tu��s?�;�pI[{��l%C�n�'��}������hTD��O
���R�~}!�D���{���^��B���j``��!lZ��������:����B�t`$�O���PW�����q���$}��^�f��R�v���u�����Poh��8���-����i��� �,rZB}�M�Y8�
�
����M�e�
�_DN���k?W{�a`�-�A2\�[|�X��.������aX 6u�M��l�������Y2�Q�3�A)�����8e��������l����i�b���2�Q��Q_l��Z����X����n��^������v�l���~>��fw�g��r��A��B'i+��v���d�^Y`�PN:��K��R>��s�;����k���'�+Ay� ��2���<)f�#�;��c}�,C��������^��6.
g��TF��]_m���5��<������0��:�J�m��k��sF�������U�J�l��3U����\|�gmW����r��\`S��������m,,�@�M�4�(J��^5��r���^�$�"� ��]���Z���BI�2�`�2��������`����������
}~5�|����9jy�:=�:ISi��R��G<��p��E�!�N_7�f\B�������YF��u{�a��'�{�fk���X���n�4�Z��H�X�;�
;�����n����c�b��p'��i ������X���:j�`���Q G�~�o�h�sz�j^%����k�+��<�E��n0�C]`�V�!h�:�%m��v�ln��QXD`����g��,��O�� Cj�vyijml��#���!P�@�y��c�6��lX������Z��f;�
�����Z�����;�1X�o��&f\B����8�/��R������>�@Ej���&�}�][�kT�"�mW�������w N�:k��L�P��C��q��0N���*4)�jG�tS��� ;*b�Y��6�^+��L���v��2�]��k������f{�����x������k�Wdm����Vl� ~�/PS���������#"�����A1��k����g�/`����n1�C���h�~�o�����)��/�^�x��X5�WL�����m�E���e���C:���G��&f(u�:H�������7�JM�R�����;��I�2�p*���P+�t�E(�1CI���sZ�/�^;�}+���N�qV���{�</��m�Xs���$$�)>u���_�^����������G��.0_�^���?�6u�M����k��� N���t��Y������~�4�
�;_��\����
��:{@��1y���^;h�k������~P9�I�Fh]���P����f7E��������IZ����<��������W)�GtJ�@�����:J��f{��m+C�iRj��2���Pia���yDRL���������(������{����&�f;����{�qE�c��C�b ��W��0�����^��q��`�O����9�U��9Z5uJM�:�{���3O.2�����b������� PPa@�����C]m���f�2�`1�����ul8�c�[�d�%���{�{`���;=;Iki�����D2�A�3�N1�jmm���� (����A�1��V������}��LD2�����I)Cj�6�h��:�{��[5�P���\�0�{e���A�+��V�~
�B����Q�{��������9, �"����l�u���`#rX�A��Z�9�
=xDq�M�b��5���k���T�("����|��������"GX���Z���ag!C�
0,4o�|��~��!'�j``��`�|��;=;Iki���v�hC����q �������+6M�M��h����`R�����P+��4�AC�(���Y�{�o���2pZ�S|5!�jm�h��!g�Q�{�9����\
��p�v#����O�p����u�����2�{]e�K�\e�~������C}[�TzW��o\h^��C]p1c
���[�m����(�^K;�?�������:I+i��>�"��UzU��oPhB�%��������N�0(v�L{��j+]���/�iR���Z��q��-�t�����^���P+�qm}�=�^^6��^+�������}�dZB������qV��m����8��S���u1���������E�r b���7,��V9�Z���C���_�=d�~AM�8�{]Pw�/��1�����l�u�������/�Y��}��os��b�2���|��������\L�;������km���
���(.��C�����^+��b�c���7&4�je��;����S����A�,��Vf�C�J6��4)d�C������/K���d�~�Ic����/��$1��o�{]�}{���Q���K���Q[�0J�9v����k��1Bm������ Q-�1Y[T{�����k����AJN 0j�FY������y)9,��&m[�����F)9&��Y`��z���'�0X�Aw��d{���w��c�2��I���^��qLe���X41��k�m��a���
�$�������)�e\��
��D����L���aP� �^��V�.9����I!Ci-�u���W`�����H;j8y�{{
��})kB���z�~�2��)eZ"������%G�Q��������/�\�_e�[u�x�e��:��:����l��dm7h��V���n6u�M�8�{��8)��+S�8g�����R�7��,��Y�%�}�4v�Hep�Lm�7{����>a�V��6������z[��n�����IK{��^�����o�Bu��P7Y`����B�A��tc�{��/��M��.�+��Q���4)e�C��b-�S^/P;B�8�J
Ic��7$?�WAt��m7i����tt��n�?i_�������mQ(����Q�{���/U������X�[-����m�����w2��� �{] 5�� oAK�N(�I[{���|��%g'�n�|��=�`�!���+�^k��w40D�A ������9��m���%%�j@`��B��������IK{������ �#�g�I�P����T�0(u�>K{������1B��&��I)Cj����DI;c�JI1i�{��A
�Y�4o��P�n�JB0H�����'hb�%�{l��Q�Y��6�^�%>�iX��2�
�&���f{��&���q \�1p����9�w���I�N�I{��
�;���b�B����P;����F�q������>���.�h B�����j3�.c&�!m��H{/��l��]����:=:I[i����Xz�P�3@� �P+�����T�0(v��K{����R���4)e�C����9��D0��4��1�,-1��= �nJ�C��c��ThR�%�J=��4�8+u����������os�>�p����yuj��z~��;�1��S�����9�Ul����)4i�`����p�2�q�4(&�{�����X�"�������������R�2�a�6�������*�,�q����^K+��U�@�J�����Z������2�q��)d\B����6�����A:/��vM=���A���4)d�C����~I���II!i�B�-31��= �.m�C�|C�n�I�#gI��`����p���RGm���Q����X�d�h�(}������^���e&�2�a��j���^{��e&FM�R��-�{�f[�2����A)���u���T4PWMJ�����2^e*��i���l���me�Z��8c�>���V�������������5�bje�#W�2�q��)f\"��G#����Az1��v�����4)4M
�H;���&�2�q�����Hz��7���Y{S���h;G�Q.�$����&�L��kO=�Q�����Q���v��y&�����8mZ5����^?u�u�
�^����)L�v��j�6~���j����u���Q[�z3|e,�gN�b����mnb��������.��C�������9-���y����Z���s���x��C��<C�fc?��b�I�Y{���.�A���,�<���rf{�{���8����SjR
��v���������
�+�(����~��
�_�����$�{�nkI�H u�'��1�%����#���*4)�3�Jm�8�� ,F
�@'���k�6p�����"���+�������Q���N�g��#T7a&�{����K6 M�B��
�^W��7^�������^+��%8�������<�C����3���dL�{��z�~���E0��~�s���� j����k�6��6�t?���a�+����s���]9t;�A�B`��B���Aw�m��=C����T8�����>��������>bh^�6,�v�Vx]� ��"�dM���R[����im�qz^��{
�C��/ �@�=kY��v�����)4i��������R�l�z����(����������.���I��^;j�;�]2c����A1�3�o6��TPWMJ��Z���l �2pV]�w���~������,c�\h�gg�r[��` 5uJM�:�{���Bs
����v��3��(s�&�fm����v�Vf��mlXz���v(�{�V[9�g;�B���a �@g7�P�5��.�����^����� ���������;�P+�`��TvY]�N
��v�cNFE��u�����71�$��?��B�
c��{����.�b�y�F���5��D�P���QE�+��Vh� >/bg}�����k��J�G]4��b����l�����X�8`1VFz�A1��k��{q�����Bl�v�@j�6������*�]H0��5����;?�K��E_%��~�����,41���7[��O����{-��w���J\��������^�>?a"�A%���QV{�=�PZ���Sh��e{��h~�X�]�h���k�
��&�gTx]�HJ�����r�cQ�C�S��Y��v��a��E�l��d{���x/�J��x��5�{�������t���5!��k
������C�u���<;�{�x6�ed���u]�B���s����Yf�^v6i�`��b�����Xh;��j�6�0=�X=��B���}��a}�)e�V�{�����CTP��^�HZ�AW���=8��^w�v�qW�.!��l��B���h��"�M�8�{�n��F�"�u*�B���
m�br��cE�{���r$�� B�y����Z�WMC�x��]5�H��^}Zfb�!u���v0�jm�a�G]�a��m����=��H,{,m�v2�3��=8�#�]Vk�{����O�b
�.��a��{��>�E����,�P��^3����B{�{���a���.,���^O�E�)���R� ��k�6s��T��I�{��zx���3`����g��k,>�H�W`H��{�������%"�B���c�]ojyb{]#b�E�,��~�s��"R�IY��v�i�9q�����^a��B�e��Y]�K$F�/=�����k���9%�~A��f���c�~�:�&m[���A���"�NA�d���C��
���= &�{]o�n��X�]c���k�=�1A��/6kR�����Xz�����fm���w_,��.�1��/6�{�����e8�.����^?tEL�wXz]�C(��u6��A~�6u�M�6�{��������Qmg�=C���a��'����72��;"�BC�����T�Q] D
�������n�Y�04 ����~��G�����?�.+�T��^w�R����F�.�!�r�C����z}����I{�]cw������~������vP�l#��h#O��4k;���B��<=]�C.���4��q����v����%��m��}�t���ba����~���eN������k�6�c���eN�.�����^�R���6��� ^Q,k;g�B.4j����i��Z���m����Xh;�j����?��.�2h�vN�7��Q�K�����p�Pk[��9�um�a}���Z�m��������&gy�Z�rk�����2����^�7d�8�tq��?�>�����Ii��t`1F'�P���^���ctb{��z�t��AJj�v�Gj}�G���A
,���1��M,���Zh����v�V���ctb�W{���Qt���n����������l���U���xP�j�5�����"���AV{��u| �3��Sl��e{��k+����"4k���^;���N(��.���d�>�z�K K������{����W��������l�����}���]�fm��i�b���������l����,6<!�F�.�Zu�%���^?e1L��;����Xhb����m^lh_u7i�`��r�9
B�tI�A���
=0k�����{��Vn�s��Z2R�I��^��Z~p}��<bXB-���Z���7�6�,�P;j�Eq��6���ja�u�EB�r6:��.^[u�'xD����6�} =�.\3@��� Jt�>�8�u��1�6
�^+3x���r�c�����l�u����*��X��Y��}��Js��j9R�I��^;�~��C�E�+4���PW�QK9��D���
mg]<C����@���F�����k�����h����.�
���c/mm}�`���@R�I��^+��1��Bc�As��vM
�R��7*������>�J��g������������^;��XN��)w��{��z���4�Ph;c�j��
v�W�C6�>������CP����<e#�{=�k����CP�
mg{��Vj����*1uq�Q��
�^+�1SGP��A�$�{����o|]h��A�������\z]h�l��[���^�t��
{�z��AB�te�A�g��n��9�um�ag}���Z��3���\��������^����F����a���@R�6�j���6��8�uq�A�6-�{���7�������0��������:������>��z49 *��.4�j��o��
r}�Y�������p�����g���Y������+��^t�����~o�p����UN���d{�����#� �^��m����V��$�1S���m����l~�`���@R����g�����c���@�fm'���Vj�;�c^�
m���~��� ���-��#o��K}�0���>`�{����~�:p��p��)�����}�.�A,(H��"�J��^���7������M��{���W�x��ydB����Vf3-�����H��ca�P;j�+����<�.�|a�u���t���,� �3�
��+���.�3��o�^+4�ul�M�G]���bx�h�l��(K|�On���OnR94uY�Q���Jmz@>]�g�������=�K �.�#��y�Z����3(��.���H�>����z&�y]�G&�3�Z����!�k���[���ov&z]�G,����b�ro�Y�*���Z�����K�VmN"�l���k����
.��Yg���q[��<� M�B�6.�{��G��A�K 5k���^;jc�`��]��NvLi��V�Xx`�6�j���X�sb}�Y������kw��c 7k�,�H���� t�W�u��o�H;hsS��\u�(xM����k��{1G��$����C����
�:�&m��~�����Ph;��j��G,��4l�<�&�{��zhnN(���4H��Z��[�8�u�Q}����=��_��)�:@�,�P+�1����e������k7r���S�W]� �����tE��u6�u ��v��j��\"��.�#���d{��z0�nTP]�G
��V���;����(��k���'�#EW�{�t�7q$��8���|a��@�n�O�.�[u1nx�N��z�7���T�Xh;�g�`o�!�#EW�%;v�j�=x����j,��;}��l-��.�"�G�d{�����-1*��.�JvMj���cA��5\�.:2����^O����1��R�p
mg���v���M���:�&m��V��s�I�`��cF��^;j�� �]:f���4�>������ ��c��^���9�t��Q}���Z������O�y��-��1�,�P�5r8�u��Ag}���Z����HI���|��3aV{�a��Z���H�S�EcFY��v������h��I����>���G���������{���s�m*����"$�{��z��LR9Lt!�Q�g��|������^�<�>������H�eT�.Fh��^?j�S�^(C,;�.i2L���a�Hl�l�B� -�P+���2F�&�fmG���~��XE,���i2L���5�X�3��o�g�d{���`���1�K���<C�����5MF�����k�'��n�Lb�tU���1��R���]�8�����"�m����v�CW4u
M�6�{����X:]�d�l��b��jT8]���!-}�]cg�c���&B����1����n�5������K��j*G�.�"�Nhy�����a���;�k����=x4�2�Z�UN�G�d{��w�����N�B��,s�z�J�>,r��d {�][��>]Ze���x�>��z��b9>um�a����5��O�Vu�7{��l,G�.t"��gy�Z;����IW:��<<�$�{=�+'�GX`Q����R�P+���FM�R�6�^;���}I��uNF��^��V�A]�d�����>����CP�:�
m����V����A}��<�dXB�^S9 u���vT�3������6��:�k���R�P��z(��$]��j:���#K������NX8`W]`�������i��N�I[��{��:��P8��:(���k=�$H�N��/m����7��1(|��. _�^��TrP1<A]�QV{�w��{R�ag}��6�j��f�4n����GY��G�d{��rj����
��)K�
L��Z��
/��Sl�����7[�(��mG�<C��c#��b���#���k��c6�r�������>��c��.�f�y�����������t
����,�P���9�e�Q������k��B�&��Q��:���<H$�{�?k��_QdT�YuTP����z����:�&m���7u�>I��1�Z����V���$(��,u2���B��n1�e��Q����1�G��[,��/���K�{��"��-����<C�fc9�e�U������k���m[���;��Ye�u�'xzI��z�UV���4(�bh�vd�j��;���N�I{������iP�Y�U@���
=����/0pO����J(b��j2�B��)c���L��.�R�yv����5�Xb��t����,�P��Z9�e[�����k��v��~-��E���l��������t �~��d{�������)6i�`��b�OdXOYJV`���G���*���%3
��V���
����'�K��`T��S�����7��,&+���X�{����I���,k�O�K��{
��~CPV�4���]2G�A��
l��jb���{t��0�0��Y��v�ckN*#�$�l�����#_�6k���^�����E�l��d{��ov�Q J� ��
�������_V��Y_l�����1(���.�`2����+�Y�fTh;W�j��:CE��^��T�Y�fTh;�d�[M�@�%lt��
{����8d
�a��8�g��m�:����B��&}���U�!S��e]�X�-����e)N{���a3,�i2�Z��~a/��Sj����k�M�K� ��
L�����r��z2�B��&}��l,�,(+��>c�{����X�Y�U`�g�{��
Y��u1|x�F��z���lmT������9��z����:�&m��V��36���"K�
J��Z��]�
����fm}��h��u�HP9@eAYA��^y���/����GnK�]gj���{EP������v���UPg}��v�Gj�;���Y���9��'n��^��hj��^V�6k���^;l��Z�{Y�V`��.�{����;1�eA�Q�6�j����2��r���2�d{��z��A1��,)3L���a��8eE�A�6����o������/6����#�^���������1��/��u����1,�,��v��i�<����N�I�!�����������<d���C�(4� B����>�H���<� t`�(�����>�H���@'v�G��>�%�o�X�tP�)@�� eyF�7�q�:���Y8
;c���:��is��Z�Un��}d{���@����J��K�����j�7q��R6�&m�������Pz]�f�l��B��;0�u%�a���M�P;l0r7A*��.d3J���QV�Z�t!������)&����ThR�g����9�c^�t��
{���us��G!��m����A*�����$&��q��*6r�� �9��m���A������u������%��[�el���^+���8�u�A��\�P����Q������\�P���8�u!�q}����=����_��
m��<C��������l������kml9�(�z�N<�IW����<R%�{=����������Ph;�e�B[�r��:6�&m��V�����K��K�����v����x��(}��
����{����|�c_��ML��Z��.�(���Xh�3,�V�� J��+��<C�no����l������k�����A�k�VmN���1����?������l�fm���kG����/�M�r��.�{�[K�� @�� �N��{P�j������^W�u����G�Q�X�y]@Gh�Y��v����A_+����#&�{������-�
:�,�H��X�z]Ag�Y_l��Z�G7��+�V]8
Y���~��q*�l�������0��k�F����wf����i��Z����Q�����cr��~��r���9�fm����vm=v��a�t��a���GZ[L }���~��{�:���~
J�����<C���n����Xh;)��{O��~T/�1p��g�d{��r����t��Y�Y9s���/��M�b�6�^+�u/8F�.$���y����9d1�u��a��#z�P��j9�u� ��vH��w���f����[�{]����_�m��<C����!~]1h�Y�-��Vh��~]1���Y���l�� +��]`T�]0h�l��R���o�,��Sf����ke��Ya��bAB��T�g�����Bn��{
��V��[�K��k
���V���
�����K��l+�<V(��b��\�g��7��.����X�{��`������0r����!=�����u{��.�Ph;h�B�=�z]i��-���
m.P������B��@�P+�`~����2�����P;��K������B��@c��k*����4����^���9�uY$���,�P+t?�q���Hc��Na�u���������_}����u�������������?��_����k��uy����{����Y��*����+�v�������J����?�J�����]%�����:��@%����[V�/���e%����[V�/���e%������y��<�����cm��Y��k�k����c���e�D<�Q������os�E�z|�������������_���y~�j�?�n��������?�m��c|�m�������o��O��o������W�<�q<uq������T���l����r��Q����]Z���S�.�^}����:U��1<�v����v
����{vi�)����>W�Gl�)���}�#jLMK�����{�z�?������a��_��|�?>e�E�����/q������o>�����u��s��b������A�s�k�R_{�?>���3?�w����>��B�wG�y��L��9�Q����k�����3����v��U���}<����v��K=�(�yn���E� �r��F_�7\�/��I74\�g��~���/�=O���?O��n��n����{����1 ��������U>�G�A�0�<����c�������3�k���%?[�����9[r��`�=u^���M�������r?��S�����[j������r�q��?N�������v��W�����#�t��xO�SJ���1|=<���E�"��zK�R��<B���y�J��fuE!|k�?z�<B���y���{�A�����??#/��3���7^�G���'pE!(���Z�| E!(����q��y�M�Q��
y�����J�\�#����<B���y�[j�G�|=�p�K-����2�p�K-����2�Pq�'��E!|�����y��]j�G]�o&�\���1��_��x������ej��Y�gh�qz��J�������+��5up�d&����9�s������t�� �����X6�R���� �����Z�n/�����+����_��������K^��t���oz�W������7��2�0u����~(�
A��d*�^�*T���U��{yuBP�=�*_�*��U
�'�|���2�>��?��Y��)�y�gY���5�p�Y���uu������(���
Y����Y��S�"�p�;-�
���gnu�eV�V�Zf*��zV�R��Ii�U���������?����Y��K=��,�
U�z��"�Pu�'��}���_�Uf�.�rV��Q%�����B���~>�5~���o������#����;,�O���d���O/+�C���?�)���l)��K.R�����F�8�p)����eA�W�"K���O����C���.�\��1����3�~�Aq���c�}t�����=��W#(�\�/�!Q�wdC*�^��T�=�aYdC*�^^c�����j�
���"
���+(��/�E6$(��[�j������E6$(����2>���]dC����y���������~����?���$�����B�������T
���|=R�v�3 ���2Rq�'��E��F���
�����"������n��K6v��/7e@���r�'��0�����1x��E����/�b)vgl
����oIr��&9*O�����o*y�1�
;�o�S�^���=�##,��s��n}o���{z�m�1B�}�9Xb���.;���;�c������v�����������&r��������"���#EP+�j��V���r�.�����mk)�Z��SQ��?��AT����"E���Ga)�p���"���i�D��/dy,E��W��f�"�����>}�i[� .|�K�|(�"���)�Z��S�:�"Ep�-R7�QK��F���n�H����AT�w���_.����y��/���u�k��d��������"�_��c��A�C_J��C�O�o'6l�;�Y���f�_���{���O����+=�������>�]c�
�F���@�$�/Nx���sb��r����!������.7����27p�{cn�cf�)�jn (���@�����J��s�r/��roY>ps�En (��w\�
�e�@T��wl��
^s'';En (��������9_8X��
E�~�`�"7>��W21En ,���_�N�b�������
T
��������[�h��x����[�h�����S�"7Pq��s7zr+B�G���N��<�:�/o��_�����p�W�e��j���k7>���nQ����|���=�6���m"������z,"�G���j%�>�����sa���W��Eq��H����"���k������(�����{;��y�(�{:c����������>o,
�.W�A��d*�^�T���-��{9[�����j_��r��Q���$
�e�Ae��/��E�+ NN��lA��/,)�A��J���:X� |������������+ nu�e��R��l��N����E�l������Zp�-����+ *n�z� (������p�@�-����������J�dh���/�c��j��W�'{�e6%�l��5�6[���+�����j�n����p~]�w�lK
*%���`y��C�^ay�JG��'��
���� n.�y����a��P��g��M;��B^����[� >>��_4W�h�J�%OP)�r��R��<A�����o������ ������'
��
�������e� (��+3�"O>�{�A��F��y��Y�������_������B�<A���y�[�]�'�|=Op�-����2OPq��WT���c��}�e���p�b�'���3���~���]7������<A/$�,O�)y���+z��`"����h�
Y�J���`>��?/������?� �a�7$*=�r� t����d� ��
����O'�����{;����7K�[�%n-w�?����,7������s�����{�h�������BT������{9Q)�r"(���"j�MX"(���zE"�k\d!����B�����E"|�vdY���_���Yd!����2�-���^w���kY���~�+��q=Q)�z�V'Zf!*_�B��F�,��n��B��F�,D���������G_�����������{!�]7��V8Y�a���~���2��+@2�)y{u��[,yJb���=
�xsyOC���4TJ�a���u����O_]iG �\���p�s.R��=XS��? $Sa}O���DX�
� �[_KA�[�� n-����l���DAT��j�^M����(��{u[C���X�P+����Q���?+Q�N��DAT��
�(������"Q?�[H����=�E� *��+sLK��������y�%
���D��[$�u�E��^Z$
�u�E��^/Z$
j��r��^7Z$
jn������[��>���DA��o�C!~�.:�w��M�������DA���-Q�-�����@�+���������� �E�����@����npI��5������y.I����% q}��u�{����9�������|�a�}o�/L����$��r���90�?��^�e����{9P)�r>�R��|@P�=��J���/D��
>�![������S��g|a�E������2������|@����������rw��q�"�2p�-���//������[�h������J���7zvEB�{��E��pp�O4X>��F����|a�E����u���~�r�a�n����2��F� ;����/_��-��8e�j
9;{-���O��u��/\��M#��e��%�O��{
�0}��C�rO���s������k �{���AX��o��%�rO'���a|F?i^����5��{9kP)�r��R���A����T
��� (���n�5
�'k|���"k�[V�O����E� (��+G:Y���?��d��A���_�������r�c�5����Y�J����:�2kp�-����"kP)�������F��A����
����z?��w��Y��)_8��o���C�n�����Y�����a
E� ,��.�"kP)��� E�`S���5����s{�TJ>������%s�A����B���V������SoK��^?� .��9�L����*�������Z� ,��^����.����|�|���[R�r/�
*���TE��R��TAP�=��J��SA��,0
�'U|a�~�*��=���)_8O�HE�z��2U�.08Y�"U>��Wv�����u����W�
nu�e��V_W�
*_O��E�T��n�H��F�T��n�LT���TA8��H�����4e� |������
7z�I
7z���2UPq��W������+����J���>���E����#EP�1�Ovd���N/�*%��"��|�@�Z��� 0E~G^?� .����L������AX��+�"��9X� ,��q7�K�Z��'t?OD>f��������x&bT��3�2��<T���y��{y�B���g"F��r&b���g"F�r&bT�/�g�E�!(����E�!|����E�!|�������Z������:���+��"�~W��VZf*_�<��D��C�����{�h�y������[�h�y����G����3�g|��E�!|����u�p���o�v���W^�Qa,���*u�)���p&b��
g"����3����|�
J� *%_?�V�
)�J�7� n~�v&b\��w�����g0 ��g"���a����Z��r����, qk�H����{1Q�yG��V��DA�����Z�W�(����DA���K��o��|����[�D�����-Q?�[v3DE�z��KDE�qe��%
�g�����E� .���kb�D��.�H�
��(�����{�h�(���Z��V���|�W�[57z��mK����DA��o91~��$
��|>oY$
������q,��|��(�����������q.�T���Z�9����q�����@���3<��g�3cr�I�������t��=����#�8� ~��w�w�yow��.7������������������y
W�A���*�^�T�����{9�{���Z���A�����o�#!*����g���}����
�e> |��? �|@��n������I��qe�J�������J����:�2p�-����"p�-�7z=|���xp�p&b����s�������=��|@P��W�������e�B��^XM�-���?J�[�E�[�o ������^����^���-7~������=���{������+����O��a��oL�������oY����T�q����WSA���*��<}�R�r/�*�^N�o��%������o9� *����"E|O� r��&E� |���-����� Xe� (���u�)��Y� ���)�����M����)�[}h�"����)�[�h�"�x��)�[�h�"����)����"��mE� |��\�?�?�������#b�u��_X�V�*n����.Ra���jP����\+P)�����
T�����u��@����j%�>�����E^ �t�~�A\�����f ,����
�����XV���kY���M���������
|DY_����Ro�
T���p�R���@���Y���{T
��
�'+|��2+|���3�g#A����/]f���rBc����J.����~���2+�fN���������
T
������Y�[�h����Y�[�h�����Y�����Y���O�Y��)����-�"DE��p��F/��(�a�W� .��=���A�a��w�"������/��d�T���|A�����J�p�a���/R�?W�����AX�
��������1��������:ps}?���_]���������������������x�������b��0~���xLq�����.n�j�7��<����I����
���O��������|�s����^�!����+|60t��S�z��� l7����|�>�n�)�����g�k������������������aZ��O�z��[�z\�������_9 �r*a��\Y�P�T*��r~�_�Tnu�eN�R��s'-�Rq�'3EN��a�l9���=�(>}���=���g��u�������^��9{DC����������K�������������=�P�7=��L{74=��U)E�e������cC7N]�+��sn�+��.����J�e���U���/�|����TJ>���\��u�\L�������4>����Tg��qw��C�o��_s�����d�x��a��W�XN&,���)$"��p�y�Q�w$"j�����DD�NGKKDD��|zVo����������A���Y$"��\xE""*��������DDT��_}��B� _������C�����C��?���,���J���q��|=��������/�����|W$�u�E��^�Z$�u�E��^�Z$jn�d���C�(.L�-����X�!v�@��C�V>� ������?]O>�y���<����55C��^�X�a�����G��W�`H\��{-�����-�i��0:;��B����R0�p{��!�J>�R�9�Z�r���;�������������j|~�
Vt�;F,{�{����;�O����A-�^���'������6�$����(��� �>P")R{cl����E����%Q�8��E9N���^4����
�:����lq�� ��= n�#���g-�b�� �&��k��8�_�@�����w��]��^c�M]�e��$ |��jT�#!N��N�Mq��S��o���$��WuZ��: I�������dX�I��(
y���N�X��C��4���8��[��$<����dl'!I��B0u�S��0���r��0�r��D�r�����l�/�pr�-)jC2r�K��r��Cx ��������F��<�L�~es�� �$ �>'���T��b�?��r�Lx �1/�e�"��Vs���x{{!y�
@m1�m(`��T��_)�P ���,���?@��aa( ��n����V�����^:M
�D@���Z�&B7.���������Do�:���!q�DhM�j"��T5Z���IV-�k1�a7U������GT10��s��iZ-�������_
aB&���DH����*~|�a�F��3��7���e��3�
U�b��6����������Sq�.�x�����(t�&���w��^sT8d�z���)`��gZ2��nh�Dx
���0
�r�����SO��Joz�� �O!���p[<��\R �!~�S�@���z
|�tx
������6�o���M��
�pS�{��/$p�8�������������/��/ ��/�S-�4Xm3�Bk~U!I��P���dX��B�b�K
i��N�S���6'���_g�x�_��t� ���g��y
��<P1��gL|<>H�]�B7�r��0�kf~�������dg\�p�Op�����������h���������pr}~���5�oC�3n�N���d�++NN��A��d���Q�;(0��ch��/�������������������n�Hd6�mI��&���hI��Q]1�k���
�GxY_s��U��v��]�1Flzc|I���:�Ab��w:L
>!�r������$���b{W5���������EoZ�"��E�X���a�eV�������Yj-���]�1n���(�'��]�q6�nc����t`���h�,����7��Yj�w �"�e�;�bTd��>�Y���?
�{�� ����:@��������f�\�x��?cL+~F�`��[|�����`����W�o??\�3�G����\\��GF6ZG>8�e���r�%�0?�G��8
���4]�kQ ���H���
�AC�v�� ����� ��QRF��F.yk�T�E�q�w9S��[��S,
�D�U_,
@����/���_��+���_��� q��O�����K7�rF��S��]��hM��T��Wu*Z�:I�-�n�?������_���T��w*�[��0��7��q*p{�
��u�M�
<!���8I:5.��S��S�������Ro�x����[i��8���\�0�M���r<������qi"#C@g�E}4���D�y����#@{�D4�l���,���qY�y���r�_p:O�R�~0[���k��m���� �P�� ���.~F�x�� ����Z���Q� ~����R� 7�ha ~���E�����~�`����Fi�x�_Eq[�(@�����5��Q��W�(H���!�����$����;����c\���(�(����Q���b@�( ��(�Q� S�1��� ���iX��2��b���$���xI�}�������A}|������4��AB�GNd�r3�Qr�N����AB��:t�9�Nd���� �sRw&����K�A�]o8��� ����g����i�py{��(��-FB����B�� �#�=F����j$`�{.`��Wa�H�8����0 }3�o�a$�&�C�H ��i�!F�MFB��<���s������H�HhM�j$��U5��� If�v�#!I��� �#c\��FF��'!FB�Z}#!M������nA���������Z�Y���@H2k��3����|�l����<�>
sX �~������z��������EfH����\n?�B3Y|����+$���0j���g�hy�0���,���[H���-4��z���7�$j���pmo�-�p[��o��- �^��F��[� � �[�8�\R ����-�n�q�^�@��!%�-L��uI���i�!�Br���o!Iv������B�X��If�/)$��c� ������1��/����Q6V-��y]��,�B�Z����[��j|?>H���$�����m����Lf������dV�j"Z0�����������f��8�7�07���NlC���? ����w������a=$�zk�����`=$d��d;9�)$�z����,�9�f����?�I^W���|����I��'��_#���$���k�c�&q���� pK?I���� |
�c>_��\/'����Ir�0&S��A�[?�����'�'d���V�y�'9������c�5���������V������k�&V�5��j�6,��5��Z,�_�cl�>_��l��_�����kpjm����h��8�Z�y ����I��/�s��0�����:.JdQ�_g�_�������E��lL��_�x��zQ"��w/J�?zD>VFW�������gd���b�i��3���32��2#�~F;y��|���9|�N��� ��!��w)0�p;�R��N��-
8���Y�9`�[�48���k�x�v.����h4��{�%D�RL��tc�D�����tn�����[9r��U<�����Ck^U�!I�v�FofU�!I�����V�3d�Z����MG���� ��7���p�cd�s����9� YII=�$�.�mA�sH���03<�$*�����[��fd+��������������( ���� .����
�{&�+d3w�t`n���<{�����Qw��Qw��]!{��$8��]� �-&A��M����m1 ��2E��=&F�x/� �\���&�yU�-&����j�&o�b ���&�M�� ���i�)&�W�,P1 Z����yUM�$��&A�Y�� ��UL���h��1�OvQ� �l�[�I���j-�e.k��BV[7��e�g���� ������qf]�4s�2k�n��If5������$wr�&�(�x��FUB��P6&�>�o�B����&GE�1�5�b"4s���$���a#�����&�m�^n��;'��gn�uncH��x ��0
^p��W���\8 p���z ac��x �m=�%`�W-E��y ����%�&��x ��;�=�K���y���9S�������%��U�Z��z Ib������^B�Z�������%���n�Qni�q^����KHSk�\/S�1r����ZEy�f��q�K��$��/�k���x �v��/��|z�����U �)$Hyf�;$�������Y���o4��!�����GJ��SL�c
&C�b�G���+�a3��"��We
�QV���u���D��jH��� ��"��-����'��=VF�e�%y[��(V�������D�&�?��� ����a���j5�Q� ��K�X
_M�jH���'�7�����X�jH2�=�"K�E�X
n��16
�j�(m�j�8����S��'2O�����V������j�����,�!Q����'�'!2�D�J�G�L�;0J!0bN�F4_�g�!�7�'���y��?��L$�R�s�DBnp&���4]H�~�~�����F8>������p'��������p��M|���������}��C�~�����m)��`�d\�-%G����F!n�
6l�E���[�eG����ca�u��B���tK�Bo���0R�5���H���rg��;�!��u�����A�N�\/��,o/�oh
���e�
�7�����W�F��m�d��n������q��~vJ�~oj=L��1L1L�y����*���6a�p�',1U8��������pU�kN�iv������
:���8�J?���z����FF�O�������q��?c�������swu�h#��+��s�
��9��{�9�z|G�G3w\���^g��|F�G���6���4��F0[�������4�~��>q�?�j$` �1a$`,�"�@�gE�@�dh��� �~6>����}�� ���#S�D�Z�<�]F��w��i2Z�� �yU��$��FBkfU#�#��)�10���� �zp�H ���T#3�S��:����u=5��o������n�������� ���;�"�m(�#!!�;8��������r���m�G�l��Y��� ��K���z��*�a`�w^f���=���YD37����^\g��lB�Y��\��c�&T�, f�Y�pm� ������{VW�b` VF;�00��Vj ze�1V� ��2$�@?�?P�f��e������;�ib ���(f���W�1�� �������m��U5Z�����U��$��}"��,�����)�wAQ� #n��� c^o��f��N�L1 �T��Y�h��8�f������4��B��0�uDQ��ep�� � a�_��]������yr�o������ �0��W�% � �]y��0����vrn`|��C*z����
� �~wI���I�a��^i����_$\� �-�A��]�B���\1 n��T]D�k��0�d�5 t�k ���z[]@w]1��a���k��}-���:\<�����8���$��n���W�5Hk�Z��If���5����I�5xxJ�gD]�x�k�1���� zmT��k ��q��5@�q5�8�f�\���mn�{����I�O�� �K��`����_��0�8]*Fs���LKs��+�2�L���]}�W����Wh�_�P��r�C*�3��Y��������
�m�z��+$j��>"=}%s[I��WH��� �_�S+�k�_Q/�T_c�TE�����0Z�W ��()V_�7_��[����}C}@��/��+ z_�F��\/�/��+L�Wx����$���B.D�����
If�}���*���S��+`��o��+`����+ zm�/s�i�gpc��q6�>;�i6�:]l�W�n�"_!��F���$���|v:I��8V@8��M��lC�c�EF._/{!9����Ok�����z[��C������pm�y��3�f�!a/ �n[\g���
{�U{�|��_�����r�����M5��]A�\�o#3}�"��}�����
�]��]��6��"�pG��$uQ��7���p,�H;����o��Z?����<���1�/` ��������F������W_!r`4'�p-�3k�ij��E��inu.$?��k�� �?��"���^$0|�$����S$q6nm��[�Lx���[g���)��]�->��F!���y��$8M�NE��������!�o@�d��BJ����� Rr����1����8���������O���a����"���a�p����E��Y�p7,��w�"M��a�p���oC5W
[�#������b+$\�*���
�������� l|[#�1U�+����
��)�@����(�n�q>�V�m���V� �V���n+��l�+�.!b+ �6���Z��
������fHs����;�����|���&W�V�(�3��
��z[� �5.����hc�b+`���7�� ��c�������?H�#i���$���q����G����_�!Yq�{B� �������,�+��6�84s�1�'��A!�68�����!�^n8���>$����pp��
��c��6'T�1 f�c�pm� �
���� ��1 �1?A�p}B�:��9��HD���L"��7�c@O�3�Q���s�B� ��'���c����)��W��uuZ�:��U�$���in��$����sY~�� cl���c�Q��_�`����� zk���c�y��1 ���:pzy{���h�U�$���A� I��{�b$Q1���yr��~X�kVABn�
�1��A�������aO~H~��I��9&��W.��y��.@n�"�qf��vr�vL����]�p�x�X���09"��3H��g�Qh�e �� ���J� ��3�e �`��:}� ���+���`���4� ���V� �q����`�-� ���Q����o��7������\h���$������{Ir-�X<|.���3��x�z!�z�\��� zk���`�6:!�g ��Q��z����d��E�� b$�����}�dVk�w���h��\���&����p���4��B��_��X�&�����ua,$��0����0�kwrL��H�v�0z��r�-���fwrl��������_r�N�H�02�k,��X n��@����hc�#l�!a,p�����a,z�|�c�#]��c���/�Zm9������"��)�������s+[�����"a,d �6z3�Yj����jYr��~.����� ��r=����q���c����P�2���1�7J��2B��[.#d��(�7!K��>%q![�!��?����!�|!#��>����BF.��
��y�����B�0�7����n t��n 0����\�fs}���)����^�/�G���|}~��5�����_"���������u�_| ��tk$��a�/���/����)��7��"�F�X����]x�vl_�qq��`�x���%���k�%�r�Y�/��X��H��/��V��Hr������#/�D�\��
�����F��E(���x�_���k�%P��#����]�X#m}��Xn���'�0���H2u�������|���MG���*�8uX8��Uq��?�0�W~H��?&Fzl�&������st<��� �����0<
��k�4�[n~�r��
O�[�`��@���1{���3�|h���u}`�� ��A1hh�@������������d�5O|@75g�H?����5g��C�6.�u��0�S�E���������MS"z����U}�$��S"zs����U|�����C�\�B�r����S�q����+r�; �iJ����^
������m�p��$m�2'�,�e��g�z8�_9p�����s\rD{$FgG^��&\��~gG~�����+�$e������h^o��������C���a|h� �5��b$\�<