Avoiding superfluous buffer locking during nbtree backwards scans
Attached patch teaches nbtree backwards scans to avoid needlessly
relocking a previously read page/buffer at the point where we need to
consider reading the next page (the page to the left).
Currently, we do this regardless of whether or not we already decided
to end the scan, back when we read the page within _bt_readpage. We'll
relock the page we've just read (and just unlocked), just to follow
its left link, while making sure to deal with concurrent page splits
correctly. But why bother relocking the page, or with thinking about
concurrent splits, if we can already plainly see that we cannot
possibly need to find matching tuples on the left sibling page?
The patch just adds a simple precheck, which works in the obvious way.
Seems like this was a missed opportunity for commit 2ed5b87f96.
On HEAD, the following query requires 24 buffer hits (I'm getting a
standard/forward index scan for this):
select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc;
However, we don't see that with the backwards scan variant:
select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc;
We actually see 30 buffer hits for this on HEAD. Whereas with the
patch, both variants get only 24 buffer hits -- there's parity, at
least in cases like these two.
Note that we only "achieve parity" here because we happen to be using
a SAOP, requiring multiple primitive index scans, each of which ends
with its own superfluous lock acquisition. Backwards scans remain at a
disadvantage with regard to buffer locks acquired in other cases --
cases that happen to involve scanning several sibling leaf pages in
sequence (no change there).
It's probably possible to fix those more complicated cases too. But
that would require a significantly more complicated approach. I
haven't touched existing comments in _bt_readnextpage that contemplate
this possibility.
--
Peter Geoghegan
Attachments:
v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-lock.patchapplication/octet-stream; name=v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-lock.patchDownload+17-5
On Fri, 5 Jul 2024 at 18:48, Peter Geoghegan <pg@bowt.ie> wrote:
Attached patch teaches nbtree backwards scans to avoid needlessly
relocking a previously read page/buffer at the point where we need to
consider reading the next page (the page to the left).
+1, LGTM.
This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
+1, LGTM.
This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.
Pushed just now. Thanks for the review!
--
Peter Geoghegan
On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:+1, LGTM.
This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.
Here's a new patch that further improves the situation, so that we
don't try to re-lock the buffer we just accessed when we're stepping
backward in index scans, reducing buffer lock operations in the common
case by 1/2.
It also further decreases the number of code differences between
forward and backward scans in _bt_steppage and _bt_readnextpage, with
mostly only small differences remaining in the code paths shared
between the two scan modes.
The change in lwlock.c is to silence a warning when LWLOCK_STATS is enabled.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
I've validated my results by compiling with LWLOCK_STATS enabled (e.g.
#define LWLOCK_STATS), and testing backward index scans, e.g.
CREATE TABLE test AS SELECT generate_series(1, 1000000) as num;
CREATE INDEX ON test (num);
VACUUM (FREEZE) test;
\c -- reconnect to get fresh lwlock stats
SELECT COUNT(num ORDER BY num DESC) FROM test;
\c -- reconnect to dump stats of previous session
Before this patch I consistently got `BufferContent 0xYYYYYYYY: shacq
2` in the logs, but with this patch that has been decreased to
`BufferContent 0xYYYYYYYY: shacq 1`
Attachments:
v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchapplication/octet-stream; name=v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchDownload+94-118
On Mon, 19 Aug 2024 at 13:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:+1, LGTM.
This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.Here's a new patch that further improves the situation, so that we
don't try to re-lock the buffer we just accessed when we're stepping
backward in index scans, reducing buffer lock operations in the common
case by 1/2.
Attached is an updated version of the patch, now v2, which fixes some
assertion failures for parallel plans by passing the correct
parameters to _bt_parallel_release for forward scans.
With the test setup below, it unifies the number of buffer accesses
between forward and backward scans:
CREATE TABLE test AS
SELECT generate_series(1, 1000000) as num,
'' j;
CREATE INDEX ON test (num);
VACUUM (FREEZE) test;
SET enable_seqscan = off; SET max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(j ORDER BY num DESC) -- or ASC
FROM test;
The test case will have an Index Scan, which in DESC case is backward.
Without this patch, the IS will have 7160 accesses for the ASC
ordering, but 9892 in the DESC case (an increase of 2732,
approximately equivalent to the number leaf pages in the index), while
with this patch, the IndexScan will have 7160 buffer accesses for both
ASC and DESC ordering.
In my previous mail I used buffer lock stats from an index-only scan
as proof of the patch working. It's been pointed out to me that an
IndexScan is easier to extract this data from, as it drops the pin on
the page after getting some results from a page.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v2-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchapplication/octet-stream; name=v2-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchDownload+99-119
On Fri, 30 Aug 2024 at 21:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Mon, 19 Aug 2024 at 13:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:+1, LGTM.
This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.Here's a new patch that further improves the situation, so that we
don't try to re-lock the buffer we just accessed when we're stepping
backward in index scans, reducing buffer lock operations in the common
case by 1/2.Attached is an updated version of the patch, now v2, which fixes some
assertion failures for parallel plans by passing the correct
parameters to _bt_parallel_release for forward scans.
I noticed I attached an older version of the patch which still had 1
assertion failure case remaining (thanks cfbot), so here's v3 which
solves that problem.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchapplication/octet-stream; name=v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchDownload+101-119
On Mon, Sep 02, 2024 at 01:31:55PM +0200, Matthias van de Meent wrote:
I noticed I attached an older version of the patch which still had 1
assertion failure case remaining (thanks cfbot), so here's v3 which
solves that problem.
Peter G., this is in your area of expertise. Could you look at what
Matthias has proposed?
--
Michael
On Fri, Jul 5, 2024 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
On HEAD, the following query requires 24 buffer hits (I'm getting a
standard/forward index scan for this):select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc;However, we don't see that with the backwards scan variant:
select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc;We actually see 30 buffer hits for this on HEAD. Whereas with the
patch, both variants get only 24 buffer hits -- there's parity, at
least in cases like these two.
I'll now present a similar example that shows the remaining issue that
my patch (which became my commit 3f44959f) didn't address, and how the
issue if fixed by Matthias' more recent follow-up patch:
$ pgbench -i -s 1
Now run:
explain (analyze, buffers, costs off, summary off)
select
abalance
from
pgbench_accounts order by aid;
With a warm cache, this query gets a full index scan requiring 1915
buffer hits. However, we still see worse performance for an equivalent
backwards scan:
explain (analyze, buffers, costs off, summary off)
select
abalance
from
pgbench_accounts order by aid desc;
On master, the latter query gets 2188 buffer hits -- so clearly my
commit 3f44959f left money on the table. These extra buffer hits just
don't make sense.
Matthias' patch (I tested
v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch) will
result in both variants requiring 1915 buffer hits. (Plus, of course,
query execution should be faster with Matthias' new approach.)
Though we do need special extra steps for dealing with concurrent page
splits during backwards scans (steps involving a reread of the
original page when its sibling link turns out to be stale), there
hasn't been a concurrent split here -- which is why I find the
inconsistency to be unnatural. Matthias' patch fixes the problem by
passing _bt_walk_left the left link that will now have been stashed in
the local scan state back when the page was read by _bt_readpage, so
that it (_bt_walk_left) can optimistically *start* there (and *not*
start at the page that has already been read, as on the master
branch).
Of course, the left link might be stale, but it was always inherently
necessary for _bt_walk_left to be able to deal with that. Technically
we're being more optimistic about it now, but that optimism is
extremely well justified (concurrent page splits are rare). Arguably,
this latest patch from Matthias actually makes the code simpler. It
makes the backwards scan case more similar to the forwards scan case.
This is another missed opportunity for commit 2ed5b87f96, I'd say --
that 2015 commit left things in a slightly odd in-between state, which
this patch fully corrects.
Note: my example was deliberately constructed to use an index scan
backwards, rather than an index-only scan backwards. While both types
of backwards scan will benefit in the same way (less buffer lock
traffic), you'll only see a reduction in buffers hit for an
EXPLAIN(ANALYZE, BUFFERS) involving a plain index scan backwards. This
is due to the fact that the mechanism added by commit 2ed5b87f96 will
only drop a leaf page buffer pin early for plain index scans, never
for index-only scans. My example also wouldn't make the difference
apparent with an unlogged table, for the same reason -- this
difference isn't really important, but seems worth noting to avoid any
confusion.
A nice side benefit of this approach is that it'll make it easier to
add better coverage for _bt_walk_left. The case where _bt_walk_left
notices a stale left link will still be very rare (we're really not
being all that optimistic in doing things this way), but it'll now be
easier to hit on purpose.
The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
looks reasonable to me. I don't think that there are any impediments
to committing it sometime this week.
--
Peter Geoghegan
On Tue, Oct 8, 2024 at 12:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
looks reasonable to me. I don't think that there are any impediments
to committing it sometime this week.
I see significant benefits for parallel index-only backwards scans
with this patch. It's not that hard to see about a 10% decrease in
execution time. I imagine that there are disproportionate benefits for
the parallel case because the work we're avoiding is work that takes
place while a backend has seized the scan. Not bad!
I've been looking at this patch in more detail. I think that you
(Matthias) have gone too far in the direction of making the backwards
scan handling within _bt_readnextpage match the forwards scan
handling. You fail to update blkno when we have to deal with a
concurrent split/deletion that necessitates reading from another blkno
(when the correct left sibling page to read from changes) -- v3 still
used the potentially-stale blkno passed by _bt_readnextpage's caller.
Attached v4 deals with that issue. It also goes further in the
direction of simplifying the code in and around _bt_readnextpage.
The code in this area has always seemed to me to be quite a lot more
complicated than it really needed to be. Now seems like a good time to
refactor and simplify. It's likely that we'll need to do more work
nearby to enable prefetching during plain index scans, without
regressing the kill_prior_tuple/_bt_killitems mechanism -- Tomas
Vondra is currently working on that. The mechanism in question is
closely tied to the stuff that I'm simplifying now, particularly the
_bt_drop_lock_and_maybe_pin/TID-recycle-safety business.
The main simplification new to v4 is that v4 isolates the need to call
_bt_drop_lock_and_maybe_pin to only 2 functions: _bt_readnextpage, and
its new _bt_readfirstpage "sibling" function. The functions have
similar preconditions, and identical postconditions -- all of which
are now clearly documented.
It's not just _bt_drop_lock_and_maybe_pin that's only called from
these 2 functions in v4. All calls to _bt_readpage (plus their
associated PredicateLockPage calls) are now also isolated to those
same 2 functions. This shift enables removing the
_bt_parallel_readpage function entirely -- the point at the top of
_bt_first where _bt_parallel_readpage used to be called can now call
_bt_readnextpage directly instead. ISTM that having a separate
_bt_parallel_readpage function never made much sense -- it was
papering over the fact that the interfaces it uses are so confusing.
v4 also demotes _bt_steppage to a helper function that sets up a call
to _bt_readnextpage. Note that the 2017 parallel index scan commit
split _bt_steppage in two: it became the current _bt_steppage code,
plus the current _bt_readnextpage code. Making that relationship
explicit seems much clearer. I'm essentially finishing off the process
of splitting _bt_steppage up.
Side note: all of these simplifications were enabled by making the
preconditions for calling _bt_readnextpage work independently of the
scan direction (which was present in earlier versions of the patch
from Matthias). We now always do a
"BTScanPosUnpinIfPinned(so->currPos)" within _bt_steppage, so it's
easy to talk clearly about the preconditions/postconditions for
_bt_readnextpage without directly considering its _bt_steppage caller
(still the main caller, including during parallel index scans).
Performance improvement bonus: We do that
"BTScanPosUnpinIfPinned(so->currPos)" at a slightly different point in
_bt_steppage in v4, as compared to master, even during forwards scans:
we consistently unpin *before* the required next call to
_bt_parallel_seize takes place (unless, of course, it's a serial scan,
which'll never need to call _bt_parallel_seize). Presumably this is
why I see a ~5% increase in throughput for parallel index-only forward
(standard) scans with this v4 -- all parallel scans (backwards and
forwards alike) now receive at least some benefit from this patch. It
makes a certain amount of sense; we should do as little as possible
during any window where our backend has seized the scan.
I do need to do more testing, but the direction of v4 seems promising.
--
Peter Geoghegan
Attachments:
v4-0001-Optimize-and-simplify-nbtree-backwards-scans.patchapplication/octet-stream; name=v4-0001-Optimize-and-simplify-nbtree-backwards-scans.patchDownload+267-250
On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
The main simplification new to v4 is that v4 isolates the need to call
_bt_drop_lock_and_maybe_pin to only 2 functions: _bt_readnextpage, and
its new _bt_readfirstpage "sibling" function. The functions have
similar preconditions, and identical postconditions -- all of which
are now clearly documented.
Worked on this a little more today. Attached is v5, which goes even
further than v4 did. Highlights:
* Completely gets rid of _bt_initialize_more_data by moving its code
into the new _bt_readfirstpage function.
* Adds similar "initialize moreLeft/moreRight" code to the
corresponding point within _bt_readnextpage (that is, at the start of
_bt_readnextpage) -- meaning I moved a bit more code from _bt_steppage
into _bt_readnextpage.
Again, _bt_readnextpage and the new _bt_readfirstpage function are
intended to be "sibling" functions (while _bt_steppage is demoted to a
helper that sets up a call to _bt_readfirstpage). The matching
handling of currPos initialization within _bt_readnextpage and the
_bt_readfirstpage pushes this further than in v4.
* We now reset currPos state (including even its moreLeft/moreRight
fields) within _bt_parallel_seize, automatically and regardless of any
other details.
It doesn't really make sense that (up until now) we've trusted the
shared state that we seized from the parallel scan to kinda be in sync
with the local currPos state (at least its moreLeft/moreRight need to
be in sync, to avoid confusion within _bt_readnextpage). This allowed
me to remove several parallel-only BTScanPosInvalidate calls from
_bt_readnextpage.
This "currPos must be kept in sync with parallel scan state" confusion
is exemplified by the following code snippet (which FWIW was already
removed by earlier revisions of the patch), that appears in the
backwards scan block of _bt_readnextpage on master:
"""
/*
* Should only happen in parallel cases, when some other backend
* advanced the scan.
*/
if (so->currPos.currPage != blkno)
{
BTScanPosUnpinIfPinned(so->currPos);
so->currPos.currPage = blkno;
}
"""
Why should the parallel scan have any concern for what currPos is, in general?
* Now nbtree has only one PredicateLockPage call, inside _bt_readpage.
This saves us an extra BufferGetBlockNumber call. (v4 pushed
PredicateLockPage calls down one layer, v5 pushes them down another
layer still.)
* Improved precondition/postcondition comments, and generally cleaner
separation between _bt_steppage and _bt_readnextpage.
--
Peter Geoghegan
Attachments:
v5-0001-Optimize-and-simplify-nbtree-backwards-scans.patchapplication/x-patch; name=v5-0001-Optimize-and-simplify-nbtree-backwards-scans.patchDownload+347-329
On Fri, Oct 11, 2024 at 7:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v5
Now I'm attaching a v6, which further polishes things. Current plan is to
commit something very close to this in the next day or two.
v6 is mostly just further comment polishing. But it also merges together
the two existing _bt_readnextpage loops (for forward and backward scan
directions) into one single loop that handles everything. This is
definitely a win for brevity, and might also be a win for clarity --
but I'm not 100% sure which way I prefer it just yet.
I'll need to make a final decision on this loop merging business
before committing. Anybody else have an opinion, either way?
--
Peter Geoghegan
Attachments:
v6-0001-Optimize-and-simplify-nbtree-backwards-scans.patchapplication/x-patch; name=v6-0001-Optimize-and-simplify-nbtree-backwards-scans.patchDownload+401-446
On Wed, 16 Oct 2024 at 20:03, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Oct 11, 2024 at 7:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v5
Now I'm attaching a v6, which further polishes things. Current plan is to
commit something very close to this in the next day or two.v6 is mostly just further comment polishing. But it also merges together
the two existing _bt_readnextpage loops (for forward and backward scan
directions) into one single loop that handles everything. This is
definitely a win for brevity, and might also be a win for clarity --
but I'm not 100% sure which way I prefer it just yet.
(patch v6)
- BlockNumber btps_scanPage; /* latest or next page to be scanned */ + BlockNumber btps_nextScanPage; /* next page to be scanned */ + BlockNumber btps_lastCurrPage; /* page whose sibling link was copied into + * btps_scanPage */
This reference to btps_scanPage in the comment needs to be updated to
its new name.
(from your v5 mail)
* Now nbtree has only one PredicateLockPage call, inside _bt_readpage.
This saves us an extra BufferGetBlockNumber call. (v4 pushed
PredicateLockPage calls down one layer, v5 pushes them down another
layer still.)
(and seen in patch v6)
+ /* allow next page to be processed by parallel worker */ + if (scan->parallel_scan) + { + if (ScanDirectionIsForward(dir)) + _bt_parallel_release(scan, so->currPos.nextPage, + so->currPos.currPage); + else + _bt_parallel_release(scan, so->currPos.prevPage, + so->currPos.currPage); + } + + PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
I'm a little bit concerned about this change: I'm not familiar with
predicate locks, PLP, or anything related to SSI, but previously we
called PLP in parallel scans while we still had hold of the scan,
while we now call PLP only after letting other backends do things,
allowing PLPs for this scan to happen concurrently with other backends
of this scan. In principle, that looks like an improvement in
parallelism by reducing the work done in the critical path.
However, before, we would always get predicate locks in ascending (or
descending for backward scans) value order, but now that strict
keyspace order access has been released an unfortunately timed
descheduling of the process between _bt_p_release and PLP's guts means
mean lock acquisition would be only approximately in index leaf page
order. If the lock acquisition order matters in the predicate lock
system, then this will likely be a new cause of lock
conflicts/deadlocks.
If that's a real issue (by which I mean predicate locking is indeed
sensitive to acquisition order) then this call to PredicateLockPage
must happen before _bt_p_release, so that users don't get conflicts
caused only by bad timing issues in single-directional index scans.
Apart from these two comments, LGTM.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Thu, Oct 17, 2024 at 5:13 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I'm a little bit concerned about this change: I'm not familiar with
predicate locks, PLP, or anything related to SSI, but previously we
called PLP in parallel scans while we still had hold of the scan,
while we now call PLP only after letting other backends do things,
allowing PLPs for this scan to happen concurrently with other backends
of this scan. In principle, that looks like an improvement in
parallelism by reducing the work done in the critical path.
It sort of is, but not really. FWIW, there were two thoughts behind this:
1. Try to avoid gratuitous BufferGetBlockNumber() calls (use a stashed
BlockNumber instead), since nbtree has a tendency to do too many of
those (something that Andres complained about not long ago).
2. Try to do as little as possible while the backend has seized the
scan, on general principle. If only to set a good example going
forward -- loudly advertising when the scan is seized should also help.
I doubt it matters very much if we call PredicateLockPage() while the
parallel scan is seized -- I wouldn't expect avoiding it (waiting
until the parallel scan has been released to call PredicateLockPage)
will measurably help performance. But I can't see how it could
possibly be incorrect, either.
There is no reason to think that the precise order that we call
PredicateLockPage() for each page matters, in general. Just as it
doesn't matter if the index is scanned forwards or backwards.
Apart from these two comments, LGTM.
Pushed something very close to v6 several hours ago.
Thanks
--
Peter Geoghegan
Hi, thanks for working on these improvements.
I noticed an unexpected behavior where the index scan continues instead
of
stopping, even when it detects that there are no tuples that match the
conditions.
(I observed this while reviewing the skip scan patch, though it isn't
directly
related to this issue.)
On 2024-10-12 08:29, Peter Geoghegan wrote:
On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
* We now reset currPos state (including even its moreLeft/moreRight
fields) within _bt_parallel_seize, automatically and regardless of any
other details.
IIUC, the above change is the root cause. The commit 1bd4bc8 adds a
reset of
the currPos state in _bt_parallel_seize(). However, this change can
overwrite
currPos.moreRight which should be preserved before calling
_bt_readnextpage().
* Test case
-- Prepare
DROP TABLE IF EXISTS test;
CREATE TABLE test (smallint smallint, bool bool);
INSERT INTO test (SELECT -20000+i%40000, random()>0.5 FROM
generate_series(1, 1_000_000) s(i));
CREATE INDEX test_smallint ON test (smallint);
VACUUM ANALYZE test;
-- Check the number of pages of the index
=# SELECT relpages FROM pg_class WHERE relname = 'test_smallint';
relpages
----------
937
(1 row)
-- Test
=# SET max_parallel_workers = 0;
=# EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT COUNT(*) FROM test WHERE
smallint < -10000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=5170.23..5170.24 rows=1 width=8) (actual
time=71.352..71.402 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=934
-> Gather (cost=5170.01..5170.22 rows=2 width=8) (actual
time=71.344..71.395 rows=1 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 0
Buffers: shared hit=934
-> Partial Aggregate (cost=4170.01..4170.02 rows=1 width=8)
(actual time=71.199..71.199 rows=1 loops=1)
Output: PARTIAL count(*)
Buffers: shared hit=934
-> Parallel Index Only Scan using test_smallint on
public.test (cost=0.42..3906.27 rows=105495 width=0) (actual
time=0.062..49.137 rows=250000 loops=1)
Output: "smallint"
Index Cond: (test."smallint" < '-10000'::integer)
Heap Fetches: 0
Buffers: shared hit=934 -- This is not the result
I expected. Almost all relpages are being read to retrieve only 25% of
the tuples.
-- Without commit 1bd4bc8,
the number was '236' in my environment.
Planning Time: 0.105 ms
Execution Time: 71.454 ms
(18 rows)
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
On Thu, Nov 7, 2024 at 5:44 AM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:0-12 08:29, Peter Geoghegan wrote:
On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
* We now reset currPos state (including even its moreLeft/moreRight
fields) within _bt_parallel_seize, automatically and regardless of any
other details.IIUC, the above change is the root cause. The commit 1bd4bc8 adds a
reset of
the currPos state in _bt_parallel_seize(). However, this change can
overwrite
currPos.moreRight which should be preserved before calling
_bt_readnextpage().
I can easily recreate the problem using your test case. I agree with
your diagnosis. Thanks for the report!
We could fix this by going back to how things were before -- don't
unset moreLeft/moreRight within _bt_parallel_seize. But that doesn't
seem like a good idea to me. It would be weird to depend on the
currPos state from the last _bt_readpage call, even after seizing the
scan for the next page in line. If the scan has already finished
(according to moreLeft/moreRight), then why even try to seize the scan
once again? Why not just end the scan instead (including its call to
_bt_parallel_done)?
* Test case
-- Without commit 1bd4bc8,
the number was '236' in my environment.
Planning Time: 0.105 ms
Execution Time: 71.454 ms
(18 rows)
With the attached patch, I get back to the good/correct behavior.
I'm not sure that this is the exact right way to fix the bug, but it
seems like the right general approach.
--
Peter Geoghegan
Attachments:
v1-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchapplication/octet-stream; name=v1-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchDownload+10-9
On Thu, Nov 7, 2024 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
It would be weird to depend on the
currPos state from the last _bt_readpage call, even after seizing the
scan for the next page in line. If the scan has already finished
(according to moreLeft/moreRight), then why even try to seize the scan
once again? Why not just end the scan instead (including its call to
_bt_parallel_done)?
The problem with what I said here was that v1 didn't go far enough in
this direction. v1 would still needlessly seize the scan whenever
_bt_steppage's blkno was involved. This was still correct, I think --
but it wasn't clear.
Attached revision v2 pushes the fix further in this direction. It
explicitly documents that parallel _bt_readnextpage callers are
allowed to use their so->currPos state (including
blkno/nextPage/prevPage) to help _bt_readnextpage to decide whether it
can end the scan right away. However, parallel index scan callers
still won't be allowed to use this same so->currPos state to decide
what page to go to next -- _bt_readnextpage really will need to seize
the scan to get an authoritative blkno to make that part safe
(actually, one parallel scan _bt_readnextpage caller still seizes the
scan for itself, which is the one caller where _bt_readnextpage fully
trusts caller's blkno + lastcurrblkno).
I think that this approach is a win for readability. It makes the
parallel case closer to the serial case, which is generally a good
thing: it allows _bt_steppage to not care at all about whether or not
the scan is a parallel scan. More importantly, it makes the exact
extent to which parallel scans rely on their so->currPos state a great
deal clearer. That was the underlying problem here all along:
confusion about using the so->currPos state to end the scan vs. using
the so->currPos state to decide which leaf page to visit next. These
are two different things, and only the first one is generally safe.
One problem that my v2 approach created (at least when I first started
work on it) was that the new _bt_readnextpage loop code was
ill-prepared for a blkno that's P_NONE when we seize the scan -- the
part of the loop where that's checked for came first, before the scan
is seized. I could have fixed that new problem by adding an extra
defensive is-blkno-P_NONE code to _bt_readnextpage. But that's not
what I settled on for v2. It makes more sense to just stop allowing
the parallel scan's btps_nextScanPage to become P_NONE in the first
place. The fact that we ever that never made much sense to me -- why
wouldn't we just stop the scan as soon as we reach the point where the
"next block" is P_NONE, which is only possible on the rightmost and
leftmost page? So that's what we do. It's simpler all around.
--
Peter Geoghegan
Attachments:
v2-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchapplication/octet-stream; name=v2-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchDownload+62-67
On 2024-11-08 07:42, Peter Geoghegan wrote:
Attached revision v2 pushes the fix further in this direction. It
explicitly documents that parallel _bt_readnextpage callers are
allowed to use their so->currPos state (including
blkno/nextPage/prevPage) to help _bt_readnextpage to decide whether it
can end the scan right away. However, parallel index scan callers
still won't be allowed to use this same so->currPos state to decide
what page to go to next -- _bt_readnextpage really will need to seize
the scan to get an authoritative blkno to make that part safe
(actually, one parallel scan _bt_readnextpage caller still seizes the
scan for itself, which is the one caller where _bt_readnextpage fully
trusts caller's blkno + lastcurrblkno).
Thank you! I've confirmed that the v2 patch fixed the bug. As you
mentioned, I also feel that the v2 patch is now easier to understand.
Since I couldn't understand the reason, I have a question: is the
following deletion related to this change?
@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)
/*
* Should not mark parallel scan done when there's still a
pending
- * primitive index scan (defensive)
+ * primitive index scan
*/
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
On Fri, Nov 8, 2024 at 8:25 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
Thank you! I've confirmed that the v2 patch fixed the bug. As you
mentioned, I also feel that the v2 patch is now easier to understand.
Great. I pushed something quite similar to v2 just now. Thanks!
Since I couldn't understand the reason, I have a question: is the
following deletion related to this change?@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)
/* * Should not mark parallel scan done when there's still a pending - * primitive index scan (defensive) + * primitive index scan */
That's a good question.
Prior to this bugfix, the check of so->needPrimScan from within
_bt_parallel_done() was defensive; it wasn't strictly necessary. It
could have been "Assert(!so->needPrimScan)" instead (I guess that I
didn't make it into an assert like this because _bt_parallel_done
worked a little like this, prior to Postgres 17, when we had a
primscan counter instead of the current so->needPrimScan flag). But
that's no longer the case with the bugfix in place; the
so->needPrimScan check really is strictly necessary now.
It's hard to see why this is. Notice that _bt_parallel_seize() will
just return false when another primitive index scan is required. Prior
to this bugfix, we'd seize the parallel scan within _bt_steppage,
which could only succeed when "!so->needPrimScan" (which was actually
verified by an assertion that just got removed). With this bug fix,
nothing stops the so->needPrimScan flag from still being set from back
when we called _bt_readpage for the so->currPos we're using. And so,
and as I said, the check of so->needPrimScan from within
_bt_parallel_done() became strictly necessary (not just defensive) --
since so->needPrimScan definitely can be set when we call
_bt_parallel_done, and we definitely don't want to *really* end the
whole top-level scan when it is set (we must never confuse the end of
one primitive index scan with the end of the whole top-level parallel
index scan).
--
Peter Geoghegan
On 2024-11-09 03:40, Peter Geoghegan wrote:
On Fri, Nov 8, 2024 at 8:25 AM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:Since I couldn't understand the reason, I have a question: is the
following deletion related to this change?@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)
/* * Should not mark parallel scan done when there's still a pending - * primitive index scan (defensive) + * primitive index scan */That's a good question.
Prior to this bugfix, the check of so->needPrimScan from within
_bt_parallel_done() was defensive; it wasn't strictly necessary. It
could have been "Assert(!so->needPrimScan)" instead (I guess that I
didn't make it into an assert like this because _bt_parallel_done
worked a little like this, prior to Postgres 17, when we had a
primscan counter instead of the current so->needPrimScan flag). But
that's no longer the case with the bugfix in place; the
so->needPrimScan check really is strictly necessary now.It's hard to see why this is. Notice that _bt_parallel_seize() will
just return false when another primitive index scan is required. Prior
to this bugfix, we'd seize the parallel scan within _bt_steppage,
which could only succeed when "!so->needPrimScan" (which was actually
verified by an assertion that just got removed). With this bug fix,
nothing stops the so->needPrimScan flag from still being set from back
when we called _bt_readpage for the so->currPos we're using. And so,
and as I said, the check of so->needPrimScan from within
_bt_parallel_done() became strictly necessary (not just defensive) --
since so->needPrimScan definitely can be set when we call
_bt_parallel_done, and we definitely don't want to *really* end the
whole top-level scan when it is set (we must never confuse the end of
one primitive index scan with the end of the whole top-level parallel
index scan).
I understand, thanks to your explanation.
Now, there is a case where _bt_readnextpage() calls
_bt_parallel_seize(),
_bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is
called
with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize()
was
called after _bt_readpage() sets so->needPrimScan=true, and it just
returned
false without calling _bt_parallel_done().
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
On Sun, Nov 10, 2024 at 9:53 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
I understand, thanks to your explanation.
Cool.
Now, there is a case where _bt_readnextpage() calls
_bt_parallel_seize(),
_bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is
called
with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize()
was
called after _bt_readpage() sets so->needPrimScan=true, and it just
returned
false without calling _bt_parallel_done().
You influenced me to add something about this to my follow-up commit caca6d8d:
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2230,8 +2230,9 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
!so->currPos.moreRight : !so->currPos.moreLeft))
{
/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
+ Assert(so->currPos.currPage == lastcurrblkno && !seized);
BTScanPosInvalidate(so->currPos);
- _bt_parallel_done(scan);
+ _bt_parallel_done(scan); /* iff !so->needPrimScan */
return false;
}
I added "iff !so->needPrimScan" to draw attention to the fact that we
don't necessarily really end the parallel scan when _bt_parallel_done
is called.
--
Peter Geoghegan