index prefetching
Hi,
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.
Motivation
----------
Imagine we have a huge table (much larger than RAM), with an index, and
that we're doing a regular index scan (e.g. using a btree index). We
first walk the index to the leaf page, read the item pointers from the
leaf page and then start issuing fetches from the heap.
The index access is usually pretty cheap, because non-leaf pages are
very likely cached, so we may do perhaps I/O for the leaf. But the
fetches from heap are likely very expensive - unless the page is
clustered, we'll do a random I/O for each item pointer. Easily ~200 or
more I/O requests per leaf page. The problem is index scans do these
requests synchronously at the moment - we get the next TID, fetch the
heap page, process the tuple, continue to the next TID etc.
That is slow and can't really leverage the bandwidth of modern storage,
which require longer queues. This patch aims to improve this by async
prefetching.
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.
But there's three shortcomings in logic:
1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.
2) Our estimates / planning are not perfect, so we may easily pick an
index scan instead of a bitmap scan. It'd be nice to limit the damage a
bit by still prefetching.
3) There are queries that can't do a bitmap scan (at all, or because
it's hopelessly inefficient). Consider queries that require ordering, or
queries by distance with GiST/SP-GiST index.
Implementation
--------------
When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.
The easiest thing would be to just do prefetching from the btree code.
But then I realized there's no particular reason why other index types
(except for GIN, which only allows bitmap scans) couldn't do prefetching
too. We could have a copy in each AM, of course, but that seems sloppy
and also violation of layering. After all, bitmap heap scans do prefetch
from the executor, so AM seems way too low level.
So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).
So what I did is introducing a IndexPrefetch struct, which is part of
IndexScanDesc, maintaining all the info about prefetching for that
particular scan - current/maximum distance, progress, etc.
It also contains two AM-specific callbacks (get_range and get_block)
which say valid range of indexes (into the internal array), and block
number for a given index.
This mostly does the trick, although index_prefetch() is still called
from the amgettuple() functions. That seems wrong, we should call it
from indexam.c right aftter calling amgettuple.
Problems / Open questions
-------------------------
There's a couple issues I ran into, I'll try to list them in the order
of importance (most serious ones first).
1) pairing-heap in GiST / SP-GiST
For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.
Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(
I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.
In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?
I'm also not entirely sure the way this interfaces with the AM (through
the get_range / get_block callbaces) is very elegant. It did the trick,
but it seems a bit cumbersome. I wonder if someone has a better/nicer
idea how to do this ...
3) prefetch distance
I think we can do various smart things about the prefetch distance.
The current code does about the same thing bitmap scans do - it starts
with distance 0 (no prefetching), and then simply ramps the distance up
until the maximum value from get_tablespace_io_concurrency(). Which is
either effective_io_concurrency, or per-tablespace value.
I think we could be a bit smarter, and also consider e.g. the estimated
number of matching rows (but we shouldn't be too strict, because it's
just an estimate). We could also track some statistics for each scan and
use that during a rescans (think index scan in a nested loop).
But the patch doesn't do any of that now.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.
I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
5) index-only scans
I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.
Benchmarks
----------
1) OLTP
For OLTP, this tested different queries with various index types, on
data sets constructed to have certain number of matching rows, forcing
different types of query plans (bitmap, index, seqscan).
The data sets have ~34GB, which is much more than available RAM (8GB).
For example for BTREE, we have a query like this:
SELECT * FROM btree_test WHERE a = $v
with data matching 1, 10, 100, ..., 100000 rows for each $v. The results
look like this:
rows bitmapscan master patched seqscan
1 19.8 20.4 18.8 31875.5
10 24.4 23.8 23.2 30642.4
100 27.7 40.0 26.3 31871.3
1000 45.8 178.0 45.4 30754.1
10000 171.8 1514.9 174.5 30743.3
100000 1799.0 15993.3 1777.4 30937.3
This says that the query takes ~31s with a seqscan, 1.8s with a bitmap
scan and 16s index scan (on master). With the prefetching patch, it
takes about ~1.8s, i.e. about the same as the bitmap scan.
I don't know where exactly would the plan switch from index scan to
bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd
bet most of the cases would do plain index scan.
For a query with ordering:
SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n
the results look a bit different:
rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1
This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).
Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).
The results for other index types (HASH, GiST, SP-GiST) follow roughly
the same pattern. See the attached PDF for more charts, and [1]https://github.com/tvondra/index-prefetch-tests for
complete results.
Benchmark / TPC-H
-----------------
I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):
query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%
The percentage is (timing patched / master, so <100% means faster, >100%
means slower).
The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.
My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.
There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.
regards
[1]: https://github.com/tvondra/index-prefetch-tests
[2]: https://github.com/tvondra/postgres/tree/dev/index-prefetch
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.
I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.
Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).
ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.
Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).
This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.
Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.
But there's three shortcomings in logic:
1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.
As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.
That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.
Implementation
--------------When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.
So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).
Maybe you were right to do that, but I'm not entirely sure.
Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.
In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
I tend to agree that this sort of thing doesn't need to happen in the
first committed version. But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).
the results look a bit different:
rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).
Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).
--
Peter Geoghegan
On 6/8/23 20:56, Peter Geoghegan wrote:
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).
I don't think SAOP is the reason. I did a bit of digging in the list
archives, and found thread [1]/messages/by-id/871wa17vxb.fsf@oxford.xeocode.com, which says:
Regardless of what mechanism is used and who is responsible for
doing it someone is going to have to figure out which blocks are
specifically interesting to prefetch. Bitmap index scans happen
to be the easiest since we've already built up a list of blocks
we plan to read. Somehow that information has to be pushed to the
storage manager to be acted upon.
Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.
So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.
There's a couple more ~2008 threads mentioning prefetching, bitmap scans
and even regular index scans (like [2]/messages/by-id/87wsnnz046.fsf@oxford.xeocode.com). None of them even mentions SAOP
stuff at all.
[1]: /messages/by-id/871wa17vxb.fsf@oxford.xeocode.com
/messages/by-id/871wa17vxb.fsf@oxford.xeocode.com
[2]: /messages/by-id/87wsnnz046.fsf@oxford.xeocode.com
/messages/by-id/87wsnnz046.fsf@oxford.xeocode.com
ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).
So if you're planning to work on this for PG17, collaborating on it
would be great.
For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.
But there's three shortcomings in logic:
1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.
Yeah, although all the tests were done with a random table generated
like this:
insert into btree_test select $d * random(), md5(i::text)
from generate_series(1, $ROWS) s(i)
So it's damn random anyway. Although maybe it's random even for the
bitmap case, so maybe if the SAOP had some sort of locality, that'd be
an advantage for the bitmap scan. But how would such table look like?
I guess something like this might be a "nice" bad case:
insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)
select * from btree_test where a in (999, 1000, 1001, 1002)
The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.
I do agree, but what do you mean by "assessing"? Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?
If we pick index scan and enable prefetching, causing a regression (e.g.
for the SAOP with locality), that'd be bad. But how is that related to
viability of index scans over bitmap index scans?
That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.
I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.
Implementation
--------------When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).Maybe you were right to do that, but I'm not entirely sure.
Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.
There are two aspects why I think AM is not the right place:
- accessing table from index code seems backwards
- we already do prefetching from the executor (nodeBitmapHeapscan.c)
It feels kinda wrong in hindsight.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.I tend to agree that this sort of thing doesn't need to happen in the
first committed version. But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).
Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.
the results look a bit different:
rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).
Right, I forgot to mention that benefit. Although, that'd only happen if
we actually choose index scans in more places, which I guess would
require tweaking the costing model ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.
What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).
I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.
So if you're planning to work on this for PG17, collaborating on it
would be great.For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.
Makes sense, but I hope that it won't come to that.
IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.
I guess something like this might be a "nice" bad case:
insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)select * from btree_test where a in (999, 1000, 1001, 1002)
The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.
This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?
I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.
I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.
This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.
I'm talking about problems that exist today, without your patch.
I'll show a concrete example of the kind of index/index scan that
might be affected.
Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.
Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?
While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.
We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.
(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)
Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.I do agree, but what do you mean by "assessing"?
I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.
Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?
I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.
One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.
The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.
I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.
Definitely.
There are two aspects why I think AM is not the right place:
- accessing table from index code seems backwards
- we already do prefetching from the executor (nodeBitmapHeapscan.c)
It feels kinda wrong in hindsight.
I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.
I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.
Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.
--
Peter Geoghegan
Attachments:
pg_type_typname_nsp_index_index_example_log.txttext/plain; charset=US-ASCII; name=pg_type_typname_nsp_index_index_example_log.txtDownload
Hi,
On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.
I'm really excited about this work.
1) pairing-heap in GiST / SP-GiST
For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.
I think it'd be perfectly fair to just not tackle distance queries for now.
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?
Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.
One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.
5) index-only scans
I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.
That should be easy to do, right?
Benchmark / TPC-H
-----------------I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%The percentage is (timing patched / master, so <100% means faster, >100%
means slower).The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).
Greetings,
Andres Freund
On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.
I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:
create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;
This gives me a table with 4 pages, and an index with 2 pages.
The following query selects about half of the rows from the table:
select * from bitmap_parity_test where randkey < 500;
If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.
I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.
Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.
(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10) from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)
--
Peter Geoghegan
On 6/9/23 02:06, Andres Freund wrote:
Hi,
On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.I'm really excited about this work.
1) pairing-heap in GiST / SP-GiST
For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.I think it'd be perfectly fair to just not tackle distance queries for now.
My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...
Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.
Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).
5) index-only scans
I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.That should be easy to do, right?
It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either
Benchmark / TPC-H
-----------------I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%The percentage is (timing patched / master, so <100% means faster, >100%
means slower).The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?
I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:
for q in `seq 1 22`; do
1. drop caches and restart postgres
2. run query $q -> uncached
3. run query $q -> cached
done
So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.
I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).
That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/9/23 01:38, Peter Geoghegan wrote:
On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.
Did you mean parallel index scan or bitmap index scan?
But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.
I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?
I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.
So if you're planning to work on this for PG17, collaborating on it
would be great.For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.Makes sense, but I hope that it won't come to that.
IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.
OK
I guess something like this might be a "nice" bad case:
insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)select * from btree_test where a in (999, 1000, 1001, 1002)
The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?
Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).
I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.
I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.
I mean, imagine we have TIDs
[T1, T2, T3, T4, T5, T6]
Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:
[T1, T5, T6, T2, T3, T4]
But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.
This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.I'm talking about problems that exist today, without your patch.
I'll show a concrete example of the kind of index/index scan that
might be affected.Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.
I'm not sure I understand all the nuance here, but the thing I take away
is to add tests with different levels of correlation, and probably also
some multi-column indexes.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.I do agree, but what do you mean by "assessing"?
I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.
Agreed.
I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.Definitely.
There are two aspects why I think AM is not the right place:
- accessing table from index code seems backwards
- we already do prefetching from the executor (nodeBitmapHeapscan.c)
It feels kinda wrong in hindsight.
I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.
OK. Thanks for all the comments.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.Did you mean parallel index scan or bitmap index scan?
I meant parallel index scan (also parallel bitmap index scan). Note
that nbtree parallel index scans have special ScalarArrayOpExpr
handling code.
ScalarArrayOpExpr is kind of special -- it is simultaneously one big
index scan (to the executor), and lots of small index scans (to
nbtree). Unlike the queries that you've looked at so far, which really
only have one plausible behavior at execution time, there are many
ways that ScalarArrayOpExpr index scans can be executed at runtime --
some much faster than others. The nbtree implementation can in
principle reorder how it processes ranges from the key space (i.e.
each range of array elements) with significant flexibility.
I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?
Hopeless might have been too strong of a word. More like it'd fall far
short of what is possible to do with a ScalarArrayOpExpr with a given
high end server.
The quality of the implementation (including prefetching) could make a
huge difference to how well we make use of the available hardware
resources. A really high quality implementation of ScalarArrayOpExpr +
prefetching can keep the system busy with useful work, which is less
true with other types of queries, which have inherently less
predictable I/O (and often have less I/O overall). What could be more
amenable to predicting I/O patterns than a query with a large IN()
list, with many constants that can be processed in whatever order
makes sense at runtime?
What I'd like to do with ScalarArrayOpExpr is to teach nbtree to
coalesce together those "small index scans" into "medium index scans"
dynamically, where that makes sense. That's the main part that's
missing right now. Dynamic behavior matters a lot with
ScalarArrayOpExpr stuff -- that's where the challenge lies, but also
where the opportunities are. Prefetching builds on all that.
I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.
If I can figure out a way of getting ScalarArrayOpExpr to visit each
leaf page exactly once, that might be enough to make things work
really well most of the time. Maybe it won't even be necessary to
coordinate very much, in the end. Unsure.
I've already done a lot of work that tries to minimize the chances of
regular (non-ScalarArrayOpExpr) queries accessing more than a single
leaf page, which will help your strategy of just prefetching items
from a single leaf page at a time -- that will get you pretty far
already. Consider the example of the tenk2_hundred index from the
bt_page_items documentation. You'll notice that the high key for the
page shown in the docs (and every other page in the same index) nicely
makes the leaf page boundaries "aligned" with natural keyspace
boundaries, due to suffix truncation. That helps index scans to access
no more than a single leaf page when accessing any one distinct
"hundred" value.
We are careful to do the right thing with the "boundary cases" when we
descend the tree, too. This _bt_search behavior builds on the way that
suffix truncation influences the on-disk structure of indexes. Queries
such as "select * from tenk2 where hundred = ?" will each return 100
rows spread across almost as many heap pages. That's a fairly large
number of rows/heap pages, but we still only need to access one leaf
page for every possible constant value (every "hundred" value that
might be specified as the ? in my point query example). It doesn't
matter if it's the leftmost or rightmost item on a leaf page -- we
always descend to exactly the correct leaf page directly, and we
always terminate the scan without having to move to the right sibling
page (we check the high key before going to the right page in some
cases, per the optimization added by commit 29b64d1d).
The same kind of behavior is also seen with the TPC-C line items
primary key index, which is a composite index. We want to access the
items from a whole order in one go, from one leaf page -- and we
reliably do the right thing there too (though with some caveats about
CREATE INDEX). We should never have to access more than one leaf page
to read a single order's line items. This matters because it's quite
natural to want to access whole orders with that particular
table/workload (it's also unnatural to only access one single item
from any given order).
Obviously there are many queries that need to access two or more leaf
pages, because that's just what needs to happen. My point is that we
*should* only do that when it's truly necessary on modern Postgres
versions, since the boundaries between pages are "aligned" with the
"natural boundaries" from the keyspace/application. Maybe your testing
should verify that this effect is actually present, though. It would
be a shame if we sometimes messed up prefetching that could have
worked well due to some issue with how page splits divide up items.
CREATE INDEX is much less smart about suffix truncation -- it isn't
capable of the same kind of tricks as nbtsplitloc.c, even though it
could be taught to do roughly the same thing. Hopefully this won't be
an issue for your work. The tenk2 case still works as expected with
CREATE INDEX/REINDEX, due to help from deduplication. Indexes like the
TPC-C line items PK will leave the index with some "orders" (or
whatever the natural grouping of things is) that span more than a
single leaf page, which is undesirable, and might hinder your
prefetching work. I wouldn't mind fixing that if it turned out to hurt
your leaf-page-at-a-time prefetching patch. Something to consider.
We can fit at most 17 TPC-C orders on each order line PK leaf page.
Could be as few as 15. If we do the wrong thing with prefetching for 2
out of every 15 orders then that's a real problem, but is still subtle enough
to easily miss with conventional benchmarking. I've had a lot of success
with paying close attention to all the little boundary cases, which is why
I'm kind of zealous about it now.
I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.
I get that it could be invasive. I have the sense that just pinning
the same heap page more than once in very close succession is just the
wrong thing to do, with or without prefetching.
I mean, imagine we have TIDs
[T1, T2, T3, T4, T5, T6]
Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:[T1, T5, T6, T2, T3, T4]
But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.
Yeah, that's possible. But isn't that par for the course? Any
optimization that involves speculation (including all prefetching)
comes with similar risks. They can be managed.
I don't think that we'd literally order by TID...we wouldn't change
the order that each heap page was *initially* pinned. We'd just
reorder the tuples minimally using an approach that is sufficient to
avoid repeated pinning of heap pages during processing of any one leaf
page's heap TIDs. ISTM that the risk of wasting work is limited to
wasting cycles on processing extra tuples from a heap page that we
definitely had to process at least one tuple from already. That
doesn't seem particularly risky, as speculative optimizations go. The
downside is bounded and well understood, while the upside could be
significant.
I really don't have that much confidence in any of this just yet. I'm
not trying to make this project more difficult. I just can't help but
notice that the order that index scans end up pinning heap pages
already has significant problems, and is sensitive to things like
small amounts of heap fragmentation -- maybe that's not a great basis
for prefetching. I *really* hate any kind of sharp discontinuity,
where a minor change in an input (e.g., from minor amounts of heap
fragmentation) has outsized impact on an output (e.g., buffers
pinned). Interactions like that tend to be really pernicious -- they
lead to bad performance that goes unnoticed and unfixed because the
problem effectively camouflages itself. It may even be easier to make
the conservative (perhaps paranoid) assumption that weird nasty
interactions will cause harm somewhere down the line...why take a
chance?
I might end up prototyping this myself. I may have to put my money
where my mouth is. :-)
--
Peter Geoghegan
On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans
At the point Greg Stark was hacking on this, the underlying OS async I/O
features were tricky to fix into PG's I/O model, and both of us did much
review work just to find working common ground that PG could plug into.
Linux POSIX advisories were completely different from Solaris's async
model, the other OS used for validation that the feature worked, with the
hope being that designing against two APIs would be better than just
focusing on Linux. Since that foundation was all so brittle and limited,
scope was limited to just the heap scan, since it seemed to have the best
return on time invested given the parts of async I/O that did and didn't
scale as expected.
As I remember it, the idea was to get the basic feature out the door and
gather feedback about things like whether the effective_io_concurrency knob
worked as expected before moving onto other prefetching. Then that got
lost in filesystem upheaval land, with so much drama around Solaris/ZFS and
Oracle's btrfs work. I think it's just that no one ever got back to it.
I have all the workloads that I use for testing automated into
pgbench-tools now, and this change would be easy to fit into testing on
them as I'm very heavy on block I/O tests. To get PG to reach full read
speed on newer storage I've had to do some strange tests, like doing index
range scans that touch 25+ pages. Here's that one as a pgbench script:
\set range 67 * (:multiplier + 1)
\set limit 100000 * :scale
\set limit :limit - :range
\set aid random(1, :limit)
SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid
LIMIT :range;
And then you use '-Dmultiplier=10' or such to crank it up. Database 4X
RAM, multiplier=25 with 16 clients is my starting point on it when I want
to saturate storage. Anything that lets me bring those numbers down would
be valuable.
--
Greg Smith greg.smith@crunchydata.com
Director of Open Source Strategy
Hi,
On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?
Yes, I meant that.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).
I'll try to have a look.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:for q in `seq 1 22`; do
1. drop caches and restart postgres
Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.
2. run query $q -> uncached
3. run query $q -> cached
done
So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.
Ah, ok.
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.
Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).
Greetings,
Andres Freund
On 6/10/23 22:34, Andres Freund wrote:
Hi,
On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?Yes, I meant that.
Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).I'll try to have a look.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:for q in `seq 1 22`; do
1. drop caches and restart postgres
Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.
Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.
2. run query $q -> uncached
3. run query $q -> cached
done
So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.Ah, ok.
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).
OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:
Hi,
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.
I added entry to
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.
--
Tomasz Rybak, Debian Developer <serpent@debian.org>
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C
On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.
One of the reasons IMHO is that in the bitmap scan before starting the
heap fetch TIDs are already sorted in heap block order. So it is
quite obvious that once we prefetch a heap block most of the
subsequent TIDs will fall on that block i.e. each prefetch will
satisfy many immediate requests. OTOH, in the index scan the I/O
request is very random so we might have to prefetch many blocks even
for satisfying the request for TIDs falling on one index page. I
agree with prefetching with an index scan will definitely help in
reducing the random I/O, but this is my guess that thinking of
prefetching with a Bitmap scan appears more natural and that would
have been one of the reasons for implementing this only for a bitmap
scan.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
I have results from the new extended round of prefetch tests. I've
pushed everything to
https://github.com/tvondra/index-prefetch-tests-2
There are scripts I used to run this (run-*.sh), raw results and various
kinds of processed summaries (pdf, ods, ...) that I'll mention later.
As before, this tests a number of query types:
- point queries with btree and hash (equality)
- ORDER BY queries with btree (inequality + order by)
- SAOP queries with btree (column IN (values))
It's probably futile to go through details of all the tests - it's
easier to go through the (hopefully fairly readable) shell scripts.
But in principle, runs some simple queries while varying both the data
set and workload:
- data set may be random, sequential or cyclic (with different length)
- the number of matches per value differs (i.e. equality condition may
match 1, 10, 100, ..., 100k rows)
- forces a particular scan type (indexscan, bitmapscan, seqscan)
- each query is executed twice - first run (right after restarting DB
and dropping caches) is uncached, second run should have data cached
- the query is executed 5x with different parameters (so 10x in total)
This is tested with three basic data sizes - fits into shared buffers,
fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and
20GB (i5) / 40GB (xeon).
Note: xeon has 64GB RAM, so technically the largest scale fits into RAM.
But should not matter, thanks to drop-caches and restart.
I also attempted to pin the backend to a particular core, in effort to
eliminate scheduling-related noise. It's mostly what taskset does, but I
did that from extension (https://github.com/tvondra/taskset) which
allows me to do that as part of the SQL script.
For the results, I'll talk about the v1 patch (as submitted here) fist.
I'll use the PDF results in the "pdf" directory which generally show a
pivot table by different test parameters, comparing the results by
different parameters (prefetching on/off, master/patched).
Feel free to do your own analysis from the raw CSV data, ofc.
For example, this:
https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf
shows how the prefetching affects timing for point queries with
different numbers of matches (1 to 100k). The numbers are timings for
master and patched build. The last group is (patched/master), so the
lower the number the better - 50% means patch makes the query 2x faster.
There's also a heatmap, with green=good, red=bad, which makes it easier
to cases that got slower/faster.
The really interesting stuff starts on page 7 (in this PDF), because the
first couple pages are "cached" (so it's more about measuring overhead
when prefetching has no benefit).
Right on page 7 you can see a couple cases with a mix of slower/faster
cases, roughtly in the +/- 30% range. However, this is unrelated from
the patch because those are results for bitmapheapscan.
For indexscans (page 8), the results are invariably improved - the more
matches the better (up to ~10x faster for 100k matches).
Those were results for the "cyclic" data set. For random data set (pages
9-11) the results are pretty similar, but for "sequential" data (11-13)
the prefetching is actually harmful - there are red clusters, with up to
500% slowdowns.
I'm not going to explain the summary for SAOP queries
(https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf),
the story is roughly the same, except that there are more tested query
combinations (because we also vary the pattern in the IN() list - number
of values etc.).
So, the conclusion from this is - generally very good results for random
and cyclic data sets, but pretty bad results for sequential. But even
for the random/cyclic cases there are combinations (especially with many
matches) where prefetching doesn't help or even hurts.
The only way to deal with this is (I think) a cheap way to identify and
skip inefficient prefetches, essentially by doing two things:
a) remembering more recently prefetched blocks (say, 1000+) and not
prefetching them over and over
b) ability to identify sequential pattern, when readahead seems to do
pretty good job already (although I heard some disagreement)
I've been thinking about how to do this - doing (a) seem pretty hard,
because on the one hand we want to remember a fair number of blocks and
we want the check "did we prefetch X" to be very cheap. So a hash table
seems nice. OTOH we want to expire "old" blocks and only keep the most
recent ones, and hash table doesn't really support that.
Perhaps there is a great data structure for this, not sure. But after
thinking about this I realized we don't need a perfect accuracy - it's
fine to have false positives/negatives - it's fine to forget we already
prefetched block X and prefetch it again, or prefetch it again. It's not
a matter of correctness, just a matter of efficiency - after all, we
can't know if it's still in memory, we only know if we prefetched it
fairly recently.
This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU
cache that's small enough to be searched linearly (say, 8 blocks). And
we have many of them (e.g. 128), so that in total we can remember 1024
block numbers. Now, every block number is mapped to a single LRU by
hashing, as if we had a hash table
index = hash(blockno) % 128
and we only use tha one LRU to track this block. It's tiny so we can
search it linearly.
To expire prefetched blocks, there's a counter incremented every time we
prefetch a block, and we store it in the LRU with the block number. When
checking the LRU we ignore old entries (with counter more than 1000
values back), and we also evict/replace the oldest entry if needed.
This seems to work pretty well for the first requirement, but it doesn't
allow identifying the sequential pattern cheaply. To do that, I added a
tiny queue with a couple entries that can checked it the last couple
entries are sequential.
And this is what the attached 0002+0003 patches do. There are PDF with
results for this build prefixed with "patch-v3" and the results are
pretty good - the regressions are largely gone.
It's even cleared in the PDFs comparing the impact of the two patches:
https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf
https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf
Which simply shows the "speedup heatmap" for the two patches, and the
"v3" heatmap has much less red regression clusters.
Note: The comparison-point.pdf summary has another group of columns
illustrating if this scan type would be actually used, with "green"
meaning "yes". This provides additional context, because e.g. for the
"noisy bitmapscans" it's all white, i.e. without setting the GUcs the
optimizer would pick something else (hence it's a non-issue).
Let me know if the results are not clear enough (I tried to cover the
important stuff, but I'm sure there's a lot of details I didn't cover),
or if you think some other summary would be better.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0003-ignore-seq-patterns-add-stats-v3.patchtext/x-patch; charset=UTF-8; name=0003-ignore-seq-patterns-add-stats-v3.patchDownload+96-1
0002-more-elaborate-prefetch-cache-v3.patchtext/x-patch; charset=UTF-8; name=0002-more-elaborate-prefetch-cache-v3.patchDownload+141-71
0001-index-prefetch-poc-v1.patchtext/x-patch; charset=UTF-8; name=0001-index-prefetch-poc-v1.patchDownload+796-43
Hi,
attached is a v4 of the patch, with a fairly major shift in the approach.
Until now the patch very much relied on the AM to provide information
which blocks to prefetch next (based on the current leaf index page).
This seemed like a natural approach when I started working on the PoC,
but over time I ran into various drawbacks:
* a lot of the logic is at the AM level
* can't prefetch across the index page boundary (have to wait until the
next index leaf page is read by the indexscan)
* doesn't work for distance searches (gist/spgist),
After thinking about this, I decided to ditch this whole idea of
exchanging prefetch information through an API, and make the prefetching
almost entirely in the indexam code.
The new patch maintains a queue of TIDs (read from index_getnext_tid),
with up to effective_io_concurrency entries - calling getnext_slot()
adds a TID at the queue tail, issues a prefetch for the block, and then
returns TID from the queue head.
Maintaining the queue is up to index_getnext_slot() - it can't be done
in index_getnext_tid(), because then it'd affect IOS (and prefetching
heap would mostly defeat the whole point of IOS). And we can't do that
above index_getnext_slot() because that already fetched the heap page.
I still think prefetching for IOS is doable (and desirable), in mostly
the same way - except that we'd need to maintain the queue from some
other place, as IOS doesn't do index_getnext_slot().
FWIW there's also the "index-only filters without IOS" patch [1]https://commitfest.postgresql.org/43/4352/ which
switches even regular index scans to index_getnext_tid(), so maybe
relying on index_getnext_slot() is a lost cause anyway.
Anyway, this has the nice consequence that it makes AM code entirely
oblivious of prefetching - there's no need to API, we just get TIDs as
before, and the prefetching magic happens after that. Thus it also works
for searches ordered by distance (gist/spgist). The patch got much
smaller (about 40kB, down from 80kB), which is nice.
I ran the benchmarks [2]https://github.com/tvondra/index-prefetch-tests-2/ with this v4 patch, and the results for the
"point" queries are almost exactly the same as for v3. The SAOP part is
still running - I'll add those results in a day or two, but I expect
similar outcome as for point queries.
regards
[1]: https://commitfest.postgresql.org/43/4352/
[2]: https://github.com/tvondra/index-prefetch-tests-2/
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch-v4.patchtext/x-patch; charset=UTF-8; name=index-prefetch-v4.patchDownload+648-33
Here's a v5 of the patch, rebased to current master and fixing a couple
compiler warnings reported by cfbot (%lu vs. UINT64_FORMAT in some debug
messages). No other changes compared to v4.
cfbot also reported a failure on windows in pg_dump [1]https://cirrus-ci.com/task/6398095366291456, but it seem
pretty strange:
[11:42:48.708] ------------------------------------- 8<
-------------------------------------
[11:42:48.708] stderr:
[11:42:48.708] # Failed test 'connecting to an invalid database: matches'
The patch does nothing related to pg_dump, and the test works perfectly
fine for me (I don't have windows machine, but 32-bit and 64-bit linux
works fine for me).
regards
[1]: https://cirrus-ci.com/task/6398095366291456
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch-v5.patchtext/x-patch; charset=UTF-8; name=index-prefetch-v5.patchDownload+650-33
Hi,
Attached is a v6 of the patch, which rebases v5 (just some minor
bitrot), and also does a couple changes which I kept in separate patches
to make it obvious what changed.
0001-v5-20231016.patch
----------------------
Rebase to current master.
0002-comments-and-minor-cleanup-20231012.patch
----------------------------------------------
Various comment improvements (remove obsolete ones clarify a bunch of
other comments, etc.). I tried to explain the reasoning why some places
disable prefetching (e.g. in catalogs, replication, ...), explain how
the caching / LRU works etc.
0003-remove-prefetch_reset-20231016.patch
-----------------------------------------
I decided to remove the separate prefetch_reset parameter, so that all
the index_beginscan() methods only take a parameter specifying the
maximum prefetch target. The reset was added early when the prefetch
happened much lower in the AM code, at the index page level, and the
reset was when moving to the next index page. But now after the prefetch
moved to the executor, this doesn't make much sense - the resets happen
on rescans, and it seems right to just reset to 0 (just like for bitmap
heap scans).
0004-PoC-prefetch-for-IOS-20231016.patch
----------------------------------------
This is a PoC adding the prefetch to index-only scans too. At first that
may seem rather strange, considering eliminating the heap fetches is the
whole point of IOS. But if the pages are not marked as all-visible (say,
the most recent part of the table), we may still have to fetch them. In
which case it'd be easy to see cases that IOS is slower than a regular
index scan (with prefetching).
The code is quite rough. It adds a separate index_getnext_tid_prefetch()
function, adding prefetching on top of index_getnext_tid(). I'm not sure
it's the right pattern, but it's pretty much what index_getnext_slot()
does too, except that it also does the fetch + store to the slot.
Note: There's a second patch adding index-only filters, which requires
the regular index scans from index_getnext_slot() to _tid() too.
The prefetching then happens only after checking the visibility map (if
requested). This part definitely needs improvements - for example
there's no attempt to reuse the VM buffer, which I guess might be expensive.
index-prefetch.pdf
------------------
Attached is also a PDF with results of the same benchmark I did before,
comparing master vs. patched with various data patterns and scan types.
It's not 100% comparable to earlier results as I only ran it on a
laptop, and it's a bit noisier too. The overall behavior and conclusions
are however the same.
I was specifically interested in the IOS behavior, so I added two more
cases to test - indexonlyscan and indexonlyscan-clean. The first is the
worst-case scenario, with no pages marked as all-visible in VM (the test
simply deletes the VM), while indexonlyscan-clean is the good-case (no
heap fetches needed).
The results mostly match the expected behavior, particularly for the
uncached runs (when the data is expected to not be in memory):
* indexonlyscan (i.e. bad case) - About the same results as
"indexscans", with the same speedups etc. Which is a good thing
(i.e. IOS is not unexpectedly slower than regular indexscans).
* indexonlyscan-clean (i.e. good case) - Seems to have mostly the same
performance as without the prefetching, except for the low-cardinality
runs with many rows per key. I haven't checked what's causing this,
but I'd bet it's the extra buffer lookups/management I mentioned.
I noticed there's another prefetching-related patch [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com from Thomas
Munro. I haven't looked at it yet, so hard to say how much it interferes
with this patch. But the idea looks interesting.
[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch.pdfapplication/pdf; name=index-prefetch.pdfDownload+1-9
0001-v5-20231012.patchtext/x-patch; charset=UTF-8; name=0001-v5-20231012.patchDownload+650-34
0002-comments-and-minor-cleanup-20231012.patchtext/x-patch; charset=UTF-8; name=0002-comments-and-minor-cleanup-20231012.patchDownload+273-121
0003-remove-prefetch_reset-20231012.patchtext/x-patch; charset=UTF-8; name=0003-remove-prefetch_reset-20231012.patchDownload+52-73
0004-PoC-prefetch-for-IOS-20231012.patchtext/x-patch; charset=UTF-8; name=0004-PoC-prefetch-for-IOS-20231012.patchDownload+195-11
Hi,
Here's a new WIP version of the patch set adding prefetching to indexes,
exploring a couple alternative approaches. After the patch 2023/10/16
version, I happened to have an off-list discussion with Andres, and he
suggested to try a couple things, and there's a couple more things I
tried on my own too.
Attached is the patch series starting with the 2023/10/16 patch, and
then trying different things in separate patches (discussed later). As
usual, there's also a bunch of benchmark results - due to size I'm
unable to attach all of them here (the PDFs are pretty large), but you
can find them at (with all the scripts etc.):
https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23
I'll attach only a couple small PNG with highlighted speedup/regression
patterns, but it's unreadable and more of a pointer to the PDF.
A quick overview of the patches
-------------------------------
v20231124-0001-prefetch-2023-10-16.patch
- same as the October 16 patch, with only minor comment tweaks
v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch
- removes custom cache of recently prefetched blocks, replaces it
simply by calling PrefetchBuffer (which check shared buffers)
v20231124-0003-check-page-cache-using-preadv2.patch
- adds a check using preadv2(RWF_NOWAIT) to check if the whole
page is in page cache
v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch
- adds back a small LRU cache to identify sequential patterns
(based on benchmarks of 0002/0003 patches)
v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch
v20231124-0006-poc-reuse-vm-information.patch
- optimizes the visibilitymap handling when prefetching for IOS
(to deal with overhead in the all-visible cases) by
v20231124-0007-20231016-reworked.patch
- returns back to the 20231016 patch, but this time with the VM
optimizations in patches 0005/0006 (in retrospect I might have
simply moved 0005+0006 right after 0001, but the patch evolved
differently - shouldn't matter here)
Now, let's talk about the patches one by one ...
PrefetchBuffer + preadv2 (0002+0003)
------------------------------------
After I posted the patch in October, I happened to have an off-list
discussion with Andres, and he suggested to try ditching the local cache
of recently prefetched blocks, and instead:
1) call PrefetchBuffer (which checks if the page is in shared buffers,
and skips the prefetch if it's already there)
2) if the page is not in shared buffers, use preadv2(RWF_NOWAIT) to
check if it's in the kernel page cache
Doing (1) is trivial - PrefetchBuffer() already does the shared buffer
check, so 0002 simply removes the custom cache code.
Doing (2) needs a bit more code to actually call preadv2() - 0003 adds
FileCached() to fd.c, smgrcached() to smgr.c, and then calls it from
PrefetchBuffer() right before smgrprefetch(). There's a couple loose
ends (e.g. configure should check if preadv2 is supported), but in
principle I think this is generally correct.
Unfortunately, these changes led to a bunch of clear regressions :-(
Take a look at the attached point-4-regressions-small.png, which is page
5 from the full results PDF [1]https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf[2]https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png. As before, I plotted this as a huge
pivot table with various parameters (test, dataset, prefetch, ...) on
the left, and (build, nmatches) on the top. So each column shows timings
for a particular patch and query returning nmatches rows.
After the pivot table (on the right) is a heatmap, comparing timings for
each build to master (the first couple of columns). As usual, the
numbers are "timing compared to master" so e.g. 50% means the query
completed in 1/2 the time compared to master. Color coding is simple
too, green means "good" (speedup), red means "bad" (regression). The
higher the saturation, the bigger the difference.
I find this visualization handy as it quickly highlights differences
between the various patches. Just look for changes in red/green areas.
In the points-5-regressions-small.png image, you can see three areas of
clear regressions, either compared to the master or the 20231016 patch.
All of this is for "uncached" runs, i.e. after instance got restarted
and the page cache was dropped too.
The first regression is for bitmapscan. The first two builds show no
difference compared to master - which makes sense, because the 20231016
patch does not touch any code used by bitmapscan, and the 0003 patch
simply uses PrefetchBuffer as is. But then 0004 adds preadv2 to it, and
the performance immediately sinks, with timings being ~5-6x higher for
queries matching 1k-100k rows.
The patches 0005/0006 can't possibly improve this, because visibilitymap
are entirely unrelated to bitmapscans, and so is the small LRU to
detect sequential patterns.
The indexscan regression #1 shows a similar pattern, but in the opposite
direction - indesxcan cases massively improved with the 20231016 patch
(and even after just using PrefetchBuffer) revert back to master with
0003 (adding preadv2). Ditching the preadv2 restores the gains (the last
build results are nicely green again).
The indexscan regression #2 is interesting too, and it illustrates the
importance of detecting sequential access patterns. It shows that as
soon as we call PrefetBuffer() directly, the timings increase to maybe
2-5x compared to master. That's pretty terrible. Once the small LRU
cache used to detect sequential patterns is added back, the performance
recovers and regression disappears. Clearly, this detection matters.
Unfortunately, the LRU can't do anything for the two other regresisons,
because those are on random/cyclic patterns, so the LRU won't work
(certainly not for the random case).
preadv2 issues?
---------------
I'm not entirely sure if I'm using preadv2 somehow wrong, but it doesn't
seem to perform terribly well in this use case. I decided to do some
microbenchmarks, measuring how long it takes to do preadv2 when the
pages are [not] in cache etc. The C files are at [3]https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests.
preadv2-test simply reads file twice, first with NOWAIT and then without
it. With clean page cache, the results look like this:
file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192
preadv2 NOWAIT time 78472 us calls 131072 hits 0 misses 131072
preadv2 WAIT time 9849082 us calls 131072 hits 131072 misses 0
and then, if you run it again with the file still being in page cache:
file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192
preadv2 NOWAIT time 258880 us calls 131072 hits 131072 misses 0
preadv2 WAIT time 213196 us calls 131072 hits 131072 misses 0
This is pretty terrible, IMO. It says that if the page is not in cache,
the preadv2 calls take ~80ms. Which is very cheap, compared to the total
read time (so if we can speed that up by prefetching, it's worth it).
But if the file is already in cache, it takes ~260ms, and actually
exceeds the time needed to just do preadv2() without the NOWAIT flag.
AFAICS the problem is preadv2() doesn't just check if the data is
available, it also copies the data and all that. But even if we only ask
for the first byte, it's still way more expensive than with empty cache:
file: ./tmp.img size: 1073741824 (131072) block 8192 check 1
preadv2 NOWAIT time 119751 us calls 131072 hits 131072 misses 0
preadv2 WAIT time 208136 us calls 131072 hits 131072 misses 0
There's also a fadvise-test microbenchmark that just does fadvise all
the time, and even that is way cheaper than using preadv2(NOWAIT) in
both cases:
no cache:
file: ./tmp.img size: 1073741824 (131072) block 8192
fadvise time 631686 us calls 131072 hits 0 misses 0
preadv2 time 207483 us calls 131072 hits 131072 misses 0
cache:
file: ./tmp.img size: 1073741824 (131072) block 8192
fadvise time 79874 us calls 131072 hits 0 misses 0
preadv2 time 239141 us calls 131072 hits 131072 misses 0
So that's 300ms vs. 500ms in the caches case (the difference in the
no-cache case is even more significant).
It's entirely possible I'm doing something wrong, or maybe I just think
about this the wrong way, but I can't quite imagine this being useful
for this working - at least not for reasonably good local storage. Maybe
it could help for slow/remote storage, or something?
For now, I think the right approach is to go back to the cache of
recently prefetched blocks. I liked on the preadv2 approach is that it
knows exactly what is currently in page cache, while the local cache is
just an approximation cache of recently prefetched blocks. And it also
knows about stuff prefetched by other backends, while the local cache is
private to the particular backend (or even to the particular scan node).
But the local cache seems to perform much better, so there's that.
LRU cache of recent blocks (0004)
---------------------------------
The importance of this optimization is clearly visible in the regression
image mentioned earlier - the "indexscan regression #2" shows that the
sequential pattern regresses with 0002+0003 patches, but once the small
LRU cache is introduced back and uses to skip prefetching for sequential
patterns, the regression disappears. Ofc, this is part of the origina
20231016 patch, so going back to that version naturally includes this.
visibility map optimizations (0005/0006)
----------------------------------------
Earlier benchmark results showed a bit annoying regression for
index-only scans that don't need prefetching (i.e. with all pages
all-visible). There was quite a bit of inefficiency because both the
prefetcher and IOS code accessed the visibilitymap independently, and
the prefetcher did that in a rather inefficient way. These patches make
the prefetcher more efficient by reusing buffer, and also share the
visibility info between prefetcher and the IOS code.
I'm sure this needs more work / cleanup, but the regresion is mostly
gone, as illustrated by the attached point-0-ios-improvement-small.png.
layering questions
------------------
Aside from the preadv2() question, the main open question remains to be
the "layering", i.e. which code should be responsible for prefetching.
At the moment all the magic happens in indexam.c, in index_getnext_*
functions, so that all callers benefit from prefetching.
But as mentioned earlier in this thread, indexam.c seems to be the wrong
layer, and I think I agree. The problem is - the prefetching needs to
happen in index_getnext_* so that all index_getnext_* callers benefit
from it. We could do that in the executor for index_getnext_tid(), but
that's a bit weird - it'd work for index-only scans, but the primary
target is regular index scans, which calls index_getnext_slot().
However, it seems it'd be good if the prefetcher and the executor code
could exchange/share information more easily. Take for example the
visibilitymap stuff in IOS in patches 0005/0006). I made it work, but it
sure looks inconvenient, partially due to the split between executor and
indexam code.
The only idea I have is to have the prefetcher code somewhere in the
executor, but then pass it to index_getnext_* functions, either as a new
parameter (with NULL => no prefetching), or maybe as a field of scandesc
(but that seems wrong, to point from the desc to something that's
essentially a part of the executor state).
There's also the thing that the prefetcher is part of IndexScanDesc, but
it really should be in the IndexScanState. That's weird, but mostly down
to my general laziness.
regards
[1]: https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf
https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf
[2]: https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png
https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png
[3]: https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests
https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
point-0-ios-improvement-small.pngimage/png; name=point-0-ios-improvement-small.pngDownload+3-5
point-4-regressions-small.pngimage/png; name=point-4-regressions-small.pngDownload+4-1
v20231124-0001-prefetch-2023-10-16.patchtext/x-patch; charset=UTF-8; name=v20231124-0001-prefetch-2023-10-16.patchDownload+975-33
v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patchtext/x-patch; charset=UTF-8; name=v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patchDownload+52-411
v20231124-0003-check-page-cache-using-preadv2.patchtext/x-patch; charset=UTF-8; name=v20231124-0003-check-page-cache-using-preadv2.patchDownload+103-1
v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patchtext/x-patch; charset=UTF-8; name=v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patchDownload+167-1
v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patchtext/x-patch; charset=UTF-8; name=v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patchDownload+17-6
v20231124-0006-poc-reuse-vm-information.patchtext/x-patch; charset=UTF-8; name=v20231124-0006-poc-reuse-vm-information.patchDownload+56-24
v20231124-0007-20231016-reworked.patchtext/x-patch; charset=UTF-8; name=v20231124-0007-20231016-reworked.patchDownload+195-129
Hi,
Here's a simplified version of the patch series, with two important
changes from the last version shared on 2023/11/24.
Firstly, it abandons the idea to use preadv2() to check page cache. This
initially seemed like a great way to check if prefetching is needed, but
in practice it seems so expensive it's not really beneficial (especially
in the "cached" case, which is where it matters most).
Note: There's one more reason to not want rely on preadv2() that I
forgot to mention - it's a Linux-specific thing. I wouldn't mind using
it to improve already acceptable behavior, but it doesn't seem like a
great idea if performance without would be poor.
Secondly, this reworks multiple aspects of the "layering".
Until now, the prefetching info was stored in IndexScanDesc and
initialized in indexam.c in the various "beginscan" functions. That was
obviously wrong - IndexScanDesc is just a description of what the scan
should do, not a place where execution state (which the prefetch queue
is) should be stored. IndexScanState (and IndexOnlyScanState) is a more
appropriate place, so I moved it there.
This also means the various "beginscan" functions don't need any changes
(i.e. not even get prefetch_max), which is nice. Because the prefetch
state is created/initialized elsewhere.
But there's a layering problem that I don't know how to solve - I don't
see how we could make indexam.c entirely oblivious to the prefetching,
and move it entirely to the executor. Because how else would you know
what to prefetch?
With index_getnext_tid() I can imagine fetching XIDs ahead, stashing
them into a queue, and prefetching based on that. That's kinda what the
patch does, except that it does it from inside index_getnext_tid(). But
that does not work for index_getnext_slot(), because that already reads
the heap tuples.
We could say prefetching only works for index_getnext_tid(), but that
seems a bit weird because that's what regular index scans do. (There's a
patch to evaluate filters on index, which switches index scans to
index_getnext_tid(), so that'd make prefetching work too, but I'd ignore
that here. There are other index_getnext_slot() callers, and I don't
think we should accept does not work for those places seems wrong (e.g.
execIndexing/execReplication would benefit from prefetching, I think).
The patch just adds a "prefetcher" argument to index_getnext_*(), and
the prefetching still happens there. I guess we could move most of the
prefether typedefs/code somewhere, but I don't quite see how it could be
done in executor entirely.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company