PATCH: Using BRIN indexes for sorted output
Hi,
There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.
The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.
The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.
This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).
A trivial example, demonstrating the benefits:
create table t (a int) with (fillfactor = 10);
insert into t select i from generate_series(1,10000000) s(i);
First, a simple LIMIT query:
explain (analyze, costs off) select * from t order by a limit 10;
QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1879.768..1879.770 rows=10 loops=1)
-> Sort (actual time=1879.767..1879.768 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on t
(actual time=0.007..1353.110 rows=10000000 loops=1)
Planning Time: 0.083 ms
Execution Time: 1879.786 ms
(7 rows)
QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1.217..1.219 rows=10 loops=1)
-> BRIN Sort using t_a_idx on t
(actual time=1.216..1.217 rows=10 loops=1)
Sort Key: a
Planning Time: 0.084 ms
Execution Time: 1.234 ms
(5 rows)
That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.
However, not only the LIMIT queries can improve - consider a sort of the
whole table:
test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN
-------------------------------------------------------------------------
Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 117528kB
-> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 3766.825 ms
(6 rows)
test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN
----------------------------------------------------------------------------------
BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
Sort Key: a
Planning Time: 0.073 ms
Execution Time: 2939.324 ms
(4 rows)
Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)
1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.
2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.
3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.
4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.
5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).
6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.
7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.
9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Allow-BRIN-indexes-to-produce-sorted-output-20221015.patchtext/x-patch; charset=UTF-8; name=0001-Allow-BRIN-indexes-to-produce-sorted-output-20221015.patchDownload+2779-2
On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Hi,
There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).A trivial example, demonstrating the benefits:
create table t (a int) with (fillfactor = 10);
insert into t select i from generate_series(1,10000000) s(i);First, a simple LIMIT query:
explain (analyze, costs off) select * from t order by a limit 10;
QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1879.768..1879.770 rows=10 loops=1)
-> Sort (actual time=1879.767..1879.768 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on t
(actual time=0.007..1353.110 rows=10000000 loops=1)
Planning Time: 0.083 ms
Execution Time: 1879.786 ms
(7 rows)QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1.217..1.219 rows=10 loops=1)
-> BRIN Sort using t_a_idx on t
(actual time=1.216..1.217 rows=10 loops=1)
Sort Key: a
Planning Time: 0.084 ms
Execution Time: 1.234 ms
(5 rows)That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.However, not only the LIMIT queries can improve - consider a sort of the
whole table:test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN
-------------------------------------------------------------------------
Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 117528kB
-> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 3766.825 ms
(6 rows)test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN----------------------------------------------------------------------------------
BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
Sort Key: a
Planning Time: 0.073 ms
Execution Time: 2939.324 ms
(4 rows)Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I am still going over the patch.
Minor: for #8, I guess you meant `it should be possible` .
Cheers
On 10/15/22 15:46, Zhihong Yu wrote:
...
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.
...
Hi,
I am still going over the patch.Minor: for #8, I guess you meant `it should be possible` .
Yes, I meant to say it should be possible. Sorry for the confusion.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 10/15/22 15:46, Zhihong Yu wrote:
...
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.
...
Hi,
I am still going over the patch.Minor: for #8, I guess you meant `it should be possible` .
Yes, I meant to say it should be possible. Sorry for the confusion.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
For brin_minmax_ranges, looking at the assignment to gottuple and
reading gottuple, it seems variable gottuple can be omitted - we can check
tup directly.
+ /* Maybe mark the range as processed. */
+ range->processed |= mark_processed;
`Maybe` can be dropped.
For brinsort_load_tuples(), do we need to check for interrupts inside the
loop ?
Similar question for subsequent methods involving loops, such
as brinsort_load_unsummarized_ranges.
Cheers
On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.
We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.
The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.
The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.
A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.
This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).
Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]
and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.
We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?
So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to
- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)
Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyHi,
In your example involving [150,500], can this range be broken down into 4
ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.
bq. can store the info into a tuplestore/tuplesort
Wouldn't this involve disk accesses which may reduce the effectiveness of
BRIN sort ?
Cheers
On 10/16/22 03:36, Zhihong Yu wrote:
On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:On 10/15/22 15:46, Zhihong Yu wrote:
...
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and thenlet the
workers to acquire/sort ranges and merge them by Gather Merge.
...
Hi,
I am still going over the patch.Minor: for #8, I guess you meant `it should be possible` .
Yes, I meant to say it should be possible. Sorry for the confusion.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL CompanyHi,
For brin_minmax_ranges, looking at the assignment to gottuple and
reading gottuple, it seems variable gottuple can be omitted - we can
check tup directly.+ /* Maybe mark the range as processed. */ + range->processed |= mark_processed;`Maybe` can be dropped.
No, because the "mark_processed" may be false. So we may not mark it as
processed in some cases.
For brinsort_load_tuples(), do we need to check for interrupts inside
the loop ?
Similar question for subsequent methods involving loops, such
as brinsort_load_unsummarized_ranges.
We could/should, although most of the loops should be very short.
regrds
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 10/16/22 16:01, Zhihong Yu wrote:
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL CompanyHi,
In your example involving [150,500], can this range be broken down into
4 ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.
Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?
Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.
It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.
Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done
For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.
bq. can store the info into a tuplestore/tuplesort
Wouldn't this involve disk accesses which may reduce the effectiveness
of BRIN sort ?
Yes, it might. But the question is whether the result is still faster
than alternative plans (e.g. seqscan+sort), and those are likely to do
even more I/O.
Moreover, for "regular" cases this shouldn't be a significant issue,
because the stuff will fit into work_mem and so there'll be no I/O. But
it'll handle those extreme cases gracefully.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
I forgot to mention one important issue in my list yesterday, and that's
memory consumption.
TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.
The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables. It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on. It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.
regards, tom lane
On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 10/16/22 16:01, Zhihong Yu wrote:
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and
that's
memory consumption. The way the patch is coded now, the new BRIN
support
function (brin_minmax_ranges) produces information about *all*
ranges in
one go, which may be an issue. The worst case is 32TB table, with
1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of
~32B
structs, so this would require ~128GB of RAM. With the default
128-page
ranges, it's still be ~1GB, which is quite a lot.
We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billionranges
is going to be "not cheap" so this is likely pretty rare). But we
should
not introduce new nodes that ignore work_mem, so we need a way to
deal
with such cases somehow.
The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality ishard.
The wrinkle here is that counting just the range info (BrinRange
struct)
does not include the values for by-reference types. We could use
average
width - that's just an estimate, though.
A more comprehensive solution seems to be to allow requesting chunks
of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can onlyhandle
100 at a time, we'd do 10 loops, each requesting 100 ranges.
This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice),but
very low minval (thus intersecting with ranges in the current slice).
Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]and let's say we can only process 2-range slices. So we'll get the
first
two, but both of them intersect with the very last range.
We could always include all the intersecting ranges into the slice,
but
what if there are too many very "wide" ranges?
So I think this will need to switch to an iterative communication
with
the BRIN index - instead of asking "give me info about all the
ranges",
we'll need a way to
- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I supposeit
can store the info into a tuplestore/tuplesort, and use that instead
of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort ofAPI,
especially if we want to support e.g. minmax-multi opclasses, that
have
a more complicated concept of "intersection".
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL Company
Hi,
In your example involving [150,500], can this range be broken down into
4 ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. doneFor "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.Hi,
Thanks for the quick reply.
I don't have good answer w.r.t. splitting the page range [0,127] now. Let
me think more about it.
The 10 step flow (subject to changes down the road) should be either given
in the description of the patch or, written as comment inside the code.
This would help people grasp the concept much faster.
BTW splill seems to be a typo - I assume you meant spill.
Cheers
On 10/16/22 16:41, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
I forgot to mention one important issue in my list yesterday, and that's
memory consumption.TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.
Maybe. If it turns out the estimates we have are insufficient to make
good planning decisions, that's life.
As I wrote in my message, I know the BRIN costing is a bit shaky in
general (not just for this new operation), and I intend to propose some
improvement in a separate patch.
I think the main issue with BRIN costing is that we have no stats about
the ranges, and we can't estimate how many ranges we'll really end up
accessing. If you have 100 rows, will that be 1 range or 100 ranges? Or
for the BRIN Sort, how many overlapping ranges will there be?
I intend to allow index AMs to collect custom statistics, and the BRIN
minmax opfamily would collect e.g. this:
1) number of non-summarized ranges
2) number of all-nulls ranges
3) number of has-nulls ranges
4) average number of overlaps (given a random range, how many other
ranges intersect with it)
5) how likely is it for a row to hit multiple ranges (cross-check
sample rows vs. ranges)
I believe this will allow much better / more reliable BRIN costing (the
number of overlaps is particularly useful for the this patch).
The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables. It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on.
Yes, the example was trivial "happy case" example. Obviously, the
performance degrades as the data become more random (with ranges wider),
forcing the BRIN Sort to read / sort more tuples.
But let's see an example with less correlated data, say, like this:
create table t (a int) with (fillfactor = 10);
insert into t select i + 10000 * random()
from generate_series(1,10000000) s(i);
With the fillfactor=10, there are ~2500 values per 1MB range, so this
means each range overlaps with ~4 more. The results then look like this:
1) select * from t order by a;
seqscan+sort: 4437 ms
brinsort: 4233 ms
2) select * from t order by a limit 10;
seqscan+sort: 1859 ms
brinsort: 4 ms
If you increase the random factor from 10000 to 100000 (so, 40 ranges),
the seqscan timings remain about the same, while brinsort gets to 5200
and 20 ms. And with 1M, it's ~6000 and 300 ms.
Only at 5000000, where we pretty much read 1/2 the table because the
ranges intersect, we get the same timing as the seqscan (for the LIMIT
query). The "full sort" query is more like 5000 vs. 6600 ms, so slower
but not by a huge amount.
Yes, this is a very simple example. I can do more tests with other
datasets (larger/smaller, different distribution, ...).
It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.
I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
(say, on different sets of keys), we kinda already have this costing
issue with index and bitmap index scans.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 10/16/22 16:42, Zhihong Yu wrote:
...
I don't have good answer w.r.t. splitting the page range [0,127] now.
Let me think more about it.
Sure, no problem.
The 10 step flow (subject to changes down the road) should be either
given in the description of the patch or, written as comment inside the
code.
This would help people grasp the concept much faster.
True. I'll add it to the next version of the pach.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, 16 Oct 2022 at 16:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.
I think the comparison against full table scan seems appropriate, as
the benefit of BRIN is less space usage when compared to other
indexes, and better IO selectivity than full table scans.
A btree easily requires 10x the space of a normal BRIN index, and may
require a lot of random IO whilst scanning. This BRIN-sorted scan
would have a much lower random IO cost during its scan, and would help
bridge the performance gap between having index that supports ordered
retrieval, and no index at all, which is especially steep in large
tables.
I think that BRIN would be an alternative to btree as a provider of
sorted data, even when the table is not 100% clustered. This
BRIN-assisted table sort can help reduce the amount of data that is
accessed in top-N sorts significantly, both at the index and at the
relation level, without having the space overhead of "all sortable
columns get a btree index".
If BRIN gets its HOT optimization back, the benefits would be even
larger, as we would then have an index that can speed up top-N sorts
without bloating other indexes, and at very low disk footprint.
Columns that are only occasionally accessed in a sorted manner could
then get BRIN minmax indexes to support this sort, at minimal overhead
to the rest of the application.
Kind regards,
Matthias van de Meent
First of all, it's really great to see that this is being worked on.
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
Why sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
Instead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. doneFor "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.
When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):
1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
- Remove Rs' from Rs
- store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
- Remove Ts' from Ts
- Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
- Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
- remove Rs' from Rs
- assign Rs' min value to max_sorted
- store the tuples of Rs' range in Ts
12. while the next item Rs' from Rs has a min value of max_sorted:
- remove Rs' from Rs
- store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7
This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.
The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.
Kind regards,
Matthias van de Meent
PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0]/messages/by-id/CAEze2Wi9=Bay_=rTf8Z6WPgZ5V0tDOayszQJJO=R_9aaHvr+Tg@mail.gmail.com would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.
[0]: /messages/by-id/CAEze2Wi9=Bay_=rTf8Z6WPgZ5V0tDOayszQJJO=R_9aaHvr+Tg@mail.gmail.com
On 10/16/22 22:17, Matthias van de Meent wrote:
First of all, it's really great to see that this is being worked on.
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minvalWhy sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.
Because that allows to identify overlapping ranges quickly.
Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?
Consider these ranges sorted by maxval
range #1 [0,100]
range #2 [101,200]
range #3 [150,250]
...
range #1000000 [190,1000000000]
processing the range #1 is simple, because there are no overlapping
ranges. When processing range #2, that's not the case - the following
range #3 is overlapping too, so we need to load the tuples too. But
there may be other ranges (in arbitrary distance) also overlapping.
So we either have to cross-check everything with everything - that's
O(N^2) so not great, or we can invent a way to eliminate ranges that
can't overlap.
The patch does that by having two arrays - one sorted by maxval, one
sorted by minval. After proceeding to the next range by maxval (using
the first array), the minval-sorted array is used to detect overlaps.
This can be done quickly, because we only care for new matches since the
previous range, so we can remember the index to the array and start from
it. And we can stop once the minval exceeds the maxval for the range in
the first step. Because we'll only sort tuples up to that point.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxvalInstead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.
I don't understand what you mean by "update to tuplesort". Can you
elaborate?
The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval
[0,1000]
[500,1500]
[1001,2000]
...
We load tuples from [0,1000] and use 1000 as "threshold" up to which we
can sort. But we have to load tuples from the overlapping range(s) too,
e.g. from [500,1500] except that all tuples with values > 1000 can't be
produced (because there might be yet more ranges intersecting with that
part).
So why sort these tuples at all? Imagine imperfectly correlated table
where each range overlaps with ~10 other ranges. If we feed all of that
into the tuplestore, we're now sorting 11x the amount of data.
Or maybe I just don't understand what you mean.
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. doneFor "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
- Remove Rs' from Rs
- store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
- Remove Ts' from Ts
- Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
- Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
- remove Rs' from Rs
- assign Rs' min value to max_sorted
- store the tuples of Rs' range in Ts
I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.
12. while the next item Rs' from Rs has a min value of max_sorted:
- remove Rs' from Rs
- store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.
I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.
Kind regards,
Matthias van de Meent
PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0] would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.
Thanks for reminding me, I'll take a look before the next CF.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 10/16/22 22:17, Matthias van de Meent wrote:
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minvalWhy sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.Because that allows to identify overlapping ranges quickly.
Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?
Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxvalInstead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.I don't understand what you mean by "update to tuplesort". Can you
elaborate?
Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.
For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.
The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval
[...]Or maybe I just don't understand what you mean.
If we sort the ranges by minval like this:
1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]
Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.
If the ranges are sorted and loaded by maxval, like your algorithm expects:
1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]
We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.
[algo]
I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.
max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.
There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.
The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.
I see that the two main differences between our designs are in
answering these questions:
- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?
For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.
For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.
Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.
Kind regards,
Matthias van de Meent
On 10/17/22 16:00, Matthias van de Meent wrote:
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 10/16/22 22:17, Matthias van de Meent wrote:
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minvalWhy sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.Because that allows to identify overlapping ranges quickly.
Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.
D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.
I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxvalInstead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.I don't understand what you mean by "update to tuplesort". Can you
elaborate?Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.
Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).
FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.
I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.
Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).
The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval
[...]Or maybe I just don't understand what you mean.
If we sort the ranges by minval like this:
1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.If the ranges are sorted and loaded by maxval, like your algorithm expects:
1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.
Right, thanks - I get this now.
[algo]
I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.I see that the two main differences between our designs are in
answering these questions:- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.
Right, those are very good points.
For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.
Right. I wonder if we these are actually complementary approaches, and
we could/should pick between them depending on how many rows we expect
to consume.
My focus was LIMIT queries, so I favored the approach with the lowest
startup cost. I haven't quite planned for this to work so well even in
full-sort cases. That kinda surprised me (I wonder if the very large
tuplesorts - compared to work_mem - would hurt this, though).
Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.
Right. I think I mentioned this in my post [1]/messages/by-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558@enterprisedb.com, where I also envisioned
some sort of iterative approach. And I think you're right the approach
with ordering by minval is naturally more suitable because it just
consumes the single sequence of ranges.
regards
[1]: /messages/by-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558@enterprisedb.com
/messages/by-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558@enterprisedb.com
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
here's an updated/reworked version of the patch, on top of the "BRIN
statistics" patch as 0001 (because some of the stuff is useful, but we
can ignore this part in this thread).
Warning: I realized the new node is somewhat broken when it comes to
projection and matching the indexed column, most likely because the
targetlists are wired/processed incorrectly or something like that. So
when experimenting with this, just index the first column of the table
and don't do anything requiring a projection. I'll get this fixed, but
I've been focusing on the other stuff. I'm not particularly familiar
with this tlist/project stuff, so any help is welcome.
The main change in this version is the adoption of multiple ideas
suggested by Matthias in his earlier responses.
Firstly, this changes how the index opclass passes information to the
executor node. Instead of using a plain array, we now use a tuplesort.
This addresses the memory consumption issues with large number of
ranges, and it also simplifies the sorting etc. which is now handled by
the tuplesort. The support procedure simply fills a tuplesort and then
hands it over to the caller (more or less).
Secondly, instead of ordering the ranges by maxval, this orders them by
minval (as suggested by Matthias), which greatly simplifies the code
because we don't need to detect overlapping ranges etc.
More precisely, the ranges are sorted to get this ordering
- not yet summarized ranges
- ranges sorted by (minval, blkno)
- all-nulls ranges
This particular ordering is beneficial for the algorithm, which does two
passes over the ranges. For the NULLS LAST case (i.e. the default), we
do this:
- produce tuples with non-NULL values, ordered by the value
- produce tuples with NULL values (arbitrary order)
And each of these phases does a separate pass over the ranges (I'll get
to that in a minute). And the ordering is tailored to this.
Note: For DESC we'd sort by maxval, and for NULLS FIRST the phases would
happen in the opposite order, but those are details. Let's assume ASC
ordering with NULLS LAST, unless stated otherwise.
The idea here is that all not-summarized ranges need to be processed
always, both when processing NULLs and non-NULL values, which happens as
two separate passes over ranges.
The all-null ranges don't need to be processed during the non-NULL pass,
and we can terminate this pass early once we hit the first null-only
range. So placing them last helps with this.
The regular ranges are ordered by minval, as dictated by the algorithm
(which is now described in nodeBrinSort.c comment), but we also sort
them by blkno to make this a bit more sequential (but this only matters
for ranges with the same minval, and that's probably rare, but the extra
sort key is also cheap so why not).
I mentioned we do two separate passes - one for non-NULL values, one for
NULL values. That may be somewhat inefficient, because in extreme cases
we might end up scanning the whole table twice (imagine BRIN ranges
where each range has both regular values and NULLs). It might be
possible to do all of this in a single pass, at least in some cases -
for example while scanning ranges, we might stash NULL rows into a
tuplestore, so that the second pass is not needed. That assumes there
are not too many such rows (otherwise we might need to write and then
read many rows, outweighing the cost of just doing two passes). This
should be possible to estimate/cost fairly well, I think, and the
comment in nodeBrinSort actually presents some ideas about this.
And we can't do that for the NULLS FIRST case, because if we stash the
non-NULL rows somewhere, we won't be able to do the "incremental" sort,
i.e. we might just do regular Sort right away. So I've kept this simple
approach with two passes for now.
This still uses the approach with spilling tuples to a tuplestore, and
only sorting rows that we know are safe to output. I still think this is
a good approach, for the reasons I explained before, but changing this
is not hard so we can experiment.
There's however a related question - how quickly should we increment the
minval value, serving as a watermark? One option is to go to the next
distinct minval value - but that may result in excessive number of tiny
sorts, because the number ranges and rows between the old and new minval
values tends to be small. Another negative consequence is that this may
cause of lot of spilling (and re-spilling), because we only consume tiny
number of rows from the tuplestore after incrementing the watermark.
Or we can do larger steps, skipping some of the minval values, so that
more rows quality into the sort. Of course, too large step means we'll
exceed work_mem and switch to an on-disk sort, which we probably don't
want. Also, this may be the wrong thing to do for LIMIT queries, that
only need a couple rows, and a tiny sort is fine (because we won't do
too many of them).
Patch 0004 introduces a new GUC called brinsort_watermark_step, that can
be used to experiment with this. By default it's set to '1' which means
we simply progress to the next minval value.
Then 0005 tries to customize this based on statistics - we estimate the
number of rows we expect to get for each minval increment to "add" and
then pick just a step value not to overflow work_mem. This happens in
create_brinsort_plan, and the comment explains the main weakness - the
way the number of rows is estimated is somewhat naive, as it just
divides reltuples by number of ranges. But I have a couple ideas about
what statistics we might collect, explained in 0001 in the comment at
brin_minmax_stats.
But there's another option - we can tune the step based on past sorts.
If we see the sorts are doing on-disk sort, maybe try doing smaller
steps. Patch 0006 implements a very simple variant of this. There's a
couple ideas about how it might be improved, mentioned in the comment at
brinsort_adjust_watermark_step.
There's also patch 0003, which extends the EXPLAIN output with a
counters tracking the number of sorts, counts of on-disk/in-memory
sorts, space used, number of rows sorted/spilled, and so on. This is
useful when analyzing e.g. the effect of higher/lower watermark steps,
discussed in the preceding paragraphs.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Allow-index-AMs-to-build-and-use-custom-sta-20221022.patchtext/x-patch; charset=UTF-8; name=0001-Allow-index-AMs-to-build-and-use-custom-sta-20221022.patchDownload+1706-6
0002-Allow-BRIN-indexes-to-produce-sorted-output-20221022.patchtext/x-patch; charset=UTF-8; name=0002-Allow-BRIN-indexes-to-produce-sorted-output-20221022.patchDownload+3025-36
0003-wip-brinsort-explain-stats-20221022.patchtext/x-patch; charset=UTF-8; name=0003-wip-brinsort-explain-stats-20221022.patchDownload+183-1
0004-wip-multiple-watermark-steps-20221022.patchtext/x-patch; charset=UTF-8; name=0004-wip-multiple-watermark-steps-20221022.patchDownload+64-7
0005-wip-adjust-watermark-step-20221022.patchtext/x-patch; charset=UTF-8; name=0005-wip-adjust-watermark-step-20221022.patchDownload+86-15
0006-wip-adaptive-watermark-step-20221022.patchtext/x-patch; charset=UTF-8; name=0006-wip-adaptive-watermark-step-20221022.patchDownload+64-12
On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.
On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
Note that we (at my work) use unique, btree indexes on multiple columns
for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
plus a separate set of indexes on all tables, used for searching:
BRIN(a) and BTREE(b). I'd hope that the costing is accurate enough to
prefer the btree index for searching the most-recent table, if that's
what's faster (for example, if columns b and c are specified).
+ /* There must not be any TID scan in progress yet. */ + Assert(node->ss.ss_currentScanDesc == NULL); + + /* Initialize the TID range scan, for the provided block range. */ + if (node->ss.ss_currentScanDesc == NULL) + {
Why is this conditional on the condition that was just Assert()ed ?
+void +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count, + bool partial_path)
It's be nice to refactor existing code to avoid this part being so
duplicitive.
+ * In some situations (particularly with OR'd index conditions) we may + * have scan_clauses that are not equal to, but are logically implied by, + * the index quals; so we also try a predicate_implied_by() check to see
Isn't that somewhat expensive ?
If that's known, then it'd be good to say that in the documentation.
+ { + {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables the planner's use of BRIN sort plans."), + NULL, + GUC_EXPLAIN + }, + &enable_brinsort, + false,
I think new GUCs should be enabled during patch development.
Maybe in a separate 0002 patch "for CI only not for commit".
That way "make check" at least has a chance to hit that new code paths.
Also, note that indxpath.c had the var initialized to true.
+ attno = (i + 1); + nranges = (nblocks / pagesPerRange); + node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;
I'm curious why you have parenthesis these places ?
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_H
NODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.
This needed a bunch of work needed to pass any of the regression tests -
even with the feature set to off.
. meson.build needs the same change as the corresponding ./Makefile.
. guc missing from postgresql.conf.sample
. brin_validate.c is missing support for the opr function.
I gather you're planning on changing this part (?) but this allows to
pass tests for now.
. mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
. Uninitialized catalog attribute.
. Some typos in your other patches: "heuristics heuristics". ste.
lest (least).
--
Justin
Attachments:
0001-Allow-index-AMs-to-build-and-use-custom-statistics.patchtext/x-diff; charset=us-asciiDownload+1706-6
0002-f-Allow-index-AMs-to-build-and-use-custom-statistics.patchtext/x-diff; charset=us-asciiDownload+41-36
0003-Allow-BRIN-indexes-to-produce-sorted-output.patchtext/x-diff; charset=us-asciiDownload+3025-36
0004-f-brinsort.patchtext/x-diff; charset=us-asciiDownload+41-40
On 10/24/22 06:32, Justin Pryzby wrote:
On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexesNote that we (at my work) use unique, btree indexes on multiple columns
for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
plus a separate set of indexes on all tables, used for searching:
BRIN(a) and BTREE(b). I'd hope that the costing is accurate enough to
prefer the btree index for searching the most-recent table, if that's
what's faster (for example, if columns b and c are specified).
Well, the costing is very crude at the moment - at the moment it's
pretty much just a copy of the existing BRIN costing. And the cost is
likely going to increase, because brinsort needs to do regular BRIN
bitmap scan (more or less) and then also a sort (which is an extra cost,
of course). So if it works now, I don't see why would brinsort break it.
Moreover, if you don't have ORDER BY in the query, I don't see why would
we create a brinsort at all.
But if you could test this once the costing gets improved, that'd be
very valuable.
+ /* There must not be any TID scan in progress yet. */ + Assert(node->ss.ss_currentScanDesc == NULL); + + /* Initialize the TID range scan, for the provided block range. */ + if (node->ss.ss_currentScanDesc == NULL) + {Why is this conditional on the condition that was just Assert()ed ?
Yeah, that's a mistake, due to how the code evolved.
+void +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count, + bool partial_path)It's be nice to refactor existing code to avoid this part being so
duplicitive.+ * In some situations (particularly with OR'd index conditions) we may + * have scan_clauses that are not equal to, but are logically implied by, + * the index quals; so we also try a predicate_implied_by() check to seeIsn't that somewhat expensive ?
If that's known, then it'd be good to say that in the documentation.
Some of this is probably a residue from create_indexscan_path and may
not be needed for this new node.
+ { + {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables the planner's use of BRIN sort plans."), + NULL, + GUC_EXPLAIN + }, + &enable_brinsort, + false,I think new GUCs should be enabled during patch development.
Maybe in a separate 0002 patch "for CI only not for commit".
That way "make check" at least has a chance to hit that new code paths.Also, note that indxpath.c had the var initialized to true.
Good point.
+ attno = (i + 1); + nranges = (nblocks / pagesPerRange); + node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;I'm curious why you have parenthesis these places ?
Not sure, it seemed more readable when writing the code I guess.
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_HNODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.
Yeah, stupid search/replace on the indescan code, which was used as a
starting point.
This needed a bunch of work needed to pass any of the regression tests -
even with the feature set to off.. meson.build needs the same change as the corresponding ./Makefile.
. guc missing from postgresql.conf.sample
. brin_validate.c is missing support for the opr function.
I gather you're planning on changing this part (?) but this allows to
pass tests for now.
. mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
. Uninitialized catalog attribute.
. Some typos in your other patches: "heuristics heuristics". ste.
lest (least).
Thanks, I'll get this fixed. I've posted the patch as a PoC to showcase
it and gather some feedback, I should have mentioned it's incomplete in
these ways.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company