MDAM techniques and Index Skip Scan patch
I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1]http://vldb.org/conf/1995/P710.PDF as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/. But there is a bigger picture here.
I don't necessarily expect to come away from this discussion with a
much better high level architecture for the patch, or any kind of
deeper insight, or even a frame of reference for further discussion. I
just think that we ought to *try* to impose some order on this stuff.
Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things. Many other MDAM techniques also seem highly appealing.
Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are
still related, clearly.
I'd like to also talk about another patch, that ISTM had that same
quality -- it was also held back by high level design uncertainty. Back in 2018,
Tom abandoned a patch that transformed a star-schema style query with
left outer joins on dimension tables with OR conditions, into an
equivalent query that UNIONs together 2 distinct queries [3]/messages/by-id/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489@BlueTreble.com[4]/messages/by-id/14593.1517581614@sss.pgh.pa.us -- Peter Geoghegan.
Believe it or not, I am now reminded of that patch by the example of
"IN() Lists", from page 5 of the paper. We see this example SQL query:
SELECT date, item_class, store, sum(total_sales)
FROM sales
WHERE date between '06/01/95' and '06/30/95' and
item_class IN (20,35,50) and
store IN (200,250)
GROUP BY dept, date, item_class, store;
Granted, this SQL might not seem directly relevant to Tom's patch at
first -- there is no join for the optimizer to even try to eliminate,
which was the whole basis of Jim Nasby's original complaint, which is
what spurred Tom to write the patch in the first place. But hear me
out: there is still a fact table (the sales table) with some
dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the
table (and this SQL query) drives discussion of an optimization
involving transforming a predicate with many ORs (which is explicitly
said to be logically/semantically equivalent to the IN() lists from
the query). They transform the query into a bunch of disjunct clauses
that can easily be independently executed, and combined at the end
(see also "General OR Optimization" on page 6 of the paper).
Also...I'm not entirely sure that the intended underlying "physical
plan" is truly free of join-like scans. If you squint just right, you
might see something that you could think of as a "physical join" (at
least very informally). The whole point of this particular "IN()
Lists" example is that we get to the following, for each distinct
"dept" and "date" in the table:
dept=1, date='06/04/95', item_class=20, store=200
dept=1, date='06/04/95', item_class=20, store=250
dept=1, date='06/04/95', item_class=35, store=200
dept=1, date='06/04/95', item_class=35, store=250
dept=1, date='06/04/95', item_class=50, store=200
dept=1, date='06/04/95', item_class=50, store=250
There are 2400 such accesses in total after transformation -- imagine
additional lines like these, for every distinct combination of dept
and date (only for those dates that actually had sales, which they
enumerate up-front), for store 200 and 250, and item_class 20, 35, and
50. This adds up to 2400 lines in total. Even 2400 index probes will
be much faster than a full table scan, given that this is a large fact
table. The "sales" table is a clustered index whose keys are on the
columns "(dept, date, item_class, store)", per note at the top of page
4. The whole point is to avoid having any secondary indexes on this
fact table, without getting a full scan. We can just probe the primary
key 2400 times instead, following this transformation. No need for
secondary indexes.
The plan can be thought of as a DAG, at least informally. This is also
somewhat similar to what Tom was thinking about back in 2018. Tom had
to deduplicate rows during execution (IIRC using a UNION style ad-hoc
approach that sorted on TIDs), whereas I think that they can get away
with skipping that extra step. Page 7 says "MDAM removes duplicates
before reading the data, so it does not have to do any post read
operations to accomplish duplicate elimination (a common problem with
OR optimization)".
My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.
Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a clause that
looks like this: "WHERE date between -inf AND +inf" -- which is not so
different from the original sales SQL query example that I have
highlighted. We don't tend to think of queries like this (like my
sales query) as in any way related to skip-scan, because we don't
imagine that there is any skipping going on. But maybe we should
recognize the similarities.
BTW, these imaginary -inf/+inf values seem to me to be just like the
sentinel values already used inside nbtree, for pivot tuples -- they
have explicit -inf values for truncated suffix key columns, and you
can think of a rightmost page as having a +inf high key, per the L&Y
paper. Wearing my B-Tree hat, I don't see much difference between
imaginary -inf/+inf values, and values from the BETWEEN "date" range
from the example SQL query. I have in the past wondered if
_bt_get_endpoint() should have been implemented that way -- we could
go through _bt_search() instead, and get rid of that code. All we need
is insertion scan keys that can explicitly contain the same -inf/+inf
sentinel values. Maybe this also allows us to get rid of
BTScanInsertData.nextkey semantics (not sure offhand).
Another more concrete concern about the patch series comes from the
backwards scan stuff. This is added by a later patch in the patch
series, "v39-0004-Extend-amskip-implementation-for-Btree.patch". It
strikes me as a bad thing that we cannot just do leaf-page-at-a-time
processing, without usually needing to hold a pin on the leaf page.
After all, ordinary backwards scans manage to avoid that today, albeit
by using trickery inside _bt_walk_left(). MDAM-style "Maintenance of
Index Order" (as described on page 8) seems like a good goal for us
here. I don't like the idea of doing ad-hoc duplicate TID elimination
inside nbtree, across calls made from the executor (whether it's
during backwards skip scans, or at any other time). Not because it
seems to go against the approach taken by the MDAM paper (though it
does); just because it seems kludgy. (I think that Tom felt the same
way about the TID deduplication stuff in his own patch back in 2018,
too.)
Open question: What does all of this MDAM business mean for
ScalarArrayOpExpr, if anything?
I freely admit that I could easily be worrying over nothing here. But
if I am, I'd really like to know *why* that's the case.
[1]: http://vldb.org/conf/1995/P710.PDF
[2]: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
[3]: /messages/by-id/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489@BlueTreble.com
[4]: /messages/by-id/14593.1517581614@sss.pgh.pa.us -- Peter Geoghegan
--
Peter Geoghegan
Great to see some interest in the skip scan patch series again!
Like many difficult patches, the skip scan patch is not so much troubled by
problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan meshes
with existing designs, as well as future designs -- particularly designs for
other MDAM techniques. I've started this thread to have a big picture
conversation about how to think about these things. Many other MDAM
techniques also seem highly appealing.
I think it is good to have this discussion. In my opinion, Postgres could make really good use of some of the described MDAM techniques.
Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are still
related, clearly.
FWIW I think skip scan is very much data warehousing use-case related - hence why the TimescaleDb people in your [2] reference implemented a simple form of it already for their extension. Skip scan is a really useful feature for large data sets. However, I agree it is only one part of the bigger MDAM picture.
My general concern is that the skip scan patch may currently be structured in
a way that paints us into a corner, MDAM-wise.
One of the concerns I raised before was that the patch may be thinking too simplistically on some things, that would make it difficult to adopt more complex optimizations in the future. One concrete example can be illustrated by a different query on the sales table of the paper's example:
SELECT DISTINCT dept, date WHERE item_class = 100
This should skip with prefix of (dept, date). Suppose we're at (dept, date) = (1, 2021-01-01) and it's skipping to the next prefix, the patch just implements what the MDAM paper describes as the 'probing' step. It finds the beginning of the next prefix. This could be for example (dept, date, item_class) = (1, 2021-01-02, 1). From there onwards, it would just scan the index until it finds item_class=100. What it should do however, is to first 'probe' for the next prefix value and then skip directly to (1, 2021-01-02, 100) (skipping item_class 1-99 altogether). The problem if it doesn't support this is that skip scan could have a quite unpredictable performance, because sometimes it'll end up going through most of the index where it should be skipping.
A while ago, I spent quite some time trying to come up with an implementation that would work in this more general case. The nice thing is that with such a more generic implementation, you get almost all the features from the MDAM paper almost for free. I focused on the executor code, not on the planner code - the planner code is for the DISTINCT skip part very similar to the original patch and I hacked in a way to make it choose a 'skip scan' also for non-DISTINCT queries for testing purposes. For this discussion about MDAM, the planner part is less relevant though. There's still a lot of discussion and work on the planner-side too, but I think that is completely independent from each other.
The more generic patch I originally posted in [1]/messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com, together with some technical considerations. That was quite a while ago so it obviously doesn't apply anymore on master. Therefore, I've attached a rebased version. Unfortunately, it's still based on an older version of the UniqueKeys patch - but since that patch is all planner machinery as well, it doesn't matter so much for the discussion about the executor code either.
I believe if we want something that fits better with future MDAM use cases, we should take a closer look at the executor code of this patch to drive this discussion. The logic is definitely more complex than the original patch, however I believe it is also more flexible and future-proof.
Another more concrete concern about the patch series comes from the
backwards scan stuff. This is added by a later patch in the patch series, "v39-
0004-Extend-amskip-implementation-for-Btree.patch". It strikes me as a bad
thing that we cannot just do leaf-page-at-a-time processing, without usually
needing to hold a pin on the leaf page.
After all, ordinary backwards scans manage to avoid that today, albeit by
using trickery inside _bt_walk_left(). MDAM-style "Maintenance of Index
Order" (as described on page 8) seems like a good goal for us here. I don't
like the idea of doing ad-hoc duplicate TID elimination inside nbtree, across
calls made from the executor (whether it's during backwards skip scans, or at
any other time). Not because it seems to go against the approach taken by
the MDAM paper (though it does); just because it seems kludgy. (I think that
Tom felt the same way about the TID deduplication stuff in his own patch
back in 2018,
too.)
It's good to mention that the patch I attached does proper 'leaf-page-at-a-time' processing, so it avoids the problem you describe with v39. It is implemented instead in the same way as a "regular" index scan - we process the full leaf page and store the matched tuples in the local state. If a DISTINCT scan wants to do a skip, we check our local state first if that skipping would be possible with the matched tuples from the current page. Then we avoid double work and also the need to look at the same page again.
Open question: What does all of this MDAM business mean for
ScalarArrayOpExpr, if anything?
This is a really interesting combination actually. I think, ideally, you'd probably get rid of it and provide full support for that with the 'skip' based approach (essentially the ScalarArrayOpExpr seems to do some form of skipping already - it transforms x IN (1,2,3) into 3 separate index scans for x).
However, even without doing any work on it, it actually interacts nicely with the skip based approach.
As an example, here's some queries based on the 'sales' table of the paper with some data in it (18M rows total, see sales_query.sql attachment for the full example):
-- terminology from paper: "intervening range predicate"
select date, sum(total_sales)
from sales
where dept between 2 and 3 and date between '2021-01-05' and '2021-01-10' and item_class=20 and store=250
group by dept, date
;
Patch: Execution Time: 0.368 ms
Master: Execution Time: 416.792 ms
-- terminology from paper: "missing key predicate"
select date, sum(total_sales)
from sales
where date between '2021-01-05' and '2021-01-10' and item_class=20 and store=250
group by dept, date
;
Patch: Execution Time: 0.667 ms
Master: Execution Time: 654.684 ms
-- terminology from paper: "IN lists"
-- this is similar to the query from your example with an IN list
-- in the current patch, this query is done with a skip scan with skip prefix (dept, date) and then the ScalarOpArray for item_class=(20,30,50)
select date, sum(total_sales)
from sales
where date between '2021-01-05' and '2021-01-10' and item_class in (20, 30, 50) and store=250
group by dept, date
;
Patch: Execution Time: 1.767 ms
Master: Execution Time: 629.792 ms
The other mentioned MDAM optimizations in the paper (NOT =, general OR) are not implemented but don't seem to be conflicting at all with the current implementation - they seem to be just a matter of transforming the expressions into the right form.
From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series, which focuses on the planner. v2-0001 is Dmitry's patch that extends it to make it possible to use UniqueKeys for the skip scan. Both of these are unfortunately still older patches, but because they are planner machinery they are less relevant to the discussion about the executor here.
Patch v2-0002 contains most of my work and introduces all the executor logic for the skip scan and hooks up the planner for DISTINCT queries to use the skip scan.
Patch v2-0003 is a planner hack that makes the planner pick a skip scan on virtually every possibility. This also enables the skip scan on the queries above that don't have a DISTINCT but where it could be useful.
-Floris
[1]: /messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com
Attachments:
On Thu, Oct 21, 2021 at 07:16:00PM -0700, Peter Geoghegan wrote:
My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a clause that
looks like this: "WHERE date between -inf AND +inf" -- which is not so
different from the original sales SQL query example that I have
highlighted. We don't tend to think of queries like this (like my
sales query) as in any way related to skip-scan, because we don't
imagine that there is any skipping going on. But maybe we should
recognize the similarities.
To avoid potential problems with extensibility in this sense, the
implementation needs to explicitly work with sets of disjoint intervals
of values instead of simple prefix size, one set of intervals per scan
key. An interesting idea, doesn't seem to be a big change in terms of
the patch itself.
Hi Peter,
On 10/21/21 22:16, Peter Geoghegan wrote:
I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.
Thanks for starting this thread !
The Index Skip Scan patch could affect a lot of areas, so keeping MDAM
in mind is definitely important.
However, I think the key part to progress on the "core" functionality
(B-tree related changes) is to get the planner functionality in place
first. Hopefully we can make progress on that during the November
CommitFest based on Andy's patch.
Best regards,
Jesper
Hi,
On Sat, Oct 23, 2021 at 07:30:47PM +0000, Floris Van Nee wrote:
From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series,
which focuses on the planner. v2-0001 is Dmitry's patch that extends it to
make it possible to use UniqueKeys for the skip scan. Both of these are
unfortunately still older patches, but because they are planner machinery
they are less relevant to the discussion about the executor here. Patch
v2-0002 contains most of my work and introduces all the executor logic for
the skip scan and hooks up the planner for DISTINCT queries to use the skip
scan. Patch v2-0003 is a planner hack that makes the planner pick a skip
scan on virtually every possibility. This also enables the skip scan on the
queries above that don't have a DISTINCT but where it could be useful.
The patchset doesn't apply anymore:
http://cfbot.cputube.org/patch_36_1741.log
=== Applying patches on top of PostgreSQL commit ID a18b6d2dc288dfa6e7905ede1d4462edd6a8af47 ===
=== applying patch ./v2-0001-Extend-UniqueKeys.patch
[...]
patching file src/include/optimizer/paths.h
Hunk #2 FAILED at 299.
1 out of 2 hunks FAILED -- saving rejects to file src/include/optimizer/paths.h.rej
Could you send a rebased version? In the meantime I will change the status on
the cf app to Waiting on Author.
Could you send a rebased version? In the meantime I will change the status
on the cf app to Waiting on Author.
Attached a rebased version.
Attachments:
On Thu, Jan 13, 2022 at 03:27:08PM +0000, Floris Van Nee wrote:
Could you send a rebased version? In the meantime I will change the status
on the cf app to Waiting on Author.Attached a rebased version.
FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:
* the original one lives in [1]/messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost, waiting for reviews since the last May
* an alternative one posted here from Floris
[1]: /messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost
Hi,
On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from Floris
Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.
I'm not sure of what's the best approach in such situation. Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?
On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
Hi,On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from FlorisAh, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.I'm not sure of what's the best approach in such situation. Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?
I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.
Hi,
On 2022-01-22 22:37:19 +0100, Dmitry Dolgov wrote:
On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
Hi,On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from FlorisAh, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.I'm not sure of what's the best approach in such situation. Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.
IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.
Either way, currently this patch fails on cfbot due to a new GUC:
https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs
https://cirrus-ci.com/task/5134905372835840
Greetings,
Andres Freund
On Mon, Mar 21, 2022 at 06:34:09PM -0700, Andres Freund wrote:
I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.
On the other hand even for patches with dependencies (i.e. the patch A
depends on the patch B) different CF items cause a lot of confusion for
reviewers. I guess for various flavours of the same patch it would be
even worse. But I don't have a strong opinion here.
Either way, currently this patch fails on cfbot due to a new GUC:
https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs
https://cirrus-ci.com/task/5134905372835840
This seems to be easy to solve.
Attachments:
On Tue, Mar 22, 2022 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.
I have the impression that this thread is getting some value from
having a CF entry, as a multi-person collaboration where people are
trading ideas and also making progress that no one wants to mark as
returned, but it's vexing for people managing the CF because it's not
really proposed for 15. Perhaps what we lack is a new status, "Work
In Progress" or something?
Peter Geoghegan <pg@bowt.ie> writes:
Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things.
Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here. I have done that, and here
are a few modest suggestions.
1. Usually I'm in favor of doing this sort of thing in an index AM
agnostic way, but here I don't see much point. All of the ideas at
stake rely fundamentally on having a lexicographically-ordered multi
column index; but we don't have any of those except btree, nor do
I think we're likely to get any soon. This motivates the general
tenor of my remarks below, which is "do it in access/nbtree/ not in
the planner".
2. The MDAM paper Peter cited is really interesting. You can see
fragments of those ideas in our existing btree code, particularly in
the scan setup stuff that detects redundant or contradictory keys and
determines a scan start strategy. The special handling we implemented
awhile ago for ScalarArrayOp index quals is also a subset of what they
were talking about. It seems to me that if we wanted to implement more
of those ideas, the relevant work should almost all be done in nbtree
proper. The planner would need only minor adjustments: btcostestimate
would have to be fixed to understand the improvements, and there are
some rules in indxpath.c that prevent us from passing "too complicated"
sets of indexquals to the AM, which would need to be relaxed or removed
altogether.
3. "Loose" indexscan (i.e., sometimes re-descend from the tree root
to find the next index entry) is again something that seems like it's
mainly nbtree's internal problem. Loose scan is interesting if we
have index quals for columns that are after the first column that lacks
an equality qual, otherwise not. I've worried in the past that we'd
need planner/statistical support to figure out whether a loose scan
is likely to be useful compared to just plowing ahead in the index.
However, that seems to be rendered moot by the idea used in the current
patchsets, ie scan till we find that we'll have to step off the current
page, and re-descend at that point. (When and if we find that that
heuristic is inadequate, we could work on passing some statistical data
forward. But we don't need any in the v1 patch.) Again, we need some
work in btcostestimate to understand how the index scan cost will be
affected, but I still don't see any pressing need for major planner
changes or plan tree contents changes.
4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case. I wouldn't even bother with that for the
initial patch. When we do get around to it, it still doesn't need
major planner support, I think --- again fixing the cost estimation
is the bulk of the work. Munro's original 2014 patch showed that we
don't really need all that much to get the planner to build such a
plan; the problem is to convince it that that plan will be cheap.
In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value. The part of the planner that actually needs
serious thought is btcostestimate, which has been woefully neglected in
both of the current patchsets.
BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start. This issue might become more
pressing if the work becomes significantly more complicated/expensive,
which these ideas might cause. But it's a refinement that could be
left for later --- and in any case, the responsibility would still
fundamentally be nbtree's. I don't think the planner would do more
than call some AM routine that could add decoration to an IndexScan
plan node.
Now ... where did I put my flameproof vest?
regards, tom lane
Hi,
On 2022-03-22 16:55:49 -0400, Tom Lane wrote:
4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case.
It's something causing plenty pain in production environments... Obviously
it'd be even better if the optimization also triggered in cases like
SELECT some_indexed_col FROM blarg GROUP BY some_indexed_col
which seems to be what ORMs like to generate.
BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start.
It does show up in simple-index-lookup heavy workloads. Not as a major thing,
but it's there. And it's just architecturally displeasing :)
Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?
The latter does show up a bit more heavily in profiles than nbtree specific
setup, and given that it's generic executor type stuff, seems even more
amenable to being moved to plan time.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2022-03-22 16:55:49 -0400, Tom Lane wrote:
BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start.
It does show up in simple-index-lookup heavy workloads. Not as a major thing,
but it's there. And it's just architecturally displeasing :)
Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?
Didn't really have specifics in mind. The key stumbling block is
that some (not all) of the work depends on knowing the specific
values of the indexqual comparison keys, so while you could do
that work in advance for constant keys, you'd still have to be
prepared to do work at scan start for non-constant keys. I don't
have a clear idea about how to factorize that effectively.
A couple of other random ideas in this space:
* I suspect that a lot of this work overlaps with the efforts that
btcostestimate makes along the way to getting a cost estimate.
So it's interesting to wonder whether we could refactor so that
btcostestimate is integrated with this hypothetical plan-time key
preprocessing and doesn't duplicate work.
* I think that we run through most or all of that preprocessing
logic even for internal catalog accesses, where we know darn well
how the keys are set up. We ought to think harder about how we
could short-circuit pointless work in those code paths.
I don't think any of this is an essential prerequisite to getting
something done for loose index scans, which ISTM ought to be the first
point of attack for v16. Loose index scans per se shouldn't add much
to the key preprocessing costs. But these ideas likely would be
useful to look into before anyone starts on the more complicated
preprocessing that would be needed for the ideas in the MDAM paper.
regards, tom lane
On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
Peter Geoghegan <pg@bowt.ie> writes:Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things.Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here. I have done that, and here
are a few modest suggestions.
Thanks. To make sure I understand your proposal better, I have a couple
of questions:
In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.
Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?
Dmitry Dolgov <9erthalion6@gmail.com> writes:
On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.
Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?
My point is that for pure loose scans --- that is, just optimizing a scan,
not doing AM-based duplicate-row-elimination --- you do not need to pass
any new data to btree at all. It can infer what to do on the basis of the
set of index quals it's handed.
The bigger picture here is that I think the reason this patch series has
failed to progress is that it's too scattershot. You need to pick a
minimum committable feature and get that done, and then you can move on
to the next part. I think the minimum committable feature is loose scans,
which will require a fair amount of work in access/nbtree/ but very little
new planner code, and will be highly useful in their own right even if we
never do anything more.
In general I feel that the UniqueKey code is a solution looking for a
problem, and that treating it as the core of the patchset is a mistake.
We should be driving this work off of what nbtree needs to make progress,
and not building more infrastructure elsewhere than we have to. Maybe
we'll end up with something that looks like UniqueKeys, but I'm far from
convinced of that.
regards, tom lane
On Wed, Mar 23, 2022 at 05:32:46PM -0400, Tom Lane wrote:
Dmitry Dolgov <9erthalion6@gmail.com> writes:On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?My point is that for pure loose scans --- that is, just optimizing a scan,
not doing AM-based duplicate-row-elimination --- you do not need to pass
any new data to btree at all. It can infer what to do on the basis of the
set of index quals it's handed.The bigger picture here is that I think the reason this patch series has
failed to progress is that it's too scattershot. You need to pick a
minimum committable feature and get that done, and then you can move on
to the next part. I think the minimum committable feature is loose scans,
which will require a fair amount of work in access/nbtree/ but very little
new planner code, and will be highly useful in their own right even if we
never do anything more.In general I feel that the UniqueKey code is a solution looking for a
problem, and that treating it as the core of the patchset is a mistake.
We should be driving this work off of what nbtree needs to make progress,
and not building more infrastructure elsewhere than we have to. Maybe
we'll end up with something that looks like UniqueKeys, but I'm far from
convinced of that.
I see. I'll need some thinking time about how it may look like (will
probably return with more questions).
The CF item could be set to RwF, what would you say, Jesper?
On 3/23/22 18:22, Dmitry Dolgov wrote:
The CF item could be set to RwF, what would you say, Jesper?
We want to thank the community for the feedback that we have received
over the years for this feature. Hopefully a future implementation can
use Tom's suggestions to get closer to a committable solution.
Here is the last CommitFest entry [1]https://commitfest.postgresql.org/37/1741/ for the archives.
RwF
[1]: https://commitfest.postgresql.org/37/1741/
Best regards,
Dmitry & Jesper
On Tue, Mar 22, 2022 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:
Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?The latter does show up a bit more heavily in profiles than nbtree specific
setup, and given that it's generic executor type stuff, seems even more
amenable to being moved to plan time.
When I was working on the patch series that became the nbtree Postgres
12 work, this came up. At one point I discovered that using palloc0()
for the insertion scankey in _bt_first() was a big problem with nested
loop joins -- it became a really noticeable bottleneck with one of my
test cases. I independently discovered what Tom must have figured out
back in 2005, when he committed d961a56899. This led to my making the
new insertion scan key structure (BTScanInsertData) not use dynamic
allocation. So _bt_first() is definitely performance critical for
certain types of queries.
We could get rid of dynamic allocations for BTStackData in
_bt_first(), perhaps. The problem is that there is no simple,
reasonable proof of the maximum height on a B-tree, even though a
B-Tree with more than 7 or 8 levels seems extraordinarily unlikely.
You could also invent a slow path (maybe do what we do in
_bt_insert_parent() in the event of a concurrent root page split/NULL
stack), but that runs into the problem of being awkward to test, and
pretty ugly. It's doable, but I wouldn't do it unless there was a
pretty noticeable payoff.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
We could get rid of dynamic allocations for BTStackData in
_bt_first(), perhaps. The problem is that there is no simple,
reasonable proof of the maximum height on a B-tree, even though a
B-Tree with more than 7 or 8 levels seems extraordinarily unlikely.
Start with a few entries preallocated, and switch to dynamically
allocated space if there turn out to be more levels than that,
perhaps? Not sure if it's worth the trouble.
In any case, what I was on about is _bt_preprocess_keys() and
adjacent code. I'm surprised that those aren't more expensive
than one palloc in _bt_first. Maybe that logic falls through very
quickly in simple cases, though.
regards, tom lane
On Tue, Mar 22, 2022 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here.
Thanks for taking a look!
"5.5 Exploiting Key Prefixes" and "5.6 Ordered Retrieval" from "Modern
B-Tree Techniques" are also good, BTW.
The terminology in this area is a mess. MySQL calls
SELECT-DISTINCT-that-matches-an-index "loose index scans". I think
that you're talking about skip scan when you say "loose index scan".
Skip scan is where there is an omitted prefix of columns in the SQL
query -- omitted columns after the first column that lack an equality
qual. Pretty sure that MySQL/InnoDB can't do that -- it can only
"skip" to the extent required to make
SELECT-DISTINCT-that-matches-an-index perform well, but that's about
it.
It might be useful for somebody to go write a "taxonomy of MDAM
techniques", or a glossary. The existing "Loose indexscan" Postgres
wiki page doesn't seem like enough. Something very high level and
explicit, with examples, just so we don't end up talking at cross
purposes too much.
1. Usually I'm in favor of doing this sort of thing in an index AM
agnostic way, but here I don't see much point. All of the ideas at
stake rely fundamentally on having a lexicographically-ordered multi
column index; but we don't have any of those except btree, nor do
I think we're likely to get any soon. This motivates the general
tenor of my remarks below, which is "do it in access/nbtree/ not in
the planner".
That was my intuition all along, but I didn't quite have the courage
to say so -- sounds too much like something that an optimizer
dilettante like me would be expected to say. :-)
Seems like one of those things where lots of high level details
intrinsically need to be figured out on-the-fly, at execution time,
rather than during planning. Perhaps it'll be easier to correctly
determine that a skip scan plan is the cheapest in practice than to
accurately cost skip scan plans. If the only alternative is a
sequential scan, then perhaps a very approximate cost model will work
well enough. It's probably way too early to tell right now, though.
I've worried in the past that we'd
need planner/statistical support to figure out whether a loose scan
is likely to be useful compared to just plowing ahead in the index.
I don't expect to be able to come up with a structure that leaves no
unanswered questions about future MDAM work -- it's not realistic to
expect everything to just fall into place. But that's okay. Just
having everybody agree on roughly the right conceptual model is the
really important thing. That now seems quite close, which I count as
real progress.
4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case. I wouldn't even bother with that for the
initial patch.
I absolutely agree. I wondered about that myself in the past. My best
guess is that a certain segment of users are familiar with
SELECT-DISTINCT-that-matches-an-index from MySQL. And so to some
extent application frameworks evolved in a world where that capability
existed. IIRC Jesper once said that Hibernate relied on this
capability.
It's probably a lot easier to implement
SELECT-DISTINCT-that-matches-an-index if you have the MySQL storage
engine model, with concurrency control that's typically based on
two-phase locking. I think that MySQL does some amount of
deduplication in its executor here -- and *not* in what they call the storage
engine. This is exactly what I'd like to avoid in Postgres; as I said
"Maintenance of Index Order" (as the paper calls it) seems important,
and not something to be added later on. Optimizer and executor layers
that each barely know the difference between a skip scan and a full
index scan seems like something we might actually want to aim for,
rather than avoid. Teaching nbtree to transform quals into ranges
sounds odd at first, but it seems like the right approach now, on
balance -- that's the only *good* way to maintain index order.
(Maintaining index order is needed to avoid needing or relying on
deduplication in the executor proper, which is even inappropriate in
an implementation of SELECT-DISTINCT-that-matches-an-index IMO.)
--
Peter Geoghegan
On Mon, Mar 28, 2022 at 5:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In any case, what I was on about is _bt_preprocess_keys() and
adjacent code. I'm surprised that those aren't more expensive
than one palloc in _bt_first. Maybe that logic falls through very
quickly in simple cases, though.
I assume that it doesn't really appear in very simple cases (also
common cases). But delaying the scan setup work until execution time
does seem ugly. That's probably a good enough reason to refactor.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
The terminology in this area is a mess. MySQL calls
SELECT-DISTINCT-that-matches-an-index "loose index scans". I think
that you're talking about skip scan when you say "loose index scan".
Skip scan is where there is an omitted prefix of columns in the SQL
query -- omitted columns after the first column that lack an equality
qual.
Right, that's the case I had in mind --- apologies if my terminology
was faulty. btree can actually handle such a case now, but what it
fails to do is re-descend from the tree root instead of plowing
forward in the index to find the next matching entry.
It might be useful for somebody to go write a "taxonomy of MDAM
techniques", or a glossary.
+1. We at least need to be sure we all are using these terms
the same way.
regards, tom lane
On Mon, Mar 28, 2022 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Right, that's the case I had in mind --- apologies if my terminology
was faulty. btree can actually handle such a case now, but what it
fails to do is re-descend from the tree root instead of plowing
forward in the index to find the next matching entry.
KNNGIST seems vaguely related to what we'd build for nbtree skip scan,
though. GiST index scans are "inherently loose", though. KNNGIST uses
a pairing heap/priority queue, which seems like the kind of thing
nbtree skip scan can avoid.
+1. We at least need to be sure we all are using these terms
the same way.
Yeah, there are *endless* opportunities for confusion here.
--
Peter Geoghegan