Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Started by Peter Geogheganover 2 years ago78 messageshackers
Jump to latest

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.

Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

2. The optimizer now produces index paths with multiple SAOP clauses
(or other clauses we can safely treat as "equality constraints'') on
each of the leading columns from a composite index -- all while
preserving index ordering/useful pathkeys in most cases.

The nbtree work from item 1 is useful even with the simplest IN() list
query involving a scan of a single column index. Obviously, it's very
inefficient for the nbtree code to use 100 primitive index scans when
1 is sufficient. But that's not really why I'm pursuing this project.
My real goal is to implement (or to enable the implementation of) a
whole family of useful techniques for multi-column indexes. I call
these "MDAM techniques", after the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1]http://vldb.org/conf/1995/P710.PDF[2]/messages/by-id/2587523.1647982549@sss.pgh.pa.us-- MDAM is short for "multidimensional
access method". In the context of the paper, "dimension" refers to
dimensions in a decision support system.

The most compelling cases for the patch all involve multiple index
columns with multiple SAOP clauses (especially where each column
represents a separate "dimension", in the DSS sense). It's important
that index sort be preserved whenever possible, too. Sometimes this is
directly useful (e.g., because the query has an ORDER BY), but it's
always indirectly needed, on the nbtree side (when the optimizations
are applicable at all). The new nbtree code now has special
requirements surrounding SAOP search type scan keys with composite
indexes. These requirements make changes in the optimizer all but
essential.

Index order
===========

As I said, there are cases where preserving index order is immediately
and obviously useful, in and of itself. Let's start there.

Here's a test case that you can run against the regression test database:

pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty);
CREATE INDEX

pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;

With the patch, this query gets 13 buffer hits. On the master branch,
it gets 1377 buffer hits -- which exceeds the number you'll get from a
sequential scan by about 4x. No coaxing was required to get the
planner to produce this plan on the master branch. Almost all of the
savings shown here are related to heap page buffer hits -- the nbtree
changes don't directly help in this particular example (strictly
speaking, you only need the optimizer changes to get this result).

Obviously, the immediate reason why the patch wins by so much is
because it produces a plan that allows the LIMIT to terminate the scan
far sooner. Benoit Tigeot (CC'd) happened to run into this issue
organically -- that was also due to heap hits, a LIMIT, and so on. As
luck would have it, I stumbled upon his problem report (in the
Postgres slack channel) while I was working on this patch. He produced
a fairly complete test case, which was helpful [3]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan. This example is
more or less just a distillation of his test case, designed to be easy
for a Postgres hacker to try out for themselves.

There are also variants of this query where a LIMIT isn't the crucial
factor, and where index page hits are the problem. This query uses an
index-only scan, both on master and with the patch (same index as
before):

select count(*), two, four, twenty
from tenk1
where two in (0, 1) and four in (1, 2, 3, 4) and
twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15)
group by two, four, twenty
order by two, four, twenty;

The patch gets 18 buffer hits for this query. That outcome makes
intuitive sense, since this query is highly unselective -- it's
approaching the selectivity of the query "select count(*) from tenk1".
The simple count(*) query gets 19 buffer hits for its own index-only
scan, confirming that the patch managed to skip only one or two leaf
pages in the complicated "group by" variant of the count(*) query.
Overall, the GroupAggregate plan used by the patch is slower than the
simple count(*) case (despite touching fewer pages). But both plans
have *approximately* the same execution cost, which makes sense, since
they both have very similar selectivities.

The master branch gets 245 buffer hits for the same group by query.
This is almost as many hits as a sequential scan would require -- even
though there are precisely zero heap accesses needed by the underlying
index-only scan. As with the first example, no planner coaxing was
required to get this outcome on master. It is inherently very
difficult to predict how selective a query like this will be using
conventional statistics. But that's not actually the problem in this
example -- the planner gets that part right, on this occasion. The
real problem is that there is a multiplicative factor to worry about
on master, when executing multiple SAOPs. That makes it almost
impossible to predict the number of pages we'll pin. While with the
patch, scans with multiple SAOPs are often fairly similar to scans
that happen to just have one on the leading column.

With the patch, it is simply impossible for an SAOP index scan to
visit any single leaf page more than once. Just like a conventional
index scan. Whereas right now, on master, using more than one SAOP
clause for a multi column index seems to me to be a wildly risky
proposition. You can easily have cases that work just fine on master,
while only slight variations of the same query see costs explode
(especially likely with a LIMIT). ISTM that there is significant value
in knowing for sure in having a pretty accurate idea of the worst case
in the planner.

Giving nbtree the ability to skip or not skip dynamically, based on
actual conditions in the index (not on statistics), seems like it has
a lot of potential as a way of improving performance *stability*.
Personally I'm most interested in this aspect of the project.

Note: we can visit internal pages more than once, but that seems to
make a negligible difference to the overall cost profile of scans. Our
policy is to not charge an I/O cost for those pages. Plus, the number
of internal page access is dramatically reduced (it's just not
guaranteed that there won't be any repeat accesses for internal pages,
is all).

Note also: there are hard-to-pin-down interactions between the
immediate problem on the nbtree side, and the use of filter quals
rather than true index quals, where the use of index quals is possible
in principle. Some problematic cases see excessive amounts of heap
page hits only (as with my first example query). Other problematic
cases see excessive amounts of index page hits, with little to no
impact on heap page hits at all (as with my second example query).
Some combination of the two is also possible.

Safety
======

As mentioned already, the ability to "advance the current set of array
keys locally" during a scan (the nbtree work in item 1) actually
relies the optimizer work in item 2 -- it's not just a question of
unlocking the potential of the nbtree work. Now I'll discuss those
aspects in a bit more detail.

Without the optimizer work, nbtree will produce wrong answers to
queries, in a way that resembles the complaint addressed by historical
bugfix commit 807a40c5. This incorrect behavior (if the optimizer were
to permit it) would only be seen when there are multiple
arrays/columns, and an inequality on a leading column -- just like
with that historical bug. (It works both ways, though -- the nbtree
changes also make the optimizer changes safe by limiting the worst
case, which would otherwise be too much of a risk to countenance. You
can't separate one from the other.)

The primary change on the optimizer side is the addition of logic to
differentiate between the following two cases when building an index
path in indxpath.c:

* Unsafe: Cases where it's fundamentally unsafe to treat
multi-column-with-SAOP-clause index paths as returning tuples in a
useful sort order.

For example, the test case committed as part of that bugfix involves
an inequality, so it continues to be treated as unsafe.

* Safe: Cases where (at least in theory) bugfix commit 807a40c5 went
further than it really had to.

Those cases get to use the optimization, and usually get to have
useful path keys.

My optimizer changes are very kludgey. I came up with various ad-hoc
rules to distinguish between the safe and unsafe cases, without ever
really placing those changes into some kind of larger framework. That
was enough to validate the general approach in nbtree, but it
certainly has problems -- glaring problems. The biggest problem of all
may be my whole "safe vs unsafe" framing itself. I know that many of
the ostensibly unsafe cases are in fact safe (with the right
infrastructure in place), because the MDAM paper says just that. The
optimizer can't support inequalities right now, but the paper
describes how to support "NOT IN( )" lists -- clearly an inequality!
The current ad-hoc rules are at best incomplete, and at worst are
addressing the problem in fundamentally the wrong way.

CNF -> DNF conversion
=====================

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]/messages/by-id/2587523.1647982549@sss.pgh.pa.us. Much work remains here.

Skip Scan
=========

MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)

I see this work as a key enabler of "filling in missing key
predicates". MDAM describes how to implement this technique by
applying the same principles that it applies everywhere else: it
proposes a scheme that converts predicates from CNF to DNF. With just
a little extra logic required to do index probes to feed the
DNF-generating state machine, on demand.

More concretely, in Postgres terms: skip scan can be implemented by
inventing a new placeholder clause that can be composed alongside
ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs
can be composed together in the patch already. I'm thinking of a type
of clause that makes the nbtree code materialize a set of "array keys"
for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes
(perhaps static approaches would be better for types like boolean,
which the paper contemplates). It should be possible to teach the
_bt_advance_array_keys state machine to generate those values in
approximately the same fashion as it already does for
ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a
localized fashion, allowing everything else to continue to work in the
same way without any special concern. This separation of concerns is a
nice consequence of the way that the MDAM design really leverages
preprocessing/DNF for everything.

Both types of clauses can be treated as part of a general class of
ScalarArrayOpExpr-like clauses. Making the rules around
"composability" simple will be important.

Although skip scan gets a lot of attention, it's not necessarily the
most compelling MDAM technique. It's also not especially challenging
to implement on top of everything else. It really isn't that special.
Right now I'm focussed on the big picture, in any case. I want to
emphasize the very general nature of these techniques. Although I'm
focussed on SOAPs in the short term, many queries that don't make use
of SAOPs should ultimately see similar benefits. For example, the
paper also describes transformations that apply to BETWEEN/range
predicates. We might end up needing a third type of expression for
those. They're all just DNF single value predicates, under the hood.

Thoughts?

[1]: http://vldb.org/conf/1995/P710.PDF
[2]: /messages/by-id/2587523.1647982549@sss.pgh.pa.us
[3]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload+919-85
#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <pg@bowt.ie> wrote:

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.

Cool.

Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

[...]
Skip Scan
=========

MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

[...]

Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's quite a lot of memory
potentially needed to execute the statement.
So, this begs the question: does this patch have the same issue? Does
it fail with OOM, does it gracefully fall back to the old behaviour
when the clauses are too complex to linearize/compose/fold into the
btree ordering clauses, or are scan keys dynamically constructed using
just-in-time- or generator patterns?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

In reply to: Matthias van de Meent (#2)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

[...]

Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

--
Peter Geoghegan

#4Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#3)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

Hmm, then I had a mistaken understanding of what we do in _bt_readpage
with _bt_saveitem.

That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

In a NestedLoop(inner=seqscan, outer=indexscan), the index gets
repeatedly scanned from the root, right? It seems that right now, we
copy matching index entries into a local cache (that is deleted on
amrescan), then we drop our locks and pins on the buffer, and then
start returning values from our local cache (in _bt_saveitem).
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
The worst case overhead of this would be 2 _bt_compares (to check if
the value is supposed to be fully located on the cached leaf page)
plus one memcpy( , , BLCKSZ) in the previous loop. With some smart
heuristics (e.g. page fill factor, number of distinct values, and
whether we previously hit this same leaf page in the previous scan of
this Node) we can probably also reduce this overhead to a minimum if
the joined keys are not correllated, but accellerate the query
significantly when we find out they are correllated.

Of course, in the cases where we'd expect very few distinct join keys
the planner would likely put a Memoize node above the index scan, but
for mostly unique join keys I think this could save significant
amounts of time, if only on buffer pinning and locking.

I guess I'll try to code something up when I have the time, as it
sounds not quite exactly related to your patch but an interesting
improvement nonetheless.

Kind regards,

Matthias van de Meent

In reply to: Matthias van de Meent (#4)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

In general, SAOP execution/MDAM performs "duplicate elimination before
it reads the data" by sorting and deduplicating the arrays up front.
While my patch sometimes elides a primitive index scan, primitive
index scans are already disjuncts that are combined to create what can
be considered one big index scan (that's how the planner and executor
think of them). The patch takes that one step further by recognizing
that it could quite literally be one big index scan in some cases (or
fewer, larger scans, at least). It's a natural incremental
improvement, as opposed to inventing a new kind of index scan. If
anything the patch makes SAOP execution more similar to traditional
index scans, especially when costing them.

Like InnoDB style loose index scan (for DISTINCT and GROUP BY
optimization), block nested loop join would require inventing a new
type of index scan. Both of these other two optimizations involve the
use of semantic information that spans multiple levels of abstraction.
Loose scan requires duplicate elimination (that's the whole point),
while IIUC block nested loop join needs to "simulate multiple inner
index scans" by deliberately returning duplicates for each would-be
inner index scan. These are specialized things.

To be clear, I think that all of these ideas are reasonable. I just
find it useful to classify these sorts of techniques according to
whether or not the index AM API would have to change or not, and the
general nature of any required changes. MDAM can do a lot of cool
things without requiring any revisions to the index AM API, which
should allow it to play nice with everything else (index path clause
safety issues notwithstanding).

--
Peter Geoghegan

#6Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#3)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

In the one page on "Loose indexscan", it refers to MySQL's "loose
index scan" documentation, which does handle groupings, and this was
targeted with the previous, mislabeled, "Index skipscan" patchset.
However, crucially, it also refers to other databases' Index Skip Scan
documentation, which document and implement this approach of 'skipping
to the next potential key range to get efficient non-prefix qual
results', giving me a false impression that those two features are one
and the same when they are not.

It seems like I'll have to wait a bit longer for the functionality of
Loose Index Scans.

[...]

Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all.

Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

#7Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#5)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.

See also the attached prototype and loosely coded patch. It passes
tests, but it might not be without bugs.

The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that. Those two values - num_rescans and
num_samepage - are used as heuristics for the following:

If 50% or more of rescans hit the same page as the end location of the
previous scan, we start saving the scan's end location's buffer into
the BTScanOpaque, so that the next _bt_first can check whether that
page might be the right leaf page, and if so, immediately go to that
buffer instead of descending the tree - saving one tree descent in the
process.

Further optimizations of this mechanism could easily be implemented by
e.g. only copying the min/max index tuples instead of the full index
page, reducing the overhead at scan end.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignoreapplication/octet-stream; name=v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignoreDownload+141-7
In reply to: Matthias van de Meent (#6)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

A lot of people (myself included) were confused on this point for
quite a while. To make matters even more confusing, one of the really
compelling cases for the MDAM design is scans that feed into
GroupAggregates -- preserving index sort order for naturally big index
scans will tend to enable it. One of my examples from the start of
this thread showed just that. (It just so happened that that example
was faster because of all the "skipping" that nbtree *wasn't* doing
with the patch.)

Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)

I get why you asked. I might have asked the same question.

As I said, the MDAM paper has *surprisingly* little to say about
B-Tree executor stuff -- it's almost all just describing the
preprocessing/transformation process. It seems as if optimizations
like the one from my patch were considered too obvious to talk about
and/or out of scope by the authors. Thinking about the MDAM paper like
that was what made everything fall into place for me. Remember,
"missing key predicates" isn't all that special.

--
Peter Geoghegan

#9Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#8)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

A lot of people (myself included) were confused on this point for
quite a while.

I've taken the liberty to update the "Loose indexscan" wiki page [0]https://wiki.postgresql.org/wiki/Loose_indexscan,
adding detail that Loose indexscans are distinct from Skip scans, and
showing some high-level distinguishing properties.
I also split the TODO entry for `` "loose" or "skip" scan `` into two,
and added links to the relevant recent threads so that it's clear
these are different (and that some previous efforts may have had a
confusing name).

I hope this will reduce the chance of future confusion between the two
different approaches to improving index scan performance.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://wiki.postgresql.org/wiki/Loose_indexscan

#10Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Peter Geoghegan (#1)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hi, all!

CNF -> DNF conversion
=====================

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.

Honestly, I'm just reading and delving into this thread and other topics
related to it, so excuse me if I ask you a few obvious questions.

I noticed that you are going to add CNF->DNF transformation at the index
construction stage. If I understand correctly, you will rewrite
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be
possible to apply such a procedure earlier? Otherwise I suppose you
could face the problem of
incorrect selectivity of the calculation and, consequently, the
cardinality calculation?
I can't clearly understand at what stage it is clear that the such a
transformation needs to be applied?

--
Regards,
Alena Rybakina
Postgres Professional

In reply to: Matthias van de Meent (#7)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.

Yeah, I see what you mean. Seems related, even though what you've
shown in your prototype patch doesn't seem like it fits into my
taxonomy very neatly.

(BTW, I was a little confused by the use of the term "endpoint" at
first, since there is a function that uses that term to refer to a
descent of the tree that happens without any insertion scan key. This
path is used whenever the best we can do in _bt_first is to descend to
the rightmost or leftmost page.)

The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that.

I can see one advantage that block nested loop join would retain here:
it does block-based accesses on both sides of the join. Since it
"looks ahead" on both sides of the join, more repeat accesses are
likely to be avoided.

Not too sure how much that matters in practice, though.

--
Peter Geoghegan

In reply to: Alena Rybakina (#10)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina
<lena.ribackina@yandex.ru> wrote:

I noticed that you are going to add CNF->DNF transformation at the index
construction stage. If I understand correctly, you will rewrite
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be
possible to apply such a procedure earlier?

Sort of. I haven't really added any new CNF->DNF transformations. The
code you're talking about is really just checking that every index
path has clauses that we know that nbtree can handle. That's a big,
ugly modularity violation -- many of these details are quite specific
to the nbtree index AM (in theory we could have other index AMs that
are amsearcharray).

At most, v1 of the patch makes greater use of an existing
transformation that takes place in the nbtree index AM, as it
preprocesses scan keys for these types of queries (it's not inventing
new transformations at all). This is a slightly creative
interpretation, too. Tom's commit 9e8da0f7 didn't actually say
anything about CNF/DNF.

Otherwise I suppose you
could face the problem of
incorrect selectivity of the calculation and, consequently, the
cardinality calculation?

I can't think of any reason why that should happen as a direct result
of what I have done here. Multi-column index paths + multiple SAOP
clauses are not a new thing. The number of rows returned does not
depend on whether we have some columns as filter quals or not.

Of course that doesn't mean that the costing has no problems. The
costing definitely has several problems right now.

It also isn't necessarily okay that it's "just as good as before" if
it turns out that it needs to be better now. But I don't see why it
would be. (Actually, my hope is that selectivity estimation might be
*less* important as a practical matter with the patch.)

I can't clearly understand at what stage it is clear that the such a
transformation needs to be applied?

I don't know either.

I think that most of this work needs to take place in the nbtree code,
during preprocessing. But it's not so simple. There is a mutual
dependency between the code that generates index paths in the planner
and nbtree scan key preprocessing. The planner needs to know what
kinds of index paths are possible/safe up-front, so that it can choose
the fastest plan (the fastest that the index AM knows how to execute
correctly). But, there are lots of small annoying nbtree
implementation details that might matter, and can change.

I think we need to have nbtree register a callback, so that the
planner can initialize some preprocessing early. I think that we
require a "two way conversation" between the planner and the index AM.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#3)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg@bowt.ie> wrote:

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values). This approach was necessary for fairly
mundane reasons (it limits the amount of work required while holding a
buffer lock), but it also solves quite a few other problems that I
find far more interesting.

It's easy to imagine the state machine from v2 of the patch being
extended for skip scan. My approach "abstracts away" the arrays. For
skip scan, it would more or less behave as if the user had written a
query "WHERE a in (<Every possible value for this column>) AND b = 5
... " -- without actually knowing what the so-called array keys for
the high-order skipped column are (not up front, at least). We'd only
need to track the current "array key" for the scan key on the skipped
column, "a". The state machine would notice when the scan had reached
the next-greatest "a" value in the index (whatever that might be), and
then make that the current value. Finally, the state machine would
effectively instruct its caller to consider repositioning the scan via
a new descent of the index. In other words, almost everything for skip
scan would work just like regular SAOPs -- and any differences would
be well encapsulated.

But it's not just skip scan. This approach also enables thinking of
SAOP index scans (using nbtree) as just another type of indexable
clause, without any special restrictions (compared to true indexable
operators such as "=", say). Particularly in the planner. That was
always the general thrust of teaching nbtree about SAOPs, from the
start. But it's something that should be totally embraced IMV. That's
just what the patch proposes to do.

In particular, the patch now:

1. Entirely removes the long-standing restriction on generating path
keys for index paths with SAOPs, even when there are inequalities on a
high order column present. You can mix SAOPs together with other
clause types, arbitrarily, and everything still works and works
efficiently.

For example, the regression test expected output for this query/test
(from bugfix commit 807a40c5) is updated by the patch, as shown here:

 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)

We don't need a sort node anymore -- even though the leading column
here (thousand) uses an inequality, a particularly tricky case. Now
it's an index scan, much like any other, with no particular
restrictions caused by using a SAOP.

2. Adds an nbtree strategy for non-required equality array scan keys,
which is built on the same state machine, with only minor differences
to deal with column values "appearing out of key space order".

3. Simplifies the optimizer side of things by consistently avoiding
filter quals (except when it's truly unavoidable). The optimizer
doesn't even consider alternative index paths with filter quals for
lower-order SAOP columns, because they have no possible advantage
anymore. On the other hand, as we saw already, upthread, filter quals
have huge disadvantages. By always using true index quals, we
automatically avoid any question of getting excessive amounts of heap
page accesses just to eliminate non-matching rows. AFAICT we don't
need to make a trade-off here.

The first version of the patch added some crufty code to the
optimizer, to account for various restrictions on sort order. This
revised version actually removes existing cruft from the same place
(indxpath.c) instead.

Items 1, 2, and 3 are all closely related. Take the query I've shown
for item 1. Bugfix commit 807a40c5 (which added the test query in
question) dealt with an oversight in the then-recent original nbtree
SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index
scans with an inequality on their leading column, we cannot be sure
that the output will appear in the same order as the order that one
big continuous index scan returns rows in. We can only expect to
maintain the illusion that we're doing one continuous index scan when
individual primitive index scans access earlier columns via the
equality strategy -- we need "equality constraints".

In practice, the optimizer (indxpath.c) is very conservative (more
conservative than it really needs to be) when it comes to trusting the
index scan to output rows in index order, in the presence of SAOPs.
All of that now seems totally unnecessary. Again, I don't see a need
to make a trade-off here.

My observation about this query (and others like it) is: why not
literally perform one continuous index scan instead (not multiple
primitive index scans)? That is strictly better, given all the
specifics here. Once we have a way to do that (which the nbtree
executor work listed under item 2 provides), it becomes safe to assume
that the tuples will be output in index order -- there is no illusion
left to preserve. Who needs an illusion that isn't actually helping
us? We actually do less I/O by using this strategy, for the usual
reasons (we can avoid repeating index page accesses).

A more concrete benefit of the non-required-scankeys stuff can be seen
by running Benoit Tigeot's test case [1]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan with v2. He had a query like
this:

SELECT * FROM docs
WHERE status IN ('draft', 'sent') AND
sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280')
ORDER BY sent_at DESC NULLS LAST LIMIT 20;

And, his test case had an index on "sent_at DESC NULLS LAST,
sender_reference, status". This variant was a weak spot for v1.

v2 of the patch is vastly more efficient here, since we don't have to
go to the heap to eliminate non-matching tuples -- that can happen in
the index AM instead. This can easily be 2x-3x faster on a warm cache,
and have *hundreds* of times fewer buffer accesses (which Benoit
verified with an early version of this v2). All because we now require
vastly less heap access -- the quals are fairly selective here, and we
have to scan hundreds of leaf pages before the scan can terminate.
Avoiding filter quals is a huge win.

This particular improvement is hard to squarely attribute to any one
of my 3 items. The immediate problem that the query presents us with
on the master branch is the problem of filter quals that require heap
accesses to do visibility checks (a problem that index quals can never
have). That makes it tempting to credit my item 3. But you can't
really have item 3 without also having items 1 and 2. Taken together,
they eliminate all possible downsides from using index quals.

That high level direction (try to have one good choice for the
optimizer) seems important to me. Both for this project, and in
general.

Other changes in v2:

* Improved costing, that takes advantage of the fact that nbtree now
promises to not repeat any leaf page accesses (unless the scan is
restarted or the direction of the scan changes). This makes the worst
case far more predictable, and more related to selectivity estimation
-- you can't scan more pages than you have in the whole index. Just
like with every other sort of index scan.

* Support for parallel index scans.

The existing approach to array keys for parallel index scan has been
adopted to work with individual primitive index scans, not individual
array keys. I haven't tested this very thoroughly just yet, but it
seems to work well enough already. I think that it's important to not
have very much variation between parallel and serial index scans,
which I seem to have mostly avoided.

[1]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload+1462-287
In reply to: Peter Geoghegan (#13)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values).

Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

No notable changes here compared to v2.

--
Peter Geoghegan

Attachments:

v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload+1506-310
In reply to: Peter Geoghegan (#14)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values).

Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being. The
commit in question is rather poorly documented, and it's not
immediately clear how to integrate it with my work. I just want to
make sure that there's a testable patch available.

--
Peter Geoghegan

Attachments:

v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload+1516-352
In reply to: Peter Geoghegan (#15)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being.

Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).

v5 also fixes a longstanding open item for the patch: we no longer
call _bt_preprocess_keys() with a buffer lock held, which was a bad
idea at best, and unsafe (due to the syscache lookups within
_bt_preprocess_keys) at worst. A new, minimal version of the function
(called _bt_preprocess_keys_leafbuf) is called at the same point
instead. That change, combined with the array binary search stuff
(which was added back in v2), makes the total amount of work performed
with a buffer lock held totally reasonable in all cases. It's even
okay in extreme or adversarial cases with many millions of array keys.

Making this _bt_preprocess_keys_leafbuf approach work has a downside:
it requires that _bt_preprocess_keys be a little less aggressive about
removing redundant scan keys, in order to meet certain assumptions
held by the new _bt_preprocess_keys_leafbuf function. Essentially,
_bt_preprocess_keys must now worry about current and future array key
values when determining redundancy among scan keys -- not just the
current array key values. _bt_preprocess_keys knows nothing about
SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict
1:1 correspondence between the number of primitive index scans and the
number of array keys (actually, the number of distinct combinations of
array keys). Obviously that's no longer the case with the patch
(that's the whole point of the patch).

It's easiest to understand how elimination of redundant quals needs to
work in v5 by way of an example. Consider the following query:

select count(*), two, four, twenty, hundred
from
tenk1
where
two in (0, 1) and four in (1, 2, 3)
and two < 1;

Notice that "two" appears in the where clause twice. First it appears
as an SAOP, and then as an inequality. Right now, on HEAD, the
primitive index scan where the SAOP's scankey is "two = 0" renders
"two < 1" redundant. However, the subsequent primitive index scan
where "two = 1" does *not* render "two < 1" redundant. This has
implications for the mechanism in the patch, since the patch will
perform one big primitive index scan for all array constants, with
only a single _bt_preprocess_keys call at the start of its one and
only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf
calls once we reach the leaf level).

The compromise that I've settled on in v5 is to teach
_bt_preprocess_keys to *never* treat "two < 1" as redundant with such
a query -- even though there is some squishy sense in which "two < 1"
is indeed still redundant (for the first SAOP key of value 0). My
approach is reasonably well targeted in that it mostly doesn't affect
queries that don't need it. But it will add cycles to some badly
written queries that wouldn't have had them in earlier Postgres
versions. I'm not entirely sure how much this matters, but my current
sense is that it doesn't matter all that much. This is the kind of
thing that is hard to test and poorly tested, so simplicity is even
more of a virtue than usual.

Note that the changes to _bt_preprocess_keys in v5 *don't* affect how
we determine if the scan has contradictory quals, which is generally
more important. With contradictory quals, _bt_first can avoid reading
any data from the index. OTOH eliminating redundant quals (i.e. the
thing that v5 *does* change) merely makes evaluating index quals less
expensive via preprocessing-away unneeded scan keys. In other words,
while it's possible that the approach taken by v5 will add CPU cycles
in a small number of cases, it should never result in more page
accesses.

--
Peter Geoghegan

Attachments:

v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload+1484-323
#17Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#16)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being.

Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
If not, I'll probably be able to discover my own understanding of the
mechanism used in the patch, but if there is a framework to build that
understanding on (rather than having to build it from scratch) that'd
be greatly appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In reply to: Matthias van de Meent (#17)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?

This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.

In v1 of the patch, from all the way back in July, the "state machine"
advanced in the hackiest way possible: via repeated "incremental"
advancement (using logic from the function that we call
_bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that
until the function I'm now calling _bt_tuple_before_array_skeys()
eventually reported that the array keys were now sufficiently
advanced. v2 greatly improved matters by totally overhauling
_bt_advance_array_keys(): it was taught to use binary searches to
advance the array keys, with limited remaining use of "incremental"
array key advancement.

However, version 2 (and all later versions to date) have somewhat
wonky state machine transitions, in one important respect: calls to
the new _bt_advance_array_keys() won't always advance the array keys
to the maximum extent possible (possible while still getting correct
behavior, that is). There were still various complicated scenarios
involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD
scan keys that use BTEqualStrategyNumber), where one single call to
_bt_advance_array_keys() would advance the array keys to a point that
was still < caller's tuple. AFAICT this didn't cause wrong answers to
queries (that would require failing to find a set of exactly matching
array keys where a matching set exists), but it was kludgey. It was
sloppy in roughly the same way as the approach in my v1 prototype was
sloppy (just to a lesser degree).

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review. The upcoming version will have simple
preconditions and postconditions for the function that advances the
array key state machine (the new _bt_advance_array_keys). These are
enforced by assertions at the start and end of the function. So the
rules for the state machine become crystal clear and fairly easy to
keep in your head (e.g., tuple must be >= required array keys on entry
and <= required array keys on exit, the array keys must always either
advance by one increment or be completely exhausted for the top-level
scan in the current scan direction).

Unsurprisingly, I found that adding and enforcing these invariants led
to a simpler and more general design within _bt_advance_array_keys.
That code is still the most complicated part of the patch, but it's
much less of a bag of tricks. Another reason for you to hold off for a
few more days.

--
Peter Geoghegan

#19Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#18)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?

This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In reply to: Matthias van de Meent (#19)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).

Highlights:

* Major simplifications to the array key state machine, already
described by my recent email.

* Added preprocessing of "redundant and contradictory" array elements
to _bt_preprocess_array_keys().

This makes the special preprocessing pass just for array keys
("preprocessing preprocessing") within _bt_preprocess_array_keys()
make this query into a no-op:

select * from tab where a in (180, 345) and a in (230, 300); -- contradictory

Similarly, it can make this query only attempt one single primitive
index scan for "230":

select * from tab where a in (180, 230) and a in (230, 300); -- has
redundancies, plus some individual elements contradict each other

This duplicates some of what _bt_preprocess_keys can do already. But
_bt_preprocess_keys can only do this stuff at the level of individual
array elements/primitive index scans. Whereas this works "one level
up", allowing preprocessing to see the full picture rather than just
seeing the start of one particular primitive index scan. It explicitly
works across array keys, saving repeat work inside
_bt_preprocess_keys. That could really add up with thousands of array
keys and/or multiple SAOPs. (Note that _bt_preprocess_array_keys
already does something like this, to deal with SAOP inequalities such
as "WHERE my_col >= any (array[1, 2])" -- it's a little surprising
that this obvious optimization wasn't part of the original nbtree SAOP
patch.)

This reminds me: you might want to try breaking the patch by coming up
with adversarial cases, Matthias. The patch needs to be able to deal
with absurdly large amounts of array keys reasonably well, because it
proposes to normalize passing those to the nbtree code. It's
especially important that the patch never takes too much time to do
something (e.g., binary searching through array keys) while holding a
buffer lock -- even with very silly adversarial queries.

So, for example, queries like this one (specifically designed to
stress the implementation) *need* to work reasonably well:

with a as (
select i from generate_series(0, 500000) i
)
select
count(*), thousand, tenthous
from
tenk1
where
thousand = any (array[(select array_agg(i) from a)]) and
tenthous = any (array[(select array_agg(i) from a)])
group by
thousand, tenthous
order by
thousand, tenthous;

(You can run this yourself after the regression tests finish, of course.)

This takes about 130ms on my machine, hardly any of which takes place
in the nbtree code with the patch (think tens of microseconds per
_bt_readpage call, at most) -- the plan is an index-only scan that
gets only 30 buffer hits. On the master branch, it's vastly slower --
1000025 buffer hits. The query as a whole takes about 3 seconds there.

If you have 3 or 4 SOAPs (with a composite index that has as many
columns) you can quite easily DOS the master branch, since the planner
makes a generic assumption that each of these SOAPs will have only 10
elements. The planner also thinks that with the patch applied, with
one important difference: it doesn't matter to nbtree. The cost of
scanning each index page should be practically independent of the
total size of each array, at least past a certain point. Similarly,
the maximum cost of an index scan should be approximately fixed: it
should be capped at the cost of a full index scan (with the added cost
of these relatively expensive quals still capped, still essentially
independent of array sizes past some point).

I notice that if I remove the "thousand = any (array[(select
array_agg(i) from a)]) and" line from the adversarial query, executing
the resulting query still get 30 buffer hits with the patch -- though
it only takes 90ms this time (it's faster for reasons that likely have
less than you'd think to do with nbtree overheads). This is just
another way of getting roughly the same full index scan. That's a
completely new way of thinking about nbtree SAOPs from a planner
perspective (also from a user's perspective, I suppose).

It's important that the planner's new optimistic assumptions about the
cost profile of SOAPS (that it can expect reasonable
performance/access patterns with wildly unreasonable/huge/arbitrarily
complicated SAOPs) always be met by nbtree -- no repeat index page
accesses, no holding a buffer lock for more than (say) a small
fraction of 1 millisecond (no matter the complexity of the query), and
possibly other things I haven't thought of yet.

If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.

--
Peter Geoghegan

Attachments:

v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload+1700-273
In reply to: Peter Geoghegan (#20)
#22Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#21)
#23Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#20)
In reply to: Matthias van de Meent (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#24)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#24)
In reply to: Tomas Vondra (#26)
In reply to: Peter Geoghegan (#27)
In reply to: Peter Geoghegan (#28)
In reply to: Heikki Linnakangas (#25)
In reply to: Peter Geoghegan (#30)
In reply to: Peter Geoghegan (#24)
In reply to: Peter Geoghegan (#32)
#34Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#33)
In reply to: Matthias van de Meent (#34)
In reply to: Peter Geoghegan (#33)
#37Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#35)
In reply to: Matthias van de Meent (#37)
#39Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#38)
In reply to: Matthias van de Meent (#39)
In reply to: Matthias van de Meent (#39)
In reply to: Peter Geoghegan (#40)
In reply to: Peter Geoghegan (#42)
#44Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#43)
In reply to: Matthias van de Meent (#44)
#46Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#45)
#47Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#46)
#48Benoit Tigeot
benoit.tigeot@gmail.com
In reply to: Peter Geoghegan (#42)
In reply to: Matthias van de Meent (#46)
In reply to: Matthias van de Meent (#47)
#51Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#46)
In reply to: Matthias van de Meent (#51)
#53Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#52)
#54Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#52)
In reply to: Benoit Tigeot (#48)
In reply to: Matthias van de Meent (#53)
In reply to: Matthias van de Meent (#54)
In reply to: Peter Geoghegan (#57)
In reply to: Peter Geoghegan (#58)
In reply to: Peter Geoghegan (#59)
#61Alexander Lakhin
exclusion@gmail.com
In reply to: Peter Geoghegan (#60)
In reply to: Alexander Lakhin (#61)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#16)
In reply to: Tom Lane (#63)
#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#64)
In reply to: Tom Lane (#65)
#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#66)
In reply to: Tom Lane (#67)
In reply to: Peter Geoghegan (#68)
#70Donghang Lin
donghanglin@gmail.com
In reply to: Peter Geoghegan (#69)
In reply to: Donghang Lin (#70)
#72Alexander Lakhin
exclusion@gmail.com
In reply to: Peter Geoghegan (#62)
In reply to: Alexander Lakhin (#72)
In reply to: Peter Geoghegan (#73)
#75Anton A. Melnikov
a.melnikov@postgrespro.ru
In reply to: Peter Geoghegan (#38)
In reply to: Anton A. Melnikov (#75)
#77Alexander Lakhin
exclusion@gmail.com
In reply to: Peter Geoghegan (#74)
In reply to: Alexander Lakhin (#77)