Adding skip scan (including MDAM style range skip scan) to nbtree

Started by Peter Geogheganalmost 2 years ago153 messageshackers

pg@bowt.ie

almost 2 years ago

Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).

In effect, a skip scan treats this composite index on '(a, b)' as if
it was a series of subindexes -- one subindex per distinct value in
'a'. We can exhaustively "search every subindex" using an index qual
that behaves just like "WHERE a = ANY(<every possible 'a' value>) AND
b = 5" would behave.

This approach might be much less efficient than an index scan that can
use an index on 'b' alone, but skip scanning can still be orders of
magnitude faster than a sequential scan. The user may very well not
have a dedicated index on 'b' alone, for whatever reason.

Note that the patch doesn't just target these simpler "skip leading
index column omitted from the predicate" cases. It's far more general
than that -- skipping attributes (or what the patch refers to as skip
arrays) can be freely mixed with SAOPs/conventional arrays, in any
order you can think of. They can also be combined with inequalities to
form range skip arrays.

This patch is a direct follow-up to the Postgres 17 work that became
commit 5bf748b8. Making everything work well together is an important
design goal here. I'll talk more about that further down, and will
show a benchmark example query that'll give a good general sense of
the value of the patch with these more complicated cases.

A note on terminology
=====================

The terminology in this area has certain baggage. Many of us will
recall the patch that implemented loose index scan. That patch also
dubbed itself "skip scan", but that doesn't seem right to me (it's at
odds with how other RDBMSs describe features in this area). I would
like to address the issues with the terminology in this area now, to
avoid creating further confusion.

When I use the term "skip scan", I'm referring to a feature that's
comparable to the skip scan features from Oracle and MySQL 8.0+. This
*isn't* at all comparable to the feature that MySQL calls "loose index
scan" -- don't confuse the two features.

Loose index scan is a far more specialized technique than skip scan.
It only applies within special scans that feed into a DISTINCT group
aggregate. Whereas my skip scan patch isn't like that at all -- it's
much more general. With my patch, nbtree has exactly the same contract
with the executor/core code as before. There are no new index paths
generated by the optimizer to make skip scan work, even. Skip scan
isn't particularly aimed at improving group aggregates (though the
benchmark I'll show happens to involve a group aggregate, simply
because the technique works best with large and expensive index
scans).

My patch is an additive thing, that speeds up what we'd currently
refer to as full index scans (as well as range index scans that
currently do a "full scan" of a range/subset of an index). These index
paths/index scans can no longer really be called "full index scans",
of course, but they're still logically the same index paths as before.

MDAM and skip scan
==================

As I touched on already, the patch actually implements a couple of
related optimizations. "Skip scan" might be considered one out of the
several optimizations from the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1]https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan -- the paper describes skip scan under
its "Missing Key Predicate" subsection. I collectively refer to the
optimizations from the paper as the "MDAM techniques".

Alternatively, you could define these MDAM techniques as each
implementing some particular flavor of skip scan, since they all do
rather similar things under the hood. In fact, that's how I've chosen
to describe things in my patch: it talks about skip scan, and about
range skip scan, which is considered a minor variant of skip scan.
(Note that the term "skip scan" is never used in the MDAM paper.)

MDAM is short for "multidimensional access method". In the context of
the paper, "dimension" refers to dimensions in a decision support
system. These dimensions are represented by low cardinality columns,
each of which appear in a large composite B-Tree index. The emphasis
in the paper (and for my patch) is DSS and data warehousing; OLTP apps
typically won't benefit as much.

Note: Loose index scan *isn't* described by the paper at all. I also
wouldn't classify loose index scan as one of the MDAM techniques. I
think of it as being in a totally different category, due to the way
that it applies semantic information. No MDAM technique will ever
apply high-level semantic information about what is truly required by
the plan tree, one level up. And so my patch simply teaches nbtree to
find the most efficient way of navigating through an index, based
solely on information that is readily available to the scan. The same
principles apply to all of the other MDAM techniques; they're
basically all just another flavor of skip scan (that do some kind of
clever preprocessing/transformation that enables reducing the scan to
a series of disjunctive accesses, and that could be implemented using
the new abstraction I'm calling skip arrays).

The paper more or less just applies one core idea, again and again.
It's surprising how far that one idea can take you. But it is still
just one core idea (don't overlook that point).

Range skip scan
---------------

To me, the most interesting MDAM technique is probably one that I
refer to as "range skip scan" in the patch. This is the technique that
the paper introduces first, in its "Intervening Range Predicates"
subsection. The best way of explaining it is through an example (you
could also just read the paper, which has an example of its own).

Imagine a table with just one index: a composite index on "(pdate,
customer_id)". Further suppose we have a query such as:

SELECT * FROM payments WHERE pdate BETWEEN '2024-01-01' AND
'2024-01-30' AND customer_id = 5; -- both index columns (pdate and
customer_id) appear in predicate

The patch effectively makes the nbtree code execute the index scan as
if the query had been written like this instead:

SELECT * FROM payments WHERE pdate = ANY ('2024-01-01', '2024-01-02',
..., '2024-01-30') AND customer_id = 5;

The use of a "range skip array" within nbtree allows the scan to skip
when that makes sense, locating the next date with customer_id = 5
each time (we might skip over many irrelevant leaf pages each time).
The scan must also *avoid* skipping when it *doesn't* make sense.

As always (since commit 5bf748b8 went in), whether and to what extent
we skip using array keys depends in large part on the physical
characteristics of the index at runtime. If the tuples that we need to
return are all clustered closely together, across only a handful of
leaf pages, then we shouldn't be skipping at all. When skipping makes
sense, we should skip constantly.

I'll discuss the trade-offs in this area a little more below, under "Design".

Using multiple MDAM techniques within the same index scan (includes benchmark)
------------------------------------------------------------------------------

I recreated the data in the MDAM paper's "sales" table by making
inferences from the paper. It's very roughly the same data set as the
paper (close enough to get the general idea across). The table size is
about 51GB, and the index is about 25GB (most of the attributes from
the table are used as index columns). There is nothing special about
this data set -- I just thought it would be cool to "recreate" the
queries from the paper, as best I could. Thought that this approach
might make my points about the design easier to follow.

The index we'll be using for this can be created via: "create index
mdam_idx on sales_mdam_paper(dept, sdate, item_class, store)". Again,
this is per the paper. It's also the order that the columns appear in
every WHERE clause in every query from the paper.

(That said, the particular column order from the index definition
mostly doesn't matter. Every index column is a low cardinality column,
so unless the order used completely obviates the need to skip a column
that would otherwise need to be skipped, such as "dept", the effect on
query execution time from varying column order is in the noise.
Obviously that's very much not how users are used to thinking about
composite indexes.)

The MDAM paper has numerous example queries, each of which builds on
the last, adding one more complication each time -- each of which is
addressed by another MDAM technique. The query I'll focus on here is
an example query that's towards the end of the paper, and so combines
multiple techniques together -- it's the query that appears in the "IN
Lists" subsection:

select
dept,
sdate,
item_class,
store,
sum(total_sales)
from
sales_mdam_paper
where
-- omitted: leading "dept" column from composite index
sdate between '1995-06-01' and '1995-06-30'
and item_class in (20, 35, 50)
and store in (200, 250)
group by dept, sdate, item_class, store
order by dept, sdate, item_class, store;

On HEAD, when we run this query we either get a sequential scan (which
is very slow) or a full index scan (which is almost as slow). Whereas
with the patch, nbtree will execute the query as a succession of a few
thousand very selective primitive index scans (which usually only scan
one leaf page, though some may scan two neighboring leaf pages).

Results: The full index scan on HEAD takes about 32 seconds. With the
patch, the query takes just under 52ms to execute. That works out to
be about 630x faster with the patch.

See the attached SQL file for full details. It provides all you'll
need to recreate this test result with the patch.

Nobody would put up with such an inefficient full index scan in the
first place, so the behavior on HEAD is not really a sensible baseline
-- 630x isn't very meaningful. I could have come up with a case that
showed an even larger improvement if I'd felt like it, but that
wouldn't have proven anything.

The important point is that the patch makes a huge composite index
like the one I've built for this actually make sense, for the first
time. So we're not so much making something faster as enabling a whole
new approach to indexing -- particularly for data warehousing use
cases. The way that Postgres DBAs choose which indexes they'll need to
create is likely to be significantly changed by this optimization.

I'll break down how this is possible. This query makes use of 3
separate MDAM techniques:

1. A "simple" skip scan (on "dept").

2. A "range" skip scan (on "sdate").

3. The pair of IN() lists/SAOPs on item_class and on store. (Nothing
new here, except that nbtree needs these regular SAOP arrays to roll
over the higher-order skip arrays to trigger moving on to the next
dept/date.)

Internally, we're just doing a series of several thousand distinct
non-overlapping accesses, in index key space order (so as to preserve
the appearance of one continuous index scan). These accesses starts
out like this:

dept=INT_MIN, date='1995-06-01', item_class=20, store=200
(Here _bt_advance_array_keys discovers that the actual lowest dept
is 1, not INT_MIN)
dept=1, date='1995-06-01', item_class=20, store=200
dept=1, date='1995-06-01', item_class=20, store=250
dept=1, date='1995-06-01', item_class=35, store=200
dept=1, date='1995-06-01', item_class=35, store=250
...

(Side note: as I mentioned, each of the two "store" values usually
appear together on the same leaf page in practice. Arguably I should
have shown 2 lines/accesses here (for "dept=1"), rather than showing
4. The 4 "dept=1" lines shown required only 2 primitive index
scans/index descents/leaf page reads. Disjunctive accesses don't
necessarily map 1:1 with primitive/physical index scans.)

About another ten thousand similar accesses occur (omitted for
brevity). Execution of the scan within nbtree finally ends with these
primitive index scans/accesses:
...
dept=100, date='1995-06-30', item_class=50, store=250
dept=101, date='1995-06-01', item_class=20, store=200
STOP

There is no "dept=101" entry in the index (the highest department in
the index happens to be 100). The index scan therefore terminates at
this point, having run out of leaf pages to scan (we've reached the
rightmost point of the rightmost leaf page, as the scan attempts to
locate non-existent dept=101 tuples).

Design
======

Since index scans with skip arrays work just like index scans with
regular arrays (as of Postgres 17), naturally, there are no special
restrictions. Associated optimizer index paths have path keys, and so
could (just for example) appear in a merge join, or feed into a group
aggregate, while avoiding a sort node. Index scans that skip could
also feed into a relocatable cursor.

As I mentioned already, the patch adds a skipping mechanism that is
purely an additive thing. I think that this will turn out to be an
important enabler of using the optimizations, even when there's much
uncertainty about how much they'll actually help at runtime.

Optimizer
---------

We make a broad assumption that skipping is always to our advantage
during nbtree preprocessing -- preprocessing generates as many skip
arrays as could possibly make sense based on static rules (rules that
don't apply any kind of information about data distribution). Of
course, skipping isn't necessarily the best approach in all cases, but
that's okay. We only actually skip when physical index characteristics
show that it makes sense. The real decisions about skipping are all
made dynamically.

That approach seems far more practicable than preempting the problem
during planning or during nbtree preprocessing. It seems like it'd be
very hard to model the costs statistically. We need revisions to
btcostestimate, of course, but the less we can rely on btcostestimate
the better. As I said, there are no new index paths generated by the
optimizer for any of this.

What do you call an index scan where 90% of all index tuples are 1 of
only 3 distinct values, while the remaining 10% of index tuples are
all perfectly unique in respect of a leading column? Clearly the best
strategy when skipping using the leading column to "use skip scan for
90% of the index, and use a conventional range scan for the remaining
10%". Skipping generally makes sense, but we legitimately need to vary
our strategy *during the same index scan*. It makes no sense to think
of skip scan as a discrete sort of index scan.

I have yet to prove that always having the option of skipping (even
when it's very unlikely to help) really does "come for free" -- for
now I'm just asserting that that's possible. I'll need proof. I expect
to hear some principled skepticism on this point. It's probably not
quite there in this v1 of the patch -- there'll be some regressions (I
haven't looked very carefully just yet). However, we seem to already
be quite close to avoiding regressions from excessive/useless
skipping.

Extensible infrastructure/support functions
-------------------------------------------

Currently, the patch only supports skip scan for a subset of all
opclasses -- those that have the required support function #6, or
"skip support" function. This provides the opclass with (among other
things) a way to increment the current skip array value (or to
decrement it, in the case of backward scans). In practice we only have
this for a handful of discrete integer (and integer-ish) types. Note
that the patch currently cannot skip for an index column that happens
to be text. Note that even this v1 supports skip scans that use
unsupported types, provided that the input opclass of the specific
columns we'll need to skip has support.

The patch should be able to support every type/opclass as a target for
skipping, regardless of whether an opclass support function happened
to be available. That could work by teaching the nbtree code to have
explicit probes for the next skip array value in the index, only then
combining that new value with the qual from the input scan keys/query.
I've put that off for now because it seems less important -- it
doesn't really affect anything I've said about the core design, which
is what I'm focussing on for now.

It makes sense to use increment/decrement whenever feasible, even
though it isn't strictly necessary (or won't be, once the patch has
the required explicit probe support). The only reason to not apply
increment/decrement opclass skip support (that I can see) is because
it just isn't practical (this is generally the case for continuous
types). While it's slightly onerous to have to invent all this new
opclass infrastructure, it definitely makes sense.

There is a performance advantage to having skip arrays that can
increment through each distinct possible indexable value (this
increment/decrement stuff comes from the MDAM paper). The MDAM
techniques inherently work best when "skipping" columns of discrete
types like integer and date, which is why the paper has examples that
all look like that. If you look at my example query and its individual
accesses, you'll realize why this is so.

Thoughts?

[1]: https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan
--
Peter Geoghegan

Aleksander Alekseev

aleksander@timescale.com

almost 2 years ago

In reply to: Peter Geoghegan (#1)

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Hi Peter,

Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).

[...]

Thoughts?

Many thanks for working on this. I believe it is an important feature
and it would be great to deliver it during the PG18 cycle.

I experimented with the patch and here are the results I got so far.

Firstly, it was compiled on Intel MacOS and ARM Linux. All the tests
pass just fine.

Secondly, I tested the patch manually using a release build on my
Raspberry Pi 5 and the GUCs that can be seen in [1]https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh.

Test 1 - simple one.

```
CREATE TABLE test1(c char, n bigint);
CREATE INDEX test1_idx ON test1 USING btree(c,n);

INSERT INTO test1
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
```

Test 2 - a more complicated one.

```
CREATE TABLE test2(c1 char, c2 char, n bigint);
CREATE INDEX test2_idx ON test2 USING btree(c1,c2,n);

INSERT INTO test2
SELECT chr(ascii('a') + random(0,2)) AS c1,
chr(ascii('a') + random(0,2)) AS c2,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test2 WHERE n > 900_000_000;
```

Test 3 - to see how it works with covering indexes.

```
CREATE TABLE test3(c char, n bigint, s text DEFAULT 'text_value' || n);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);

INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT s FROM test3 WHERE n < 1000;
```

In all the cases the patch worked as expected.

I noticed that with the patch we choose Index Only Scans for Test 1
and without the patch - Parallel Seq Scan. However the Parallel Seq
Scan is 2.4 times faster. Before the patch the query takes 53 ms,
after the patch - 127 ms. I realize this could be just something
specific to my hardware and/or amount of data.

Do you think this is something that was expected or something worth
investigating further?

I haven't looked at the code yet.

[1]: https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh

--
Best regards,
Aleksander Alekseev

Adding skip scan (including MDAM style range skip scan) to nbtree

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: