Adding skip scan (including MDAM style range skip scan) to nbtree

Started by Peter Geogheganover 1 year ago153 messages
Jump to latest

Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).

In effect, a skip scan treats this composite index on '(a, b)' as if
it was a series of subindexes -- one subindex per distinct value in
'a'. We can exhaustively "search every subindex" using an index qual
that behaves just like "WHERE a = ANY(<every possible 'a' value>) AND
b = 5" would behave.

This approach might be much less efficient than an index scan that can
use an index on 'b' alone, but skip scanning can still be orders of
magnitude faster than a sequential scan. The user may very well not
have a dedicated index on 'b' alone, for whatever reason.

Note that the patch doesn't just target these simpler "skip leading
index column omitted from the predicate" cases. It's far more general
than that -- skipping attributes (or what the patch refers to as skip
arrays) can be freely mixed with SAOPs/conventional arrays, in any
order you can think of. They can also be combined with inequalities to
form range skip arrays.

This patch is a direct follow-up to the Postgres 17 work that became
commit 5bf748b8. Making everything work well together is an important
design goal here. I'll talk more about that further down, and will
show a benchmark example query that'll give a good general sense of
the value of the patch with these more complicated cases.

A note on terminology
=====================

The terminology in this area has certain baggage. Many of us will
recall the patch that implemented loose index scan. That patch also
dubbed itself "skip scan", but that doesn't seem right to me (it's at
odds with how other RDBMSs describe features in this area). I would
like to address the issues with the terminology in this area now, to
avoid creating further confusion.

When I use the term "skip scan", I'm referring to a feature that's
comparable to the skip scan features from Oracle and MySQL 8.0+. This
*isn't* at all comparable to the feature that MySQL calls "loose index
scan" -- don't confuse the two features.

Loose index scan is a far more specialized technique than skip scan.
It only applies within special scans that feed into a DISTINCT group
aggregate. Whereas my skip scan patch isn't like that at all -- it's
much more general. With my patch, nbtree has exactly the same contract
with the executor/core code as before. There are no new index paths
generated by the optimizer to make skip scan work, even. Skip scan
isn't particularly aimed at improving group aggregates (though the
benchmark I'll show happens to involve a group aggregate, simply
because the technique works best with large and expensive index
scans).

My patch is an additive thing, that speeds up what we'd currently
refer to as full index scans (as well as range index scans that
currently do a "full scan" of a range/subset of an index). These index
paths/index scans can no longer really be called "full index scans",
of course, but they're still logically the same index paths as before.

MDAM and skip scan
==================

As I touched on already, the patch actually implements a couple of
related optimizations. "Skip scan" might be considered one out of the
several optimizations from the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1]https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan -- the paper describes skip scan under
its "Missing Key Predicate" subsection. I collectively refer to the
optimizations from the paper as the "MDAM techniques".

Alternatively, you could define these MDAM techniques as each
implementing some particular flavor of skip scan, since they all do
rather similar things under the hood. In fact, that's how I've chosen
to describe things in my patch: it talks about skip scan, and about
range skip scan, which is considered a minor variant of skip scan.
(Note that the term "skip scan" is never used in the MDAM paper.)

MDAM is short for "multidimensional access method". In the context of
the paper, "dimension" refers to dimensions in a decision support
system. These dimensions are represented by low cardinality columns,
each of which appear in a large composite B-Tree index. The emphasis
in the paper (and for my patch) is DSS and data warehousing; OLTP apps
typically won't benefit as much.

Note: Loose index scan *isn't* described by the paper at all. I also
wouldn't classify loose index scan as one of the MDAM techniques. I
think of it as being in a totally different category, due to the way
that it applies semantic information. No MDAM technique will ever
apply high-level semantic information about what is truly required by
the plan tree, one level up. And so my patch simply teaches nbtree to
find the most efficient way of navigating through an index, based
solely on information that is readily available to the scan. The same
principles apply to all of the other MDAM techniques; they're
basically all just another flavor of skip scan (that do some kind of
clever preprocessing/transformation that enables reducing the scan to
a series of disjunctive accesses, and that could be implemented using
the new abstraction I'm calling skip arrays).

The paper more or less just applies one core idea, again and again.
It's surprising how far that one idea can take you. But it is still
just one core idea (don't overlook that point).

Range skip scan
---------------

To me, the most interesting MDAM technique is probably one that I
refer to as "range skip scan" in the patch. This is the technique that
the paper introduces first, in its "Intervening Range Predicates"
subsection. The best way of explaining it is through an example (you
could also just read the paper, which has an example of its own).

Imagine a table with just one index: a composite index on "(pdate,
customer_id)". Further suppose we have a query such as:

SELECT * FROM payments WHERE pdate BETWEEN '2024-01-01' AND
'2024-01-30' AND customer_id = 5; -- both index columns (pdate and
customer_id) appear in predicate

The patch effectively makes the nbtree code execute the index scan as
if the query had been written like this instead:

SELECT * FROM payments WHERE pdate = ANY ('2024-01-01', '2024-01-02',
..., '2024-01-30') AND customer_id = 5;

The use of a "range skip array" within nbtree allows the scan to skip
when that makes sense, locating the next date with customer_id = 5
each time (we might skip over many irrelevant leaf pages each time).
The scan must also *avoid* skipping when it *doesn't* make sense.

As always (since commit 5bf748b8 went in), whether and to what extent
we skip using array keys depends in large part on the physical
characteristics of the index at runtime. If the tuples that we need to
return are all clustered closely together, across only a handful of
leaf pages, then we shouldn't be skipping at all. When skipping makes
sense, we should skip constantly.

I'll discuss the trade-offs in this area a little more below, under "Design".

Using multiple MDAM techniques within the same index scan (includes benchmark)
------------------------------------------------------------------------------

I recreated the data in the MDAM paper's "sales" table by making
inferences from the paper. It's very roughly the same data set as the
paper (close enough to get the general idea across). The table size is
about 51GB, and the index is about 25GB (most of the attributes from
the table are used as index columns). There is nothing special about
this data set -- I just thought it would be cool to "recreate" the
queries from the paper, as best I could. Thought that this approach
might make my points about the design easier to follow.

The index we'll be using for this can be created via: "create index
mdam_idx on sales_mdam_paper(dept, sdate, item_class, store)". Again,
this is per the paper. It's also the order that the columns appear in
every WHERE clause in every query from the paper.

(That said, the particular column order from the index definition
mostly doesn't matter. Every index column is a low cardinality column,
so unless the order used completely obviates the need to skip a column
that would otherwise need to be skipped, such as "dept", the effect on
query execution time from varying column order is in the noise.
Obviously that's very much not how users are used to thinking about
composite indexes.)

The MDAM paper has numerous example queries, each of which builds on
the last, adding one more complication each time -- each of which is
addressed by another MDAM technique. The query I'll focus on here is
an example query that's towards the end of the paper, and so combines
multiple techniques together -- it's the query that appears in the "IN
Lists" subsection:

select
dept,
sdate,
item_class,
store,
sum(total_sales)
from
sales_mdam_paper
where
-- omitted: leading "dept" column from composite index
sdate between '1995-06-01' and '1995-06-30'
and item_class in (20, 35, 50)
and store in (200, 250)
group by dept, sdate, item_class, store
order by dept, sdate, item_class, store;

On HEAD, when we run this query we either get a sequential scan (which
is very slow) or a full index scan (which is almost as slow). Whereas
with the patch, nbtree will execute the query as a succession of a few
thousand very selective primitive index scans (which usually only scan
one leaf page, though some may scan two neighboring leaf pages).

Results: The full index scan on HEAD takes about 32 seconds. With the
patch, the query takes just under 52ms to execute. That works out to
be about 630x faster with the patch.

See the attached SQL file for full details. It provides all you'll
need to recreate this test result with the patch.

Nobody would put up with such an inefficient full index scan in the
first place, so the behavior on HEAD is not really a sensible baseline
-- 630x isn't very meaningful. I could have come up with a case that
showed an even larger improvement if I'd felt like it, but that
wouldn't have proven anything.

The important point is that the patch makes a huge composite index
like the one I've built for this actually make sense, for the first
time. So we're not so much making something faster as enabling a whole
new approach to indexing -- particularly for data warehousing use
cases. The way that Postgres DBAs choose which indexes they'll need to
create is likely to be significantly changed by this optimization.

I'll break down how this is possible. This query makes use of 3
separate MDAM techniques:

1. A "simple" skip scan (on "dept").

2. A "range" skip scan (on "sdate").

3. The pair of IN() lists/SAOPs on item_class and on store. (Nothing
new here, except that nbtree needs these regular SAOP arrays to roll
over the higher-order skip arrays to trigger moving on to the next
dept/date.)

Internally, we're just doing a series of several thousand distinct
non-overlapping accesses, in index key space order (so as to preserve
the appearance of one continuous index scan). These accesses starts
out like this:

dept=INT_MIN, date='1995-06-01', item_class=20, store=200
(Here _bt_advance_array_keys discovers that the actual lowest dept
is 1, not INT_MIN)
dept=1, date='1995-06-01', item_class=20, store=200
dept=1, date='1995-06-01', item_class=20, store=250
dept=1, date='1995-06-01', item_class=35, store=200
dept=1, date='1995-06-01', item_class=35, store=250
...

(Side note: as I mentioned, each of the two "store" values usually
appear together on the same leaf page in practice. Arguably I should
have shown 2 lines/accesses here (for "dept=1"), rather than showing
4. The 4 "dept=1" lines shown required only 2 primitive index
scans/index descents/leaf page reads. Disjunctive accesses don't
necessarily map 1:1 with primitive/physical index scans.)

About another ten thousand similar accesses occur (omitted for
brevity). Execution of the scan within nbtree finally ends with these
primitive index scans/accesses:
...
dept=100, date='1995-06-30', item_class=50, store=250
dept=101, date='1995-06-01', item_class=20, store=200
STOP

There is no "dept=101" entry in the index (the highest department in
the index happens to be 100). The index scan therefore terminates at
this point, having run out of leaf pages to scan (we've reached the
rightmost point of the rightmost leaf page, as the scan attempts to
locate non-existent dept=101 tuples).

Design
======

Since index scans with skip arrays work just like index scans with
regular arrays (as of Postgres 17), naturally, there are no special
restrictions. Associated optimizer index paths have path keys, and so
could (just for example) appear in a merge join, or feed into a group
aggregate, while avoiding a sort node. Index scans that skip could
also feed into a relocatable cursor.

As I mentioned already, the patch adds a skipping mechanism that is
purely an additive thing. I think that this will turn out to be an
important enabler of using the optimizations, even when there's much
uncertainty about how much they'll actually help at runtime.

Optimizer
---------

We make a broad assumption that skipping is always to our advantage
during nbtree preprocessing -- preprocessing generates as many skip
arrays as could possibly make sense based on static rules (rules that
don't apply any kind of information about data distribution). Of
course, skipping isn't necessarily the best approach in all cases, but
that's okay. We only actually skip when physical index characteristics
show that it makes sense. The real decisions about skipping are all
made dynamically.

That approach seems far more practicable than preempting the problem
during planning or during nbtree preprocessing. It seems like it'd be
very hard to model the costs statistically. We need revisions to
btcostestimate, of course, but the less we can rely on btcostestimate
the better. As I said, there are no new index paths generated by the
optimizer for any of this.

What do you call an index scan where 90% of all index tuples are 1 of
only 3 distinct values, while the remaining 10% of index tuples are
all perfectly unique in respect of a leading column? Clearly the best
strategy when skipping using the leading column to "use skip scan for
90% of the index, and use a conventional range scan for the remaining
10%". Skipping generally makes sense, but we legitimately need to vary
our strategy *during the same index scan*. It makes no sense to think
of skip scan as a discrete sort of index scan.

I have yet to prove that always having the option of skipping (even
when it's very unlikely to help) really does "come for free" -- for
now I'm just asserting that that's possible. I'll need proof. I expect
to hear some principled skepticism on this point. It's probably not
quite there in this v1 of the patch -- there'll be some regressions (I
haven't looked very carefully just yet). However, we seem to already
be quite close to avoiding regressions from excessive/useless
skipping.

Extensible infrastructure/support functions
-------------------------------------------

Currently, the patch only supports skip scan for a subset of all
opclasses -- those that have the required support function #6, or
"skip support" function. This provides the opclass with (among other
things) a way to increment the current skip array value (or to
decrement it, in the case of backward scans). In practice we only have
this for a handful of discrete integer (and integer-ish) types. Note
that the patch currently cannot skip for an index column that happens
to be text. Note that even this v1 supports skip scans that use
unsupported types, provided that the input opclass of the specific
columns we'll need to skip has support.

The patch should be able to support every type/opclass as a target for
skipping, regardless of whether an opclass support function happened
to be available. That could work by teaching the nbtree code to have
explicit probes for the next skip array value in the index, only then
combining that new value with the qual from the input scan keys/query.
I've put that off for now because it seems less important -- it
doesn't really affect anything I've said about the core design, which
is what I'm focussing on for now.

It makes sense to use increment/decrement whenever feasible, even
though it isn't strictly necessary (or won't be, once the patch has
the required explicit probe support). The only reason to not apply
increment/decrement opclass skip support (that I can see) is because
it just isn't practical (this is generally the case for continuous
types). While it's slightly onerous to have to invent all this new
opclass infrastructure, it definitely makes sense.

There is a performance advantage to having skip arrays that can
increment through each distinct possible indexable value (this
increment/decrement stuff comes from the MDAM paper). The MDAM
techniques inherently work best when "skipping" columns of discrete
types like integer and date, which is why the paper has examples that
all look like that. If you look at my example query and its individual
accesses, you'll realize why this is so.

Thoughts?

[1]: https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

mdam_paper_in_query.sqlapplication/octet-stream; name=mdam_paper_in_query.sqlDownload
v1-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v1-0001-Add-skip-scan-to-nbtree.patchDownload+1882-166
#2Aleksander Alekseev
aleksander@timescale.com
In reply to: Peter Geoghegan (#1)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Hi Peter,

Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).

[...]

Thoughts?

Many thanks for working on this. I believe it is an important feature
and it would be great to deliver it during the PG18 cycle.

I experimented with the patch and here are the results I got so far.

Firstly, it was compiled on Intel MacOS and ARM Linux. All the tests
pass just fine.

Secondly, I tested the patch manually using a release build on my
Raspberry Pi 5 and the GUCs that can be seen in [1]https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh.

Test 1 - simple one.

```
CREATE TABLE test1(c char, n bigint);
CREATE INDEX test1_idx ON test1 USING btree(c,n);

INSERT INTO test1
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
```

Test 2 - a more complicated one.

```
CREATE TABLE test2(c1 char, c2 char, n bigint);
CREATE INDEX test2_idx ON test2 USING btree(c1,c2,n);

INSERT INTO test2
SELECT chr(ascii('a') + random(0,2)) AS c1,
chr(ascii('a') + random(0,2)) AS c2,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test2 WHERE n > 900_000_000;
```

Test 3 - to see how it works with covering indexes.

```
CREATE TABLE test3(c char, n bigint, s text DEFAULT 'text_value' || n);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);

INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT s FROM test3 WHERE n < 1000;
```

In all the cases the patch worked as expected.

I noticed that with the patch we choose Index Only Scans for Test 1
and without the patch - Parallel Seq Scan. However the Parallel Seq
Scan is 2.4 times faster. Before the patch the query takes 53 ms,
after the patch - 127 ms. I realize this could be just something
specific to my hardware and/or amount of data.

Do you think this is something that was expected or something worth
investigating further?

I haven't looked at the code yet.

[1]: https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh

--
Best regards,
Aleksander Alekseev

In reply to: Aleksander Alekseev (#2)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Tue, Jul 2, 2024 at 8:53 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:

CREATE TABLE test1(c char, n bigint);
CREATE INDEX test1_idx ON test1 USING btree(c,n);

The type "char" (note the quotes) is different from char(1). It just
so happens that v1 has support for skipping attributes that use the
default opclass for "char", without support for char(1).

If you change your table definition to CREATE TABLE test1(c "char", n
bigint), then your example queries can use the optimization. This
makes a huge difference.

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;

For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.

I noticed that with the patch we choose Index Only Scans for Test 1
and without the patch - Parallel Seq Scan. However the Parallel Seq
Scan is 2.4 times faster. Before the patch the query takes 53 ms,
after the patch - 127 ms.

I'm guessing that it's actually much faster once you change the
leading column to the "char" type/default opclass.

I realize this could be just something
specific to my hardware and/or amount of data.

The selfuncs.c costing current has a number of problems.

One problem is that it doesn't know that some opclasses/types don't
support skipping at all. That particular problem should be fixed on
the nbtree side; nbtree should support skipping regardless of the
opclass that the skipped attribute uses (while still retaining the new
opclass support functions for a subset of types where we expect it to
make skip scans somewhat faster).

--
Peter Geoghegan

In reply to: Peter Geoghegan (#3)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg@bowt.ie> wrote:

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;

For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.

Actually, looks like that's an invalid result. The "char" opclass
support function appears to have bugs.

My testing totally focussed on types like integer, date, and UUID. The
"char" opclass was somewhat of an afterthought. Will fix "char" skip
support for v2.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#4)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Tue, Jul 2, 2024 at 9:40 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg@bowt.ie> wrote:

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;

For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.

Actually, looks like that's an invalid result. The "char" opclass
support function appears to have bugs.

Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight. Your
test cases will work with this v2, provided you use "char" (instead of
unadorned char) in the create table statements.

Another small change in v2: I added a DEBUG2 message to nbtree
preprocessing, indicating the number of attributes that we're going to
skip. This provides an intuitive way to see whether the optimizations
are being applied in the first place. That should help to avoid
further confusion like this as the patch continues to evolve.

Support for char(1) doesn't seem feasible within the confines of a
skip support routine. Just like with text (which I touched on in the
introductory email), this will require teaching nbtree to perform
explicit next-key probes. An approach based on explicit probes is
somewhat less efficient in some cases, but it should always work. It's
impractical to write opclass support that (say) increments a char
value 'a' to 'b'. Making that approach work would require extensive
cooperation from the collation provider, and some knowledge of
encoding, which just doesn't make sense (if it's possible at all). I
don't have the problem with "char" because it isn't a collatable type
(it is essentially the same thing as an uint8 integer type, except
that it outputs printable ascii characters).

FWIW, your test cases don't seem like particularly good showcases for
the patch. The queries you came up with require a relatively large
amount of random I/O when accessing the heap, which skip scan will
never help with -- so skip scan is a small win (at least relative to
an unoptimized full index scan). Obviously, no skip scan can ever
avoid any required heap accesses compared to a naive full index scan
(loose index scan *does* have that capability, which is possible only
because it applies semantic information in a way that's very
different).

FWIW, a more sympathetic version of your test queries would have
involved something like "WHERE n = 900_500_000". That would allow the
implementation to perform a series of *selective* primitive index
scans (one primitive index scan per "c" column/char grouping). That
change has the effect of allowing the scan to skip over many
irrelevant leaf pages, which is of course the whole point of skip
scan. It also makes the scan will require far fewer heap accesses, so
heap related costs no longer drown out the nbtree improvements.

--
Peter Geoghegan

Attachments:

v2-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v2-0001-Add-skip-scan-to-nbtree.patchDownload+1899-177
In reply to: Peter Geoghegan (#5)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Tue, Jul 2, 2024 at 12:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight.

Although v2 gives correct answers to the queries, the scan itself
performs an excessive amount of leaf page accesses. In short, it
behaves just like a full index scan would, even though we should
expect it to skip over significant runs of the index. So that's
another bug.

It looks like the queries you posted have a kind of adversarial
quality to them, as if they were designed to confuse the
implementation. Was it intentional? Did you take them from an existing
test suite somewhere?

The custom instrumentation I use to debug these issues shows:

_bt_readpage: 🍀 1981 with 175 offsets/tuples (leftsib 4032, rightsib 3991) ➡️
_bt_readpage first: (c, n)=(b, 998982285), TID='(1236,173)',
0x7f1464fe9fc0, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173,
nmatching: 174 ✅
_bt_readpage: 🍀 3991 with 175 offsets/tuples (leftsib 1981, rightsib 9) ➡️
_bt_readpage first: (c, n)=(b, 999474517), TID='(4210,9)',
0x7f1464febfc8, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173,
nmatching: 174 ✅
_bt_readpage: 🍀 9 with 229 offsets/tuples (leftsib 3991, rightsib 3104) ➡️
_bt_readpage first: (c, n)=(c, 1606), TID='(882,68)', 0x7f1464fedfc0,
from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌
_bt_readpage: 🍀 3104 with 258 offsets/tuples (leftsib 9, rightsib 1685) ➡️
_bt_readpage first: (c, n)=(c, 706836), TID='(3213,4)',
0x7f1464feffc0, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌
*** SNIP, many more "nmatching: 0" pages appear after these two ***

The final _bt_advance_array_keys call for leaf page 3991 should be
scheduling a new primitive index scan (i.e. skipping), but that never
happens. Not entirely sure why that is, but it probably has something
to do with _bt_advance_array_keys failing to hit the
"has_required_opposite_direction_only" path for determining if another
primitive scan is required. You're using an inequality required in the
opposite-to-scan-direction here, so that path is likely to be
relevant.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#6)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Tue, Jul 2, 2024 at 12:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

Although v2 gives correct answers to the queries, the scan itself
performs an excessive amount of leaf page accesses. In short, it
behaves just like a full index scan would, even though we should
expect it to skip over significant runs of the index. So that's
another bug.

Hit "send" too soon. I simply forgot to run "alter table test1 alter
column c type "char";" before running the query. So, I was mistaken
about there still being a bug in v2. The issue here is that we don't
have support for the underlying type, char(1) -- nothing more.

v2 of the patch with your query 1 (when changed to use the "char"
type/opclass instead of the currently unsupported char(1)
type/opclass) performs 395 index related buffer hits, and 5406 heap
block accesses. Whereas it's 3833 index buffer hits with master
(naturally, the same 5406 heap accesses are required with master). In
short, this query isn't particularly sympathetic to the patch. Nor is
it unsympathetic.

--
Peter Geoghegan

#8Aleksander Alekseev
aleksander@timescale.com
In reply to: Peter Geoghegan (#7)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Hi Peter,

It looks like the queries you posted have a kind of adversarial
quality to them, as if they were designed to confuse the
implementation. Was it intentional?

To some extent. I merely wrote several queries that I would expect
should benefit from skip scans. Since I didn't look at the queries you
used there was a chance that I will hit something interesting.

Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight. Your
test cases will work with this v2, provided you use "char" (instead of
unadorned char) in the create table statements.

Thanks for v2.

If you change your table definition to CREATE TABLE test1(c "char", n
bigint), then your example queries can use the optimization. This
makes a huge difference.

You are right, it does.

Test1 takes 33.7 ms now (53 ms before the path, x1.57)

Test3 I showed before contained an error in the table definition
(Postgres can't do `n bigint, s text DEFAULT 'text_value' || n`). Here
is the corrected test:

```
CREATE TABLE test3(c "char", n bigint, s text);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);

INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);

EXPLAIN ANALYZE SELECT s FROM test3 WHERE n < 10_000;
```

It runs fast (< 1 ms) and uses the index, as expected.

Test2 with "char" doesn't seem to benefit from the patch anymore
(pretty sure it did in v1). It always chooses Parallel Seq Scans even
if I change the condition to `WHERE n > 999_995_000` or `WHERE n =
999_997_362`. Is it an expected behavior?

I also tried Test4 and Test5.

In Test4 I was curious if scip scans work properly with functional indexes:

```
CREATE TABLE test4(d date, n bigint);
CREATE INDEX test4_idx ON test4 USING btree(extract(year from d),n);

INSERT INTO test4
SELECT ('2024-' || random(1,12) || '-' || random(1,28)) :: date AS d,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM test4 WHERE n > 900_000_000;
```

The query uses Index Scan, however the performance is worse than with
Seq Scan chosen before the patch. It doesn't matter if I choose '>' or
'=' condition.

Test5 checks how skip scans work with partial indexes:

```
CREATE TABLE test5(c "char", n bigint);
CREATE INDEX test5_idx ON test5 USING btree(c, n) WHERE n > 900_000_000;

INSERT INTO test5
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM test5 WHERE n > 950_000_000;
```

It runs fast and choses Index Only Scan. But then I discovered that
without the patch Postgres also uses Index Only Scan for this query. I
didn't know it could do this - what is the name of this technique? The
query takes 17.6 ms with the patch, 21 ms without the patch. Not a
huge win but still.

That's all I have for now.

--
Best regards,
Aleksander Alekseev

In reply to: Aleksander Alekseev (#8)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Fri, Jul 5, 2024 at 7:04 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:

Test2 with "char" doesn't seem to benefit from the patch anymore
(pretty sure it did in v1). It always chooses Parallel Seq Scans even
if I change the condition to `WHERE n > 999_995_000` or `WHERE n =
999_997_362`. Is it an expected behavior?

The "char" opclass's skip support routine was totally broken in v1, so
its performance isn't really relevant. In any case v2 didn't make any
changes to the costing, so I'd expect it to use exactly the same query
plan as v1.

The query uses Index Scan, however the performance is worse than with
Seq Scan chosen before the patch. It doesn't matter if I choose '>' or
'=' condition.

That's because the index has a leading/skipped column of type
"numeric", which isn't a supported type just yet (a supported B-Tree
opclass, actually).

The optimization is effective if you create the expression index with
a cast to integer:

CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n);

This performs much better. Now I see "DEBUG: skipping 1 index
attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT
COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the
optimization has in fact been used as expected. There are far fewer
buffers hit with this version of your test4, which also indicates that
the optimization has been effective.

Note that the original numeric expression index test4 showed "DEBUG:
skipping 0 index attributes" when the test query ran, which indicated
that the optimization couldn't be used. I suggest that you look out
for that, by running "set client_min_messages to debug2;" from psql
when testing the patch.

It runs fast and choses Index Only Scan. But then I discovered that
without the patch Postgres also uses Index Only Scan for this query. I
didn't know it could do this - what is the name of this technique?

It is a full index scan. These have been possible for many years now
(possibly well over 20 years).

Arguably, the numeric case that didn't use the optimization (your
test4) should have been costed as a full index scan, but it wasn't --
that's why you didn't get a faster sequential scan, which would have
made a little bit more sense. In general, the costing changes in the
patch are very rough.

That said, this particular problem (the test4 numeric issue) should be
fixed by inventing a way for nbtree to use skip scan with types that
lack skip support. It's not primarily a problem with the costing. At
least not in my mind.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#9)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Fri, Jul 5, 2024 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote:

CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n);

This performs much better. Now I see "DEBUG: skipping 1 index
attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT
COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the
optimization has in fact been used as expected. There are far fewer
buffers hit with this version of your test4, which also indicates that
the optimization has been effective.

Actually, with an index-only scan it is 281 buffer hits (including
some small number of VM buffer hits) with the patch, versus 2736
buffer hits on master. So a big change to the number of index page
accesses only.

If you use a plain index scan for this, then the cost of random heap
accesses totally dominates, so skip scan cannot possibly give much
benefit. Even a similar bitmap scan requires 4425 distinct heap page accesses,
which is significantly more than the total number of index pages in
the index. 4425 heap pages is almost the entire table; the table
consists of 4480 mainfork blocks.

This is a very nonselective query. It's not at all surprising that
this query (and others like it) hardly benefit at all, except when we
can use an index-only scan (so that the cost of heap accesses doesn't
totally dominate).

--
Peter Geoghegan

#11Masahiro.Ikeda@nttdata.com
Masahiro.Ikeda@nttdata.com
In reply to: Peter Geoghegan (#1)
RE: Adding skip scan (including MDAM style range skip scan) to nbtree

Hi,

Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1]Improve EXPLAIN output for multicolumn B-Tree Index /messages/by-id/TYWPR01MB1098260B694D27758FE2BA46FB1C92@TYWPR01MB10982.jpnprd01.prod.outlook.com, I began to try the skip scan with some
queries and look into the source code.

I have some feedback and comments.

(1)

At first, I was surprised to look at your benchmark result because the skip scan
index can improve much performance. I agree that there are many users to be
happy with the feature for especially OLAP use-case. I expected to use v18.

(2)

I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?

# Test result. The attached file is the detail of tests.

-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
Output: id1, id2, id3, value
Index Cond: (test.id3 = 101)
Buffers: shared hit=4402
Planning:
Buffers: shared hit=7
Planning Time: 0.234 ms
Execution Time: 15.711 ms
(8 rows)

-- Seq Scan
-- actual time is high, but the cost is lower than one of the above Index Scan.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..12676.73 rows=984 width=20) (actual time=0.856..113.861 rows=991 loops=1)
Output: id1, id2, id3, value
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6370
-> Parallel Seq Scan on public.test (cost=0.00..11578.33 rows=410 width=20) (actual time=0.061..102.016 rows=330 loops=3)
Output: id1, id2, id3, value
Filter: (test.id3 = 101)
Rows Removed by Filter: 333003
Buffers: shared hit=6370
Worker 0: actual time=0.099..98.014 rows=315 loops=1
Buffers: shared hit=2066
Worker 1: actual time=0.054..97.162 rows=299 loops=1
Buffers: shared hit=1858
Planning:
Buffers: shared hit=19
Planning Time: 0.194 ms
Execution Time: 114.129 ms
(18 rows)

I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.

My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
per skipped attribute)

But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1

As I know you said the below, but I'd like to know the above is expected or not.

That approach seems far more practicable than preempting the problem
during planning or during nbtree preprocessing. It seems like it'd be
very hard to model the costs statistically. We need revisions to
btcostestimate, of course, but the less we can rely on btcostestimate
the better. As I said, there are no new index paths generated by the
optimizer for any of this.

I couldn't understand why there is the below logic well.

btcostestimate()
(...omit...)
if (indexcol != iclause->indexcol)
{
/* no quals at all for indexcol */
found_skip = true;
if (index->pages < 100)
break;
num_sa_scans += 10 * (indexcol - iclause->indexcol); // why add minus value?
continue; // why skip to add bound quals?
}

(3)

Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?

[1]: Improve EXPLAIN output for multicolumn B-Tree Index /messages/by-id/TYWPR01MB1098260B694D27758FE2BA46FB1C92@TYWPR01MB10982.jpnprd01.prod.outlook.com
/messages/by-id/TYWPR01MB1098260B694D27758FE2BA46FB1C92@TYWPR01MB10982.jpnprd01.prod.outlook.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

compare_cost.sqlapplication/octet-stream; name=compare_cost.sqlDownload
compare_cost_with_v2_patch.outapplication/octet-stream; name=compare_cost_with_v2_patch.outDownload
In reply to: Masahiro.Ikeda@nttdata.com (#11)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote:

Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1], I began to try the skip scan with some
queries and look into the source code.

Thanks for the review!

Attached is v3, which generalizes skip scan, allowing it to work with
opclasses/types that lack a skip support routine. In other words, v3
makes skip scan work for all types, including continuous types, where
it's impractical or infeasible to add skip support. So now important
types like text and numeric also get the skip scan optimization (it's
not just discrete types like integer and date, as in previous
versions).

I feel very strongly that everything should be implemented as part of
the new skip array abstraction; the patch should only add the concept
of skip arrays, which should work just like SAOP arrays. We should
avoid introducing any special cases. In short, _bt_advance_array_keys
should work in exactly the same way as it does as of Postgres 17
(except for a few representational differences for skip arrays). This
seems essential because _bt_advance_array_keys inherently need to be
able to trigger moving on to the next skip array value when it reaches
the end of a SAOP array (and vice-versa). And so it just makes sense
to abstract-away the differences, hiding the difference in lower level
code.

I have described the new _bt_first behavior that is now available in
this new v3 of the patch as "adding explicit next key probes". While
v3 does make new changes to _bt_first, it's not really a special kind
of index probe. v3 invents new sentinel values instead.

The use of sentinels avoids inventing true special cases: the values
-inf, +inf, as well as variants of = that use a real datum value, but
match on the next key in the index. These new = variants can be
thought of as "+infinitesimal" values. So when _bt_advance_array_keys
has to "increment" the numeric value 5.0, it sets the scan key to the
value "5.0 +infinitesimal". There can never be any matching tuples in
the index (just like with -inf sentinel values), but that doesn't
matter. So the changes v3 makes to _bt_first doesn't change the basic
conceptual model. The added complexity is kept to a manageable level,
particularly within _bt_advance_array_keys, which is already very
complicated.

To help with testing and review, I've added another temporary testing
GUC to v3: skipscan_skipsupport_enabled. This can be set to "false" to
avoid using skip support, even where available. The GUC makes it easy
to measure how skip support routines can help performance (with
discrete types like integer and date).

I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?

Yes and no.

Honestly, the current costing is just placeholder code. It is totally
inadequate. I'm not surprised that you found problems with it. I just
didn't put much work into it, because I didn't really know what to do.

# Test result. The attached file is the detail of tests.

-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
Output: id1, id2, id3, value
Index Cond: (test.id3 = 101)
Buffers: shared hit=4402
Planning:
Buffers: shared hit=7
Planning Time: 0.234 ms
Execution Time: 15.711 ms
(8 rows)

This is a useful example, because it shows the difficulty with the
costing. I ran this query using my own custom instrumentation of the
scan. I saw that we only ever manage to skip ahead by perhaps 3 leaf
pages at a time, but we still come out ahead. As you pointed out, it's
~7.5x faster than the sequential scan, but not very different to the
equivalent full index scan. At least not very different in terms of
leaf page accesses. Why should we win by this much, for what seems
like a marginal case for skip scan?

Even cases where "skipping" doesn't manage to skip any leaf pages can
still benefit from skipping *index tuples* -- there is more than one
kind of skipping to consider. That is, the patch helps a lot with some
(though not all) cases where I didn't really expect that to happen:
the Postgres 17 SAOP tuple skipping code (the code in
_bt_checkkeys_look_ahead, and the related code in _bt_readpage) helps
quite a bit in "marginal" skip scan cases, even though it wasn't
really designed for that purpose (it was added to avoid regressions in
SAOP array scans for the Postgres 17 work).

I find that some queries using my original example test case are about
twice as fast as an equivalent full index scan, even when only the
fourth and final index column is used in the query predicate. The scan
can't even skip a single leaf page at a time, and yet we still win by
a nice amount. We win, though it is almost by mistake!

This is mostly a good thing. Both for the obvious reason (fast is
better than slow), and because it justifies being so aggressive in
assuming that skip scan might work out during planning (being wrong
without really losing is nice). But there is also a downside: it makes
it even harder to model costs at runtime, from within the optimizer.

If I measure the actual runtime costs other than runtime (e.g.,
buffers accesses), I'm not sure that the optimizer is wrong to think
that the parallel sequential scan is faster. It looks approximately
correct. It is only when we look at runtime that the optimizer's
choice looks wrong. Which is...awkward.

In general, I have very little idea about how to improve the costing
within btcostestimate. I am hoping that somebody has better ideas
about it. btcostestimate is definitely the area where the patch is
weakest right now.

I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.

My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
per skipped attribute)

But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1

The logic with cost.num_sa_scans was definitely not what I intended.
That's fixed in v3, at least. But the code in btcostestimate is still
essentially the same as in earlier versions -- it needs to be
completely redesigned (or, uh, designed for the first time).

As I know you said the below, but I'd like to know the above is expected or not.

Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?

It probably makes sense to use pg_stats.n_distinct here. But how?

If the problem is that we're too pessimistic, then I think that this
will usually (though not always) make us more pessimistic. Isn't that
the wrong direction to go in? (We're probably also too optimistic in
some cases, but being too pessimistic is a bigger problem in
practice.)

For example, your test case involved 11 distinct values in each
column. The current approach of hard-coding 10 (which is just a
temporary hack) should actually make the scan look a bit cheaper than
it would if we used the true ndistinct.

Another underlying problem is that the existing SAOP costing really
isn't very accurate, without skip scan -- that's a big source of the
pessimism with arrays/skipping. Why should we be able to get the true
number of primitive index scans just by multiplying together each
omitted prefix column's ndistinct? That approach is good for getting
the worst case, which is probably relevant -- but it's probably not a
very good assumption for the average case. (Though at least we can cap
the total number of primitive index scans to 1/3 of the total number
of pages in the index in btcostestimate, since we have guarantees
about the worst case as of Postgres 17.)

--
Peter Geoghegan

Attachments:

v3-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v3-0001-Add-skip-scan-to-nbtree.patchDownload+2293-183
In reply to: Peter Geoghegan (#12)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3, which generalizes skip scan, allowing it to work with
opclasses/types that lack a skip support routine. In other words, v3
makes skip scan work for all types, including continuous types, where
it's impractical or infeasible to add skip support.

Attached is v4, which:

* Fixes a previous FIXME item affecting range skip scans/skip arrays
used in cross-type scenarios.

* Refactors and simplifies the handling of range inequalities
associated with skip arrays more generally. We now always use
inequality scan keys during array advancement (and when descending the
tree within _bt_first), rather than trying to use a datum taken from
the range inequality as an array element directly.

This gives us cleaner separation between scan keys/data types in
cross-type scenarios: skip arrays will now only ever contain
"elements" of opclass input type. Sentinel values such as -inf are
expanded to represent "the lowest possible value that comes after the
array's low_compare lower bound, if any". Opclasses that don't offer
skip support took roughly this same approach within v3, but in v4 all
opclasses do it the same way (so opclasses with skip support use the
SK_BT_NEG_INF sentinel marking in their scan keys, though never the
SK_BT_NEXTKEY sentinel marking).

This is really just a refactoring revision. Nothing particularly
exciting here compared to v3.

--
Peter Geoghegan

Attachments:

v4-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v4-0001-Add-skip-scan-to-nbtree.patchDownload+2237-205
#14Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Peter Geoghegan (#12)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Wed, Jun 26, 2024 at 03:16:07PM GMT, Peter Geoghegan wrote:

Loose index scan is a far more specialized technique than skip scan.
It only applies within special scans that feed into a DISTINCT group
aggregate. Whereas my skip scan patch isn't like that at all -- it's
much more general. With my patch, nbtree has exactly the same contract
with the executor/core code as before. There are no new index paths
generated by the optimizer to make skip scan work, even. Skip scan
isn't particularly aimed at improving group aggregates (though the
benchmark I'll show happens to involve a group aggregate, simply
because the technique works best with large and expensive index
scans).

I see that the patch is not supposed to deal with aggregates in any special
way. But from what I understand after a quick review, skip scan is not getting
applied to them if there are no quals in the query (in that case
_bt_preprocess_keys returns before calling _bt_preprocess_array_keys). Yet such
queries could benefit from skipping, I assume they still could be handled by
the machinery introduced in this patch?

Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?

It probably makes sense to use pg_stats.n_distinct here. But how?

If the problem is that we're too pessimistic, then I think that this
will usually (though not always) make us more pessimistic. Isn't that
the wrong direction to go in? (We're probably also too optimistic in
some cases, but being too pessimistic is a bigger problem in
practice.)

For example, your test case involved 11 distinct values in each
column. The current approach of hard-coding 10 (which is just a
temporary hack) should actually make the scan look a bit cheaper than
it would if we used the true ndistinct.

Another underlying problem is that the existing SAOP costing really
isn't very accurate, without skip scan -- that's a big source of the
pessimism with arrays/skipping. Why should we be able to get the true
number of primitive index scans just by multiplying together each
omitted prefix column's ndistinct? That approach is good for getting
the worst case, which is probably relevant -- but it's probably not a
very good assumption for the average case. (Though at least we can cap
the total number of primitive index scans to 1/3 of the total number
of pages in the index in btcostestimate, since we have guarantees
about the worst case as of Postgres 17.)

Do I understand correctly, that the only way how multiplying ndistincts could
produce too pessimistic results is when there is a correlation between distinct
values? Can one benefit from the extended statistics here?

And while we're at it, I think it would be great if the implementation will
allow some level of visibility about the skip scan. From what I see, currently
it's by design impossible for users to tell whether something was skipped or
not. But when it comes to planning and estimates, maybe it's not a bad idea to
let explain analyze show something like "expected number of primitive scans /
actual number of primitive scans".

In reply to: Dmitry Dolgov (#14)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Sat, Aug 3, 2024 at 3:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I see that the patch is not supposed to deal with aggregates in any special
way.

Right.

But from what I understand after a quick review, skip scan is not getting
applied to them if there are no quals in the query (in that case
_bt_preprocess_keys returns before calling _bt_preprocess_array_keys).

Right.

Yet such queries could benefit from skipping, I assume they still could be handled by
the machinery introduced in this patch?

I'm not sure.

There are no real changes required inside _bt_advance_array_keys with
this patch -- skip arrays are dealt with in essentially the same way
as conventional arrays (as of Postgres 17). I suspect that loose index
scan would be best implemented using _bt_advance_array_keys. It could
also "plug in" to the existing _bt_advance_array_keys design, I
suppose.

As I touched on already, your loose index scan patch applies
high-level semantic information in a way that is very different to my
skip scan patch. This means that it makes revisions to the index AM
API (if memory serves it adds a callback called amskip to that API).
It also means that loose index scan can actually avoid heap accesses;
loose scans wholly avoid accessing logical rows (in both the index and
the heap) by reasoning that it just isn't necessary to do so at all.
Skipping happens in both data structures. Right?

Obviously, my new skip scan patch cannot possibly reduce the number of
heap page accesses required by a given index scan. Precisely the same
logical rows must be accessed as before. There is no two-way
conversation between the index AM and the table AM about which
rows/row groupings have at least one visible tuple. We're just
navigating through the index more efficiently, without changing any
contract outside of nbtree itself.

The "skip scan" name collision is regrettable. But the fact is that
Oracle, MySQL, and now SQLite all call this feature skip scan. That
feels like the right precedent to follow.

Do I understand correctly, that the only way how multiplying ndistincts could
produce too pessimistic results is when there is a correlation between distinct
values?

Yes, that's one problem with the costing. Not the only one, though.

The true number of primitive index scans depends on the cardinality of
the data. For example, a skip scan might be the cheapest plan by far
if (say) 90% of the index has the same leading column value and the
remaining 10% has totally unique values. We'd still do a bad job of
costing this query with an accurate ndistinct for the leading column.
We really one need to do one or two primitive index scans for "the
first 90% of the index", and one more primitive index scan for "the
remaining 10% of the index". For a query such as this, we "require a
full index scan for the remaining 10% of the index", which is
suboptimal, but doesn't fundamentally change anything (I guess that a
skip scan is always suboptimal, in the sense that you could always do
better by having more indexes).

Can one benefit from the extended statistics here?

I really don't know. Certainly seems possible in cases with more than
one skipped leading column.

The main problem with the costing right now is that it's just not very
well thought through, in general. The performance at runtime depends
on the layout of values in the index itself, so the underlying way
that you'd model the costs doesn't have any great precedent in
costsize.c. We do have some idea of the number of leaf pages we'll
access in btcostestimate(), but that works in a way that isn't really
fit for purpose. It kind of works with one primitive index scan, but
works much less well with multiple primitive scans.

And while we're at it, I think it would be great if the implementation will
allow some level of visibility about the skip scan. From what I see, currently
it's by design impossible for users to tell whether something was skipped or
not. But when it comes to planning and estimates, maybe it's not a bad idea to
let explain analyze show something like "expected number of primitive scans /
actual number of primitive scans".

I agree. I think that that's pretty much mandatory for this patch. At
least the actual number of primitive scans should be exposed. Not
quite as sure about showing the estimated number, since that might be
embarrassingly wrong quite regularly, without it necessarily mattering
that much (I'd worry that it'd be distracting).

Displaying the number of primitive scans would already be useful for
index scans with SAOPs, even without this patch. The same general
concepts (estimated vs. actual primitive index scans) already exist,
as of Postgres 17. That's really nothing new.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#15)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Sat, Aug 3, 2024 at 6:14 PM Peter Geoghegan <pg@bowt.ie> wrote:

Displaying the number of primitive scans would already be useful for
index scans with SAOPs, even without this patch. The same general
concepts (estimated vs. actual primitive index scans) already exist,
as of Postgres 17. That's really nothing new.

We actually expose this via instrumentation, in a certain sense. This
is documented by a "Note":

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-INDEXES-VIEW

That is, we already say "Each internal primitive index scan increments
pg_stat_all_indexes.idx_scan, so it's possible for the count of index
scans to significantly exceed the total number of index scan executor
node executions". So, as I said in the last email, advertising the
difference between # of primitive index scans and # of index scan
executor node executions in EXPLAIN ANALYZE is already a good idea.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#13)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Wed, Jul 24, 2024 at 5:14 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4

Attached is v5, which splits the code from v4 patch into 2 pieces --
it becomes 0002-* and 0003-*. Certain refactoring work now appears
under its own separate patch/commit -- see 0002-* (nothing new here,
except the commit message/patch structure). The patch that actually
adds skip scan (0003-* in this new version) has been further polished,
though not in a way that I think is interesting enough to go into
here.

The interesting and notable change for v5 is the addition of the code
in 0001-*. The new 0001-* patch is concerned with certain aspects of
how _bt_advance_array_keys decides whether to start another primitive
index scan (or to stick with the ongoing one for one more leaf page
instead). This is a behavioral change, albeit a subtle one. It's also
kinda independent of skip scan (more on why that is at the end).

It's easiest to explain why 0001-* matters by way of an example. My
example will show significantly more internal/root page accesses than
seen on master, though only when 0002-* and 0003-* are applied, and
0001-* is omitted. When all 3 v5 patches are applied together, the
total number of index pages accessed by the test query will match the
master branch. It's important that skip scan never loses by much to
the master branch, of course. Even when the details of the index/scan
are inconvenient to the implementation, in whatever way.

Setup:

create table demo (int4 a, numeric b);
create index demo_idx on demo (a, b);
insert into demo select a, random() from generate_series(1, 10000) a,
generate_series(1,5) five_rows_per_a_val;
vacuum demo;

We now have a btree index "demo_idx", which has two levels (a root
page plus a leaf level). The root page contains several hundred pivot
tuples, all of which have their "b" value truncated away (or have the
value -inf, if you prefer), with just one prefix "a" column left in
place. Naturally, every leaf page has a high key with its own
separator key that matches one particular tuple that appears in the
root page (except for the rightmost leaf page). So our leaf level scan
will see lots of truncated leaf page high keys (all matching a
corresponding root page tuple).

Test query:

select a from demo where b > 0.99;

This is a query that really shouldn't be doing any skipping at all. We
nevertheless still see a huge amount of skipping with this query, ocne
0001-* is omitted. Prior to 0001-*, a new primitive index scan is
started whenever the scan reaches a "boundary" between adjoining leaf
pages. That is, whenever _bt_advance_array_keys stopped on a high key
pstate.finaltup. So without the new 0001-* work, the number of page
accesses almost doubles (because we access the root page once per leaf
page accessed, instead of just accessing it once for the whole scan).

What skip scan should have been doing all along (and will do now) is
to step forward to the next right sibling leaf page whenever it
reaches a boundary between leaf pages. This should happen again and
again, without our ever choosing to start a new primitive index scan
instead (it shouldn't happen even once with this query). In other
words, we ought to behave just like a full index scan would behave
with this query -- which is exactly what we get on master.

The scan will still nominally "use skip scan" even with this fix in
place, but in practice, for this particular query/index, the scan
won't ever actually decide to skip. So it at least "looks like" an
index scan from the point of view of EXPLAIN (ANALYZE, BUFFERS). There
is a separate question of how many CPU cycles we use to do all this,
but for now my focus is on total pages accessed by the patch versus on
master, especially for adversarial cases such as this.

It should be noted that the skip scan patch never had any problems
with this very similar query (same table as before):

select a from demo where b < 0.01;

The fact that we did the wrong thing for the first query, but the
right thing for this second similar query, was solely due to certain
accidental implementation details -- it had nothing to do with the
fundamentals of the problem. You might even say that 0001-* makes the
original "b > 0.99" case behave in the same manner as this similar "b
< 0.01" case, which is justifiable on consistency grounds. Why
wouldn't these two cases behave similarly? It's only logical.

The underlying problem arguably has little to do with skip scan;
whether we use a real SAOP array on "a" or a consed up skip array is
incidental to the problem that my example highlights. As always, the
underlying "array type" (skip vs SOAP) only matters to the lowest
level code. And so technically, this is an existing issue on
HEAD/master. You can see that for yourself by making the problematic
query's qual "where a = any ('every possible a value') and b > 0.99"
-- same problem on Postgres 17, without involving skip scan.

To be sure, the underlying problem does become more practically
relevant with the invention of skip arrays for skip scan, but 0001-*
can still be treated as independent work. It can be committed well
ahead of the other stuff IMV. The same is likely also true of the
refactoring now done in 0002-* -- it does refactoring that makes
sense, even without skip scan. And so I don't expect it to take all
that long for it to be committable.

--
Peter Geoghegan

Attachments:

v5-0001-Normalize-nbtree-truncated-high-key-array-behavio.patchapplication/octet-stream; name=v5-0001-Normalize-nbtree-truncated-high-key-array-behavio.patchDownload+97-54
v5-0003-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v5-0003-Add-skip-scan-to-nbtree.patchDownload+2175-135
v5-0002-Refactor-handling-of-nbtree-array-redundancies.patchapplication/octet-stream; name=v5-0002-Refactor-handling-of-nbtree-array-redundancies.patchDownload+83-85
In reply to: Peter Geoghegan (#12)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote:

I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?

Yes and no.

Honestly, the current costing is just placeholder code. It is totally
inadequate. I'm not surprised that you found problems with it. I just
didn't put much work into it, because I didn't really know what to do.

Attached is v6, which finally does something sensible in btcostestimate.

v6 is also the first version that supports parallel index scans that
can skip. This works by extending the approach taken by scans with
regular SAOP arrays to work with skip arrays. We need to serialize and
deserialize the current array keys in shared memory, as datums -- we
cannot just use simple BTArrayKeyInfo.cur_elem offsets with skip
arrays.

v6 also includes the patch that shows "Index Searches" in EXPLAIN
ANALYZE output, just because it's convenient when testing the patch.
This has been independently submitted as
https://commitfest.postgresql.org/49/5183/, so probably doesn't need
review here.

v6 is the first version of the patch that is basically feature
complete. I only have one big open item left: I must still fix certain
regressions seen with queries that are very unfavorable for skip scan,
where the CPU cost (but not I/O cost) of maintaining skip arrays slows
things down. Overall, I'm making fast progress here.

Back to the topic of the btcostestimate/planner changes. The rest of
the email is a discussion of the cost model.

The planner changes probably still have some problems, but all of the
obvious problems have been fixed by v6. I found it useful to focus on
making the cost model not have any obvious problems instead of trying
to make it match a purely theoretical ideal. For example, your
(Ikeda-san's) complaint about the "Index Scan using idx_id1_id2_id3 on
public.test" test case having too high a cost (higher than the cost of
a slower sequential scan) has been fixed. It's now about 3x cheaper
than the sequential scan, since we're actually paying attention to
ndistinct in v6.

Just like when we cost SAOP arrays on HEAD, skip arrays are costed by
pessimistically multiplying together the estimated number of array
elements for all the scan's arrays, without trying to account for
correlation between index columns. Being pessimistic about
correlations like this is often wrong, but that still seems like the
best bias we could have, all things considered. Plus it's nothing new.

Range style skip arrays require a slightly more complicated approach
to estimating the number of array elements: costing applies a
selectivity estimate, taken from the associated index column's
inequality keys, and applies that estimate to ndistinct itself. That
way the cost of a range skip array is lower than an
otherwise-equivalent simple skip array case (we prorate ndistinct with
skip arrays). More importantly, the cost of more selectivity ranges is
lower than the cost of less selective ranges. There is also a bias
here: we don't account for skew in ndistinct. That's probably OK,
because at least it's a bias *against* skip scan.

The new cost model does not specifically try to account for how scans
will behave when no skipping should be expected at all -- cases where
a so-called "skip scan" degenerates into a full index scan. In theory,
we should be costing these scans the same as before, since there has
been no change in runtime behavior. Overall, the cost of full index
scans with very many distinct prefix column values goes down by quite
a bit -- the cost is something like 1/3 lower in typical cases.

The problem with preserving the cost model from HEAD for these
unfavorable cases for skip scan is that I don't feel that I understand
the existing behavior. In practice the revised costing seems to be a
somewhat more accurate predictor of the actual runtime of queries.
Another problem is that I can't see a good way to make the behavior
continuous when ndistinct starts small and grows so large that we
should expect a true full index scan. (As I mentioned at the start of
this email, there are unfixed regressions for these unfavorable cases,
so I'm basing this analysis on the "set skipscan_prefix_cols = 0"
behavior rather than the current default patch behavior to correct for
that. This behavior matches HEAD with a full index scan, and should
match the default behavior in a future version of the skip scan
patch.)

--
Peter Geoghegan

Attachments:

v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patchapplication/octet-stream; name=v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patchDownload+198-48
v6-0003-Refactor-handling-of-nbtree-array-redundancies.patchapplication/octet-stream; name=v6-0003-Refactor-handling-of-nbtree-array-redundancies.patchDownload+82-85
v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patchapplication/octet-stream; name=v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patchDownload+95-54
v6-0004-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v6-0004-Add-skip-scan-to-nbtree.patchDownload+2630-301
#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#18)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Hi,

I started looking at this patch today. The first thing I usually do for
new patches is a stress test, so I did a simple script that generates
random table and runs a random query with IN() clause with various
configs (parallel query, index-only scans, ...). And it got stuck on a
parallel query pretty quick.

I've seen a bunch of those cases, so it's not a particularly unlikely
issue. The backtraces look pretty much the same in all cases - the
processes are stuck either waiting on the conditional variable in
_bt_parallel_seize, or trying to send data in shm_mq_send_bytes.

Attached is the script I use for stress testing (pretty dumb, just a
bunch of loops generating tables + queries), and backtraces for two
lockups (one is EXPLAIN ANALYZE, but otherwise exactly the same).

I haven't investigated why this is happening, but I wonder if this might
be similar to the parallel hashjoin issues, with trying to send data,
but the receiver being unable to proceed and effectively working on the
sender. But that's just a wild guess.

regards

--
Tomas Vondra

Attachments:

lockup2.logtext/x-log; charset=UTF-8; name=lockup2.logDownload
lockup.logtext/x-log; charset=UTF-8; name=lockup.logDownload
run.shapplication/x-shellscript; name=run.shDownload
In reply to: Tomas Vondra (#19)
Re: Adding skip scan (including MDAM style range skip scan) to nbtree

On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra <tomas@vondra.me> wrote:

I started looking at this patch today.

Thanks for taking a look!

The first thing I usually do for
new patches is a stress test, so I did a simple script that generates
random table and runs a random query with IN() clause with various
configs (parallel query, index-only scans, ...). And it got stuck on a
parallel query pretty quick.

I can reproduce this locally, without too much difficulty.
Unfortunately, this is a bug on master/Postgres 17. Some kind of issue
in my commit 5bf748b8.

The timing of this is slightly unfortunate. There's only a few weeks
until the release of 17, plus I have to travel for work over the next
week. I won't be back until the 16th, and will have limited
availability between then and now. I think that I'll have ample time
to debug and fix the issue ahead of the release of 17, though.

Looks like the problem is a parallel index scan with SAOP array keys
can find itself in a state where every parallel worker waits for the
leader to finish off a scheduled primitive index scan, while the
leader itself waits for the scan's tuple queue to return more tuples.
Obviously, the query will effectively go to sleep indefinitely when
that happens (unless and until the DBA cancels the query). This is
only possible with just the right/wrong combination of array keys and
index cardinality.

I cannot recreate the problem with parallel_leader_participation=off,
which strongly suggests that leader participation is a factor. I'll
find time to study this in detail as soon as I can.

Further background: I was always aware of the leader's tendency to go
away forever shortly after the scan begins. That was supposed to be
safe, since we account for it by serializing the scan's current array
keys in shared memory, at the point a primitive index scan is
scheduled -- any backend should be able to pick up where any other
backend left off, no matter how primitive scans are scheduled. That
now doesn't seem to be completely robust, likely due to restrictions
on when and how other backends can pick up the scheduled work from
within _bt_first, at the point that it calls _bt_parallel_seize.

In short, one or two details of how backends call _bt_parallel_seize
to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought.

--
Peter Geoghegan

#21Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#20)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#21)
In reply to: Matthias van de Meent (#21)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#18)
In reply to: Peter Geoghegan (#23)
In reply to: Peter Geoghegan (#18)
In reply to: Tomas Vondra (#24)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#27)
In reply to: Tomas Vondra (#28)
In reply to: Tomas Vondra (#24)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#30)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#29)
In reply to: Tomas Vondra (#31)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#33)
In reply to: Tomas Vondra (#32)
In reply to: Peter Geoghegan (#35)
In reply to: Tomas Vondra (#28)
In reply to: Peter Geoghegan (#37)
In reply to: Peter Geoghegan (#38)
In reply to: Peter Geoghegan (#39)
In reply to: Peter Geoghegan (#40)
In reply to: Peter Geoghegan (#41)
#43Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#40)
In reply to: Matthias van de Meent (#43)
In reply to: Peter Geoghegan (#42)
#46Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Peter Geoghegan (#45)
In reply to: Masahiro Ikeda (#46)
#48Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Peter Geoghegan (#47)
In reply to: Masahiro Ikeda (#48)
#50Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Peter Geoghegan (#49)
#51Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiro Ikeda (#50)
In reply to: Masahiro Ikeda (#51)
#53Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Peter Geoghegan (#52)
In reply to: Masahiro Ikeda (#53)
#55Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Peter Geoghegan (#54)
In reply to: Masahiro Ikeda (#55)
In reply to: Peter Geoghegan (#56)
In reply to: Peter Geoghegan (#57)
In reply to: Peter Geoghegan (#58)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#57)
In reply to: Heikki Linnakangas (#60)
In reply to: Peter Geoghegan (#61)
In reply to: Peter Geoghegan (#61)
In reply to: Peter Geoghegan (#63)
In reply to: Peter Geoghegan (#64)
In reply to: Peter Geoghegan (#65)
In reply to: Peter Geoghegan (#66)
In reply to: Peter Geoghegan (#67)
#69Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Peter Geoghegan (#68)
In reply to: Alena Rybakina (#69)
#71Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Peter Geoghegan (#70)
In reply to: Alena Rybakina (#71)
#73Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#68)
#74Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Peter Geoghegan (#72)
#75Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Alena Rybakina (#74)
In reply to: Alena Rybakina (#75)
#77Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#73)
In reply to: Matthias van de Meent (#73)
In reply to: Matthias van de Meent (#77)
In reply to: Peter Geoghegan (#78)
In reply to: Peter Geoghegan (#80)
In reply to: Peter Geoghegan (#81)
In reply to: Peter Geoghegan (#82)
#84Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Peter Geoghegan (#83)
In reply to: Alena Rybakina (#84)
In reply to: Peter Geoghegan (#83)
In reply to: Peter Geoghegan (#86)
#88Aleksander Alekseev
aleksander@timescale.com
In reply to: Peter Geoghegan (#87)
In reply to: Aleksander Alekseev (#88)
#90Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#86)
#91Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Matthias van de Meent (#90)
In reply to: Matthias van de Meent (#90)
#93Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Peter Geoghegan (#85)
#94Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#92)
#95Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#87)
#96Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#95)
In reply to: Matthias van de Meent (#95)
In reply to: Matthias van de Meent (#94)
In reply to: Alena Rybakina (#93)
#100Alena Rybakina
a.rybakina@postgrespro.ru
In reply to: Peter Geoghegan (#99)
In reply to: Peter Geoghegan (#97)
#102Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Geoghegan (#68)
In reply to: Mark Dilger (#102)
In reply to: Peter Geoghegan (#101)
In reply to: Peter Geoghegan (#104)
In reply to: Peter Geoghegan (#105)
#107Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Geoghegan (#106)
In reply to: Mark Dilger (#107)
#109Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#108)
In reply to: Tomas Vondra (#109)
In reply to: Peter Geoghegan (#110)
In reply to: Peter Geoghegan (#111)
#113Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#112)
In reply to: Tomas Vondra (#109)
In reply to: Peter Geoghegan (#112)
#116Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#114)
#117Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#115)
In reply to: Tomas Vondra (#117)
In reply to: Peter Geoghegan (#118)
#120Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#118)
#121Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#119)
In reply to: Tomas Vondra (#120)
In reply to: Tomas Vondra (#121)
#124Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#122)
In reply to: Tomas Vondra (#124)
In reply to: Tomas Vondra (#109)
#127Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#126)
In reply to: Peter Geoghegan (#1)
#129Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#128)
#130Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#129)
#131Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#130)
#132Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#131)
In reply to: Tomas Vondra (#132)
#134Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#133)
In reply to: Tomas Vondra (#134)
In reply to: Peter Geoghegan (#135)
#137Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#136)
In reply to: Peter Geoghegan (#105)
#139Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#137)
#140Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#139)
In reply to: Tomas Vondra (#140)
#142Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#141)
#143BharatDB
bharatdbpg@gmail.com
In reply to: Tomas Vondra (#142)
#144BharatDB
bharatdbpg@gmail.com
In reply to: BharatDB (#143)
#145Natalya Aksman
natalya@tigerdata.com
In reply to: Peter Geoghegan (#47)
In reply to: Natalya Aksman (#145)
In reply to: Peter Geoghegan (#146)
#148Natalya Aksman
natalya@tigerdata.com
In reply to: Peter Geoghegan (#146)
In reply to: Natalya Aksman (#148)
In reply to: BharatDB (#143)
#151Natalya Aksman
natalya@tigerdata.com
In reply to: Peter Geoghegan (#149)
In reply to: Natalya Aksman (#151)
#153Natalya Aksman
natalya@tigerdata.com
In reply to: Peter Geoghegan (#152)