Use of additional index columns in rows filtering

Started by Maxim Ivanovabout 3 years ago82 messageshackers
Jump to latest
#1Maxim Ivanov
hi@yamlcoder.me

Hi All,

I'd like to report what seems to be a missing optimization opportunity or understand why it is not possible to achieve.

TLDR; additional index column B specified in CREATE INDEX .. (A) INCLUDE(B) is not used to filter rows in queries like WHERE B = $1 ORDER BY A during IndexScan. https://dbfiddle.uk/iehtq44L

Take following database:

CREATE TABLE t(
a integer NOT NULL,
b integer NOT NULL,
d integer NOT NULL
);

CREATE UNIQUE INDEX t_a_include_b ON t (a) INCLUDE (b);
-- I'd expect index above to behave as index below for the purpose
-- of this query
-- CREATE UNIQUE INDEX ON t(a,b);

INSERT INTO t(
SELECT random() * 100000000 as a,
random() * 3 as b,
generate_series as d FROM generate_series(1,200000)
) ON CONFLICT DO NOTHING;

If we filter on `a` and `b` columns while scanning index created as `(a) INCLUDE (b)` it seems to be fetching tuples from heap to check for condition `b = 4` despite both columns available in the index:

SELECT * FROM t WHERE a > 1000000 and b = 4 ORDER BY a ASC LIMIT 10;

Here is the plan (notice high "shared hit"):

Limit (cost=0.42..10955.01 rows=1 width=12) (actual time=84.283..84.284 rows=0 loops=1)
   Output: a, b, d
   Buffers: shared hit=198307
   -> Index Scan using t_a_include_b on public.t (cost=0.42..10955.01 rows=1 width=12) (actual time=84.280..84.281 rows=0 loops=1)
         Output: a, b, d
         Index Cond: (t.a > 1000000)
         Filter: (t.b = 4)
         Rows Removed by Filter: 197805
         Buffers: shared hit=198307
Planning:
   Buffers: shared hit=30
Planning Time: 0.201 ms
Execution Time: 84.303 ms

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 loops=1)
   Output: a, b, d
   Buffers: shared hit=613
   -> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 width=12) (actual time=6.880..6.881 rows=0 loops=1)
         Output: a, b, d
         Index Cond: ((t.a > 1000000) AND (t.b = 4))
         Buffers: shared hit=613
Planning:
   Buffers: shared hit=41
Planning Time: 0.314 ms
Execution Time: 6.910 ms

Because query doesn't sort on `b`, only filters on it while sorting on `a`, I'd expect indexes `(a) INCLUDE (b)` and `(a,b)` behave exactly the same with this particular query.

Interestingly, IndexOnlyScan is capable of using additional columns to filter rows without fetching them from the heap, but only for visibible tuples:

VACUUM FREEZE t;
SELECT a,b FROM t WHERE a > 1000000 and b = 4 ORDER BY a ASC LIMIT 10;

Limit (cost=0.42..6619.76 rows=1 width=8) (actual time=18.479..18.480 rows=0 loops=1)
  Output: a, b
   Buffers: shared hit=662
   -> Index Only Scan using t_a_include_b on public.t (cost=0.42..6619.76 rows=1 width=8) (actual time=18.477..18.477 rows=0 loops=1)
         Output: a, b
        Index Cond: (t.a > 1000000)
         Filter: (t.b = 4)
         Rows Removed by Filter: 197771
         Heap Fetches: 0
         Buffers: shared hit=662

Removing VACUUM makes it behave like IndexScan and fetch candidate tuples from heap all while returning zero rows in the result.

To make query plan comparable I had to force index scan on both with:

SET enable_bitmapscan to off;
SET enable_seqscan to off;
SET max_parallel_workers_per_gather = 0;

Self contained fully reproducible example is in https://dbfiddle.uk/iehtq44L

Regards,
Maxim

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Maxim Ivanov (#1)
Re: Use of additional index columns in rows filtering

On 2/15/23 09:57, Maxim Ivanov wrote:

Hi All,

I'd like to report what seems to be a missing optimization
opportunity or understand why it is not possible to achieve.

TLDR; additional index column B specified in CREATE INDEX .. (A)
INCLUDE(B) is not used to filter rows in queries like WHERE B = $1
ORDER BY A during IndexScan. https://dbfiddle.uk/iehtq44L

...

Here is the plan (notice high "shared hit"):

Limit (cost=0.42..10955.01 rows=1 width=12) (actual time=84.283..84.284 rows=0 loops=1)
   Output: a, b, d
   Buffers: shared hit=198307
   -> Index Scan using t_a_include_b on public.t (cost=0.42..10955.01 rows=1 width=12) (actual time=84.280..84.281 rows=0 loops=1)
         Output: a, b, d
         Index Cond: (t.a > 1000000)
         Filter: (t.b = 4)
         Rows Removed by Filter: 197805
         Buffers: shared hit=198307
Planning:
   Buffers: shared hit=30
Planning Time: 0.201 ms
Execution Time: 84.303 ms

Yeah. The reason for this behavior is pretty simple:

1) When matching clauses to indexes in match_clause_to_index(), we only
look at key columns (nkeycolumns). We'd need to check all columns
(ncolumns) and remember if the clause matched a key or included one.

2) index_getnext_slot would need to get "candidate" TIDs using
conditions on keys, and then check the clauses on included
columns.

Seems doable, unless I'm missing some fatal issue.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#2)
Re: Use of additional index columns in rows filtering

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 2/15/23 09:57, Maxim Ivanov wrote:

TLDR; additional index column B specified in CREATE INDEX .. (A)
INCLUDE(B) is not used to filter rows in queries like WHERE B = $1
ORDER BY A during IndexScan. https://dbfiddle.uk/iehtq44L

Seems doable, unless I'm missing some fatal issue.

Partly this is lack of round tuits, but there's another significant
issue: there very likely are index entries corresponding to dead heap
rows. Applying random user-defined quals to values found in such rows
could produce semantic anomalies; for example, divide-by-zero failures
even though you deleted all the rows having a zero in that column.

This isn't a problem for operators found in operator families, because
we trust those to not have undesirable side effects like raising
data-dependent errors. But it'd be an issue if we started to apply
quals that aren't index quals directly to index rows before doing
the heap liveness check. (And, of course, once you've fetched the
heap row there's no point in having a special path for columns
available from the index.)

regards, tom lane

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: Use of additional index columns in rows filtering

On 2/15/23 16:18, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 2/15/23 09:57, Maxim Ivanov wrote:

TLDR; additional index column B specified in CREATE INDEX .. (A)
INCLUDE(B) is not used to filter rows in queries like WHERE B = $1
ORDER BY A during IndexScan. https://dbfiddle.uk/iehtq44L

Seems doable, unless I'm missing some fatal issue.

Partly this is lack of round tuits, but there's another significant
issue: there very likely are index entries corresponding to dead heap
rows. Applying random user-defined quals to values found in such rows
could produce semantic anomalies; for example, divide-by-zero failures
even though you deleted all the rows having a zero in that column.

This isn't a problem for operators found in operator families, because
we trust those to not have undesirable side effects like raising
data-dependent errors. But it'd be an issue if we started to apply
quals that aren't index quals directly to index rows before doing
the heap liveness check. (And, of course, once you've fetched the
heap row there's no point in having a special path for columns
available from the index.)

Sure, but we can do the same VM check as index-only scan, right?

That would save some of the I/O to fetch the heap tuple, as long as the
page is all-visible and the filter eliminates the tuples. It makes the
costing a bit trickier, because it needs to consider both how many pages
are all-visible and selectivity of the condition.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Maxim Ivanov
hi+postgresql@yamlcoder.me
In reply to: Tom Lane (#3)
Re: Use of additional index columns in rows filtering

This isn't a problem for operators found in operator families, because
we trust those to not have undesirable side effects like raising
data-dependent errors. But it'd be an issue if we started to apply
quals that aren't index quals directly to index rows before doing
the heap liveness check. (And, of course, once you've fetched the
heap row there's no point in having a special path for columns
available from the index.)

Assuming operators are pure and don't have global side effects, is it possible to ignore any error during that check? If tuple is not visible it shouldn't matter, if it is visible then error will be reported by the same routine which does filtering now (ExecQual?).

If not, then limiting this optimization to builtin ops is something I can live with :)

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#4)
Re: Use of additional index columns in rows filtering

Hi,

I took a stab at this and implemented the trick with the VM - during
index scan, we also extract the filters that only need the indexed
attributes (just like in IOS). And then, during the execution we:

1) scan the index using the scan keys (as before)

2) if the heap page is all-visible, we check the new filters that can
be evaluated on the index tuple

3) fetch the heap tuple and evaluate the filters

This is pretty much exactly the same thing we do for IOS, so I don't see
why this would be incorrect while IOS is correct.

This also adds "Index Filter" to explain output, to show which filters
are executed on the index tuple (at the moment the filters are a subset
of "Filter"), so if the index tuple matches we'll execute them again on
the heap tuple. I guess that could be fixed by having two "filter"
lists, depending on whether we were able to evaluate the index filters.

Most of the patch is pretty mechanical - particularly the planning part
is about identifying filters that can be evaluated on the index tuple,
and that code was mostly shamelessly copied from index-only scan.

The matching of filters to index is done in check_index_filter(), and
it's simpler than match_clause_to_indexcol() as it does not need to
consider operators etc. (I think). But maybe it should be careful about
other things, not sure.

The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
earlier, the idea is to check VM and evaluate the filters on the index
tuple if possible, similar to index-only scans. Except that we then have
to fetch the heap tuple. Unfortunately, this means the code can't use
index_getnext_slot() anymore. Perhaps we should invent a new variant
that'd allow evaluating the index filters in between.

With the patch applied, the query plan changes from:

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..10929.89 rows=1 width=12)
(actual time=94.649..94.653 rows=0 loops=1)
Buffers: shared hit=197575 read=661
-> Index Scan using t_a_include_b on t
(cost=0.42..10929.89 rows=1 width=12)
(actual time=94.646..94.647 rows=0 loops=1)
Index Cond: (a > 1000000)
Filter: (b = 4)
Rows Removed by Filter: 197780
Buffers: shared hit=197575 read=661
Planning Time: 0.091 ms
Execution Time: 94.674 ms
(9 rows)

to

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..3662.15 rows=1 width=12)
(actual time=13.663..13.667 rows=0 loops=1)
Buffers: shared hit=544
-> Index Scan using t_a_include_b on t
(cost=0.42..3662.15 rows=1 width=12)
(actual time=13.659..13.660 rows=0 loops=1)
Index Cond: (a > 1000000)
Index Filter: (b = 4)
Rows Removed by Index Recheck: 197780
Filter: (b = 4)
Buffers: shared hit=544
Planning Time: 0.105 ms
Execution Time: 13.690 ms
(10 rows)

which is much closer to the "best" case:

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..4155.90 rows=1 width=12)
(actual time=10.152..10.156 rows=0 loops=1)
Buffers: shared read=543
-> Index Scan using t_a_b_idx on t
(cost=0.42..4155.90 rows=1 width=12)
(actual time=10.148..10.150 rows=0 loops=1)
Index Cond: ((a > 1000000) AND (b = 4))
Buffers: shared read=543
Planning Time: 0.089 ms
Execution Time: 10.176 ms
(7 rows)

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-evaluate-filters-on-the-index-tuple-when-po-20230608.patchtext/x-patch; charset=UTF-8; name=0001-evaluate-filters-on-the-index-tuple-when-po-20230608.patchDownload+547-26
#7James Coleman
jtc331@gmail.com
In reply to: Tomas Vondra (#6)
Re: Use of additional index columns in rows filtering

Hello,

I've cc'd Jeff Davis on this due to a conversation we had at PGCon
about applying filters on index tuples during index scans.

I've also cc'd Andres Freund because I think this relates to his
musing in [1] that:

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

I think I remember Peter Geoghegan also wondering (I can't remember if
this was in conversation at PGCon about index skip scans or in a
hackers thread) about how we compose these various index scan
optimizations.

To be certain this is probably a thing to tackle as a follow-on to
this patch, but it does seem to me that what we are implicitly
realizing is that (unlike with bitmap scans, I think) it doesn't
really make a lot of conceptual sense to have index only scans be a
separate node from index scans. Instead it's likely better to consider
it an optimization to index scans that can dynamically kick in when
it's able to be of use. That would allow it to compose with e.g.
prefetching in the aforelinked thread. At the very least we would need
pragmatic (e.g., cost of dynamically applying optimizations) rather
than conceptual reasons to argue they should continue to be separate.

Apologies for that lengthy preamble; on to the patch under discussion:

On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I took a stab at this and implemented the trick with the VM - during
index scan, we also extract the filters that only need the indexed
attributes (just like in IOS). And then, during the execution we:

1) scan the index using the scan keys (as before)

2) if the heap page is all-visible, we check the new filters that can
be evaluated on the index tuple

3) fetch the heap tuple and evaluate the filters

Thanks for working on this; I'm excited about this class of work
(along with index prefetching and other ideas I think there's a lot of
potential for improving index scans).

This is pretty much exactly the same thing we do for IOS, so I don't see
why this would be incorrect while IOS is correct.

This also adds "Index Filter" to explain output, to show which filters
are executed on the index tuple (at the moment the filters are a subset
of "Filter"), so if the index tuple matches we'll execute them again on
the heap tuple. I guess that could be fixed by having two "filter"
lists, depending on whether we were able to evaluate the index filters.

Given that we show index filters and heap filters separately it seems
like we might want to maintain separate instrumentation counts of how
many tuple were filtered by each set of filters.

Most of the patch is pretty mechanical - particularly the planning part
is about identifying filters that can be evaluated on the index tuple,
and that code was mostly shamelessly copied from index-only scan.

The matching of filters to index is done in check_index_filter(), and
it's simpler than match_clause_to_indexcol() as it does not need to
consider operators etc. (I think). But maybe it should be careful about
other things, not sure.

This would end up requiring some refactoring of the existing index
matching code (or alternative caching on IndexOptInfo), but
match_filter_to_index() calling check_index_filter() results in
constructs a bitmapset of index columns for every possible filter
which seems wasteful (I recognize this is a bit of a proof-of-concept
level v1).

The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
earlier, the idea is to check VM and evaluate the filters on the index
tuple if possible, similar to index-only scans. Except that we then have
to fetch the heap tuple. Unfortunately, this means the code can't use
index_getnext_slot() anymore. Perhaps we should invent a new variant
that'd allow evaluating the index filters in between.

It does seem there are some refactoring opportunities there.

With the patch applied, the query plan changes from:

...

to

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..3662.15 rows=1 width=12)
(actual time=13.663..13.667 rows=0 loops=1)
Buffers: shared hit=544
-> Index Scan using t_a_include_b on t
(cost=0.42..3662.15 rows=1 width=12)
(actual time=13.659..13.660 rows=0 loops=1)
Index Cond: (a > 1000000)
Index Filter: (b = 4)
Rows Removed by Index Recheck: 197780
Filter: (b = 4)
Buffers: shared hit=544
Planning Time: 0.105 ms
Execution Time: 13.690 ms
(10 rows)

...

I did also confirm that this properly identifies cases Jeff had
mentioned to me like "Index Filter: (((a * 2) > 500000) AND ((b % 10)
= 4))".

I noticed also you still had questions/TODOs about handling index
scans for join clauses.

Regards,
James Coleman

1: /messages/by-id/20230609000600.syqy447e6metnvyj@awork3.anarazel.de

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: James Coleman (#7)
Re: Use of additional index columns in rows filtering

On 6/21/23 14:45, James Coleman wrote:

Hello,

I've cc'd Jeff Davis on this due to a conversation we had at PGCon
about applying filters on index tuples during index scans.

I've also cc'd Andres Freund because I think this relates to his
musing in [1] that:

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

I think I remember Peter Geoghegan also wondering (I can't remember if
this was in conversation at PGCon about index skip scans or in a
hackers thread) about how we compose these various index scan
optimizations.

To be certain this is probably a thing to tackle as a follow-on to
this patch, but it does seem to me that what we are implicitly
realizing is that (unlike with bitmap scans, I think) it doesn't
really make a lot of conceptual sense to have index only scans be a
separate node from index scans. Instead it's likely better to consider
it an optimization to index scans that can dynamically kick in when
it's able to be of use. That would allow it to compose with e.g.
prefetching in the aforelinked thread. At the very least we would need
pragmatic (e.g., cost of dynamically applying optimizations) rather
than conceptual reasons to argue they should continue to be separate.

I agree it seems a bit weird to have IOS as a separate node. In a way, I
think there are two dimensions for "index-only" scans - which pages can
be scanned like that, and which clauses can be evaluated with only the
index tuple. The current approach focuses on page visibility, but
ignores the other aspect entirely. Or more precisely, it disables IOS
entirely as soon as there's a single condition requiring heap tuple.

I agree it's probably better to see this as a single node with various
optimizations that can be applied when possible / efficient (based on
planner and/or dynamically).

I'm not sure I see a direct link to the prefetching patch, but it's true
that needs to deal with tids (instead of slots), just like IOS. So if
the node worked with tids, maybe the prefetching could be done at that
level (which I now realize may be what Andres meant by doing prefetching
in the executor).

Apologies for that lengthy preamble; on to the patch under discussion:

On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I took a stab at this and implemented the trick with the VM - during
index scan, we also extract the filters that only need the indexed
attributes (just like in IOS). And then, during the execution we:

1) scan the index using the scan keys (as before)

2) if the heap page is all-visible, we check the new filters that can
be evaluated on the index tuple

3) fetch the heap tuple and evaluate the filters

Thanks for working on this; I'm excited about this class of work
(along with index prefetching and other ideas I think there's a lot of
potential for improving index scans).

This is pretty much exactly the same thing we do for IOS, so I don't see
why this would be incorrect while IOS is correct.

This also adds "Index Filter" to explain output, to show which filters
are executed on the index tuple (at the moment the filters are a subset
of "Filter"), so if the index tuple matches we'll execute them again on
the heap tuple. I guess that could be fixed by having two "filter"
lists, depending on whether we were able to evaluate the index filters.

Given that we show index filters and heap filters separately it seems
like we might want to maintain separate instrumentation counts of how
many tuple were filtered by each set of filters.

Yeah, separate instrumentation counters would be useful. What I was
talking about was more about the conditions itself, because right now we
re-evaluate the index-only clauses on the heap tuple.

Imagine an index on t(a) and a query that has WHERE (a = 1) AND (b = 2).
the patch splits this into two lists:

index-only clauses: (a=1)
clauses: (a=1) AND (b=1)

So we evaluate (a=1) first, and then we fetch the heap tuple and check
"clauses" again, which however includes the (a=1) again. For cheap
clauses (or when (a=1) eliminated a lot of tuples using just the index),
but for expensive clauses it might hurt.

It's fixable, we'd just need to keep two versions of the "clauses" list,
one for IOS mode (when index-only clauses were checked) and a complete
one when we need to check all clauses.

Most of the patch is pretty mechanical - particularly the planning part
is about identifying filters that can be evaluated on the index tuple,
and that code was mostly shamelessly copied from index-only scan.

The matching of filters to index is done in check_index_filter(), and
it's simpler than match_clause_to_indexcol() as it does not need to
consider operators etc. (I think). But maybe it should be careful about
other things, not sure.

This would end up requiring some refactoring of the existing index
matching code (or alternative caching on IndexOptInfo), but
match_filter_to_index() calling check_index_filter() results in
constructs a bitmapset of index columns for every possible filter
which seems wasteful (I recognize this is a bit of a proof-of-concept
level v1).

Probably, I'm sure there's a couple other places where the current API
was a bit cumbersome and we could optimize.

The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
earlier, the idea is to check VM and evaluate the filters on the index
tuple if possible, similar to index-only scans. Except that we then have
to fetch the heap tuple. Unfortunately, this means the code can't use
index_getnext_slot() anymore. Perhaps we should invent a new variant
that'd allow evaluating the index filters in between.

It does seem there are some refactoring opportunities there.

Actually, I realized maybe we should switch this to index_getnext_tid()
because of the prefetching patch. That would allow us to introduce a
"buffer" of TIDs, populated by the index_getnext_tid(), and then do
prefetching based on that. It's similar to what bitmap scans do, except
that intead of the tbm iterator we get items from index_getnext_tid().

I haven't tried implementing this yet, but I kinda like the idea as it
works no matter what exactly the AM does (i.e. it'd work even for cases
like GiST with distance searches).

With the patch applied, the query plan changes from:

...

to

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..3662.15 rows=1 width=12)
(actual time=13.663..13.667 rows=0 loops=1)
Buffers: shared hit=544
-> Index Scan using t_a_include_b on t
(cost=0.42..3662.15 rows=1 width=12)
(actual time=13.659..13.660 rows=0 loops=1)
Index Cond: (a > 1000000)
Index Filter: (b = 4)
Rows Removed by Index Recheck: 197780
Filter: (b = 4)
Buffers: shared hit=544
Planning Time: 0.105 ms
Execution Time: 13.690 ms
(10 rows)

...

I did also confirm that this properly identifies cases Jeff had
mentioned to me like "Index Filter: (((a * 2) > 500000) AND ((b % 10)
= 4))".

Good!

I noticed also you still had questions/TODOs about handling index
scans for join clauses.

Not sure which questions/TODOs you refer to, but I don't recall any
issues with join clauses. But maybe I just forgot.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9James Coleman
jtc331@gmail.com
In reply to: Tomas Vondra (#8)
Re: Use of additional index columns in rows filtering

On Wed, Jun 21, 2023 at 11:28 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/21/23 14:45, James Coleman wrote:

Hello,

I've cc'd Jeff Davis on this due to a conversation we had at PGCon
about applying filters on index tuples during index scans.

I've also cc'd Andres Freund because I think this relates to his
musing in [1] that:

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

I think I remember Peter Geoghegan also wondering (I can't remember if
this was in conversation at PGCon about index skip scans or in a
hackers thread) about how we compose these various index scan
optimizations.

To be certain this is probably a thing to tackle as a follow-on to
this patch, but it does seem to me that what we are implicitly
realizing is that (unlike with bitmap scans, I think) it doesn't
really make a lot of conceptual sense to have index only scans be a
separate node from index scans. Instead it's likely better to consider
it an optimization to index scans that can dynamically kick in when
it's able to be of use. That would allow it to compose with e.g.
prefetching in the aforelinked thread. At the very least we would need
pragmatic (e.g., cost of dynamically applying optimizations) rather
than conceptual reasons to argue they should continue to be separate.

I agree it seems a bit weird to have IOS as a separate node. In a way, I
think there are two dimensions for "index-only" scans - which pages can
be scanned like that, and which clauses can be evaluated with only the
index tuple. The current approach focuses on page visibility, but
ignores the other aspect entirely. Or more precisely, it disables IOS
entirely as soon as there's a single condition requiring heap tuple.

I agree it's probably better to see this as a single node with various
optimizations that can be applied when possible / efficient (based on
planner and/or dynamically).

I'm not sure I see a direct link to the prefetching patch, but it's true
that needs to deal with tids (instead of slots), just like IOS. So if
the node worked with tids, maybe the prefetching could be done at that
level (which I now realize may be what Andres meant by doing prefetching
in the executor).

The link to prefetching is that IOS (as a separate node) won't benefit
from prefetching (I think) with your current prefetching patch (in the
case where the VM doesn't allow us to just use the index tuple),
whereas if the nodes were combined that would more naturally be
composable.

Apologies for that lengthy preamble; on to the patch under discussion:

On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I took a stab at this and implemented the trick with the VM - during
index scan, we also extract the filters that only need the indexed
attributes (just like in IOS). And then, during the execution we:

1) scan the index using the scan keys (as before)

2) if the heap page is all-visible, we check the new filters that can
be evaluated on the index tuple

3) fetch the heap tuple and evaluate the filters

Thanks for working on this; I'm excited about this class of work
(along with index prefetching and other ideas I think there's a lot of
potential for improving index scans).

This is pretty much exactly the same thing we do for IOS, so I don't see
why this would be incorrect while IOS is correct.

This also adds "Index Filter" to explain output, to show which filters
are executed on the index tuple (at the moment the filters are a subset
of "Filter"), so if the index tuple matches we'll execute them again on
the heap tuple. I guess that could be fixed by having two "filter"
lists, depending on whether we were able to evaluate the index filters.

Given that we show index filters and heap filters separately it seems
like we might want to maintain separate instrumentation counts of how
many tuple were filtered by each set of filters.

Yeah, separate instrumentation counters would be useful. What I was
talking about was more about the conditions itself, because right now we
re-evaluate the index-only clauses on the heap tuple.

Imagine an index on t(a) and a query that has WHERE (a = 1) AND (b = 2).
the patch splits this into two lists:

index-only clauses: (a=1)
clauses: (a=1) AND (b=1)

So we evaluate (a=1) first, and then we fetch the heap tuple and check
"clauses" again, which however includes the (a=1) again. For cheap
clauses (or when (a=1) eliminated a lot of tuples using just the index),
but for expensive clauses it might hurt.

It's fixable, we'd just need to keep two versions of the "clauses" list,
one for IOS mode (when index-only clauses were checked) and a complete
one when we need to check all clauses.

In some cases (where the VM doesn't allow us to use just the index
tuple) we'd have to execute both lists against the heap tuple, right?

Most of the patch is pretty mechanical - particularly the planning part
is about identifying filters that can be evaluated on the index tuple,
and that code was mostly shamelessly copied from index-only scan.

The matching of filters to index is done in check_index_filter(), and
it's simpler than match_clause_to_indexcol() as it does not need to
consider operators etc. (I think). But maybe it should be careful about
other things, not sure.

This would end up requiring some refactoring of the existing index
matching code (or alternative caching on IndexOptInfo), but
match_filter_to_index() calling check_index_filter() results in
constructs a bitmapset of index columns for every possible filter
which seems wasteful (I recognize this is a bit of a proof-of-concept
level v1).

Probably, I'm sure there's a couple other places where the current API
was a bit cumbersome and we could optimize.

The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
earlier, the idea is to check VM and evaluate the filters on the index
tuple if possible, similar to index-only scans. Except that we then have
to fetch the heap tuple. Unfortunately, this means the code can't use
index_getnext_slot() anymore. Perhaps we should invent a new variant
that'd allow evaluating the index filters in between.

It does seem there are some refactoring opportunities there.

Actually, I realized maybe we should switch this to index_getnext_tid()
because of the prefetching patch. That would allow us to introduce a
"buffer" of TIDs, populated by the index_getnext_tid(), and then do
prefetching based on that. It's similar to what bitmap scans do, except
that intead of the tbm iterator we get items from index_getnext_tid().

I haven't tried implementing this yet, but I kinda like the idea as it
works no matter what exactly the AM does (i.e. it'd work even for cases
like GiST with distance searches).

Oh, interesting, I'll let you keep chewing on that then.

With the patch applied, the query plan changes from:

...

to

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..3662.15 rows=1 width=12)
(actual time=13.663..13.667 rows=0 loops=1)
Buffers: shared hit=544
-> Index Scan using t_a_include_b on t
(cost=0.42..3662.15 rows=1 width=12)
(actual time=13.659..13.660 rows=0 loops=1)
Index Cond: (a > 1000000)
Index Filter: (b = 4)
Rows Removed by Index Recheck: 197780
Filter: (b = 4)
Buffers: shared hit=544
Planning Time: 0.105 ms
Execution Time: 13.690 ms
(10 rows)

...

I did also confirm that this properly identifies cases Jeff had
mentioned to me like "Index Filter: (((a * 2) > 500000) AND ((b % 10)
= 4))".

Good!

I noticed also you still had questions/TODOs about handling index
scans for join clauses.

Not sure which questions/TODOs you refer to, but I don't recall any
issues with join clauses. But maybe I just forgot.

I was referring to the comment:

+ * FIXME Maybe this should fill the filterset too?

above match_eclass_clauses_to_index()'s definition.

Regards,
James

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: James Coleman (#9)
Re: Use of additional index columns in rows filtering

On 6/21/23 18:17, James Coleman wrote:

On Wed, Jun 21, 2023 at 11:28 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/21/23 14:45, James Coleman wrote:

Hello,

I've cc'd Jeff Davis on this due to a conversation we had at PGCon
about applying filters on index tuples during index scans.

I've also cc'd Andres Freund because I think this relates to his
musing in [1] that:

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

I think I remember Peter Geoghegan also wondering (I can't remember if
this was in conversation at PGCon about index skip scans or in a
hackers thread) about how we compose these various index scan
optimizations.

To be certain this is probably a thing to tackle as a follow-on to
this patch, but it does seem to me that what we are implicitly
realizing is that (unlike with bitmap scans, I think) it doesn't
really make a lot of conceptual sense to have index only scans be a
separate node from index scans. Instead it's likely better to consider
it an optimization to index scans that can dynamically kick in when
it's able to be of use. That would allow it to compose with e.g.
prefetching in the aforelinked thread. At the very least we would need
pragmatic (e.g., cost of dynamically applying optimizations) rather
than conceptual reasons to argue they should continue to be separate.

I agree it seems a bit weird to have IOS as a separate node. In a way, I
think there are two dimensions for "index-only" scans - which pages can
be scanned like that, and which clauses can be evaluated with only the
index tuple. The current approach focuses on page visibility, but
ignores the other aspect entirely. Or more precisely, it disables IOS
entirely as soon as there's a single condition requiring heap tuple.

I agree it's probably better to see this as a single node with various
optimizations that can be applied when possible / efficient (based on
planner and/or dynamically).

I'm not sure I see a direct link to the prefetching patch, but it's true
that needs to deal with tids (instead of slots), just like IOS. So if
the node worked with tids, maybe the prefetching could be done at that
level (which I now realize may be what Andres meant by doing prefetching
in the executor).

The link to prefetching is that IOS (as a separate node) won't benefit
from prefetching (I think) with your current prefetching patch (in the
case where the VM doesn't allow us to just use the index tuple),
whereas if the nodes were combined that would more naturally be
composable.

Yeah, mostly. Although just unifying "regular" indexscans and IOS would
not allow prefetching for IOS.

The reason why the current design does not allow doing prefetching for
IOS is that the prefetching happens deep in indexam.c, and down there we
don't know which TIDs are not from all-visible pages and would need
prefetching. Which is another good reason to do the prefetching at the
executor level, I believe.

Apologies for that lengthy preamble; on to the patch under discussion:

On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I took a stab at this and implemented the trick with the VM - during
index scan, we also extract the filters that only need the indexed
attributes (just like in IOS). And then, during the execution we:

1) scan the index using the scan keys (as before)

2) if the heap page is all-visible, we check the new filters that can
be evaluated on the index tuple

3) fetch the heap tuple and evaluate the filters

Thanks for working on this; I'm excited about this class of work
(along with index prefetching and other ideas I think there's a lot of
potential for improving index scans).

This is pretty much exactly the same thing we do for IOS, so I don't see
why this would be incorrect while IOS is correct.

This also adds "Index Filter" to explain output, to show which filters
are executed on the index tuple (at the moment the filters are a subset
of "Filter"), so if the index tuple matches we'll execute them again on
the heap tuple. I guess that could be fixed by having two "filter"
lists, depending on whether we were able to evaluate the index filters.

Given that we show index filters and heap filters separately it seems
like we might want to maintain separate instrumentation counts of how
many tuple were filtered by each set of filters.

Yeah, separate instrumentation counters would be useful. What I was
talking about was more about the conditions itself, because right now we
re-evaluate the index-only clauses on the heap tuple.

Imagine an index on t(a) and a query that has WHERE (a = 1) AND (b = 2).
the patch splits this into two lists:

index-only clauses: (a=1)
clauses: (a=1) AND (b=1)

So we evaluate (a=1) first, and then we fetch the heap tuple and check
"clauses" again, which however includes the (a=1) again. For cheap
clauses (or when (a=1) eliminated a lot of tuples using just the index),
but for expensive clauses it might hurt.

It's fixable, we'd just need to keep two versions of the "clauses" list,
one for IOS mode (when index-only clauses were checked) and a complete
one when we need to check all clauses.

In some cases (where the VM doesn't allow us to use just the index
tuple) we'd have to execute both lists against the heap tuple, right?

Not quite. I suspect you imagine we'd have two lists

L1: (a=1)
L2: (b=1)

and that for all-visible pages we'd check L1 on index, and then maybe L2
on heap. And for non-all-visible pages we'd check both on heap.

But that doesn't work, because L1 has references to attnums in the index
tuple, while L2 has attnums to heap.

So we'd need

L1: (a=1) -> against index tuple
L2: (b=1) -> against heap tuple
L3: (a=1) AND (b=1) -> against heap tuple

And for non-all-visible pages we'd only use L3. (I wonder if we could
check if the tuple is visible and then go back and check L1 on index
tuple, but I doubt it'd be really more efficient.)

Most of the patch is pretty mechanical - particularly the planning part
is about identifying filters that can be evaluated on the index tuple,
and that code was mostly shamelessly copied from index-only scan.

The matching of filters to index is done in check_index_filter(), and
it's simpler than match_clause_to_indexcol() as it does not need to
consider operators etc. (I think). But maybe it should be careful about
other things, not sure.

This would end up requiring some refactoring of the existing index
matching code (or alternative caching on IndexOptInfo), but
match_filter_to_index() calling check_index_filter() results in
constructs a bitmapset of index columns for every possible filter
which seems wasteful (I recognize this is a bit of a proof-of-concept
level v1).

Probably, I'm sure there's a couple other places where the current API
was a bit cumbersome and we could optimize.

The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
earlier, the idea is to check VM and evaluate the filters on the index
tuple if possible, similar to index-only scans. Except that we then have
to fetch the heap tuple. Unfortunately, this means the code can't use
index_getnext_slot() anymore. Perhaps we should invent a new variant
that'd allow evaluating the index filters in between.

It does seem there are some refactoring opportunities there.

Actually, I realized maybe we should switch this to index_getnext_tid()
because of the prefetching patch. That would allow us to introduce a
"buffer" of TIDs, populated by the index_getnext_tid(), and then do
prefetching based on that. It's similar to what bitmap scans do, except
that intead of the tbm iterator we get items from index_getnext_tid().

I haven't tried implementing this yet, but I kinda like the idea as it
works no matter what exactly the AM does (i.e. it'd work even for cases
like GiST with distance searches).

Oh, interesting, I'll let you keep chewing on that then.

Cool!

With the patch applied, the query plan changes from:

...

to

QUERY PLAN
-------------------------------------------------------------------
Limit (cost=0.42..3662.15 rows=1 width=12)
(actual time=13.663..13.667 rows=0 loops=1)
Buffers: shared hit=544
-> Index Scan using t_a_include_b on t
(cost=0.42..3662.15 rows=1 width=12)
(actual time=13.659..13.660 rows=0 loops=1)
Index Cond: (a > 1000000)
Index Filter: (b = 4)
Rows Removed by Index Recheck: 197780
Filter: (b = 4)
Buffers: shared hit=544
Planning Time: 0.105 ms
Execution Time: 13.690 ms
(10 rows)

...

I did also confirm that this properly identifies cases Jeff had
mentioned to me like "Index Filter: (((a * 2) > 500000) AND ((b % 10)
= 4))".

Good!

I noticed also you still had questions/TODOs about handling index
scans for join clauses.

Not sure which questions/TODOs you refer to, but I don't recall any
issues with join clauses. But maybe I just forgot.

I was referring to the comment:

+ * FIXME Maybe this should fill the filterset too?

above match_eclass_clauses_to_index()'s definition.

Ah, yeah. I'm sure there are some loose ends in the matching code.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#10)
Re: Use of additional index columns in rows filtering

Hi,

here's a minor update of the patch, rebased to a current master and
addressing a couple issues reported by cfbot. Most are minor tweaks, but
the last one (4) is a somewhat more serious issue.

1) "tid" might have not been initialized in the IndexNext loop

2) add enable_indexonlyfilter GUC to postgresql.conf.sample (which is
checked by one regression test)

3) accepts a couple plan changes, either switching to index scan (thanks
to the costing changes) or showing the extra index-only filters in the
explain output. The plan changes seem reasonable.

4) problems with opcintype != opckeytype (name_ops)

While running the tests, I ran into an issue with name_ops, causing
failures for \dT and other catalog queries. The root cause is that
name_ops has opcintype = name, but opckeytype = cstring. The index-only
clauses are copied from the table, with Vars mutated to reference the
INDEX_VAR. But the type is not, so when we get to evaluating the
expressions, CheckVarSlotCompatibility() fails because the Var has name,
but the iss_IndexSlot (created with index tuple descriptor) has cstring.

The rebased patch fixes this by explicitly adjusting types of the
descriptor in ExecInitIndexScan().

However, maybe this indicates the very idea of evaluating expressions
using slot with index tuple descriptor is misguided. This made me look
at regular index-only scan (nodeIndexonlyscan.c), and that uses a slot
with the "table" structure, and instead of evaluating the expression on
the index index tuple it expands the index tuple into the table slot.
Which is what StoreIndexTuple() does.

So maybe this should do what IOS does - expand the index tuple into
"table slot" and evaluate the expression on that. That'd also make the
INDEX_VAR tweak in createplan.c unnecessary - in fact, that seemed a bit
strange anyway, so ditching fix_indexfilter_mutator would be good.

However, I wonder if the stuff StoreIndexTuple() is doing is actually
safe. I mean, it's essentially copying values from the index tuple into
the slot, ignoring the type difference. What if opcintype and opckeytype
are not binary compatible? Is it possible to define an opclass with such
opckeytype? I haven't notice any check enforcing such compatibility ...

Also, it's a bit confusing the SGML docs say opckeytype is not supported
for btree, but name_ops clearly does that. Later I found it's actually
mentioned in pg_opclass.dat as a hack, to save space in catalogs.

But then btree also has amstorage=false ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-evaluate-filters-on-the-index-tuple-when-po-20230715.patchtext/x-patch; charset=UTF-8; name=0001-evaluate-filters-on-the-index-tuple-when-po-20230715.patchDownload+624-40
#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#11)
Re: Use of additional index columns in rows filtering

On 7/15/23 16:20, Tomas Vondra wrote:

...

4) problems with opcintype != opckeytype (name_ops)

While running the tests, I ran into an issue with name_ops, causing
failures for \dT and other catalog queries. The root cause is that
name_ops has opcintype = name, but opckeytype = cstring. The index-only
clauses are copied from the table, with Vars mutated to reference the
INDEX_VAR. But the type is not, so when we get to evaluating the
expressions, CheckVarSlotCompatibility() fails because the Var has name,
but the iss_IndexSlot (created with index tuple descriptor) has cstring.

The rebased patch fixes this by explicitly adjusting types of the
descriptor in ExecInitIndexScan().

However, maybe this indicates the very idea of evaluating expressions
using slot with index tuple descriptor is misguided. This made me look
at regular index-only scan (nodeIndexonlyscan.c), and that uses a slot
with the "table" structure, and instead of evaluating the expression on
the index index tuple it expands the index tuple into the table slot.
Which is what StoreIndexTuple() does.

So maybe this should do what IOS does - expand the index tuple into
"table slot" and evaluate the expression on that. That'd also make the
INDEX_VAR tweak in createplan.c unnecessary - in fact, that seemed a bit
strange anyway, so ditching fix_indexfilter_mutator would be good.

This kept bothering me, so I looked at it today, and reworked it to use
the IOS approach. It's a bit more complicated because for IOS both slots
have the same overall structure, except for the data types. But for
regular index scans that's not the case - the code has to "expand" the
index tuple into the larger "table slot". This works, and in general I
think the result is much cleaner - in particular, it means we don't need
to switch the Var nodes to reference the INDEX_VAR.

While experimenting with this I realized again that we're not matching
expressions to IOS. So if you have an expression index on (a+b), that
can't be used even if the query only uses this particular expression.
The same limitation applies to index-only filters, of course. It's not
the fault of this patch, but perhaps it'd be an interesting improvement.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-evaluate-filters-on-the-index-tuple-when-po-20230716.patchtext/x-patch; charset=UTF-8; name=0001-evaluate-filters-on-the-index-tuple-when-po-20230716.patchDownload+580-40
#13Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#12)
Re: Use of additional index columns in rows filtering

Hi,

On Sun, 2023-07-16 at 22:36 +0200, Tomas Vondra wrote:

This kept bothering me, so I looked at it today, and reworked it to
use
the IOS approach.

Initial comments on patch 20230716:

* check_index_filter() alredy looks at "canreturn", which should mean
that you don't need to later check for opcintype<>opckeytype. But
there's a comment in IndexNext() indicating that's a problem -- under
what conditions is it a problem?

* (may be a matter of taste) Recomputing the bitmapset from the
canreturn array in check_index_filter() for each call seems awkward. I
would just iterate through the bitmapset and check that all are set
true in the amcanreturn array.

* There are some tiny functions that don't seem to add much value or
have slightly weird APIs. For instance, match_filter_to_index() could
probably just return a boolean, and maybe doesn't even need to exist
because it's such a thin wrapper over check_index_filter(). Similarly
for fix_indexfilter_clause(). I'm OK with tiny functions even if the
only value is a comment, but I didn't find these particularly helpful.

* fix_indexfilter_references() could use a better comment. Perhaps
refactor so that you can share code with fix_indexqual_references()?

* it looks like index filters are duplicated with ordinary filters, is
there a reason for that?

* I'm confused about the relationship of an IOS to an index filter. It
seems like the index filter only works for an ordinary index scan? Why
is that?

Regards,
Jeff Davis

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#13)
Re: Use of additional index columns in rows filtering

On 7/18/23 22:21, Jeff Davis wrote:

Hi,

On Sun, 2023-07-16 at 22:36 +0200, Tomas Vondra wrote:

This kept bothering me, so I looked at it today, and reworked it to
use
the IOS approach.

Initial comments on patch 20230716:

* check_index_filter() alredy looks at "canreturn", which should mean
that you don't need to later check for opcintype<>opckeytype. But
there's a comment in IndexNext() indicating that's a problem -- under
what conditions is it a problem?

The comment in IndexNext() is a bit obsolete. There was an issue when
using a slot matching the index, because then StoreIndexTuple might fail
because of type mismatch (as explained in [1]/messages/by-id/97985ef2-ef9b-e62e-6fd4-e00a573d4ead@enterprisedb.com). But that's no longer an
issue, thanks to switching to the table slot in the last patch version.

* (may be a matter of taste) Recomputing the bitmapset from the
canreturn array in check_index_filter() for each call seems awkward. I
would just iterate through the bitmapset and check that all are set
true in the amcanreturn array.

check_index_filter() is a simplified version of check_index_only(), and
that calculates the bitmap this way.

* There are some tiny functions that don't seem to add much value or
have slightly weird APIs. For instance, match_filter_to_index() could
probably just return a boolean, and maybe doesn't even need to exist
because it's such a thin wrapper over check_index_filter(). Similarly
for fix_indexfilter_clause(). I'm OK with tiny functions even if the
only value is a comment, but I didn't find these particularly helpful.

Yes, I agree some of this could be simplified. I only did the bare
minimum to get this bit working.

* fix_indexfilter_references() could use a better comment. Perhaps
refactor so that you can share code with fix_indexqual_references()?

I don't think this can share code with fix_indexqual_references(),
because that changes the Var nodes to point to the index (because it
then gets translated to scan keys). The filters don't need that.

* it looks like index filters are duplicated with ordinary filters, is
there a reason for that?

Good point. That used to be necessary, because the index-only filters
can be evaluated only on all-visible pages, and filters were had Vars
referencing the index tuple. We'd have to maintain another list of
clauses, which didn't seem worth it.

But now that the filters reference the heap tuple, we could not include
them into the second list.

* I'm confused about the relationship of an IOS to an index filter. It
seems like the index filter only works for an ordinary index scan? Why
is that?

What would it do for IOS? IOS evaluates all filters on the index tuple,
and it does not need the heap tuple at all (assuming allvisible=true).

Index-only filters try to do something like that even for regular index
scans, by evaluating as many expression on the index tuple, but may
still require fetching the heap tuple in the end.

regards

[1]: /messages/by-id/97985ef2-ef9b-e62e-6fd4-e00a573d4ead@enterprisedb.com
/messages/by-id/97985ef2-ef9b-e62e-6fd4-e00a573d4ead@enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#14)
Re: Use of additional index columns in rows filtering

On Wed, 2023-07-19 at 00:36 +0200, Tomas Vondra wrote:

* I'm confused about the relationship of an IOS to an index filter.
It
seems like the index filter only works for an ordinary index scan?
Why
is that?

What would it do for IOS?

The way it's presented is slightly confusing. If you have table x with
and index on column i, then:

EXPLAIN (ANALYZE, BUFFERS)
SELECT i, j FROM x WHERE i = 7 and (i % 1000 = 7);

Index Scan using x_idx on x (cost=0.42..8.45 rows=1 width=8)
(actual time=0.094..0.098 rows=1 loops=1)
Index Cond: (i = 7)
Index Filter: ((i % 1000) = 7)

But if you remove "j" from the target list, you get:

EXPLAIN (ANALYZE, BUFFERS)
SELECT i FROM x WHERE i = 7 and (i % 1000 = 7);

Index Only Scan using x_idx on x (cost=0.42..4.45 rows=1 width=4)
(actual time=0.085..0.088 rows=1 loops=1)
Index Cond: (i = 7)
Filter: ((i % 1000) = 7)

The confused me at first because the "Filter" in the second plan means
the same thing as the "Index Filter" in the first plan. Should we call
the filter in an IOS an "Index Filter" too? Or is that redundant?

Regards,
Jeff Davis

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#15)
Re: Use of additional index columns in rows filtering

On 7/19/23 01:22, Jeff Davis wrote:

On Wed, 2023-07-19 at 00:36 +0200, Tomas Vondra wrote:

* I'm confused about the relationship of an IOS to an index filter.
It
seems like the index filter only works for an ordinary index scan?
Why
is that?

What would it do for IOS?

The way it's presented is slightly confusing. If you have table x with
and index on column i, then:

EXPLAIN (ANALYZE, BUFFERS)
SELECT i, j FROM x WHERE i = 7 and (i % 1000 = 7);

Index Scan using x_idx on x (cost=0.42..8.45 rows=1 width=8)
(actual time=0.094..0.098 rows=1 loops=1)
Index Cond: (i = 7)
Index Filter: ((i % 1000) = 7)

But if you remove "j" from the target list, you get:

EXPLAIN (ANALYZE, BUFFERS)
SELECT i FROM x WHERE i = 7 and (i % 1000 = 7);

Index Only Scan using x_idx on x (cost=0.42..4.45 rows=1 width=4)
(actual time=0.085..0.088 rows=1 loops=1)
Index Cond: (i = 7)
Filter: ((i % 1000) = 7)

The confused me at first because the "Filter" in the second plan means
the same thing as the "Index Filter" in the first plan. Should we call
the filter in an IOS an "Index Filter" too? Or is that redundant?

I agree the naming in explain is a bit confusing.

I wonder if Andres was right (in the index prefetch thread) that
splitting regular index scans and index-only scans may not be ideal. In
a way, this patch moves those nodes closer, both in capability and code
(because now both use index_getnext_tid etc.).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#16)
Re: Use of additional index columns in rows filtering

On Wed, 2023-07-19 at 11:16 +0200, Tomas Vondra wrote:

I wonder if Andres was right (in the index prefetch thread) that
splitting regular index scans and index-only scans may not be ideal.
In
a way, this patch moves those nodes closer, both in capability and
code
(because now both use index_getnext_tid etc.).

Yeah. I could also imagine decomposing the index scan node into more
pieces, but I don't think it would work out to be a clean data flow.
Either way, probably out of scope for this patch.

For this patch I think we should just tweak the EXPLAIN output so that
it's a little more clear what parts are index-only (at least if VM bit
is set) and what parts need to go to the heap.

Regards,
Jeff Davis

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#17)
Re: Use of additional index columns in rows filtering

On 7/19/23 19:17, Jeff Davis wrote:

On Wed, 2023-07-19 at 11:16 +0200, Tomas Vondra wrote:

I wonder if Andres was right (in the index prefetch thread) that
splitting regular index scans and index-only scans may not be ideal.
In
a way, this patch moves those nodes closer, both in capability and
code
(because now both use index_getnext_tid etc.).

Yeah. I could also imagine decomposing the index scan node into more
pieces, but I don't think it would work out to be a clean data flow.
Either way, probably out of scope for this patch.

OK

For this patch I think we should just tweak the EXPLAIN output so that
it's a little more clear what parts are index-only (at least if VM bit
is set) and what parts need to go to the heap.

Makes sense, I also need to think about maybe not having duplicate
clauses in the two lists. What annoys me on that it partially prevents
the cost-based reordering done by order_qual_clauses(). So maybe we
should have three lists ... Also, some of the expressions count be
fairly expensive.

BTW could you double-check how I expanded the index_getnext_slot()? I
recall I wasn't entirely confident the result is correct, and I wanted
to try getting rid of the "while (true)" loop.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#18)
Re: Use of additional index columns in rows filtering

On Wed, 2023-07-19 at 20:03 +0200, Tomas Vondra wrote:

Makes sense, I also need to think about maybe not having duplicate
clauses in the two lists. What annoys me on that it partially
prevents
the cost-based reordering done by order_qual_clauses(). So maybe we
should have three lists ... Also, some of the expressions count be
fairly expensive.

Can we just calculate the costs of the pushdown and do it when it's a
win? If the random_page_cost savings exceed the costs from evaluating
the clause earlier, then push down.

BTW could you double-check how I expanded the index_getnext_slot()? I
recall I wasn't entirely confident the result is correct, and I
wanted
to try getting rid of the "while (true)" loop.

I suggest refactoring slightly to have the two loops in different
functions (rather than nested loops in the same function) to make
control flow a bit more clear. I'm not sure if the new function for the
inner loop should be defined in nodeIndexscan.c or indexam.c; I suppose
it depends on how clean the signature looks.

Also please expand the tests a bit to show more EXPLAIN plans that
illustrate the different cases.

Regards,
Jeff Davis

In reply to: Jeff Davis (#19)
Re: Use of additional index columns in rows filtering

On Wed, Jul 19, 2023 at 1:46 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2023-07-19 at 20:03 +0200, Tomas Vondra wrote:

Makes sense, I also need to think about maybe not having duplicate
clauses in the two lists. What annoys me on that it partially
prevents
the cost-based reordering done by order_qual_clauses(). So maybe we
should have three lists ... Also, some of the expressions count be
fairly expensive.

Can we just calculate the costs of the pushdown and do it when it's a
win? If the random_page_cost savings exceed the costs from evaluating
the clause earlier, then push down.

My patch that teaches nbtree to execute ScalarArrayOps intelligently
(by dynamically choosing to not re-descend the btree to perform
another "primitive index scan" when the data we need is located on the
same leaf page as the current ScalarArrayOps arrays) took an
interesting turn recently -- one that seems related.

I found that certain kinds of queries are dramatically faster once you
teach the optimizer to accept that multi-column ScalarArrayOps can be
trusted to return tuples in logical/index order, at least under some
circumstances. For example:

pg@regression:5555 [583930]=# create index order_by_saop on
tenk1(two,four,twenty);
CREATE INDEX

pg@regression:5555 [583930]=# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;

This shows "Buffers: shared hit=1377" on HEAD, versus "Buffers: shared
hit=13" with my patch. All because we can safely terminate the scan
early now. The vast majority of the buffer hits the patch will avoid
are against heap pages, even though I started out with the intention
of eliminating unnecessary repeat index page accesses.

Note that build_index_paths() currently refuses to allow SAOP clauses
to be recognized as ordered with a multi-column index and a query with
a clause for more than the leading column -- that is something that
the patch needs to address (to get this particular improvement, at
least). Allowing such an index path to have useful pathkeys is
typically safe (in the sense that it won't lead to wrong answers to
queries), but we still make a conservative assumption that they can
lead to wrong answers. There are comments about "equality constraints"
that describe the restrictions right now.

But it's not just the question of basic correctness -- the optimizer
is very hesitant to use multi-column SAOPs, even in cases that don't
care about ordering. So it's also (I think, implicitly) a question of
*risk*. The risk of getting very inefficient SAOP execution in nbtree,
when it turns out that a huge number of "primitive index scans" are
needed. But, if nbtree is taught to "coalesce together" primitive
index scans at runtime, that risk goes way down.

Anyway, this seems related to what you're talking about because the
relationship between selectivity and ordering seems particularly
important in this context. And because it suggests that there is at
least some scope for adding "run time insurance" to the executor,
which is valuable in the optimizer if it bounds the potential
downside. If you can be practically certain that there is essentially
zero risk of serious problems when the costing miscalculates (for a
limited subset of cases), then life might be a lot easier -- clearly
we should be biased in one particular direction with a case that has
that kind of general profile.

My current understanding of the optimizer side of things --
particularly things like costing for "filter quals/pqquals" versus
costing for "true index quals" -- is rather basic. That will have to
change. Curious to hear your thoughts (if any) on how what you're
discussing may relate to what I need to do with my patch. Right now my
patch assumes that making SAOP clauses into proper index quals (that
usually preserve index ordering) is an unalloyed good (when safe!).
This assumption is approximately true on average, as far as I can
tell. But it's probably quite untrue in various specific cases, that
somebody is bound to care about.

--
Peter Geoghegan

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#20)
In reply to: Tomas Vondra (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#22)
In reply to: Tomas Vondra (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#24)
In reply to: Tomas Vondra (#25)
#27Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#25)
In reply to: Jeff Davis (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#28)
#30Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#29)
#31Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#28)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#30)
In reply to: Jeff Davis (#31)
In reply to: Peter Geoghegan (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#34)
In reply to: Tomas Vondra (#35)
In reply to: Peter Geoghegan (#36)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#37)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#36)
In reply to: Tomas Vondra (#38)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#40)
In reply to: Tomas Vondra (#39)
In reply to: Tomas Vondra (#41)
#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#42)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#43)
In reply to: Tomas Vondra (#44)
In reply to: Tomas Vondra (#45)
In reply to: Tomas Vondra (#45)
#49Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#47)
In reply to: Tomas Vondra (#49)
#51Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#48)
In reply to: Tomas Vondra (#51)
#53Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#52)
In reply to: Tomas Vondra (#53)
In reply to: Peter Geoghegan (#54)
In reply to: Peter Geoghegan (#55)
#57Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#56)
In reply to: Tomas Vondra (#57)
In reply to: Peter Geoghegan (#58)
#60Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#58)
#61Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#59)
In reply to: Tomas Vondra (#60)
In reply to: Peter Geoghegan (#59)
#64Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#63)
In reply to: Tomas Vondra (#64)
In reply to: Peter Geoghegan (#65)
#67Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#65)
#68Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#66)
In reply to: Tomas Vondra (#67)
In reply to: Tomas Vondra (#68)
#71Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#70)
In reply to: Tomas Vondra (#71)
#73Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#69)
In reply to: Tomas Vondra (#73)
#75Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#74)
In reply to: Tomas Vondra (#75)
#77Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#76)
In reply to: Jeff Davis (#77)
In reply to: Peter Geoghegan (#65)
In reply to: Tomas Vondra (#57)
#81Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#78)
#82Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#79)