Index Skip Scan (new UniqueKeys)
Hi,
Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]/messages/by-id/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.
To simplify reviewing I've split it into several parts:
* First two are taken from [1]/messages/by-id/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com just for the reference and to make cfbot happy.
* Extend-UniqueKeys consists changes that needs to be done for
UniqueKeys to be used in skip scan. Essentially this is a reduced
version of previous implementation from Jesper & David, based on the
new UniqueKeys infrastructure, as it does the very same thing.
* Index-Skip-Scan contains not am specific code and the overall
infrastructure, including configuration elements and so on.
* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.
* The last one contains just documentation bits.
Interesting enough with a new UniqueKey implementation skipping is
applied in some tests unexpectedly. For those I've disabled
indexskipscan to avoid confusion.
[1]: /messages/by-id/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com
Attachments:
v8-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchtext/x-diff; charset=us-asciiDownload+22-1
v8-0002-Introuduce-RelOptInfo.uniquekeys-attribute.patchtext/x-diff; charset=us-asciiDownload+1349-6
v35-0001-Extend-UniqueKeys.patchtext/x-diff; charset=us-asciiDownload+201-11
v35-0002-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload+429-12
v35-0003-Btree-implementation-of-skipping.patchtext/x-diff; charset=us-asciiDownload+1373-2
v35-0004-Index-skip-scan-documentation.patchtext/x-diff; charset=us-asciiDownload+100-1
On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Hi,
Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.To simplify reviewing I've split it into several parts:
* First two are taken from [1] just for the reference and to make cfbot
happy.* Extend-UniqueKeys consists changes that needs to be done for
UniqueKeys to be used in skip scan. Essentially this is a reduced
version of previous implementation from Jesper & David, based on the
new UniqueKeys infrastructure, as it does the very same thing.* Index-Skip-Scan contains not am specific code and the overall
infrastructure, including configuration elements and so on.* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.* The last one contains just documentation bits.
Interesting enough with a new UniqueKey implementation skipping is
applied in some tests unexpectedly. For those I've disabled
indexskipscan to avoid confusion.[1]:
/messages/by-id/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com
Thanks for the patch.
I just get the rough idea of patch, looks we have to narrow down the user
cases
where we can use this method. Consider the below example:
create table j1(i int, im5 int, im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1,
10000000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;
demo=# select distinct * from j1 where i < 2;
i | im5 | im100 | im1000
---+-----+-------+--------
1 | 1 | 1 | 1
(1 row)
demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
i | im5 | im100 | im1000
---+-----+-------+--------
1 | 1 | 0 | 0
1 | 1 | 1 | 1
(2 rows)
drop index j1_im5_i_idx;
create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
im5 | i | im100
-----+---+-------
1 | 1 | 0
1 | 1 | 1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
im5 | i | im100
-----+---+-------
1 | 1 | 0
(1 row)
--
Best Regards
Andy Fan
On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote:
I just get the rough idea of patch, looks we have to narrow down the
user cases where we can use this method. Consider the below example:
Hi
Not exactly narrow down, but rather get rid of wrong usage of skipping
for index scan. Since skipping for it was added later than for index
only scan I can imagine there are still blind spots, so good that you've
looked. In this particular case, when index expressions do not fully
cover those expressionse result need to be distinct on, skipping just
doesn't have enough information and should not be used. I'll add it to
the next version, thanks!
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.
The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:
+ currItem = &so->currPos.items[so->currPos.lastItem]; + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
But I wonder more generally what the idea here is. The following
comments that immediately follow provide some hints:
+ /* + * To check if we returned the same tuple, try to find a + * startItup on the current page. For that we need to update + * scankey to match the whole tuple and set nextkey to return + * an exact tuple, not the next one. If the nextOffset is the + * same as before, it means we are in the loop, return offnum + * to the original position and jump further + */
Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:
+ startOffset = _bt_binsrch(scan->indexRelation, + so->skipScanKey, + so->currPos.buf); + + page = BufferGetPage(so->currPos.buf); + maxoff = PageGetMaxOffsetNumber(page); + + if (nextOffset <= startOffset) + {
Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.
--
Peter Geoghegan
Hi Dmitry,
Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creating the paths, as I can't recall seeing this in earlier versions.
create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b;
create index on t1 (a,b,c);
postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
QUERY PLAN
-------------------------------------------------------------------------------
Sort (cost=2.92..2.94 rows=10 width=20)
Sort Key: a, b DESC, c
-> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20)
Skip scan: true
(4 rows)
With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it always gives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case.
-Floris
On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote:
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:+ currItem = &so->currPos.items[so->currPos.lastItem]; + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?
+ /* + * To check if we returned the same tuple, try to find a + * startItup on the current page. For that we need to update + * scankey to match the whole tuple and set nextkey to return + * an exact tuple, not the next one. If the nextOffset is the + * same as before, it means we are in the loop, return offnum + * to the original position and jump further + */Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:+ startOffset = _bt_binsrch(scan->indexRelation, + so->skipScanKey, + so->currPos.buf); + + page = BufferGetPage(so->currPos.buf); + maxoff = PageGetMaxOffsetNumber(page); + + if (nextOffset <= startOffset) + {Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.
Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.
On Fri, Jul 10, 2020 at 05:03:37PM +0000, Floris Van Nee wrote:
Also took another look at the patch now, and found a case of incorrect
data. It looks related to the new way of creating the paths, as I
can't recall seeing this in earlier versions.create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b;
create index on t1 (a,b,c);postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
QUERY PLAN
-------------------------------------------------------------------------------
Sort (cost=2.92..2.94 rows=10 width=20)
Sort Key: a, b DESC, c
-> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20)
Skip scan: true
(4 rows)
Good point, thanks for looking at this. With the latest planner version
there are indeed more possibilities to use skipping. It never occured to
me that some of those paths will still rely on index scan returning full
data set. I'll look in details and add verification to prevent putting
something like this on top of skip scan in the next version.
Good point, thanks for looking at this. With the latest planner version there
are indeed more possibilities to use skipping. It never occured to me that
some of those paths will still rely on index scan returning full data set. I'll look
in details and add verification to prevent putting something like this on top of
skip scan in the next version.
I believe the required changes are something like in attached patch. There were a few things I've changed:
- build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with a and one with b. However, it should be one unique key with (a,b).
- the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists.
- the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeys in this case. This makes the planner not choose skip scans even though it could. For example in queries that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys.
- to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matches the useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keys are only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constant path keys, we have to search the attributes of the index to find the correct prefix to use in skipping.
- creating the skip scan path did not actually fill the Path's unique keys. It should just set this to query_uniquekeys.
I've attached the first two unique-keys patches (v9, 0001, 0002)), your patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007) that applies my suggested changes on top of it.
-Floris
Attachments:
0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload+53-1
0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchapplication/octet-stream; name=0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchDownload+1502-24
0003-Extend-UniqueKeys.patchapplication/octet-stream; name=0003-Extend-UniqueKeys.patchDownload+201-11
0004-Index-skip-scan.patchapplication/octet-stream; name=0004-Index-skip-scan.patchDownload+429-12
0005-Btree-implementation-of-skipping.patchapplication/octet-stream; name=0005-Btree-implementation-of-skipping.patchDownload+1373-2
0006-Index-skip-scan-documentation.patchapplication/octet-stream; name=0006-Index-skip-scan-documentation.patchDownload+100-1
0007-planner-fixes.patchapplication/octet-stream; name=0007-planner-fixes.patchDownload+129-63
I've attached the first two unique-keys patches (v9, 0001, 0002)), your
patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007)
that applies my suggested changes on top of it.
I just realized there's another thing that looks a bit strange too. From reading the thread, I thought it should be the case that in create_distinct_paths, it is checked whether the uniqueness in the unique_pathlist matches the uniqueness that is needed by the query.
This means that I think what it should be comparing is this:
- The generated index path should have some path-level unique keys set
- The path-level unique keys must be at least as strict as the path-level unique keys. Eg. if path-level is unique on (a), then query-level must be (a), or possibly (a,b).
I've changed the patch to compare the path-level keys (set in create index path) with the query-level keys in create_distinct_path. Currently, I don't think the previous implementation was an actual issue leading to incorrect queries, but it would be causing problems if we tried to extend the uniqueness for distinct to join rels etc.
One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?
-Floris
Attachments:
0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload+53-1
0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchapplication/octet-stream; name=0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchDownload+1502-24
0003-Extend-UniqueKeys.patchapplication/octet-stream; name=0003-Extend-UniqueKeys.patchDownload+201-11
0004-Index-skip-scan.patchapplication/octet-stream; name=0004-Index-skip-scan.patchDownload+429-12
0005-Btree-implementation-of-skipping.patchapplication/octet-stream; name=0005-Btree-implementation-of-skipping.patchDownload+1373-2
0006-Index-skip-scan-documentation.patchapplication/octet-stream; name=0006-Index-skip-scan-documentation.patchDownload+100-1
0007-planner-fixes.patchapplication/octet-stream; name=0007-planner-fixes.patchDownload+162-98
On Sun, Jul 12, 2020 at 12:48:47PM +0000, Floris Van Nee wrote:
Good point, thanks for looking at this. With the latest planner version there
are indeed more possibilities to use skipping. It never occured to me that
some of those paths will still rely on index scan returning full data set. I'll look
in details and add verification to prevent putting something like this on top of
skip scan in the next version.I believe the required changes are something like in attached patch. There were a few things I've changed:
- build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with a and one with b. However, it should be one unique key with (a,b).
Yes, I've also noticed that while preparing fix for index scan not
covered by index and included it.
- the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists.
I guess you're talking about:
+ if (EC_MUST_BE_REDUNDANT(ec))
+ continue;
Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could
be I've missed something.
Along the lines I'm also curious about this part:
- ListCell *k;
- List *exprs = NIL;
-
- foreach(k, ec->ec_members)
- {
- EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
- exprs = lappend(exprs, mem->em_expr);
- }
-
- result = lappend(result, makeUniqueKey(exprs, false, false));
+ EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to
preserve it in unique keys?
- the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeys in this case. This makes the planner not choose skip scans even though it could. For example in queries that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys.
What would be the point of skipping if it's a constant?
- to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matches the useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keys are only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constant path keys, we have to search the attributes of the index to find the correct prefix to use in skipping.
IIUC here you mean this function, right?
+ prefix = find_index_prefix_for_pathkey(root,
+ index,
+ BackwardScanDirection,
+ llast_node(PathKey,
+ root->distinct_pathkeys));
Doesn't it duplicate the job already done in build_index_pathkeys by
building those pathkeys again? If yes, probably it's possible to reuse
useful_pathkeys. Not sure about unordered indexes, but looks like
query_pathkeys should also match in this case.
Will also look at the follow up questions in the next email.
- the uniquekeys that is built, still contains some redundant keys, that are
normally eliminated from the path keys lists.
I guess you're talking about:
+ if (EC_MUST_BE_REDUNDANT(ec))
+ continue;Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
I've missed something.
The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, but not for constantness. Consider a query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c
All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeys to the unique keys, so it wouldn't consider the skip scan.
Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there.
Along the lines I'm also curious about this part:
- ListCell *k; - List *exprs = NIL; - - foreach(k, ec->ec_members) - { - EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); - exprs = lappend(exprs, mem->em_expr); - } - - result = lappend(result, makeUniqueKey(exprs, false, false)); + EquivalenceMember *mem = (EquivalenceMember*) +lfirst(list_head(ec->ec_members));I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to preserve it
in unique keys?
Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, but for that we need to check equivalence classes rather than expressions).
- the distinct_pathkeys may be NULL, even though there's a possibility for
skipping. But it wouldn't create the uniquekeys in this case. This makes the
planner not choose skip scans even though it could. For example in queries
that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
is constant, it's eliminated from regular pathkeys.What would be the point of skipping if it's a constant?
For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan.
- to combat the issues mentioned earlier, there's now a check in
build_index_paths that checks if the query_pathkeys matches the
useful_pathkeys. Note that we have to use the path keys here rather than
any of the unique keys. The unique keys are only Expr nodes - they do not
contain the necessary information about ordering. Due to elimination of
some constant path keys, we have to search the attributes of the index to
find the correct prefix to use in skipping.IIUC here you mean this function, right?
+ prefix = find_index_prefix_for_pathkey(root, + index, + BackwardScanDirection, + llast_node(PathKey, + root->distinct_pathkeys));Doesn't it duplicate the job already done in build_index_pathkeys by building
those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys.
Not sure about unordered indexes, but looks like query_pathkeys should
also match in this case.
Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a new path key, but it looks it up in root->canon_pathkeys and returns that path key.
I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index of that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn't find a solid way except doing this. Maybe there is though?
-Floris
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
+ currItem = &so->currPos.items[so->currPos.lastItem]; + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?
Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)
Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.
This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.
Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)
Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.
Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).
--
Peter Geoghegan
On Tue, Jul 14, 2020 at 06:18:50PM +0000, Floris Van Nee wrote:
Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there.
That would be my suggestion as well.
Along the lines I'm also curious about this part:
- ListCell *k; - List *exprs = NIL; - - foreach(k, ec->ec_members) - { - EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); - exprs = lappend(exprs, mem->em_expr); - } - - result = lappend(result, makeUniqueKey(exprs, false, false)); + EquivalenceMember *mem = (EquivalenceMember*) +lfirst(list_head(ec->ec_members));I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to preserve it
in unique keys?Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses.
One UniqueKey can have multiple corresponding expressions, which gives
us also possibility of having one unique key with (t1.a, t2.a) and it
looks now similar to EquivalenceClass.
- the distinct_pathkeys may be NULL, even though there's a possibility for
skipping. But it wouldn't create the uniquekeys in this case. This makes the
planner not choose skip scans even though it could. For example in queries
that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
is constant, it's eliminated from regular pathkeys.What would be the point of skipping if it's a constant?
For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan.
The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly
the same stuff just under another name. But if allowing skipping on
constants do not bring significant changes in the code probably it's
fine.
- to combat the issues mentioned earlier, there's now a check in
build_index_paths that checks if the query_pathkeys matches the
useful_pathkeys. Note that we have to use the path keys here rather than
any of the unique keys. The unique keys are only Expr nodes - they do not
contain the necessary information about ordering. Due to elimination of
some constant path keys, we have to search the attributes of the index to
find the correct prefix to use in skipping.IIUC here you mean this function, right?
+ prefix = find_index_prefix_for_pathkey(root, + index, + BackwardScanDirection, + llast_node(PathKey, + root->distinct_pathkeys));Doesn't it duplicate the job already done in build_index_pathkeys by building
those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys.
Not sure about unordered indexes, but looks like query_pathkeys should
also match in this case.Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a new path key, but it looks it up in root->canon_pathkeys and returns that path key.
I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index of that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn't find a solid way except doing this. Maybe there is though?
I don't think there is a direct way, but why not modify
build_index_paths to also provide this information, or compare
index_pathkeys expressions with indextlist without actually create those
pathkeys again?
And couple of words about this thread [1]/messages/by-id/e4b623692a1447d4a13ac80fa271c8e6@opammb0561.comp.optiver.com. It looks to me like a strange
way of interacting with the community. Are you going to duplicate there
everything, or what are your plans? At the very least you could try to
include everyone involved in the recipients list, not exclude some of
the authors.
[1]: /messages/by-id/e4b623692a1447d4a13ac80fa271c8e6@opammb0561.comp.optiver.com
One UniqueKey can have multiple corresponding expressions, which gives us
also possibility of having one unique key with (t1.a, t2.a) and it looks now
similar to EquivalenceClass.
I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a).
The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly the same
stuff just under another name. But if allowing skipping on constants do not
bring significant changes in the code probably it's fine.
Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT with LIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROM t1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the implementation more consistent with little code changes.
Yeah, there's definitely some double work there, but the actual impact may
be limited - it doesn't actually allocate a new path key, but it looks it up in
root->canon_pathkeys and returns that path key.I wrote it like this, because I couldn't find a way to identify from a certain
PathKey the actual location in the index of that column. The constructed path
keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes
path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But
how to get from PathKey=b to that number 3, I didn't find a solid way except
doing this. Maybe there is though?I don't think there is a direct way, but why not modify build_index_paths to
also provide this information, or compare index_pathkeys expressions with
indextlist without actually create those pathkeys again?
I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later.
And couple of words about this thread [1]. It looks to me like a strange way
of interacting with the community. Are you going to duplicate there
everything, or what are your plans? At the very least you could try to include
everyone involved in the recipients list, not exclude some of the authors.
When I wrote the first mail in the thread, I went to this thread [1]/messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost and included everyone from there, but I see now that I only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should've looked better to make sure I had everyone.
In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions about both patches.
[1]: /messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost
On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:
Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)
I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <florisvannee@optiver.com> wrote:
One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?
I'm still not quite sure on this either way. I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys. But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.
Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.
CREATE TABLE xy (x int primary key, y int not null);
SELECT DISTINCT y FROM xy WHERE x=y;
whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.
Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:
CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;
it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check. That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.
My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1]: /messages/by-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA@mail.gmail.com
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}. I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL. This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.
So, it seems the patch dependency chain for skip scans just got a bit longer :-(
David
[1]: /messages/by-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA@mail.gmail.com
[2]: /messages/by-id/20190920051857.2fhnvhvx4qdddviz@alap3.anarazel.de
Hi David:
Thanks for looking into this.
On Fri, Jul 31, 2020 at 11:07 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <florisvannee@optiver.com>
wrote:One question about the unique keys - probably for Andy or David: I've
looked in the archives to find arguments for/against using Expr nodes or
EquivalenceClasses in the Unique Keys patch. However, I couldn't really
find a clear answer about why the current patch uses Expr rather than
EquivalenceClasses. At some point David mentioned "that probably Expr nodes
were needed rather than EquivalenceClasses", but it's not really clear to
me why. What were the thoughts behind this?I'm still not quite sure on this either way. I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys. But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.
Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.CREATE TABLE xy (x int primary key, y int not null);
SELECT DISTINCT y FROM xy WHERE x=y;
whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.
Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check. That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2]. After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}. I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL. This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.So, it seems the patch dependency chain for skip scans just got a bit
longer :-(
I admit that EquivalenceClasses has a better expressive power. There are
2 more
cases we can handle better with EquivalenceClasses. SELECT DISTINCT a, b,
c
FROM t WHERE a = b; Currently the UniqueKey is (a, b, c), but it is better
be (a, c)
and (b, c). The other case happens similarly in group by case.
After realizing this, I am still hesitant to do that, due to the
complexity. If we do that,
we may have to maintain a EquivalenceClasses in one more place or make the
existing
EquivalenceClasses List longer, for example: SELECT pk FROM t; The
current
infrastructure doesn't create any EquivalenceClasses for pk. So we have to
create
a new one in this case and reuse some existing ones in other cases.
Finally since the
EquivalenceClasses is not so straight to upper user, we have to depend on
the
infrastructure change to look up an EquivalenceClasses quickly from an
Expr.
I rethink more about the case you provide above, IIUC, there is such issue
for joinrel.
then we can just add a EC checking for populate_baserel_uniquekeys. As for
the
DISTINCT/GROUP BY case, we should build the UniqueKeys from
root->distinct_pathkeys
and root->group_pathkeys where the EquivalenceClasses are already there.
I am still not insisting on either Expr or EquivalenceClasses right now,
if we need to
change it to EquivalenceClasses, I'd see if we need to have more places to
take
care before doing that.
--
Best Regards
Andy Fan
On Mon, Jul 27, 2020 at 12:24:31PM +0200, Dmitry Dolgov wrote:
I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.
Here is a new version that hopefully address most of the concerns
mentioned in this thread so far. As before, first two patches are taken
from UniqueKeys thread and attached only for the reference. List of
changes includes:
* fix for index scan not being fully covered
* rebase on the latest UniqueKey patch
* taking into account posting tuples (although I must say I couldn't
produce a test that will hit this part, so I would appreciate if
someone can take a look)
* fixes suggested by Floris with adjustments as discussed in the thread
There are no changes related to EquivalenceClasses vs expressions, which
would probably be my next target. Having this in mind I must admit I'm
not super excited about possibility of including another patch as a
dependency without clear prospects and plans for it.
Thanks for the feedback folks!
Attachments:
v36-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchtext/x-diff; charset=us-asciiDownload+53-1
v36-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-str.patchtext/x-diff; charset=us-asciiDownload+1502-24
v36-0003-Extend-UniqueKeys.patchtext/x-diff; charset=us-asciiDownload+195-11
v36-0004-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload+546-18
v36-0005-Btree-implementation-of-skipping.patchtext/x-diff; charset=us-asciiDownload+1493-3
v36-0006-Index-skip-scan-documentation.patchtext/x-diff; charset=us-asciiDownload+100-1
On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Here is a new version that hopefully address most of the concerns
mentioned in this thread so far. As before, first two patches are taken
from UniqueKeys thread and attached only for the reference. List of
changes includes:
Some thoughts on this version of the patch series (I'm focussing on
v36-0005-Btree-implementation-of-skipping.patch again):
* I see the following compiler warning:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:
==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
==1083468== by 0x56AEA5: set_plain_rel_size (allpaths.c:586)
==1083468== by 0x56AADB: set_rel_size (allpaths.c:412)
==1083468== by 0x56A8CD: set_base_rel_sizes (allpaths.c:323)
==1083468== by 0x56A5A7: make_one_rel (allpaths.c:185)
==1083468== by 0x5AB426: query_planner (planmain.c:269)
==1083468== by 0x5AF02C: grouping_planner (planner.c:2058)
==1083468== by 0x5AD202: subquery_planner (planner.c:1015)
==1083468== by 0x5ABABF: standard_planner (planner.c:405)
==1083468== by 0x5AB7F8: planner (planner.c:275)
==1083468== by 0x6E6F84: pg_plan_query (postgres.c:875)
==1083468== by 0x6E70C4: pg_plan_queries (postgres.c:966)
==1083468== by 0x6E7497: exec_simple_query (postgres.c:1158)
==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468== by 0x624284: BackendRun (postmaster.c:4541)
==1083468== by 0x623995: BackendStartup (postmaster.c:4225)
==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468== by 0x514AF2: main (main.c:209)
==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size
8,192 alloc'd
==1083468== at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1083468== by 0x8C15C8: AllocSetAlloc (aset.c:919)
==1083468== by 0x8CEA52: palloc (mcxt.c:964)
==1083468== by 0x267F25: systable_beginscan (genam.c:373)
==1083468== by 0x8682CE: SearchCatCacheMiss (catcache.c:1359)
==1083468== by 0x868167: SearchCatCacheInternal (catcache.c:1299)
==1083468== by 0x867E2C: SearchCatCache1 (catcache.c:1167)
==1083468== by 0x8860B2: SearchSysCache1 (syscache.c:1123)
==1083468== by 0x8BD482: check_enable_rls (rls.c:66)
==1083468== by 0x68A113: get_row_security_policies (rowsecurity.c:134)
==1083468== by 0x683C2C: fireRIRrules (rewriteHandler.c:2045)
==1083468== by 0x687340: QueryRewrite (rewriteHandler.c:3962)
==1083468== by 0x6E6EB1: pg_rewrite_query (postgres.c:784)
==1083468== by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700)
==1083468== by 0x6E7476: exec_simple_query (postgres.c:1155)
==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468== by 0x624284: BackendRun (postmaster.c:4541)
==1083468== by 0x623995: BackendStartup (postmaster.c:4225)
==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==
==1083468== VALGRINDERROR-END
(You'll see the same error if you run Postgres Valgrind + "make
installcheck", though I don't think that the queries in question are
tests that you yourself wrote.)
* IndexScanDescData.xs_itup comments could stand to be updated here --
IndexScanDescData.xs_want_itup is no longer just about index-only
scans.
* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?
Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).
A part of my objection to "scanstart" is that it seems to require that
much of the code within _bt_skip() get another level of
indentation...which makes it even more difficult to follow.
* I don't understand what _bt_scankey_within_page() comments mean when
they refer to "the page highkey". It looks like this function examines
the highest data item on the page, not the high key.
It is highly confusing to refer to a tuple as the page high key if it
isn't the tuple from the P_HIKEY offset number on a non-rightmost
page, which is a pivot tuple even on the leaf level (as indicated by
BTreeTupleIsPivot()).
* Why does _bt_scankey_within_page() have an unused "ScanDirection
dir" argument?
* Why is it okay to do anything important based on the
_bt_scankey_within_page() return value?
If the page is empty, then how can we know that it's okay to go to the
next value? I'm concerned that there could be subtle bugs in this
area. VACUUM will usually just delete the empty page. But it won't
always do so, for a variety of reasons that aren't worth going into
now. This could mask bugs in this area. I'm concerned about patterns
like this one from _bt_skip():
while (!nextFound)
{
....
if (_bt_scankey_within_page(scan, so->skipScanKey,
so->currPos.buf, dir))
{
...
}
else
/*
* If startItup could be not found within the current page,
* assume we found something new
*/
nextFound = true;
....
}
Why would you assume that "we found something new" here? In general I
just don't understand the design of _bt_skip(). I get the basic idea
of what you're trying to do, but it could really use better comments.
*The "jump one more time if it's the same as at the beginning" thing
seems scary to me. Maybe you should be doing something with the actual
high key here.
* Tip: You might find cases involving "empty but not yet deleted"
pages a bit easier to test by temporarily disabling page deletion. You
can modify nbtree.c to look like this:
index a1ad22f785..db977a0300 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1416,6 +1416,7 @@ backtrack:
Assert(!attempt_pagedel || nhtidslive == 0);
}
+ attempt_pagedel = false;
if (attempt_pagedel)
{
MemoryContext oldcontext;
That's all I have for now.
--
Peter Geoghegan