Making Row Comparison NULL row member handling more robust during skip scans

Started by Peter Geoghegan11 months ago7 messageshackers

pg@bowt.ie

11 months ago

There's one aspect of my recent bugfix commit 5f4d98d4 that still
doesn't sit well with me: the way that it taught _bt_set_startikey to
completely back out of applying any of its optimization in the
presence of a RowCompare. Without this "row compare back out"
behavior, there are obscure cases where resetting the scan's array
keys within _bt_readpage leads to an endless cycle of ending the
current primitive index scan, then starting another, only to end the
new primitive index scan on exactly the same leaf page/page offset as
last time. This could only happen in the presence of a NULL row
compare member, mostly due to the weird rules about those within
_bt_check_rowcompare (and how they interact with
_bt_advance_array_keys when we reset the scan's array keys in
_bt_readpage, another thing introduced by commit 5f4d98d4).

For example, a query like the following could get into an endless
cycle, if I remove the "don't set forcenonrequired" code from
_bt_set_startikey:

select *
from fuzz_skip_scan
where (b, c) >= (6, null) and (b, c) <= (7, null)
order by a desc, b desc, c desc, d desc;

Notice that we'll have a skip array on "a" here, since it is the
leading column, but has been omitted. I'm assuming the use of skip
scan/multiple primitive scans, and some use of forcenonrequired.
Obviously, this kind of case is ultra contrived/adversarial.

My confusion about how best to deal with the problem was on clear
display in commit 54c6ea8c, which removed forcenonrequired=true
support from _bt_check_rowcompare. I had to revert that commit shortly
thereafter (in commit dd2ce379) after it became clear that
_bt_check_rowcompare does in fact need to be called during
forcenonrequired=true scans. You see, the logic I added to
_bt_set_startikey in commit 5f4d98d4 only prevents application of
forcenonrequired mode (as well as setting pstate.ikey to a key past
the first key) with row compares that happen to have a first row
compare member whose index column happens to be the first index column
whose tuple values change within the page in question (the page
examined by _bt_set_startikey). I have no reason to believe that
there's any bug here, but it is certainly needlessly complicated.

Thinking about the problem some more today, I realized that the issue
wasn't really with _bt_set_startikey, or with _bt_advance_array_keys.
The problem was with the weird rules in _bt_check_rowcompare around
NULL row compare members. And the lack of joined-up thinking between
_bt_check_rowcompare and corresponding _bt_first code that deals with
building an insertion scan key given a row compare. These two places
should really have symmetrical rules, including in their handling of
NULL row compare members, but right now they don't. Disabling
forcenonrequired mode in _bt_set_startikey now seems to me to be a
case of the tail wagging the dog.

It would make much more sense to teach row comparisons to behave more
like standard scalar inequalities in their handling of NULLs/NULL
semantics/_bt_first NULL positioning constraints. That way,
_bt_set_startikey can safely treat row compares (with or without NULL
members) just like any other kind of inequality -- it'd *always* be
safe to use forcenonrequired=true mode when that made sense, no matter
what.

Attached patch shows how this could work. It refines the rules around
NULL row comparison members in _bt_check_rowcompare, and in _bt_first.
The fundamental idea here is to make _bt_check_rowcompare *consistent*
with _bt_first. That way _bt_set_startikey doesn't have to know
anything about row compares.

I would like to commit this to Postgres 18, treating it as a bugfix.
It seems like a much more solid fix than what I came up with in bugfix
commit 5f4d98d4. Does anybody else have an opinion on that question?

Note that this patch effectively *adds a new optimization* to
_bt_first around row comparison keys with NULL row members. I don't
actually care about the performance of such row comparisons. Again,
what I actually care about is making _bt_first *consistent* with
_bt_check_rowcompare in its handling of NULLs. This approach happens
to be the best way to make them consistent (see the patch's draft
commit message if you want to know about an alternative approach
involving *removing* an old optimization from _bt_check_rowcompare).

In summary:

In general, when we end a primitive index scan, the code that sets
continuescan=false (any such code, not just _bt_check_rowcompare NULL
row member code) has to make sure that starting the next primitive
index scan will actually allow the top-level scan to make forward
progress -- the new primscan needs to land on some later leaf page.
Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
consider truly robust. We need that guarantee to avoid these
cycles/infinite looping. AFAIK there are no bugs of that general
nature on Postgres 18, but all of that depends on _bt_set_startikey
knowing about _bt_advance_array_keys and _bt_check_rowcompare
behaviors from a great distance, which is pretty grotty. Seems worth
fixing sooner rather than later.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

11 months ago

In reply to: Peter Geoghegan (#1)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch shows how this could work. It refines the rules around
NULL row comparison members in _bt_check_rowcompare, and in _bt_first.
The fundamental idea here is to make _bt_check_rowcompare *consistent*
with _bt_first. That way _bt_set_startikey doesn't have to know
anything about row compares.

I was dissatisfied in _bt_check_rowcompare, where in v1 I check
whether the previous subkey/row compare member has a column that's
immediately before the current subkey/row compare (where the current
subkey/row compare is marked ISNULL, making it unsatisfiable).

Why was it ever necessary to look at the previous subkey like this?
Why can't we simply check if an unsatisfiable ISNULL subkey/row
compare is itself marked required in the current scan direction --
just like in every other case where we set continuescan=false? One
possible answer is this: we'll never mark a row compare member after
the first row compare member as required, no matter the details.
But...why not?

Attached is v2, which avoids the v1 business with checking the prior
row member's attribute number. In fact, it doesn't even access the
prior row member at all. All that matters is whether or not the
unsatisfiable ISNULL subkey/row compare is itself marked required in
the current scan direction (there's no more of that "--subkey"
business that you see on HEAD). This makes _bt_check_rowcompare
clearer, and even more in line with what code in places like
_bt_advance_array_keys expects.

This also has significant performance advantages (see later test
case). I now find myself in the awkward position of proposing a useful
performance enhancement well after feature freeze for Postgres 18. I
think that that is actually the best option, on balance. Since it
makes row comparisons so much easier to understand from a great
distance -- without the v1 "check the attribute number for a not
required but kinda still required row member key" kludge.

Relevant background information concerning _bt_mark_scankey_required,
the function that deals with marking row compare members required (why
it's okay to change the rules in _bt_mark_scankey_required like this):

There was a 10 year period (from 2006 - 2016) during which it was
possible for multiple/all row members to be marked required (in one
scan direction). This changed in bugfix commit a298a1e0, which dealt
with incorrect handling of NULL index entries in indexed ROW()
comparison. That bugfix simply taught _bt_mark_scankey_required to not
mark any row member beyond the first one as required. But that seems
like overkill to me. My v2 goes back to the old pre-2016
_bt_mark_scankey_required behavior, and fixes the same 2016 problem by
following a more principled approach -- that's what makes it safe to
use required-ness marking, instead of using the v1 "check the
attribute number for a not required but kinda still required row
member key" kludge.

It seems to me that the actual problem that 2016 commit a298a1e0 fixed
is best understood as a problem with the basic rules for stopping the
scan when we encounter a NULL index tuple value in the context of row
compares; once we do that, we should then be able to mark multiple row
members required, based on the old pre-a298a1e0 rules. In other words,
I think that it would have made more sense to blame the original 2016
bug report problem on "the fix for bug #6278" (which happened in
commit 6980f817 and follow-up commit 882368e8, both from 2011).

If I revert 2016 a298a1e0 (i.e. make _bt_mark_scankey_required mark
multiple row compare members required per the old rules), which is a
part of what I'm doing in this v2, I can easily reproduce the bug that
Tom was concerned about back in 2016. However, I find that it's
possible to fix that same bug in a more principled way: by further
refining the basic rules for stopping the scan when we encounter a
NULL (more or less a fixed version of the logic added in 2011 by those
2 commits). So v2 does that as well.

We only need to be slightly more careful on the second-or-subsequent
row member marked required when we encounter a NULL now: it's only
okay to terminate the scan when the row compare is marked required in
the current scan direction. It is not good enough for it to be marked
required in the opposite direction -- unless we're dealing with the
first row compare member (so there's no change in requiredness-marking
behavior when dealing with the first row member, compared to HEAD).

If you think about why it's *ever* okay to terminate the scan using a
key not marked required in the *current* scan direction (when dealing
with NULLs), you'll understand why it cannot safely be applied to
lower-order row compare members. That in itself doesn't make it
generally unsafe to treat lower-order row compare members as required
to continue the scan. After all, there weren't any problems with it
for many years -- until the rules with terminating the scan on NULL
tuple values changed/were optimized in 2011 or so.

Comments in v2 try to explain this as clearly as possible.

I would like to commit this to Postgres 18, treating it as a bugfix.
It seems like a much more solid fix than what I came up with in bugfix
commit 5f4d98d4. Does anybody else have an opinion on that question?

Note that this patch effectively *adds a new optimization* to
_bt_first around row comparison keys with NULL row members. I don't
actually care about the performance of such row comparisons.

v2 actually does add an optimization that people would probably find
valuable -- which is a logical consequence of where I've taken things
in v2. Improving performance still isn't really the goal, but I want
to be clear about the fact that this patch could reasonably also be
understood as an optimization (that was much less true with v1, since
that only changed things with ISNULL-marked scan keys, not with tuples
that contain NULLs).

Currently, on master/18, the following query scans an index with
plenty of NULLs in all columns:

select a, b, c
from fuzz_skip_scan
where (a, b, c) >= (11, 1, 71) and (a, b, c) <= (12, 1, 1)
order by a desc, b desc, c desc, d desc;

We get an index-only scan backwards, with 37 buffer hits. However, the
equivalent forwards scan does a lot worse:

select a, b, c
from fuzz_skip_scan
where (a, b, c) >= (11, 1, 71) and (a, b, c) <= (12, 1, 1)
order by a, b, c, d;

That gets an index-only scan forwards, with 69 buffer hits.

There's no need for this inconsistency. With v2 applied, we get it
down to 35 buffer hits for both variants. If it's safe to only read 37
or 35 pages in one direction, then why wouldn't it also be safe to get
about that same number of buffer hits when scanning in the opposite
direction?

This improved performance profile is a consequence of v2 restoring
_bt_mark_scankey_required to its pre-2016 state, adding back logic
that is symmetric with existing logic in _bt_first (logic that wasn't
removed in 2016). One could argue that v2 is just fixing a regression
in 2016 commit a298a1e0.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

10 months ago

In reply to: Peter Geoghegan (#1)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

In general, when we end a primitive index scan, the code that sets
continuescan=false (any such code, not just _bt_check_rowcompare NULL
row member code) has to make sure that starting the next primitive
index scan will actually allow the top-level scan to make forward
progress -- the new primscan needs to land on some later leaf page.
Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
consider truly robust. We need that guarantee to avoid these
cycles/infinite looping. AFAIK there are no bugs of that general
nature on Postgres 18

Correction: there *is* a live bug like this on Postgres 18/HEAD, which
involves simple scalar inequalities with an incomplete opfamily.
Attached v3 fixes the bug in its new 0001-* patch. (Note that v3's
0002-* patch previously appeared as 0001-* in v2 and v1).

The bug in question involves redundant keys that could not be
eliminated by preprocessing due to a lack of available cross-type
support. Quals with certain combinations of redundant scalar keys can
cycle through the index being scanned ceaselessly. The scan fails to
make any real progress, again and again. Unlike the row compare case,
we lack even kludgey handling that avoids the problem within
_bt_set_startikey.

There's a live bug here, so I have to act. It seems to make sense to
apply everything on HEAD/Postgres 18, and be done with any question of
getting stuck like this/infinite cycling from resetting the scan's
arrays to get a clean slate. But I would certainly welcome other
opinions on this.

Problem statement
=================

I have confirmed the existence of the live bug using fuzz-testing,
combined with hard-coding that makes _bt_compare_scankey_args return
false unconditionally at the point where we're supposed to compare
each sk_argument from a pair of scan keys against the same attribute
(that's the easiest way to test problems in this area). With the right
combination of contradictory runtime keys, and with the right index
scan/index, we can get stuck.

When we get stuck its because we apply forcenonrequired mode in
_bt_readpage, reset the arrays at the end of the same _bt_readpage,
start another primitive index scan when pstate.finaltup is considered,
and then arrive at the same leaf page we ended on last time. We'll
then start the cycle anew, without any hope of ever making useful
progress. Not good.

Fundamentally, the problem here is that _bt_advance_array_keys expects
certain things from _bt_preprocess_keys and _bt_first that they won't
always honor. My approach in 0001-* more or less makes things always
work in the way that _bt_advance_array_keys already expects, rather
than changing _bt_advance_array_keys (or related code such as
_bt_set_startikey). More on that below, under "bugfix".

Background info on _bt_advance_array_keys' expectations
=======================================================

Notably, _bt_advance_array_keys would like to be able to determine if
_bt_first will be able to perform another primitive index scan based
on lower-order required inequalities marked required in the
opposite-to-scan direction only (i.e. an > or >= key when scanning
forwards, a < or <= key when scanning backwards). That detail is
important to my repro of the bug. _bt_advance_array_keys expects to be
able to apply required-in-opposite-direction inequalities to reason
about how _bt_first will behave the next time it is called.

During a forwards scan with a qual "WHERE a IN(1, 2, 3) AND b > 5",
_bt_advance_array_keys needs to know how "b > 5" will affect _bt_first
should it choose to schedule another primscan. It needs to decide
whether or not to actually schedule such a primscan. It might make
sense for _bt_advance_array_keys to stick to the leaf level for the
time being upon reaching the threshold between "a = 1" and "a = 2"
tuples -- the first "a = 2 AND b > 5" might be on the same leaf page.
OTOH, scheduling a new primscan is far preferable when it'll allow
_bt_first to land the scan on a much later leaf page (we'll skip over
many irrelevant leaf pages).

In general, the first leaf page/position that the first tuple "a = 2
AND b > 5" appears on might be many leaf pages after the first leaf
page that the first tuple "a =2" appears on. It all depends.

This scheme works well if there are no redundant > keys (which is very
much the common case, since incomplete opfamilies are rare). It
tacitly assumes that _bt_first shares the same definition of "required
in the opposite-to-scan direction key". But that isn't quite true
right now.

Bugfix
======

My 0001-* bugfix makes nbtree preprocessing deal with a qual like
"WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" with an
incomplete opfamily on "b" by leaving so->keyData[] in a state that
reflects the redundancy. _bt_preprocess_keys will now make sure that
keys left in so->keyData[] still appear in the most useful order. This
avoids creating confusion elsewhere.

Preprocessing will arbitrarily decide that only the first "b >"
condition gets to be marked required, while *unmarking* the
requiredness marking on the second "b >" condition. That way _bt_first
doesn't get to make its own independent choice about initial
positioning keys, based on ever-so-slightly different rules. Rather,
_bt_first has to agree with everybody else about which > or >= key
should be used, and about which < or <= key should be used (and even
about which "=" key should be used).

_bt_first does what preprocessing tells it to do. Just like
_bt_advance_array_keys will. No ifs, no buts.

Placing nonrequired keys at the end of so->keyData[]
----------------------------------------------------

Preprocessing also reorders so->keyData[] such that the "unlucky" keys
that _bt_preprocess_keys *doesn't* pick go to the end of the array --
we want to keep them out of the way, at least until they're needed.
Note that this makes it possible for keys to appear out of index
attribute order, but that's okay. Nonrequired keys don't need to be in
any particular order.

Reordering so->keyData[] in this way eliminates any risk that some
early nonrequired key on "b" will stop us from getting to some later
required key of interest on "c". Not in _bt_first, not in
_bt_checkkeys or _bt_advance_array_keys. Nowhere.

In general, the rule going forward is that all nbtree code (barring
preprocessing code) can assume that redundant keys don't really exist,
as long as they're marked required (which virtually all keys are now,
thanks to the skip scan/skip array work). There can be at most one
inequality key marked SK_BT_REQFWD and one inequality key marked
SK_BT_REQBKWD per index attribute -- no matter what.

nbtree is blasé about keeping around redundant scan keys right now.
That seems like it has the potential to cause more bugs in the future.
It's easy to forget about redundant required keys, and they're rare
(except perhaps when row compares are used, since preprocessing has no
hardly any smarts about redundancy that happens to involve row
compares right now). For a recent example a bug caused by such an
oversight, see bugfix commit 7e6fb5da.

New _bt_first behavior with row compares
----------------------------------------

Currently, on HEAD, _bt_first is prepared to use a seemingly random
mix of the first row compare member followed by some later
redudant-ish scalar on some later column (not the key on the same
column that appears in the row compare). This v3's row compare patch
now avoids that behavior. That happens automatically in v3, by virtue
of the fact that there can't be a row compare that's marked required
in the opposite-to-scan scan direction followed by some other key
that's also marked required (skip arrays don't accept row compare
inequalities, so it's just not possible).

In short, if _bt_first gets to use a row compare to build its
insertion scan key at all, then it *must* use all of the row members;
it cannot use any additional lower-order keys. (Actually, _bt_first
can only do that with individual row members marked required).

Again, there needs to be symmetry here (perhaps it isn't strictly
necessary to go exactly this far, but the row compare change to
_bt_first *is* necessary). Again, the rules that decide when we'll
start an index scan when scanning in one particular scan direction
need to be symmetric with the rules that decide when we'll end a index
scan with the same qual moving in the opposite scan direction.

Impact on Postgres 17
=====================

On Postgres 17, even a qual with redundancy such as "WHERE a IN(1, 2,
3) AND b > 5 AND c > 10_0000::bigint" shouldn't get infinite cycling
with an incomplete opfamily on "b". That can only happen on HEAD
because recent bugfix commit 5f4d98d4 added a _bt_start_array_keys
call to _bt_readpage, to reset the scan's arrays after we used
forcenonrequired mode to read all the tuples on the page.

Postgres 17 *does* reset the array keys through a call to
_bt_start_array_keys, to get a clean slate. But that only happens in
the context of mark/restore for merge joins, which seems robust
against infinite cycling of the kind I'm seeing on HEAD.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

10 months ago

In reply to: Peter Geoghegan (#3)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 25, 2025 at 5:44 PM Peter Geoghegan <pg@bowt.ie> wrote:

Correction: there *is* a live bug like this on Postgres 18/HEAD, which
involves simple scalar inequalities with an incomplete opfamily.
Attached v3 fixes the bug in its new 0001-* patch.

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Current plan is to commit these two some time next week. It'd be nice
to get some code review before then, or any kind of input on my
general direction.

Note again that the approach I've taken to fixing the bug *adds a
performance optimization*. It will fix the row compare performance
problem described on a thread from February of this year:

/messages/by-id/BLAPR09MB64993673A034EE93C6C6F4B6CCF62@BLAPR09MB6499.namprd09.prod.outlook.com

It will *also* fix the row compare performance problem that Tom
complained about last year, which included its own test case:

/messages/by-id/66001.1715455158@sss.pgh.pa.us

I am aware that this all sounds quite odd. Yes, I'm proposing a bug
fix that improves performance in what must seem like fairly tangential
cases. As I went into already, it really did just work out that way.

I had forgotten about the two threads I'm referencing here until I saw
references to each of them in my skip scan project notes. I'm
mentioning them on the thread now in case they're useful breadcrumbs
for somebody down that revisits this thread in the future.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

10 months ago

In reply to: Peter Geoghegan (#4)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Heikki said he'd be able to give this patch set at least a quick
review, so here's a new revision, v4.

This isn't really different to v4. It has more comment cleanup, and
better commit messages.

--
Peter Geoghegan

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

10 months ago

In reply to: Peter Geoghegan (#5)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On 01/07/2025 19:37, Peter Geoghegan wrote:

On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Heikki said he'd be able to give this patch set at least a quick
review, so here's a new revision, v4.

This isn't really different to v4. It has more comment cleanup, and
better commit messages.

I like how this makes row comparisons less special. To be honest I don't
quite understand what the bug was, I'd have to dig much deeper into this
to swap enough context into my memory for that. But just reading the
comments in these patches, I get the impression that they make this all
less fragile.

On bt_unmark_extra_keys() function: The function comment explains well
what it does; that's good. A few very minor stylistic remarks on its
implementation:

+	/* Set things up for first key's attr */
+	origkey = so->keyData;
+	curattr = origkey->sk_attno;
+	firsti = 0;
+	haveReqEquals = false;
+	haveReqForward = false;
+	haveReqBackward = false;
+	for (int i = 0; i < so->numberOfKeys; origkey++, i++)
+	{

It took me a moment to understand that origkey is always pointing to
&so->keyData[i]. I'd suggest making it more clear with:

/* Set things up for first key's attr */
curattr = so->keyData[0].sk_attno;
firsti = 0;
haveReqEquals = false;
haveReqForward = false;
haveReqBackward = false;
for (int i = 0; i < so->numberOfKeys; i++)
{
ScanKey origkey = &so->keyData[i];

In the second loop in the function you're actually doing this:

+ origkey = so->keyData + i;

which is yet another way to say the same thing. IMHO "&so->keyData[i]"
is more readable.

Would it be worth adding a comment that we assume keyData to be in
sk_attno order? Or is that a widely-known assumption? I don't remember.

+	/*
+	 * Next, allocate temp arrays: one set for unchanged keys, another for
+	 * keys that will be unmarked/made non-required
+	 */
+	unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));

Doesn't matter much but I think this could be:

unmarkKeys = palloc(nunmark * sizeof(ScanKeyData));
keepKeys = palloc((so->numberOfKeys - nunmark) * sizeof(ScanKeyData));

Or use a single array, putting the unmarked entries to the end of the array.

PS. Not a new thing, but the "Add coverage..." comments in the tests
seem redundant. All tests add coverage for something.

- Heikki

Peter Geoghegan

pg@bowt.ie

10 months ago

In reply to: Heikki Linnakangas (#6)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Tue, Jul 1, 2025 at 2:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I like how this makes row comparisons less special. To be honest I don't
quite understand what the bug was, I'd have to dig much deeper into this
to swap enough context into my memory for that.

If you're interested, I can provide you with all you'll need to
reproduce the infinite cycling behavior, where _bt_first is called
again and again, without the scan ever making useful progress. The
test data is fairly small, but it's a bit too big to just post to the
list.

My testing *simulates* cases where preprocessing cannot eliminate
redundant keys, without actually creating an incomplete opfamily, and
without really using cross-type operators in the queries. This
shouldn't matter -- it's just much easier to work with.

But just reading the
comments in these patches, I get the impression that they make this all
less fragile.

Technically the second patch (the row compare patch) isn't strictly
necessary to make my tests stop showing infinite cycling behavior. But
it's very easy to show infinite cycling behavior if I remove the
"refuse to forcenonrequired on row compare" kludge that I added to
_bt_set_startikey in bugfix commit 5f4d98d4 -- that's why I added it
in the first place. As you said, the goal is to make row compare quals
less special. And, in particular, to make it safe to remove that hack.

IMO the second patch is essential, since the hack I have in place
right now seems very fragile. The first patch (which adds the new
preprocessing step) is unambiguously needed, to fix a live bug
(without it, the infinite cycling behavior *is* still possible with
redundant keys that preprocessing can't eliminate).

On bt_unmark_extra_keys() function: The function comment explains well
what it does; that's good. A few very minor stylistic remarks on its
implementation:

I agree with all of your feedback; will make all those changes.

Would it be worth adding a comment that we assume keyData to be in
sk_attno order? Or is that a widely-known assumption? I don't remember.

I'll add a comment like that, just to be on the safe side.

There is an old comment at the top of _bt_preprocess_keys that says
that scan->keyData[] is in attribute order. And I add a new one in the
patch, which points out that that *isn't* always true anymore when the
so->keyData[] output array has nonrequired keys. There's no comment
that says what the new preprocessing function/pass expects about the
order of the so->keyDatap[] (which is both its input array and its
output array).

PS. Not a new thing, but the "Add coverage..." comments in the tests
seem redundant. All tests add coverage for something.

I'll adjust them along those lines.

Unless there are any objections, and assuming you're done giving
feedback, I'll commit both patches tomorrow.

Thanks!
--
Peter Geoghegan

Making Row Comparison NULL row member handling more robust during skip scans

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: