Making Row Comparison NULL row member handling more robust during skip scans

Started by Peter Geoghegan7 months ago7 messages

pg@bowt.ie

7 months ago

1 attachment(s)

There's one aspect of my recent bugfix commit 5f4d98d4 that still
doesn't sit well with me: the way that it taught _bt_set_startikey to
completely back out of applying any of its optimization in the
presence of a RowCompare. Without this "row compare back out"
behavior, there are obscure cases where resetting the scan's array
keys within _bt_readpage leads to an endless cycle of ending the
current primitive index scan, then starting another, only to end the
new primitive index scan on exactly the same leaf page/page offset as
last time. This could only happen in the presence of a NULL row
compare member, mostly due to the weird rules about those within
_bt_check_rowcompare (and how they interact with
_bt_advance_array_keys when we reset the scan's array keys in
_bt_readpage, another thing introduced by commit 5f4d98d4).

For example, a query like the following could get into an endless
cycle, if I remove the "don't set forcenonrequired" code from
_bt_set_startikey:

select *
from fuzz_skip_scan
where (b, c) >= (6, null) and (b, c) <= (7, null)
order by a desc, b desc, c desc, d desc;

Notice that we'll have a skip array on "a" here, since it is the
leading column, but has been omitted. I'm assuming the use of skip
scan/multiple primitive scans, and some use of forcenonrequired.
Obviously, this kind of case is ultra contrived/adversarial.

My confusion about how best to deal with the problem was on clear
display in commit 54c6ea8c, which removed forcenonrequired=true
support from _bt_check_rowcompare. I had to revert that commit shortly
thereafter (in commit dd2ce379) after it became clear that
_bt_check_rowcompare does in fact need to be called during
forcenonrequired=true scans. You see, the logic I added to
_bt_set_startikey in commit 5f4d98d4 only prevents application of
forcenonrequired mode (as well as setting pstate.ikey to a key past
the first key) with row compares that happen to have a first row
compare member whose index column happens to be the first index column
whose tuple values change within the page in question (the page
examined by _bt_set_startikey). I have no reason to believe that
there's any bug here, but it is certainly needlessly complicated.

Thinking about the problem some more today, I realized that the issue
wasn't really with _bt_set_startikey, or with _bt_advance_array_keys.
The problem was with the weird rules in _bt_check_rowcompare around
NULL row compare members. And the lack of joined-up thinking between
_bt_check_rowcompare and corresponding _bt_first code that deals with
building an insertion scan key given a row compare. These two places
should really have symmetrical rules, including in their handling of
NULL row compare members, but right now they don't. Disabling
forcenonrequired mode in _bt_set_startikey now seems to me to be a
case of the tail wagging the dog.

It would make much more sense to teach row comparisons to behave more
like standard scalar inequalities in their handling of NULLs/NULL
semantics/_bt_first NULL positioning constraints. That way,
_bt_set_startikey can safely treat row compares (with or without NULL
members) just like any other kind of inequality -- it'd *always* be
safe to use forcenonrequired=true mode when that made sense, no matter
what.

Attached patch shows how this could work. It refines the rules around
NULL row comparison members in _bt_check_rowcompare, and in _bt_first.
The fundamental idea here is to make _bt_check_rowcompare *consistent*
with _bt_first. That way _bt_set_startikey doesn't have to know
anything about row compares.

I would like to commit this to Postgres 18, treating it as a bugfix.
It seems like a much more solid fix than what I came up with in bugfix
commit 5f4d98d4. Does anybody else have an opinion on that question?

Note that this patch effectively *adds a new optimization* to
_bt_first around row comparison keys with NULL row members. I don't
actually care about the performance of such row comparisons. Again,
what I actually care about is making _bt_first *consistent* with
_bt_check_rowcompare in its handling of NULLs. This approach happens
to be the best way to make them consistent (see the patch's draft
commit message if you want to know about an alternative approach
involving *removing* an old optimization from _bt_check_rowcompare).

In summary:

In general, when we end a primitive index scan, the code that sets
continuescan=false (any such code, not just _bt_check_rowcompare NULL
row member code) has to make sure that starting the next primitive
index scan will actually allow the top-level scan to make forward
progress -- the new primscan needs to land on some later leaf page.
Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
consider truly robust. We need that guarantee to avoid these
cycles/infinite looping. AFAIK there are no bugs of that general
nature on Postgres 18, but all of that depends on _bt_set_startikey
knowing about _bt_advance_array_keys and _bt_check_rowcompare
behaviors from a great distance, which is pretty grotty. Seems worth
fixing sooner rather than later.

--
Peter Geoghegan

Attachments:

v1-0001-Improve-_bt_first-row-compare-NULL-member-handlin.patchapplication/octet-stream; name=v1-0001-Improve-_bt_first-row-compare-NULL-member-handlin.patchDownload

From 234cf6f13f43bebdba2da7f2933278d6c2d2de68 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 18 Jun 2025 19:32:57 -0400
Subject: [PATCH v1] Improve _bt_first row compare NULL member handling.

Improve _bt_first's handling of NULL row comparison members in such a
way as to make it consistent with _bt_check_rowcompare's approach.  In
other words, make sure that code that starts primitive index scans that
involve row compares fully agrees with code that ends primitive index
scans -- even in the presence of a NULL row compare member.  This makes
row comparisons more similar to scalar inequality keys, obviating the
need for _bt_set_startikey to consider row comparisons directly.

_bt_first will now avoid uselessly scanning earlier index tuples, that
cannot possibly contain matching tuples due to the use of a NULL row
compare member.  The goal isn't to improve performance; the goal is to
make _bt_first agree with _bt_check_rowcompare about where primitive
index scans should start and end.

It would have been possible to make these two functions agree in the
same way by removing (rather than adding) an optimization: we could have
prevented _bt_check_rowcompare from terminating the scan at the point
that it compare an unsatisfiable NULL row compare member.  However, that
approach would likely lead to complaints about performance regressions.

Follow-up to bugfix commit 5f4d98d4, which taught _bt_set_startikey to
avoid the use of forcenonrequired mode iff it reached a row compare key.
Now _bt_set_startikey never needs to avoid the use of nonrequired mode,
which is simpler, more consistent, and more robust.
---
 src/backend/access/nbtree/nbtsearch.c | 41 +++++++++--
 src/backend/access/nbtree/nbtutils.c  | 98 ++++++++++++++-------------
 2 files changed, 87 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 36544ecfd..80d727bb1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1288,14 +1288,23 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * even if the row comparison is of ">" or "<" type, because the
 			 * condition applied to all but the last row member is effectively
 			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
+			 * scheme.  But, by the same token, if a "column gap" appears
+			 * between members, then the part of the row comparison that we
 			 * did use has to be treated as just a ">=" or "<=" condition, and
 			 * so we'd better adjust strat_total accordingly.
+			 *
+			 * We're able to use a _more_ restrictive strategy when we
+			 * encounter a NULL row compare member (which is unsatisfiable).
+			 * For example, a qual "(a, b, c) >= (1, NULL, 77)" will use an
+			 * insertion scan key "a > 1".  All rows where "a = 1" have to
+			 * perform a NULL row member comparison (or would, if we didn't
+			 * use the more restrictive ">" strategy), which is guranteed to
+			 * return false/return NULL.
 			 */
 			if (i == keysz - 1)
 			{
-				bool		used_all_subkeys = false;
+				bool		keep_strat_total = false;
+				bool		tighten_strat_total = false;
 
 				Assert(!(subkey->sk_flags & SK_ROW_END));
 				for (;;)
@@ -1307,19 +1316,26 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_strategy != cur->sk_strategy)
 						break;	/* wrong direction, can't use it */
 					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
+					{
+						/* Unsatisfiable NULL row member */
+						keep_strat_total = true;
+						tighten_strat_total = true;
+						break;
+					}
 					Assert(keysz < INDEX_MAX_KEYS);
 					memcpy(inskey.scankeys + keysz, subkey,
 						   sizeof(ScanKeyData));
 					keysz++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
-						used_all_subkeys = true;
+						keep_strat_total = true;
 						break;
 					}
 				}
-				if (!used_all_subkeys)
+				if (!keep_strat_total)
 				{
+					/* Use less restrictive strategy (and fewer keys) */
+					Assert(!tighten_strat_total);
 					switch (strat_total)
 					{
 						case BTLessStrategyNumber:
@@ -1330,6 +1346,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 							break;
 					}
 				}
+				if (tighten_strat_total)
+				{
+					/* Use more restrictive strategy (and fewer keys) */
+					switch (strat_total)
+					{
+						case BTLessEqualStrategyNumber:
+							strat_total = BTLessStrategyNumber;
+							break;
+						case BTGreaterEqualStrategyNumber:
+							strat_total = BTGreaterStrategyNumber;
+							break;
+					}
+				}
 				break;			/* done with outer loop */
 			}
 		}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2..457786885 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2565,23 +2565,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 			 * whether or not every tuple on the page satisfies a RowCompare
 			 * key based only on firsttup and lasttup -- so we just give up.
 			 */
-			if (!start_past_saop_eq && !so->skipScan)
-				break;			/* unsafe to go further */
-
-			/*
-			 * We have to be even more careful with RowCompares that come
-			 * after an array: we assume it's unsafe to even bypass the array.
-			 * Calling _bt_start_array_keys to recover the scan's arrays
-			 * following use of forcenonrequired mode isn't compatible with
-			 * _bt_check_rowcompare's continuescan=false behavior with NULL
-			 * row compare members.  _bt_advance_array_keys must not make a
-			 * decision on the basis of a key not being satisfied in the
-			 * opposite-to-scan direction until the scan reaches a leaf page
-			 * where the same key begins to be satisfied in scan direction.
-			 * The _bt_first !used_all_subkeys behavior makes this limitation
-			 * hard to work around some other way.
-			 */
-			return;				/* completely unsafe to set pstate.startikey */
+			break;				/* unsafe */
 		}
 		if (key->sk_strategy != BTEqualStrategyNumber)
 		{
@@ -3078,6 +3062,56 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Unlike the simple-scankey case, NULL row members aren't disallowed
+		 * (except when it's the first row element that has the NULL arg,
+		 * where preprocessing recognizes the scan's qual as unsatisfiable).
+		 * But it can never match any rows.
+		 *
+		 * If the first row member's column is marked required, and this row
+		 * comparison is on the very next column, we can stop the scan; there
+		 * can't be another tuple that will succeed.
+		 */
+		if (subkey->sk_flags & SK_ISNULL)
+		{
+			AttrNumber	isnull_attno = subkey->sk_attno;
+
+			/* can't be the first row member (preprocessing catches this) */
+			Assert(subkey != (ScanKey) DatumGetPointer(skey->sk_argument));
+
+			subkey--;
+			if (forcenonrequired)
+			{
+				/* treating scan's keys as non-required */
+			}
+			else if (subkey->sk_attno != isnull_attno - 1)
+			{
+				/*
+				 * There's an index column gap between the NULL row member,
+				 * and the next most significant row member.  Cannot end scan.
+				 *
+				 * The presence of this gap doesn't actually imply that there
+				 * really can be another tuple that will succeed; there can't.
+				 * This isn't really about the _current_ primitive index scan.
+				 *
+				 * When we set continuescan=false, it had better be safe for a
+				 * scan with a more-significant-than-rowcompare array key to
+				 * reposition itself to some later leaf page, in the usual way
+				 * (by starting the next primitive index scan).  But there'd
+				 * be no way for _bt_first to actually locate some later leaf
+				 * page if we were to set continuescan=false here -- we'd run
+				 * the risk of getting stuck in an unending cycle.
+				 */
+			}
+			else if ((subkey->sk_flags & SK_BT_REQFWD) &&
+					 ScanDirectionIsForward(dir))
+				*continuescan = false;
+			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*continuescan = false;
+			return false;
+		}
+
 		if (subkey->sk_attno > tupnatts)
 		{
 			/*
@@ -3087,11 +3121,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(BTreeTupleIsPivot(tuple));
-			cmpresult = 0;
-			if (subkey->sk_flags & SK_ROW_END)
-				break;
-			subkey++;
-			continue;
+			return true;
 		}
 
 		datum = index_getattr(tuple,
@@ -3148,30 +3178,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			return false;
 		}
 
-		if (subkey->sk_flags & SK_ISNULL)
-		{
-			/*
-			 * Unlike the simple-scankey case, this isn't a disallowed case
-			 * (except when it's the first row element that has the NULL arg).
-			 * But it can never match.  If all the earlier row comparison
-			 * columns are required for the scan direction, we can stop the
-			 * scan, because there can't be another tuple that will succeed.
-			 */
-			Assert(subkey != (ScanKey) DatumGetPointer(skey->sk_argument));
-			subkey--;
-			if (forcenonrequired)
-			{
-				/* treating scan's keys as non-required */
-			}
-			else if ((subkey->sk_flags & SK_BT_REQFWD) &&
-					 ScanDirectionIsForward(dir))
-				*continuescan = false;
-			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
-					 ScanDirectionIsBackward(dir))
-				*continuescan = false;
-			return false;
-		}
-
 		/* Perform the test --- three-way comparison not bool operator */
 		cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
 													subkey->sk_collation,
-- 
2.49.0

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#1)

1 attachment(s)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch shows how this could work. It refines the rules around
NULL row comparison members in _bt_check_rowcompare, and in _bt_first.
The fundamental idea here is to make _bt_check_rowcompare *consistent*
with _bt_first. That way _bt_set_startikey doesn't have to know
anything about row compares.

I was dissatisfied in _bt_check_rowcompare, where in v1 I check
whether the previous subkey/row compare member has a column that's
immediately before the current subkey/row compare (where the current
subkey/row compare is marked ISNULL, making it unsatisfiable).

Why was it ever necessary to look at the previous subkey like this?
Why can't we simply check if an unsatisfiable ISNULL subkey/row
compare is itself marked required in the current scan direction --
just like in every other case where we set continuescan=false? One
possible answer is this: we'll never mark a row compare member after
the first row compare member as required, no matter the details.
But...why not?

Attached is v2, which avoids the v1 business with checking the prior
row member's attribute number. In fact, it doesn't even access the
prior row member at all. All that matters is whether or not the
unsatisfiable ISNULL subkey/row compare is itself marked required in
the current scan direction (there's no more of that "--subkey"
business that you see on HEAD). This makes _bt_check_rowcompare
clearer, and even more in line with what code in places like
_bt_advance_array_keys expects.

This also has significant performance advantages (see later test
case). I now find myself in the awkward position of proposing a useful
performance enhancement well after feature freeze for Postgres 18. I
think that that is actually the best option, on balance. Since it
makes row comparisons so much easier to understand from a great
distance -- without the v1 "check the attribute number for a not
required but kinda still required row member key" kludge.

Relevant background information concerning _bt_mark_scankey_required,
the function that deals with marking row compare members required (why
it's okay to change the rules in _bt_mark_scankey_required like this):

There was a 10 year period (from 2006 - 2016) during which it was
possible for multiple/all row members to be marked required (in one
scan direction). This changed in bugfix commit a298a1e0, which dealt
with incorrect handling of NULL index entries in indexed ROW()
comparison. That bugfix simply taught _bt_mark_scankey_required to not
mark any row member beyond the first one as required. But that seems
like overkill to me. My v2 goes back to the old pre-2016
_bt_mark_scankey_required behavior, and fixes the same 2016 problem by
following a more principled approach -- that's what makes it safe to
use required-ness marking, instead of using the v1 "check the
attribute number for a not required but kinda still required row
member key" kludge.

It seems to me that the actual problem that 2016 commit a298a1e0 fixed
is best understood as a problem with the basic rules for stopping the
scan when we encounter a NULL index tuple value in the context of row
compares; once we do that, we should then be able to mark multiple row
members required, based on the old pre-a298a1e0 rules. In other words,
I think that it would have made more sense to blame the original 2016
bug report problem on "the fix for bug #6278" (which happened in
commit 6980f817 and follow-up commit 882368e8, both from 2011).

If I revert 2016 a298a1e0 (i.e. make _bt_mark_scankey_required mark
multiple row compare members required per the old rules), which is a
part of what I'm doing in this v2, I can easily reproduce the bug that
Tom was concerned about back in 2016. However, I find that it's
possible to fix that same bug in a more principled way: by further
refining the basic rules for stopping the scan when we encounter a
NULL (more or less a fixed version of the logic added in 2011 by those
2 commits). So v2 does that as well.

We only need to be slightly more careful on the second-or-subsequent
row member marked required when we encounter a NULL now: it's only
okay to terminate the scan when the row compare is marked required in
the current scan direction. It is not good enough for it to be marked
required in the opposite direction -- unless we're dealing with the
first row compare member (so there's no change in requiredness-marking
behavior when dealing with the first row member, compared to HEAD).

If you think about why it's *ever* okay to terminate the scan using a
key not marked required in the *current* scan direction (when dealing
with NULLs), you'll understand why it cannot safely be applied to
lower-order row compare members. That in itself doesn't make it
generally unsafe to treat lower-order row compare members as required
to continue the scan. After all, there weren't any problems with it
for many years -- until the rules with terminating the scan on NULL
tuple values changed/were optimized in 2011 or so.

Comments in v2 try to explain this as clearly as possible.

I would like to commit this to Postgres 18, treating it as a bugfix.
It seems like a much more solid fix than what I came up with in bugfix
commit 5f4d98d4. Does anybody else have an opinion on that question?

Note that this patch effectively *adds a new optimization* to
_bt_first around row comparison keys with NULL row members. I don't
actually care about the performance of such row comparisons.

v2 actually does add an optimization that people would probably find
valuable -- which is a logical consequence of where I've taken things
in v2. Improving performance still isn't really the goal, but I want
to be clear about the fact that this patch could reasonably also be
understood as an optimization (that was much less true with v1, since
that only changed things with ISNULL-marked scan keys, not with tuples
that contain NULLs).

Currently, on master/18, the following query scans an index with
plenty of NULLs in all columns:

select a, b, c
from fuzz_skip_scan
where (a, b, c) >= (11, 1, 71) and (a, b, c) <= (12, 1, 1)
order by a desc, b desc, c desc, d desc;

We get an index-only scan backwards, with 37 buffer hits. However, the
equivalent forwards scan does a lot worse:

select a, b, c
from fuzz_skip_scan
where (a, b, c) >= (11, 1, 71) and (a, b, c) <= (12, 1, 1)
order by a, b, c, d;

That gets an index-only scan forwards, with 69 buffer hits.

There's no need for this inconsistency. With v2 applied, we get it
down to 35 buffer hits for both variants. If it's safe to only read 37
or 35 pages in one direction, then why wouldn't it also be safe to get
about that same number of buffer hits when scanning in the opposite
direction?

This improved performance profile is a consequence of v2 restoring
_bt_mark_scankey_required to its pre-2016 state, adding back logic
that is symmetric with existing logic in _bt_first (logic that wasn't
removed in 2016). One could argue that v2 is just fixing a regression
in 2016 commit a298a1e0.

--
Peter Geoghegan

Attachments:

v2-0001-Simplify-and-improve-row-compare-NULL-handling.patchapplication/octet-stream; name=v2-0001-Simplify-and-improve-row-compare-NULL-handling.patchDownload

From 7e86fe5aaff4ef18abc79bdc3bd0d0f1a23f2fc8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 18 Jun 2025 19:32:57 -0400
Subject: [PATCH v2] Simplify and improve row compare NULL handling.

Improve _bt_first's handling of NULL row comparison members in such a
way as to make it consistent with _bt_check_rowcompare's approach.  In
other words, make sure that code that starts primitive index scans that
involve row compares fully agrees with code that ends primitive index
scans -- even in the presence of a NULL row compare member.  This makes
row comparisons more similar to scalar inequality keys, obviating the
need for _bt_set_startikey to consider row comparisons directly.

_bt_first will now avoid uselessly scanning earlier index tuples, that
cannot possibly contain matching tuples due to the use of a NULL row
compare member.  The goal isn't to improve performance; the goal is to
make _bt_first agree with _bt_check_rowcompare about where primitive
index scans should start and end.

Also go back to marking row compare members after the first as required
when safe.  This is more or less a straight revert of 2016 bugfix commit
a298a1e0.  We can address the complaint that led to that commit by being
more careful about the conditions where NULLs can end the scan in the
row compare path (for row members beyond the first one, which can once
again be marked required when safe).

Follow-up to bugfix commit 5f4d98d4, which taught _bt_set_startikey to
avoid the use of forcenonrequired mode iff it reached a row compare key.
Now _bt_set_startikey never needs to avoid the use of nonrequired mode,
which is simpler, more consistent, and more robust.
---
 src/backend/access/nbtree/nbtpreprocesskeys.c |  18 ++-
 src/backend/access/nbtree/nbtsearch.c         |  43 +++++-
 src/backend/access/nbtree/nbtutils.c          | 122 +++++++++++-------
 3 files changed, 126 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..44edb2e48 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -786,12 +786,26 @@ _bt_mark_scankey_required(ScanKey skey)
 	if (skey->sk_flags & SK_ROW_HEADER)
 	{
 		ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+		AttrNumber	attno = skey->sk_attno;
 
 		/* First subkey should be same column/operator as the header */
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
-		Assert(subkey->sk_attno == skey->sk_attno);
+		Assert(subkey->sk_attno == attno);
 		Assert(subkey->sk_strategy == skey->sk_strategy);
-		subkey->sk_flags |= addflags;
+
+		for (;;)
+		{
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_attno != attno)
+				break;			/* non-adjacent key, so not required */
+			if (subkey->sk_strategy != skey->sk_strategy)
+				break;			/* wrong direction, so not required */
+			subkey->sk_flags |= addflags;
+			if (subkey->sk_flags & SK_ROW_END)
+				break;
+			subkey++;
+			attno++;
+		}
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 36544ecfd..ff453ec4e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1288,14 +1288,23 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * even if the row comparison is of ">" or "<" type, because the
 			 * condition applied to all but the last row member is effectively
 			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
+			 * scheme.  But, by the same token, if a "column gap" appears
+			 * between members, then the part of the row comparison that we
 			 * did use has to be treated as just a ">=" or "<=" condition, and
 			 * so we'd better adjust strat_total accordingly.
+			 *
+			 * We're able to use a _more_ restrictive strategy when we
+			 * encounter a NULL row compare member (which is unsatisfiable).
+			 * For example, a qual "(a, b, c) >= (1, NULL, 77)" will use an
+			 * insertion scan key "a > 1".  All rows where "a = 1" have to
+			 * perform a NULL row member comparison (or would, if we didn't
+			 * use the more restrictive ">" strategy), which is guranteed to
+			 * return false/return NULL.
 			 */
 			if (i == keysz - 1)
 			{
-				bool		used_all_subkeys = false;
+				bool		keep_strat_total = false;
+				bool		tighten_strat_total = false;
 
 				Assert(!(subkey->sk_flags & SK_ROW_END));
 				for (;;)
@@ -1307,19 +1316,26 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_strategy != cur->sk_strategy)
 						break;	/* wrong direction, can't use it */
 					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
+					{
+						/* Unsatisfiable NULL row member */
+						keep_strat_total = true;
+						tighten_strat_total = true;
+						break;
+					}
 					Assert(keysz < INDEX_MAX_KEYS);
 					memcpy(inskey.scankeys + keysz, subkey,
 						   sizeof(ScanKeyData));
 					keysz++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
-						used_all_subkeys = true;
+						keep_strat_total = true;
 						break;
 					}
 				}
-				if (!used_all_subkeys)
+				if (!keep_strat_total)
 				{
+					/* Use less restrictive strategy (and fewer keys) */
+					Assert(!tighten_strat_total);
 					switch (strat_total)
 					{
 						case BTLessStrategyNumber:
@@ -1330,7 +1346,20 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 							break;
 					}
 				}
-				break;			/* done with outer loop */
+				if (tighten_strat_total)
+				{
+					/* Use more restrictive strategy (and fewer keys) */
+					switch (strat_total)
+					{
+						case BTLessEqualStrategyNumber:
+							strat_total = BTLessStrategyNumber;
+							break;
+						case BTGreaterEqualStrategyNumber:
+							strat_total = BTGreaterStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outermost loop */
 			}
 		}
 		else
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2..ea6fab02f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2565,23 +2565,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 			 * whether or not every tuple on the page satisfies a RowCompare
 			 * key based only on firsttup and lasttup -- so we just give up.
 			 */
-			if (!start_past_saop_eq && !so->skipScan)
-				break;			/* unsafe to go further */
-
-			/*
-			 * We have to be even more careful with RowCompares that come
-			 * after an array: we assume it's unsafe to even bypass the array.
-			 * Calling _bt_start_array_keys to recover the scan's arrays
-			 * following use of forcenonrequired mode isn't compatible with
-			 * _bt_check_rowcompare's continuescan=false behavior with NULL
-			 * row compare members.  _bt_advance_array_keys must not make a
-			 * decision on the basis of a key not being satisfied in the
-			 * opposite-to-scan direction until the scan reaches a leaf page
-			 * where the same key begins to be satisfied in scan direction.
-			 * The _bt_first !used_all_subkeys behavior makes this limitation
-			 * hard to work around some other way.
-			 */
-			return;				/* completely unsafe to set pstate.startikey */
+			break;				/* unsafe */
 		}
 		if (key->sk_strategy != BTEqualStrategyNumber)
 		{
@@ -3078,6 +3062,34 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Unlike the simple-scankey case, NULL row members aren't disallowed
+		 * (except when it's the first row element that has the NULL arg,
+		 * where preprocessing recognizes the scan's qual as unsatisfiable).
+		 * But it can never match any rows.
+		 *
+		 * If this row comparison member is marked required in the current
+		 * scan direction, we can stop the scan; there can't be another tuple
+		 * that will succeed.
+		 */
+		if (subkey->sk_flags & SK_ISNULL)
+		{
+			/* can't be the first row member (preprocessing catches this) */
+			Assert(subkey != (ScanKey) DatumGetPointer(skey->sk_argument));
+
+			if (forcenonrequired)
+			{
+				/* treating scan's keys as non-required */
+			}
+			else if ((subkey->sk_flags & SK_BT_REQFWD) &&
+					 ScanDirectionIsForward(dir))
+				*continuescan = false;
+			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*continuescan = false;
+			return false;
+		}
+
 		if (subkey->sk_attno > tupnatts)
 		{
 			/*
@@ -3087,11 +3099,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(BTreeTupleIsPivot(tuple));
-			cmpresult = 0;
-			if (subkey->sk_flags & SK_ROW_END)
-				break;
-			subkey++;
-			continue;
+			return true;
 		}
 
 		datum = index_getattr(tuple,
@@ -3107,6 +3115,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			}
 			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
 			{
+				int			reqflags = SK_BT_REQBKWD;
+
 				/*
 				 * Since NULLs are sorted before non-NULLs, we know we have
 				 * reached the lower limit of the range of values for this
@@ -3118,13 +3128,34 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * have initially positioned to the start of the index.
 				 * (_bt_advance_array_keys also relies on this behavior during
 				 * forward scans.)
+				 *
+				 * For example, quals like "WHERE (a, b, c) < (2, 42, 333)"
+				 * can terminate a backwards scan upon reaching the rightmost
+				 * tuple whose "a" column has a NULL, since the entire index
+				 * cannot contain further matches to the left of that tuple.
+				 * NULL is "<" 2, but it nevertheless indicates the end of all
+				 * matching tuples to our < row compare qual.
+				 *
+				 * Note, however, that it is _not_ safe to end the scan on a
+				 * row member that's marked required in the opposite-to-scan
+				 * direction for row compare members beyond the first one.
+				 * We'll only terminate the scan on a second or subsequent row
+				 * member when they're marked required in the scan direction.
+				 * For example, quals like "WHERE (a, b, c) > (2, 42, 333)"
+				 * can terminate a backwards scan upon reaching the index's
+				 * rightmost "a = 2" tuple whose "b" column contains a NULL.
+				 * NULL is treated like just another value that's "<" 42.
 				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+				if (subkey->sk_attno == skey->sk_attno)
+					reqflags |= SK_BT_REQFWD;	/* for first member only */
+				if ((subkey->sk_flags & reqflags) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
 			}
 			else
 			{
+				int			reqflags = SK_BT_REQFWD;
+
 				/*
 				 * Since NULLs are sorted after non-NULLs, we know we have
 				 * reached the upper limit of the range of values for this
@@ -3136,8 +3167,27 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * may have initially positioned to the end of the index.
 				 * (_bt_advance_array_keys also relies on this behavior during
 				 * backward scans.)
+				 *
+				 * For example, quals like "WHERE (a, b, c) > (2, 42, 333)"
+				 * can terminate a forwards scan upon reaching the leftmost
+				 * tuple whose "a" column has a NULL, since the entire index
+				 * cannot contain further matches to the right of that tuple.
+				 * NULL is ">" 2, but it nevertheless indicates the end of all
+				 * matching tuples to our > row compare qual.
+				 *
+				 * Note, however, that it is _not_ safe to end the scan on a
+				 * row member that's marked required in the opposite-to-scan
+				 * direction for row compare members beyond the first one.
+				 * We'll only terminate the scan on a second or subsequent row
+				 * member when they're marked required in the scan direction.
+				 * For example, quals like "WHERE (a, b, c) < (2, 42, 333)"
+				 * can terminate a forwards scan upon reaching the index's
+				 * leftmost "a = 2" tuple whose "b" column contains a NULL.
+				 * NULL is treated like just another value that's ">" 42.
 				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+				if (subkey->sk_attno == skey->sk_attno)
+					reqflags |= SK_BT_REQBKWD;	/* for first member only */
+				if ((subkey->sk_flags & reqflags) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
 			}
@@ -3148,30 +3198,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			return false;
 		}
 
-		if (subkey->sk_flags & SK_ISNULL)
-		{
-			/*
-			 * Unlike the simple-scankey case, this isn't a disallowed case
-			 * (except when it's the first row element that has the NULL arg).
-			 * But it can never match.  If all the earlier row comparison
-			 * columns are required for the scan direction, we can stop the
-			 * scan, because there can't be another tuple that will succeed.
-			 */
-			Assert(subkey != (ScanKey) DatumGetPointer(skey->sk_argument));
-			subkey--;
-			if (forcenonrequired)
-			{
-				/* treating scan's keys as non-required */
-			}
-			else if ((subkey->sk_flags & SK_BT_REQFWD) &&
-					 ScanDirectionIsForward(dir))
-				*continuescan = false;
-			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
-					 ScanDirectionIsBackward(dir))
-				*continuescan = false;
-			return false;
-		}
-
 		/* Perform the test --- three-way comparison not bool operator */
 		cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
 													subkey->sk_collation,
-- 
2.49.0

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#1)

2 attachment(s)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

In general, when we end a primitive index scan, the code that sets
continuescan=false (any such code, not just _bt_check_rowcompare NULL
row member code) has to make sure that starting the next primitive
index scan will actually allow the top-level scan to make forward
progress -- the new primscan needs to land on some later leaf page.
Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
consider truly robust. We need that guarantee to avoid these
cycles/infinite looping. AFAIK there are no bugs of that general
nature on Postgres 18

Correction: there *is* a live bug like this on Postgres 18/HEAD, which
involves simple scalar inequalities with an incomplete opfamily.
Attached v3 fixes the bug in its new 0001-* patch. (Note that v3's
0002-* patch previously appeared as 0001-* in v2 and v1).

The bug in question involves redundant keys that could not be
eliminated by preprocessing due to a lack of available cross-type
support. Quals with certain combinations of redundant scalar keys can
cycle through the index being scanned ceaselessly. The scan fails to
make any real progress, again and again. Unlike the row compare case,
we lack even kludgey handling that avoids the problem within
_bt_set_startikey.

There's a live bug here, so I have to act. It seems to make sense to
apply everything on HEAD/Postgres 18, and be done with any question of
getting stuck like this/infinite cycling from resetting the scan's
arrays to get a clean slate. But I would certainly welcome other
opinions on this.

Problem statement
=================

I have confirmed the existence of the live bug using fuzz-testing,
combined with hard-coding that makes _bt_compare_scankey_args return
false unconditionally at the point where we're supposed to compare
each sk_argument from a pair of scan keys against the same attribute
(that's the easiest way to test problems in this area). With the right
combination of contradictory runtime keys, and with the right index
scan/index, we can get stuck.

When we get stuck its because we apply forcenonrequired mode in
_bt_readpage, reset the arrays at the end of the same _bt_readpage,
start another primitive index scan when pstate.finaltup is considered,
and then arrive at the same leaf page we ended on last time. We'll
then start the cycle anew, without any hope of ever making useful
progress. Not good.

Fundamentally, the problem here is that _bt_advance_array_keys expects
certain things from _bt_preprocess_keys and _bt_first that they won't
always honor. My approach in 0001-* more or less makes things always
work in the way that _bt_advance_array_keys already expects, rather
than changing _bt_advance_array_keys (or related code such as
_bt_set_startikey). More on that below, under "bugfix".

Background info on _bt_advance_array_keys' expectations
=======================================================

Notably, _bt_advance_array_keys would like to be able to determine if
_bt_first will be able to perform another primitive index scan based
on lower-order required inequalities marked required in the
opposite-to-scan direction only (i.e. an > or >= key when scanning
forwards, a < or <= key when scanning backwards). That detail is
important to my repro of the bug. _bt_advance_array_keys expects to be
able to apply required-in-opposite-direction inequalities to reason
about how _bt_first will behave the next time it is called.

During a forwards scan with a qual "WHERE a IN(1, 2, 3) AND b > 5",
_bt_advance_array_keys needs to know how "b > 5" will affect _bt_first
should it choose to schedule another primscan. It needs to decide
whether or not to actually schedule such a primscan. It might make
sense for _bt_advance_array_keys to stick to the leaf level for the
time being upon reaching the threshold between "a = 1" and "a = 2"
tuples -- the first "a = 2 AND b > 5" might be on the same leaf page.
OTOH, scheduling a new primscan is far preferable when it'll allow
_bt_first to land the scan on a much later leaf page (we'll skip over
many irrelevant leaf pages).

In general, the first leaf page/position that the first tuple "a = 2
AND b > 5" appears on might be many leaf pages after the first leaf
page that the first tuple "a =2" appears on. It all depends.

This scheme works well if there are no redundant > keys (which is very
much the common case, since incomplete opfamilies are rare). It
tacitly assumes that _bt_first shares the same definition of "required
in the opposite-to-scan direction key". But that isn't quite true
right now.

Bugfix
======

My 0001-* bugfix makes nbtree preprocessing deal with a qual like
"WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" with an
incomplete opfamily on "b" by leaving so->keyData[] in a state that
reflects the redundancy. _bt_preprocess_keys will now make sure that
keys left in so->keyData[] still appear in the most useful order. This
avoids creating confusion elsewhere.

Preprocessing will arbitrarily decide that only the first "b >"
condition gets to be marked required, while *unmarking* the
requiredness marking on the second "b >" condition. That way _bt_first
doesn't get to make its own independent choice about initial
positioning keys, based on ever-so-slightly different rules. Rather,
_bt_first has to agree with everybody else about which > or >= key
should be used, and about which < or <= key should be used (and even
about which "=" key should be used).

_bt_first does what preprocessing tells it to do. Just like
_bt_advance_array_keys will. No ifs, no buts.

Placing nonrequired keys at the end of so->keyData[]
----------------------------------------------------

Preprocessing also reorders so->keyData[] such that the "unlucky" keys
that _bt_preprocess_keys *doesn't* pick go to the end of the array --
we want to keep them out of the way, at least until they're needed.
Note that this makes it possible for keys to appear out of index
attribute order, but that's okay. Nonrequired keys don't need to be in
any particular order.

Reordering so->keyData[] in this way eliminates any risk that some
early nonrequired key on "b" will stop us from getting to some later
required key of interest on "c". Not in _bt_first, not in
_bt_checkkeys or _bt_advance_array_keys. Nowhere.

In general, the rule going forward is that all nbtree code (barring
preprocessing code) can assume that redundant keys don't really exist,
as long as they're marked required (which virtually all keys are now,
thanks to the skip scan/skip array work). There can be at most one
inequality key marked SK_BT_REQFWD and one inequality key marked
SK_BT_REQBKWD per index attribute -- no matter what.

nbtree is blasé about keeping around redundant scan keys right now.
That seems like it has the potential to cause more bugs in the future.
It's easy to forget about redundant required keys, and they're rare
(except perhaps when row compares are used, since preprocessing has no
hardly any smarts about redundancy that happens to involve row
compares right now). For a recent example a bug caused by such an
oversight, see bugfix commit 7e6fb5da.

New _bt_first behavior with row compares
----------------------------------------

Currently, on HEAD, _bt_first is prepared to use a seemingly random
mix of the first row compare member followed by some later
redudant-ish scalar on some later column (not the key on the same
column that appears in the row compare). This v3's row compare patch
now avoids that behavior. That happens automatically in v3, by virtue
of the fact that there can't be a row compare that's marked required
in the opposite-to-scan scan direction followed by some other key
that's also marked required (skip arrays don't accept row compare
inequalities, so it's just not possible).

In short, if _bt_first gets to use a row compare to build its
insertion scan key at all, then it *must* use all of the row members;
it cannot use any additional lower-order keys. (Actually, _bt_first
can only do that with individual row members marked required).

Again, there needs to be symmetry here (perhaps it isn't strictly
necessary to go exactly this far, but the row compare change to
_bt_first *is* necessary). Again, the rules that decide when we'll
start an index scan when scanning in one particular scan direction
need to be symmetric with the rules that decide when we'll end a index
scan with the same qual moving in the opposite scan direction.

Impact on Postgres 17
=====================

On Postgres 17, even a qual with redundancy such as "WHERE a IN(1, 2,
3) AND b > 5 AND c > 10_0000::bigint" shouldn't get infinite cycling
with an incomplete opfamily on "b". That can only happen on HEAD
because recent bugfix commit 5f4d98d4 added a _bt_start_array_keys
call to _bt_readpage, to reset the scan's arrays after we used
forcenonrequired mode to read all the tuples on the page.

Postgres 17 *does* reset the array keys through a call to
_bt_start_array_keys, to get a clean slate. But that only happens in
the context of mark/restore for merge joins, which seems robust
against infinite cycling of the kind I'm seeing on HEAD.

--
Peter Geoghegan

Attachments:

v3-0001-nbtree-Make-redundant-scan-keys-non-required.patchapplication/octet-stream; name=v3-0001-nbtree-Make-redundant-scan-keys-non-required.patchDownload

From dbfa7684a0554c0e2d33b9667ef445425f67ad94 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 23 Jun 2025 22:46:05 -0400
Subject: [PATCH v3 1/2] nbtree: Make redundant scan keys non-required.

Preprocessing will now only mark one key out of a pair of known
contradictory/redundant keys as required to continue the scan in corner
cases where it cannot just eliminate one particular key (or end the scan
by recognizing that the scan's qual is contradictory) due to a lack of
suitable cross-type support.  This simplifies things for all downstream
users of the preprocessed scan keys.  Now the scan key that is used for
initial positioning purposes when scanning forward must also be the only
key that is capable of ending the scan when scanning backwards.

This is needed so that scans with array keys that reset their arrays in
the forcenonrequired=true path (added by bugfix commit 5f4d98d4) will
consistently avoid infinite cycles.  It also makes scans with array keys
reliably avoid duplicative leaf page accesses, which seems like a good
thing to insist upon going forward, on general robustness grounds.

Scan keys that are no longer marked required due to being redundant are
now relocated to the end of the so->keyData[] array.  That way they'll
be evaluated last.  This makes things more in line with what code like
_bt_checkkeys and _bt_advance_array_keys generally expects.  The new
scheme also obviates the need for _bt_advance_array_keys to make sure
that any non-required SAOP arrays have their elements reset to the first
element in the current scan direction, lest the array elements be used
for initial positioning purposes within _bt_first later on.

Follow-up to bugfix commit 5f4d98d4, which taught _bt_set_startikey to
avoid the use of forcenonrequired mode iff it reached a row compare key.
Now _bt_set_startikey will be robust against infinite cycles, even with
arbitrary combinations of redundant/contradictory keys that couldn't be
fully eliminated by preprocessing.
---
 src/backend/access/nbtree/nbtpreprocesskeys.c | 349 +++++++++++++++---
 src/backend/access/nbtree/nbtsearch.c         |  36 +-
 src/backend/access/nbtree/nbtutils.c          | 136 +------
 3 files changed, 331 insertions(+), 190 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..855b32e83 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "common/int.h"
 #include "lib/qunique.h"
 #include "utils/array.h"
 #include "utils/lsyscache.h"
@@ -56,6 +57,8 @@ static void _bt_skiparray_strat_decrement(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
 static void _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
+static void _bt_unmark_extra_keys(IndexScanDesc scan, int *keyDataMap);
+static int	_bt_reorder_array_cmp(const void *a, const void *b);
 static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *new_numberOfKeys);
 static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_num_array_keys(IndexScanDesc scan, Oid *skip_eq_ops_out,
@@ -96,7 +99,7 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
  * incomplete sets of cross-type operators, we may fail to detect redundant
  * or contradictory keys, but we can survive that.)
  *
- * The output keys must be sorted by index attribute.  Presently we expect
+ * Required output keys are sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here.
@@ -134,22 +137,17 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
  * cannot compare two keys for lack of a suitable cross-type operator,
  * we cannot eliminate either.  If there are two such keys of the same
  * operator strategy, the second one is just pushed into the output array
- * without further processing here.  We may also emit both >/>= or both
- * </<= keys if we can't compare them.  The logic about required keys still
- * works if we don't eliminate redundant keys.
+ * without further processing here.  We may also emit both >/>= or both </<=
+ * keys if we can't compare them (though only one will be marked required).
  *
- * Note that one reason we need direction-sensitive required-key flags is
- * precisely that we may not be able to eliminate redundant keys.  Suppose
- * we have "x > 4::int AND x > 10::bigint", and we are unable to determine
- * which key is more restrictive for lack of a suitable cross-type operator.
- * _bt_first will arbitrarily pick one of the keys to do the initial
- * positioning with.  If it picks x > 4, then the x > 10 condition will fail
- * until we reach index entries > 10; but we can't stop the scan just because
- * x > 10 is failing.  On the other hand, if we are scanning backwards, then
- * failure of either key is indeed enough to stop the scan.  (In general, when
- * inequality keys are present, the initial-positioning code only promises to
- * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * We may not be able to eliminate redundant keys.  Suppose we have a qual
+ * like "x > 4::int AND x > 10::bigint", and we are unable to determine which
+ * key is more restrictive for lack of a suitable cross-type operator.  We'll
+ * handle this by arbitrarily picking one of the keys to marked as required.
+ * If we pick x > 4, then the x > 10 condition will fail until we reach index
+ * entries > 10.  That won't end the scan, because x > 10 won't be marked
+ * required.  Note that in addition to not marking these redundant keys as
+ * required, we also relocate them to the end of the array.
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
@@ -193,6 +191,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		arrayKeyData;
 	int		   *keyDataMap = NULL;
 	int			arrayidx = 0;
+	bool		comparisonfailed = false;
 
 	if (so->numberOfKeys > 0)
 	{
@@ -388,7 +387,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						xform[j].inkey = NULL;
 						xform[j].inkeyi = -1;
 					}
-					/* else, cannot determine redundancy, keep both keys */
+					else
+						comparisonfailed = true;
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
@@ -409,6 +409,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTLessStrategyNumber - 1].inkey = NULL;
 				}
+				else
+					comparisonfailed = true;
 			}
 
 			/* try to keep only one of >, >= */
@@ -426,6 +428,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTGreaterStrategyNumber - 1].inkey = NULL;
 				}
+				else
+					comparisonfailed = true;
 			}
 
 			/*
@@ -466,25 +470,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* check strategy this key's operator corresponds to */
 		j = inkey->sk_strategy - 1;
 
-		/* if row comparison, push it directly to the output array */
-		if (inkey->sk_flags & SK_ROW_HEADER)
-		{
-			ScanKey		outkey = &so->keyData[new_numberOfKeys++];
-
-			memcpy(outkey, inkey, sizeof(ScanKeyData));
-			if (arrayKeyData)
-				keyDataMap[new_numberOfKeys - 1] = i;
-			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
-
-			/*
-			 * We don't support RowCompare using equality; such a qual would
-			 * mess up the numberOfEqualCols tracking.
-			 */
-			Assert(j != (BTEqualStrategyNumber - 1));
-			continue;
-		}
-
 		if (inkey->sk_strategy == BTEqualStrategyNumber &&
 			(inkey->sk_flags & SK_SEARCHARRAY))
 		{
@@ -591,11 +576,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				 * We can't determine which key is more restrictive.  Push
 				 * xform[j] directly to the output array, then set xform[j] to
 				 * the new scan key.
-				 *
-				 * Note: We do things this way around so that our arrays are
-				 * always in the same order as their corresponding scan keys,
-				 * even with incomplete opfamilies.  _bt_advance_array_keys
-				 * depends on this.
 				 */
 				ScanKey		outkey = &so->keyData[new_numberOfKeys++];
 
@@ -607,6 +587,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				xform[j].inkey = inkey;
 				xform[j].inkeyi = i;
 				xform[j].arrayidx = arrayidx;
+				comparisonfailed = true;
 			}
 		}
 	}
@@ -622,6 +603,16 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (arrayKeyData)
 		_bt_preprocess_array_keys_final(scan, keyDataMap);
 
+	/*
+	 * If all prior preprocessing steps failed to eliminate any "extra" keys
+	 * due to a lack of suitable cross-type support, unset their required
+	 * markings now.  This leaves so->keyData[] with only one required > or >=
+	 * key, and only one required < or <= key.  It'll always prefer to keep a
+	 * required = key on an attr that also has a redundant inequality.
+	 */
+	if (unlikely(comparisonfailed && so->qual_ok))
+		_bt_unmark_extra_keys(scan, keyDataMap);
+
 	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
@@ -847,9 +838,6 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 				cmp_op;
 	StrategyNumber strat;
 
-	Assert(!((leftarg->sk_flags | rightarg->sk_flags) &
-			 (SK_ROW_HEADER | SK_ROW_MEMBER)));
-
 	/*
 	 * First, deal with cases where one or both args are NULL.  This should
 	 * only happen when the scankeys represent IS NULL/NOT NULL conditions.
@@ -924,6 +912,13 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * We don't yet know how to determine redundancy when it involves a row
+	 * compare key (barring simple cases involving IS NULL/IS NOT NULL)
+	 */
+	if ((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_HEADER)
+		return false;
+
 	/*
 	 * If either leftarg or rightarg are equality-type array scankeys, we need
 	 * specialized handling (since by now we know that IS NULL wasn't used)
@@ -1467,6 +1462,272 @@ _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 	}
 }
 
+/*
+ *	_bt_unmark_extra_keys() -- make "extra" so->keyData[] keys nonrequired.
+ *
+ * When _bt_preprocess_keys failed to eliminate one or more contradictory
+ * keys, it calls here to make sure that there is no more than one > or >= key
+ * marked required, and no more than one < or <= key marked required.  That
+ * way _bt_first and _bt_checkkeys will reliably have exactly the same ideas
+ * about which keys to use to start and end the scan.  This is important
+ * during scans with array keys, where _bt_checkkeys/_bt_advance_array_keys
+ * relies on _bt_first to reposition the scan when scheduling a new primscan.
+ *
+ * We also relocate "extra" keys that were unmarked required to the end of the
+ * so->keyData[] array.  This is for the benefit of _bt_advance_array_keys,
+ * which expects to be able to test if the scan's keys would end the scan were
+ * the scan direction to change.  This is used to test whether _bt_first is
+ * able to relocate the scan to a later leaf page; if an inequality would end
+ * the scan if the scan direction were reversed, then _bt_first will be able
+ * to apply the same inequality to relocate the scan to a later leaf page
+ * (even when the current leaf page satisfies the scan's current = keys,
+ * including those used by one or more = arrays).
+ *
+ * Only call here when _bt_compare_scankey_args returned false at least once
+ * (otherwise, calling here will just waste cycles).  We always unmark at
+ * least one key per call to _bt_compare_scankey_args that reported failure.
+ */
+static void
+_bt_unmark_extra_keys(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	AttrNumber	curattr;
+	ScanKey		cur;
+	bool	   *unmarkikey;
+	int			nunmark,
+				nunmarked,
+				nkept,
+				firsti;
+	ScanKey		keepKeys,
+				unmarkKeys;
+	FmgrInfo   *keepOrderProcs = NULL,
+			   *unmarkOrderProcs = NULL;
+	bool		haveEquals,
+				haveReqForward,
+				haveReqBackward;
+
+	/*
+	 * Do an initial pass over so->keyData[] that determines which keys to
+	 * keeo as required, and which to mark non-required.
+	 *
+	 * When both equality and inequality keys remain on a single attribute, we
+	 * *must* leave only the equality key as required to continue the scan.
+	 * That way _bt_first will use the key for initial positioning purposes.
+	 *
+	 * This isn't optional.  _bt_checkkeys() will stop the scan as soon as an
+	 * equality qual fails.  For example, if _bt_first were allowed to start
+	 * at x=4 given a qual "x >= 4 AND x = 10", it would fail and stop before
+	 * reaching x=10.  If multiple equality quals on the same attribute have
+	 * survived all prior preprocessing steps, we'll arbitrarily allow only
+	 * one to remain marked required.
+	 */
+	unmarkikey = palloc0(so->numberOfKeys * sizeof(bool));
+	nunmark = 0;
+
+	/* Set things up for first key's attr */
+	cur = so->keyData;
+	curattr = cur->sk_attno;
+	firsti = 0;
+	haveEquals = false;
+	haveReqForward = false;
+	haveReqBackward = false;
+	for (int i = 0; i < so->numberOfKeys; cur++, i++)
+	{
+		if (cur->sk_attno != curattr)
+		{
+			/* Reset for next attr */
+			curattr = cur->sk_attno;
+			firsti = i;
+
+			haveEquals = false;
+			haveReqForward = false;
+			haveReqBackward = false;
+		}
+
+		/* First consider equalities */
+		if (haveEquals)
+		{
+			/*
+			 * We've already found the first "=" key for attr.  We already
+			 * decided that every other key on the same attr is to be unmarked
+			 */
+			Assert(!(cur->sk_flags & SK_SEARCHNULL));
+			unmarkikey[i] = true;
+			nunmark++;
+			continue;
+		}
+		else if ((cur->sk_flags & SK_BT_REQFWD) &&
+				 (cur->sk_flags & SK_BT_REQBKWD))
+		{
+			/*
+			 * Found the first "=" key for attr.  All other keys on the same
+			 * attr will be unmarked.
+			 */
+			haveEquals = true;
+			for (int j = firsti; j < i; j++)
+			{
+				/* We'll unmark prior keys for the same attr after all */
+				if (!unmarkikey[j])
+				{
+					unmarkikey[j] = true;
+					nunmark++;
+				}
+			}
+			continue;
+		}
+
+		/* Next consider inequalities */
+		if ((cur->sk_flags & SK_BT_REQFWD) && !haveReqForward)
+		{
+			haveReqForward = true;
+			continue;
+		}
+		else if ((cur->sk_flags & SK_BT_REQBKWD) && !haveReqBackward)
+		{
+			haveReqBackward = true;
+			continue;
+		}
+		unmarkikey[i] = true;
+		nunmark++;
+	}
+
+	/*
+	 * We're only supposed to be called when _bt_compare_scankey_args reported
+	 * failure when called from the main _bt_preprocess_keys loop
+	 */
+	Assert(nunmark > 0);
+
+	/*
+	 * Next, allocate temp arrays: one set for unchanged keys, another for
+	 * keys that will be unmarked/made non-required
+	 */
+	unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	nunmarked = 0;
+	nkept = 0;
+	if (so->numArrayKeys)
+	{
+		unmarkOrderProcs = palloc(so->numberOfKeys * sizeof(FmgrInfo));
+		keepOrderProcs = palloc(so->numberOfKeys * sizeof(FmgrInfo));
+	}
+
+	/*
+	 * Next, copy the contents of so->keyData[] into the appropriate temp
+	 * array.
+	 *
+	 * Note: scans with array keys need us to maintain a mapping of the old
+	 * so->keyData[] positions to the new ones.  This is similar to what
+	 * already just happened in _bt_preprocess_array_keys_final.
+	 */
+	for (int origikey = 0; origikey < so->numberOfKeys; origikey++)
+	{
+		ScanKey		origkey = so->keyData + origikey;
+		ScanKey		unmark;
+
+		if (!unmarkikey[origikey])
+		{
+			/* non-redundant key stays at the start of so->keyData[] */
+			memcpy(keepKeys + nkept, origkey, sizeof(ScanKeyData));
+
+			if (so->numArrayKeys)
+			{
+				keyDataMap[origikey] = nkept;
+				memcpy(keepOrderProcs + nkept, &so->orderProcs[origikey],
+					   sizeof(FmgrInfo));
+			}
+
+			nkept++;
+			continue;
+		}
+
+		/* key will be unmarked */
+		unmark = unmarkKeys + nunmarked;
+		memcpy(unmark, origkey, sizeof(ScanKeyData));
+
+		if (so->numArrayKeys)
+		{
+			keyDataMap[origikey] = (so->numberOfKeys - nunmark) + nunmarked;
+			memcpy(&unmarkOrderProcs[nunmarked], &so->orderProcs[origikey],
+				   sizeof(FmgrInfo));
+		}
+
+		nunmarked++;
+
+		/* clear requiredness flags on redundant key (and its subkeys) */
+		unmark->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
+		if (unmark->sk_flags & SK_ROW_HEADER)
+		{
+			ScanKey		subkey = (ScanKey) DatumGetPointer(origkey->sk_argument);
+
+			Assert(subkey->sk_strategy == origkey->sk_strategy);
+			for (;;)
+			{
+				Assert(subkey->sk_flags & SK_ROW_MEMBER);
+				subkey->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
+				if (subkey->sk_flags & SK_ROW_END)
+					break;
+				subkey++;
+			}
+		}
+	}
+
+	/*
+	 * Copy temp arrays back into so->keyData[]
+	 */
+	Assert(nkept == so->numberOfKeys - nunmark);
+	Assert(nunmarked == nunmark);
+	memcpy(so->keyData, keepKeys, sizeof(ScanKeyData) * nkept);
+	memcpy(so->keyData + nkept, unmarkKeys, sizeof(ScanKeyData) * nunmarked);
+
+	/* Done with temp arrays */
+	pfree(unmarkikey);
+	pfree(keepKeys);
+	pfree(unmarkKeys);
+
+	/*
+	 * Now copy temp entries needed by scans with = array keys back into
+	 * so->orderProcs[].  The order needs to continue to match the new order
+	 * used by so->keyData[].
+	 */
+	if (so->numArrayKeys)
+	{
+		memcpy(so->orderProcs, keepOrderProcs, sizeof(FmgrInfo) * nkept);
+		memcpy(so->orderProcs + nkept, unmarkOrderProcs,
+			   sizeof(FmgrInfo) * nunmarked);
+
+		/* Also fix-up array->scan_key references */
+		for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arridx];
+
+			array->scan_key = keyDataMap[array->scan_key];
+		}
+
+		/*
+		 * Also make sure that the scan's arrays appear in the expected order.
+		 * This must match corresponding scan key/so->orderProcs[] entries.
+		 */
+		qsort(so->arrayKeys, so->numArrayKeys, sizeof(BTArrayKeyInfo),
+			  _bt_reorder_array_cmp);
+
+		/* Done with temp arrays */
+		pfree(unmarkOrderProcs);
+		pfree(keepOrderProcs);
+	}
+}
+
+/*
+ * qsort comparator for reordering so->arrayKeys[] BTArrayKeyInfo entries
+ */
+static int
+_bt_reorder_array_cmp(const void *a, const void *b)
+{
+	BTArrayKeyInfo *arraya = (BTArrayKeyInfo *) a;
+	BTArrayKeyInfo *arrayb = (BTArrayKeyInfo *) b;
+
+	return pg_cmp_s32(arraya->scan_key, arrayb->scan_key);
+}
+
 /*
  *	_bt_preprocess_array_keys() -- Preprocess SK_SEARCHARRAY scan keys
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 36544ecfd..a40b0cf81 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -971,22 +971,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * of what initial positioning strategy to use.
 	 *
 	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
-	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
+	 * may not be able to eliminate redundant keys; in such cases it will
+	 * arbitrarily pick a usable one for each attribute, making sure that the
+	 * other key isn't marked required to continue the scan.  We only use keys
+	 * that were marked required (perhaps _only_ marked required in the scan
+	 * direction opposite our own) here.  There's no risk that we'll get
+	 * confused about which key to use, since _bt_preprocess_keys will also
+	 * relocate the other key to the end of so->keyData[].
 	 *
 	 * In practice we rarely see any "attribute boundary key gaps" here.
 	 * Preprocessing can usually backfill skip array keys for any attributes
@@ -996,10 +987,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * This happens with range skip arrays, which store inequality keys in the
 	 * array's low_compare/high_compare fields (used to find the first/last
 	 * set of matches, when = key will lack a usable sk_argument value).
-	 * These are always preferred over any redundant "standard" inequality
-	 * keys on the same column (per the usual rule about preferring = keys).
-	 * Note also that any column with an = skip array key can never have an
-	 * additional, contradictory = key.
 	 *
 	 * All keys (with the exception of SK_SEARCHNULL keys and SK_BT_SKIP
 	 * array keys whose array is "null_elem=true") imply a NOT NULL qualifier.
@@ -1012,7 +999,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 *
 	 * In this loop, row-comparison keys are treated the same as keys on their
 	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
+	 * comparison below.
 	 *
 	 * The selected scan keys (at most one per index column) are remembered by
 	 * storing their addresses into the local startKeys[] array.
@@ -1196,6 +1183,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				impliesNN = NULL;
 			}
 
+			/*
+			 * Keys that are not marked required in either scan direction
+			 * aren't eligible to be an initial positioning key
+			 */
+			if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
+				continue;
+
 			/*
 			 * Can we use this key as a starting boundary for this attr?
 			 *
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2..eb6dbfda3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -44,7 +44,6 @@ static bool _bt_array_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *arra
 static bool _bt_array_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
 static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 											 bool *skip_array_set);
-static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
 										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
 										 bool readpagetup, int sktrig, bool *scanBehind);
@@ -52,7 +51,6 @@ static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								   int sktrig, bool sktrig_required);
 #ifdef USE_ASSERT_CHECKING
-static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 #endif
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
@@ -1034,73 +1032,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	return false;
 }
 
-/*
- * _bt_rewind_nonrequired_arrays() -- Rewind SAOP arrays not marked required
- *
- * Called when _bt_advance_array_keys decides to start a new primitive index
- * scan on the basis of the current scan position being before the position
- * that _bt_first is capable of repositioning the scan to by applying an
- * inequality operator required in the opposite-to-scan direction only.
- *
- * Although equality strategy scan keys (for both arrays and non-arrays alike)
- * are either marked required in both directions or in neither direction,
- * there is a sense in which non-required arrays behave like required arrays.
- * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
- * the scan key on "c" is non-required, but nevertheless enables positioning
- * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
- * first descent of the tree by _bt_first.  Later on, there could also be a
- * second descent, that places the scan right before tuples >= "(200, 3, 5)".
- * _bt_first must never be allowed to build an insertion scan key whose "c"
- * entry is set to a value other than 5, the "c" array's first element/value.
- * (Actually, it's the first in the current scan direction.  This example uses
- * a forward scan.)
- *
- * Calling here resets the array scan key elements for the scan's non-required
- * arrays.  This is strictly necessary for correctness in a subset of cases
- * involving "required in opposite direction"-triggered primitive index scans.
- * Not all callers are at risk of _bt_first using a non-required array like
- * this, but advancement always resets the arrays when another primitive scan
- * is scheduled, just to keep things simple.  Array advancement even makes
- * sure to reset non-required arrays during scans that have no inequalities.
- * (Advancement still won't call here when there are no inequalities, though
- * that's just because it's all handled indirectly instead.)
- *
- * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
- * everybody got this right.
- *
- * Note: In practice almost all SAOP arrays are marked required during
- * preprocessing (if necessary by generating skip arrays).  It is hardly ever
- * truly necessary to call here, but consistently doing so is simpler.
- */
-static void
-_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			arrayidx = 0;
-
-	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
-	{
-		ScanKey		cur = so->keyData + ikey;
-		BTArrayKeyInfo *array = NULL;
-
-		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
-			cur->sk_strategy != BTEqualStrategyNumber)
-			continue;
-
-		array = &so->arrayKeys[arrayidx++];
-		Assert(array->scan_key == ikey);
-
-		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
-			continue;
-
-		Assert(array->num_elems != -1); /* No non-required skip arrays */
-
-		_bt_array_set_low_or_high(rel, cur, array,
-								  ScanDirectionIsForward(dir));
-	}
-}
-
 /*
  * _bt_tuple_before_array_skeys() -- too early to advance required arrays?
  *
@@ -1380,8 +1311,6 @@ _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (so->needPrimScan)
 	{
-		Assert(_bt_verify_arrays_bt_first(scan, dir));
-
 		/*
 		 * Flag was set -- must call _bt_first again, which will reset the
 		 * scan's needPrimScan flag
@@ -2007,14 +1936,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 	 */
 	else if (has_required_opposite_direction_only && pstate->finaltup &&
 			 unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
-	{
-		/*
-		 * Make sure that any SAOP arrays that were not marked required by
-		 * preprocessing are reset to their first element for this direction
-		 */
-		_bt_rewind_nonrequired_arrays(scan, dir);
 		goto new_prim_scan;
-	}
 
 continue_scan:
 
@@ -2045,8 +1967,6 @@ continue_scan:
 		 */
 		so->oppositeDirCheck = has_required_opposite_direction_only;
 
-		_bt_rewind_nonrequired_arrays(scan, dir);
-
 		/*
 		 * skip by setting "look ahead" mechanism's offnum for forwards scans
 		 * (backwards scans check scanBehind flag directly instead)
@@ -2142,48 +2062,6 @@ end_toplevel_scan:
 }
 
 #ifdef USE_ASSERT_CHECKING
-/*
- * Verify that the scan's qual state matches what we expect at the point that
- * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
- *
- * We enforce a rule against non-required array scan keys: they must start out
- * with whatever element is the first for the scan's current scan direction.
- * See _bt_rewind_nonrequired_arrays comments for an explanation.
- */
-static bool
-_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			arrayidx = 0;
-
-	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
-	{
-		ScanKey		cur = so->keyData + ikey;
-		BTArrayKeyInfo *array = NULL;
-		int			first_elem_dir;
-
-		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
-			cur->sk_strategy != BTEqualStrategyNumber)
-			continue;
-
-		array = &so->arrayKeys[arrayidx++];
-
-		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
-			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
-			continue;
-
-		if (ScanDirectionIsForward(dir))
-			first_elem_dir = 0;
-		else
-			first_elem_dir = array->num_elems - 1;
-
-		if (array->cur_elem != first_elem_dir)
-			return false;
-	}
-
-	return _bt_verify_keys_with_arraykeys(scan);
-}
-
 /*
  * Verify that the scan's "so->keyData[]" scan keys are in agreement with
  * its array key state
@@ -2194,6 +2072,7 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			last_sk_attno = InvalidAttrNumber,
 				arrayidx = 0;
+	bool		nonrequiredseen = false;
 
 	if (!so->qual_ok)
 		return false;
@@ -2217,8 +2096,16 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 		if (array->num_elems != -1 &&
 			cur->sk_argument != array->elem_values[array->cur_elem])
 			return false;
-		if (last_sk_attno > cur->sk_attno)
-			return false;
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			if (last_sk_attno > cur->sk_attno)
+				return false;
+			if (nonrequiredseen)
+				return false;
+		}
+		else
+			nonrequiredseen = true;
+
 		last_sk_attno = cur->sk_attno;
 	}
 
@@ -2551,7 +2438,6 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 		if (!(key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 		{
 			/* Scan key isn't marked required (corner case) */
-			Assert(!(key->sk_flags & SK_ROW_HEADER));
 			break;				/* unsafe */
 		}
 		if (key->sk_flags & SK_ROW_HEADER)
-- 
2.50.0

v3-0002-Simplify-and-improve-row-compare-NULL-handling.patchapplication/octet-stream; name=v3-0002-Simplify-and-improve-row-compare-NULL-handling.patchDownload

From 9cef26a6ea695544d439f27be1c61f2af5b47530 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 18 Jun 2025 19:32:57 -0400
Subject: [PATCH v3 2/2] Simplify and improve row compare NULL handling.

Improve _bt_first's handling of NULL row comparison members in such a
way as to make it consistent with _bt_check_rowcompare's approach.  In
other words, make sure that code that starts primitive index scans that
involve row compares fully agrees with code that ends primitive index
scans -- even in the presence of a NULL row compare member.  This makes
row comparisons more similar to scalar inequality keys, obviating the
need for _bt_set_startikey to consider row comparisons directly.

_bt_first will now avoid uselessly scanning earlier index tuples, that
cannot possibly contain matching tuples due to the use of a NULL row
compare member.  The goal isn't to improve performance; the goal is to
make _bt_first agree with _bt_check_rowcompare about where primitive
index scans should start and end.

Also go back to marking row compare members after the first as required
when safe.  This is more or less a straight revert of 2016 bugfix commit
a298a1e0.  We can address the complaint that led to that commit by being
more careful about the conditions where NULLs can end the scan in the
row compare path (for row members beyond the first one, which can once
again be marked required when safe).

Follow-up to bugfix commit 5f4d98d4, which taught _bt_set_startikey to
avoid the use of forcenonrequired mode iff it reached a row compare key.
Now _bt_set_startikey never needs to avoid the use of nonrequired mode,
which is simpler, more consistent, and more robust.
---
 src/backend/access/nbtree/nbtpreprocesskeys.c |  19 +-
 src/backend/access/nbtree/nbtsearch.c         | 115 ++++++----
 src/backend/access/nbtree/nbtutils.c          | 210 ++++++++++--------
 3 files changed, 201 insertions(+), 143 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index 855b32e83..58c735c0b 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -777,12 +777,25 @@ _bt_mark_scankey_required(ScanKey skey)
 	if (skey->sk_flags & SK_ROW_HEADER)
 	{
 		ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+		AttrNumber	attno = skey->sk_attno;
 
 		/* First subkey should be same column/operator as the header */
-		Assert(subkey->sk_flags & SK_ROW_MEMBER);
-		Assert(subkey->sk_attno == skey->sk_attno);
+		Assert(subkey->sk_attno == attno);
 		Assert(subkey->sk_strategy == skey->sk_strategy);
-		subkey->sk_flags |= addflags;
+
+		for (;;)
+		{
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_attno != attno)
+				break;			/* non-adjacent key, so not required */
+			if (subkey->sk_strategy != skey->sk_strategy)
+				break;			/* wrong direction, so not required */
+			subkey->sk_flags |= addflags;
+			if (subkey->sk_flags & SK_ROW_END)
+				break;
+			subkey++;
+			attno++;
+		}
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index a40b0cf81..7d8417425 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1257,6 +1257,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * Row comparison header: look to the first row member instead
 			 */
 			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			bool		keep_strat_total = false;
+			bool		tighten_strat_total = false;
 
 			/*
 			 * Cannot be a NULL in the first row member: _bt_preprocess_keys
@@ -1267,6 +1269,14 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			Assert(subkey->sk_attno == cur->sk_attno);
 			Assert(!(subkey->sk_flags & SK_ISNULL));
 
+			/*
+			 * A row comparison key can only become an initial position key
+			 * when it was marked required by preprocessing, which means that
+			 * it must be the final key we picked as a positioning key
+			 */
+			Assert(i == keysz - 1);
+			Assert(cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD));
+
 			/*
 			 * The member scankeys are already in insertion format (ie, they
 			 * have sk_func = 3-way-comparison function)
@@ -1274,58 +1284,73 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
+			 * If a "column gap" appears between row compare members, then the
+			 * part of the row comparison that we use has to be treated as
+			 * just a ">=" or "<=" condition. and so we'd better adjust
+			 * strat_total accordingly.
+			 *
+			 * We're able to use a _more_ restrictive strategy when we
+			 * encounter a NULL row compare member (which is unsatisfiable).
+			 * For example, a qual "(a, b, c) >= (1, NULL, 77)" will use an
+			 * insertion scan key "a > 1".  All rows where "a = 1" have to
+			 * perform a NULL row member comparison (or would, if we didn't
+			 * use the more restrictive ">" strategy), which is guranteed to
+			 * return false/return NULL.
 			 */
-			if (i == keysz - 1)
+			Assert(!(subkey->sk_flags & SK_ROW_END));
+			for (;;)
 			{
-				bool		used_all_subkeys = false;
-
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
+				subkey++;
+				Assert(subkey->sk_flags & SK_ROW_MEMBER);
+				if (subkey->sk_attno != keysz + 1)
+					break;		/* out-of-sequence, can't use it */
+				if (subkey->sk_strategy != cur->sk_strategy)
+					break;		/* wrong direction, can't use it */
+				if (subkey->sk_flags & SK_ISNULL)
 				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysz + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysz < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysz, subkey,
-						   sizeof(ScanKeyData));
-					keysz++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
+					/* Unsatisfiable NULL row member */
+					keep_strat_total = true;
+					tighten_strat_total = true;
+					break;
 				}
-				if (!used_all_subkeys)
+				Assert(keysz < INDEX_MAX_KEYS);
+				memcpy(inskey.scankeys + keysz, subkey,
+					   sizeof(ScanKeyData));
+				keysz++;
+				if (subkey->sk_flags & SK_ROW_END)
 				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
+					keep_strat_total = true;
+					break;
 				}
-				break;			/* done with outer loop */
 			}
+			if (!keep_strat_total)
+			{
+				/* Use less restrictive strategy (and fewer keys) */
+				Assert(!tighten_strat_total);
+				switch (strat_total)
+				{
+					case BTLessStrategyNumber:
+						strat_total = BTLessEqualStrategyNumber;
+						break;
+					case BTGreaterStrategyNumber:
+						strat_total = BTGreaterEqualStrategyNumber;
+						break;
+				}
+			}
+			if (tighten_strat_total)
+			{
+				/* Use more restrictive strategy (and fewer keys) */
+				switch (strat_total)
+				{
+					case BTLessEqualStrategyNumber:
+						strat_total = BTLessStrategyNumber;
+						break;
+					case BTGreaterEqualStrategyNumber:
+						strat_total = BTGreaterStrategyNumber;
+						break;
+				}
+			}
+			break;				/* done building insertion scan key */
 		}
 		else
 		{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index eb6dbfda3..a5ff5f708 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2445,29 +2445,11 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 			/*
 			 * RowCompare inequality.
 			 *
-			 * Only the first subkey from a RowCompare can ever be marked
-			 * required (that happens when the row header is marked required).
 			 * There is no simple, general way for us to transitively deduce
 			 * whether or not every tuple on the page satisfies a RowCompare
 			 * key based only on firsttup and lasttup -- so we just give up.
 			 */
-			if (!start_past_saop_eq && !so->skipScan)
-				break;			/* unsafe to go further */
-
-			/*
-			 * We have to be even more careful with RowCompares that come
-			 * after an array: we assume it's unsafe to even bypass the array.
-			 * Calling _bt_start_array_keys to recover the scan's arrays
-			 * following use of forcenonrequired mode isn't compatible with
-			 * _bt_check_rowcompare's continuescan=false behavior with NULL
-			 * row compare members.  _bt_advance_array_keys must not make a
-			 * decision on the basis of a key not being satisfied in the
-			 * opposite-to-scan direction until the scan reaches a leaf page
-			 * where the same key begins to be satisfied in scan direction.
-			 * The _bt_first !used_all_subkeys behavior makes this limitation
-			 * hard to work around some other way.
-			 */
-			return;				/* completely unsafe to set pstate.startikey */
+			break;			/* unsafe */
 		}
 		if (key->sk_strategy != BTEqualStrategyNumber)
 		{
@@ -2964,87 +2946,17 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
-		if (subkey->sk_attno > tupnatts)
-		{
-			/*
-			 * This attribute is truncated (must be high key).  The value for
-			 * this attribute in the first non-pivot tuple on the page to the
-			 * right could be any possible value.  Assume that truncated
-			 * attribute passes the qual.
-			 */
-			Assert(BTreeTupleIsPivot(tuple));
-			cmpresult = 0;
-			if (subkey->sk_flags & SK_ROW_END)
-				break;
-			subkey++;
-			continue;
-		}
-
-		datum = index_getattr(tuple,
-							  subkey->sk_attno,
-							  tupdesc,
-							  &isNull);
-
-		if (isNull)
-		{
-			if (forcenonrequired)
-			{
-				/* treating scan's keys as non-required */
-			}
-			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
-			{
-				/*
-				 * Since NULLs are sorted before non-NULLs, we know we have
-				 * reached the lower limit of the range of values for this
-				 * index attr.  On a backward scan, we can stop if this qual
-				 * is one of the "must match" subset.  We can stop regardless
-				 * of whether the qual is > or <, so long as it's required,
-				 * because it's not possible for any future tuples to pass. On
-				 * a forward scan, however, we must keep going, because we may
-				 * have initially positioned to the start of the index.
-				 * (_bt_advance_array_keys also relies on this behavior during
-				 * forward scans.)
-				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
-					ScanDirectionIsBackward(dir))
-					*continuescan = false;
-			}
-			else
-			{
-				/*
-				 * Since NULLs are sorted after non-NULLs, we know we have
-				 * reached the upper limit of the range of values for this
-				 * index attr.  On a forward scan, we can stop if this qual is
-				 * one of the "must match" subset.  We can stop regardless of
-				 * whether the qual is > or <, so long as it's required,
-				 * because it's not possible for any future tuples to pass. On
-				 * a backward scan, however, we must keep going, because we
-				 * may have initially positioned to the end of the index.
-				 * (_bt_advance_array_keys also relies on this behavior during
-				 * backward scans.)
-				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
-					ScanDirectionIsForward(dir))
-					*continuescan = false;
-			}
-
-			/*
-			 * In any case, this indextuple doesn't match the qual.
-			 */
-			return false;
-		}
-
+		/* When a NULL row member is compared, the row never matches */
 		if (subkey->sk_flags & SK_ISNULL)
 		{
 			/*
-			 * Unlike the simple-scankey case, this isn't a disallowed case
-			 * (except when it's the first row element that has the NULL arg).
-			 * But it can never match.  If all the earlier row comparison
-			 * columns are required for the scan direction, we can stop the
-			 * scan, because there can't be another tuple that will succeed.
+			 * Cannot be a NULL in the first row member: _bt_preprocess_keys
+			 * would've marked the qual as unsatisfiable, ending the scan
+			 * before any tuples needed to be read (just like it will with
+			 * simple scalar inequalities with a NULL arg)
 			 */
 			Assert(subkey != (ScanKey) DatumGetPointer(skey->sk_argument));
-			subkey--;
+
 			if (forcenonrequired)
 			{
 				/* treating scan's keys as non-required */
@@ -3058,6 +2970,114 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			return false;
 		}
 
+		if (subkey->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(BTreeTupleIsPivot(tuple));
+			return true;
+		}
+
+		datum = index_getattr(tuple,
+							  subkey->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (isNull)
+		{
+			int			reqflags;
+
+			if (forcenonrequired)
+			{
+				/* treating scan's keys as non-required */
+			}
+			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  However, on a forwards
+				 * scan, we must keep going, because we may have initially
+				 * positioned to the start of the index.
+				 *
+				 * All required NULLS FIRST > row members can use NULL tuple
+				 * values to end backwards scans, just like with other values.
+				 * A qual "WHERE (a, b, c) > (9, 42, 'foo')" can terminate a
+				 * backwards scan upon reaching the index's rightmost "a = 9"
+				 * tuple whose "b" column contains a NULL (if not sooner).
+				 * Since "b" is NULLS FIRST, we can treat its NULLs as "<" 42.
+				 */
+				reqflags = SK_BT_REQBKWD;
+
+				/*
+				 * When a most significant required NULLS FIRST < row compare
+				 * member sees NULL tuple values during a backwards scan, it
+				 * signals the end of matches for the whole row compare/scan.
+				 * A qual "WHERE (a, b, c) < (9, 42, 'foo')" will terminate a
+				 * backwards scan upon reaching the rightmost tuple whose "a"
+				 * column has a NULL.  The "a" NULL value is "<" 9, and yet
+				 * our < row compare will still end the scan.  (This isn't
+				 * safe with later/lower-order row members.  Notice that it
+				 * can only happen with an "a" NULL some time after the scan
+				 * completely stops needing to use its "b" and "c" members.)
+				 */
+				if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
+					reqflags |= SK_BT_REQFWD;	/* safe, first row member */
+
+				if ((subkey->sk_flags & reqflags) &&
+					ScanDirectionIsBackward(dir))
+					*continuescan = false;
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  However, on a backward
+				 * scan, we must keep going, because we may have initially
+				 * positioned to the end of the index.
+				 *
+				 * All required NULLS LAST < row members can use NULL tuple
+				 * values to end forwards scans, just like with other values.
+				 * A qual "WHERE (a, b, c) < (9, 42, 'foo')" can terminate a
+				 * forwards scan upon reaching the index's leftmost "a = 9"
+				 * tuple whose "b" column contains a NULL (if not sooner).
+				 * Since "b" is NULLS LAST, we can treat its NULLs as ">" 42.
+				 */
+				reqflags = SK_BT_REQFWD;
+
+				/*
+				 * When a most significant required NULLS LAST > row compare
+				 * member sees NULL tuple values during a forwards scan, it
+				 * signals the end of matches for the whole row compare/scan.
+				 * A qual "WHERE (a, b, c) > (9, 42, 'foo')" will terminate a
+				 * forwards scan upon reaching the leftmost tuple whose "a"
+				 * column has a NULL.  The "a" NULL value is ">" 9, and yet
+				 * our > row compare will end the scan.  (This isn't safe with
+				 * later/lower-order row members.  Notice that it can only
+				 * happen with an "a" NULL some time after the scan completely
+				 * stops needing to use its "b" and "c" members.)
+				 */
+				if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
+					reqflags |= SK_BT_REQBKWD;	/* safe, first row member */
+
+				if ((subkey->sk_flags & reqflags) &&
+					ScanDirectionIsForward(dir))
+					*continuescan = false;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
 		/* Perform the test --- three-way comparison not bool operator */
 		cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
 													subkey->sk_collation,
-- 
2.50.0

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#3)

2 attachment(s)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 25, 2025 at 5:44 PM Peter Geoghegan <pg@bowt.ie> wrote:

Correction: there *is* a live bug like this on Postgres 18/HEAD, which
involves simple scalar inequalities with an incomplete opfamily.
Attached v3 fixes the bug in its new 0001-* patch.

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Current plan is to commit these two some time next week. It'd be nice
to get some code review before then, or any kind of input on my
general direction.

Note again that the approach I've taken to fixing the bug *adds a
performance optimization*. It will fix the row compare performance
problem described on a thread from February of this year:

/messages/by-id/BLAPR09MB64993673A034EE93C6C6F4B6CCF62@BLAPR09MB6499.namprd09.prod.outlook.com

It will *also* fix the row compare performance problem that Tom
complained about last year, which included its own test case:

/messages/by-id/66001.1715455158@sss.pgh.pa.us

I am aware that this all sounds quite odd. Yes, I'm proposing a bug
fix that improves performance in what must seem like fairly tangential
cases. As I went into already, it really did just work out that way.

I had forgotten about the two threads I'm referencing here until I saw
references to each of them in my skip scan project notes. I'm
mentioning them on the thread now in case they're useful breadcrumbs
for somebody down that revisits this thread in the future.

--
Peter Geoghegan

Attachments:

v4-0002-Make-row-compares-robust-during-scans-with-arrays.patchapplication/octet-stream; name=v4-0002-Make-row-compares-robust-during-scans-with-arrays.patchDownload

From b4f81c2b330dd0ceecae7e596109896462b99e25 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 18 Jun 2025 19:32:57 -0400
Subject: [PATCH v4 2/2] Make row compares robust during scans with arrays.

Bring nbtree scans that use row compare keys in line with the new
charter established for all required keys by recent bugfix XXXXX:
harmonize how _bt_check_rowcompare determines if it can end the scan
(when scanning in one direction) with how code in _bt_first determines
where the scan should be initially positioned (when scanning in the
other direction).  They need to agree on all details, or else there's a
risk of infinite cycling of the kind addressed by commit XXXXX during
scans that happen to mix row compares with = arrays.

While recent bugfix commit 5f4d98d4 directly accounted for infinite
cycling behavior (by refusing to use forcenonrequired in certain cases
involving row compares), it's not clear that that's enough to make
everything truly robust.  Testing has shown that a scan with a row
compare can still read the same leaf page twice (without the scan
direction changing), which isn't supposed to be possible following
Postgres 17 commit 5bf748b8.  That might lead to problems with cursors
whose direction changes back and forth.

There were two notable points of disagreement between _bt_first and
_bt_check_rowcompare.  Firstly, _bt_check_rowcompare was capable of
ending the scan at the point where it needed to compare an ISNULL-marked
row compare member that came immediately after the first row compare
member, but _bt_first lacked symmetric handling for NULL row compares.
Secondly, _bt_first had its own ideas about which keys were safe to use
for initial positioning purposes.  Sometimes _bt_first would use more
keys at the start of a scan than _bt_check_rowcompare would use to end a
similar scan (same qual, with the scan direction flipped).  At other
times _bt_first used fewer keys.  It was even possible for _bt_first to
use the first row compare member for the index's first column, and some
other random scalar key (not the second row compare member) for its
second column.

Fix the first issue by adding handling of ISNULL-marked row compare
members to _bt_first.  That way when _bt_advance_array_keys determines
that it should start another primitive index scan (following a call to
_bt_check_rowcompare setting continuescan=false for the opposite-to-scan
scan direction), _bt_first will reliably land on a later leaf page.
There won't be any risk of infinite cycling when this happens, so we can
get rid of the kludge added by recent bugfix commit 5f4d98d4.

Fix the second issue by arranging for _bt_first to build its initial
positioning/insertion scan key based on the same requiredness markings
used within _bt_check_rowcompare (just like it now does with every other
type of key, following recent bugfix commit XXXXXX).  _bt_first can only
use a row compare to build its insertion scan key when all row members
that are safe to include are included (adding a random unrelated scalar
key after the first row compare member key is no longer allowed).

Fixing these inconsistencies necessitates dealing with a related issue
with the way that row compares are marked required during preprocessing:
we never marked lower-order members as required following 2016 bugfix
commit a298a1e0.  The approach taken by that bugfix commit was overly
conservative.  The bug in question was actually an oversight in how
_bt_check_rowcompare dealt with tuple NULL values that failed to satisfy
a scan key marked required in the opposite scan direction (i.e. it was
actually a bug in 2011 commits 6980f817 and 882368e8, not in 2006 commit
3a0a16cb).  Go back to marking row compare members as required based on
the original 2006 rules, and fix the 2016 bug in a more principled way:
by limiting use of the "continuescan=false set by a key required in the
opposite scan direction on NULL tuple value" behavior to the first/most
significant row member.  It isn't safe to do this with a lower-order row
compare member, since it can only indicate the absolute end matching
tuples for the whole scan when it is required in the scan's direction.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com
---
 src/backend/access/nbtree/nbtpreprocesskeys.c |  19 +-
 src/backend/access/nbtree/nbtsearch.c         | 128 +++++++----
 src/backend/access/nbtree/nbtutils.c          | 205 ++++++++++--------
 src/test/regress/expected/btree_index.out     |  42 +++-
 src/test/regress/sql/btree_index.sql          |  22 +-
 5 files changed, 259 insertions(+), 157 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index e4cb48359..2af4653f0 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -778,12 +778,25 @@ _bt_mark_scankey_required(ScanKey skey)
 	if (skey->sk_flags & SK_ROW_HEADER)
 	{
 		ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+		AttrNumber	attno = skey->sk_attno;
 
 		/* First subkey should be same column/operator as the header */
-		Assert(subkey->sk_flags & SK_ROW_MEMBER);
-		Assert(subkey->sk_attno == skey->sk_attno);
+		Assert(subkey->sk_attno == attno);
 		Assert(subkey->sk_strategy == skey->sk_strategy);
-		subkey->sk_flags |= addflags;
+
+		for (;;)
+		{
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_attno != attno)
+				break;			/* non-adjacent key, so not required */
+			if (subkey->sk_strategy != skey->sk_strategy)
+				break;			/* wrong direction, so not required */
+			subkey->sk_flags |= addflags;
+			if (subkey->sk_flags & SK_ROW_END)
+				break;
+			subkey++;
+			attno++;
+		}
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0b997282e..d6730f901 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1004,8 +1004,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * traversing a lot of null entries at the start of the scan.
 	 *
 	 * In this loop, row-comparison keys are treated the same as keys on their
-	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
+	 * first (leftmost) columns.  We'll add all lower-order columns of the row
+	 * comparison that were marked required during preprocessing below.
 	 *
 	 * _bt_checkkeys/_bt_advance_array_keys decide whether and when to start
 	 * the next primitive index scan (for scans with array keys) based in part
@@ -1257,6 +1257,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * Row comparison header: look to the first row member instead
 			 */
 			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			bool		keep_strat_total = false;
+			bool		tighten_strat_total = false;
 
 			/*
 			 * Cannot be a NULL in the first row member: _bt_preprocess_keys
@@ -1267,6 +1269,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			Assert(subkey->sk_attno == cur->sk_attno);
 			Assert(!(subkey->sk_flags & SK_ISNULL));
 
+			/*
+			 * A row comparison key can only become an initial positioning key
+			 * when it was marked required by preprocessing, which means that
+			 * it must be the final key we picked as a positioning key
+			 */
+			Assert(i == keysz - 1);
+
 			/*
 			 * The member scankeys are already in insertion format (ie, they
 			 * have sk_func = 3-way-comparison function)
@@ -1274,58 +1283,85 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
+			 * If a "column gap" appears between row compare members, then the
+			 * part of the row comparison that we use has to be treated as
+			 * just a ">=" or "<=" condition.  We have to adjust strat_total.
+			 * For example, a qual "(a, c) > (1, 42)" with an intervening
+			 * index attribute "b" will use an insertion scan key "a >= 1".
+			 * Even the first "a = 1" tuple might satisfy the scan's qual.
+			 *
+			 * We're able to use a _more_ restrictive strategy when we
+			 * encounter a NULL row compare member (which is unsatisfiable).
+			 * For example, a qual "(a, b, c) >= (1, NULL, 77)" will use an
+			 * insertion scan key "a > 1".  All rows where "a = 1" have to
+			 * perform a NULL row member comparison (or would, if we didn't
+			 * use the more restrictive ">" strategy), which is guranteed to
+			 * return false/return NULL.
 			 */
-			if (i == keysz - 1)
+			Assert(!(subkey->sk_flags & SK_ROW_END));
+			for (;;)
 			{
-				bool		used_all_subkeys = false;
+				subkey++;
+				Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
+				if (subkey->sk_flags & SK_ISNULL)
 				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysz + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysz < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysz, subkey,
-						   sizeof(ScanKeyData));
-					keysz++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
+					/*
+					 * Unsatisfiable NULL row member.
+					 *
+					 * We deliberately avoid checking if this row member is
+					 * marked required.  All earlier members are, which is all
+					 * we need.  See _bt_check_rowcompare for an explanation.
+					 */
+					keep_strat_total = true;
+					tighten_strat_total = true;
+					break;
 				}
-				if (!used_all_subkeys)
+
+				if (!(subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+					break;		/* nonrequired, can't use it */
+
+				Assert(subkey->sk_attno == keysz + 1);
+				Assert(subkey->sk_strategy == cur->sk_strategy);
+				Assert(keysz < INDEX_MAX_KEYS);
+
+				memcpy(inskey.scankeys + keysz, subkey,
+					   sizeof(ScanKeyData));
+				keysz++;
+				if (subkey->sk_flags & SK_ROW_END)
 				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
+					keep_strat_total = true;
+					break;
 				}
-				break;			/* done with outer loop */
 			}
+			if (!keep_strat_total)
+			{
+				/* Use less restrictive strategy (and fewer member keys) */
+				Assert(!tighten_strat_total);
+				switch (strat_total)
+				{
+					case BTLessStrategyNumber:
+						strat_total = BTLessEqualStrategyNumber;
+						break;
+					case BTGreaterStrategyNumber:
+						strat_total = BTGreaterEqualStrategyNumber;
+						break;
+				}
+			}
+			if (tighten_strat_total)
+			{
+				/* Use more restrictive strategy (and fewer member keys) */
+				switch (strat_total)
+				{
+					case BTLessEqualStrategyNumber:
+						strat_total = BTLessStrategyNumber;
+						break;
+					case BTGreaterEqualStrategyNumber:
+						strat_total = BTGreaterStrategyNumber;
+						break;
+				}
+			}
+			break;				/* done building insertion scan key */
 		}
 		else
 		{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index eb6dbfda3..57d76a8d4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2443,31 +2443,9 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
 			/*
-			 * RowCompare inequality.
-			 *
-			 * Only the first subkey from a RowCompare can ever be marked
-			 * required (that happens when the row header is marked required).
-			 * There is no simple, general way for us to transitively deduce
-			 * whether or not every tuple on the page satisfies a RowCompare
-			 * key based only on firsttup and lasttup -- so we just give up.
+			 * RowCompare inequality.  Currently, we just punt on these.
 			 */
-			if (!start_past_saop_eq && !so->skipScan)
-				break;			/* unsafe to go further */
-
-			/*
-			 * We have to be even more careful with RowCompares that come
-			 * after an array: we assume it's unsafe to even bypass the array.
-			 * Calling _bt_start_array_keys to recover the scan's arrays
-			 * following use of forcenonrequired mode isn't compatible with
-			 * _bt_check_rowcompare's continuescan=false behavior with NULL
-			 * row compare members.  _bt_advance_array_keys must not make a
-			 * decision on the basis of a key not being satisfied in the
-			 * opposite-to-scan direction until the scan reaches a leaf page
-			 * where the same key begins to be satisfied in scan direction.
-			 * The _bt_first !used_all_subkeys behavior makes this limitation
-			 * hard to work around some other way.
-			 */
-			return;				/* completely unsafe to set pstate.startikey */
+			break;				/* "unsafe" */
 		}
 		if (key->sk_strategy != BTEqualStrategyNumber)
 		{
@@ -2964,76 +2942,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
-		if (subkey->sk_attno > tupnatts)
-		{
-			/*
-			 * This attribute is truncated (must be high key).  The value for
-			 * this attribute in the first non-pivot tuple on the page to the
-			 * right could be any possible value.  Assume that truncated
-			 * attribute passes the qual.
-			 */
-			Assert(BTreeTupleIsPivot(tuple));
-			cmpresult = 0;
-			if (subkey->sk_flags & SK_ROW_END)
-				break;
-			subkey++;
-			continue;
-		}
-
-		datum = index_getattr(tuple,
-							  subkey->sk_attno,
-							  tupdesc,
-							  &isNull);
-
-		if (isNull)
-		{
-			if (forcenonrequired)
-			{
-				/* treating scan's keys as non-required */
-			}
-			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
-			{
-				/*
-				 * Since NULLs are sorted before non-NULLs, we know we have
-				 * reached the lower limit of the range of values for this
-				 * index attr.  On a backward scan, we can stop if this qual
-				 * is one of the "must match" subset.  We can stop regardless
-				 * of whether the qual is > or <, so long as it's required,
-				 * because it's not possible for any future tuples to pass. On
-				 * a forward scan, however, we must keep going, because we may
-				 * have initially positioned to the start of the index.
-				 * (_bt_advance_array_keys also relies on this behavior during
-				 * forward scans.)
-				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
-					ScanDirectionIsBackward(dir))
-					*continuescan = false;
-			}
-			else
-			{
-				/*
-				 * Since NULLs are sorted after non-NULLs, we know we have
-				 * reached the upper limit of the range of values for this
-				 * index attr.  On a forward scan, we can stop if this qual is
-				 * one of the "must match" subset.  We can stop regardless of
-				 * whether the qual is > or <, so long as it's required,
-				 * because it's not possible for any future tuples to pass. On
-				 * a backward scan, however, we must keep going, because we
-				 * may have initially positioned to the end of the index.
-				 * (_bt_advance_array_keys also relies on this behavior during
-				 * backward scans.)
-				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
-					ScanDirectionIsForward(dir))
-					*continuescan = false;
-			}
-
-			/*
-			 * In any case, this indextuple doesn't match the qual.
-			 */
-			return false;
-		}
-
+		/* When a NULL row member is compared, the row never matches */
 		if (subkey->sk_flags & SK_ISNULL)
 		{
 			/*
@@ -3058,6 +2967,114 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			return false;
 		}
 
+		if (subkey->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(BTreeTupleIsPivot(tuple));
+			return true;
+		}
+
+		datum = index_getattr(tuple,
+							  subkey->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (isNull)
+		{
+			int			reqflags;
+
+			if (forcenonrequired)
+			{
+				/* treating scan's keys as non-required */
+			}
+			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  However, on a forwards
+				 * scan, we must keep going, because we may have initially
+				 * positioned to the start of the index.
+				 *
+				 * All required NULLS FIRST > row members can use NULL tuple
+				 * values to end backwards scans, just like with other values.
+				 * A qual "WHERE (a, b, c) > (9, 42, 'foo')" can terminate a
+				 * backwards scan upon reaching the index's rightmost "a = 9"
+				 * tuple whose "b" column contains a NULL (if not sooner).
+				 * Since "b" is NULLS FIRST, we can treat its NULLs as "<" 42.
+				 */
+				reqflags = SK_BT_REQBKWD;
+
+				/*
+				 * When a most significant required NULLS FIRST < row compare
+				 * member sees NULL tuple values during a backwards scan, it
+				 * signals the end of matches for the whole row compare/scan.
+				 * A qual "WHERE (a, b, c) < (9, 42, 'foo')" will terminate a
+				 * backwards scan upon reaching the rightmost tuple whose "a"
+				 * column has a NULL.  The "a" NULL value is "<" 9, and yet
+				 * our < row compare will still end the scan.  (This isn't
+				 * safe with later/lower-order row members.  Notice that it
+				 * can only happen with an "a" NULL some time after the scan
+				 * completely stops needing to use its "b" and "c" members.)
+				 */
+				if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
+					reqflags |= SK_BT_REQFWD;	/* safe, first row member */
+
+				if ((subkey->sk_flags & reqflags) &&
+					ScanDirectionIsBackward(dir))
+					*continuescan = false;
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  However, on a backward
+				 * scan, we must keep going, because we may have initially
+				 * positioned to the end of the index.
+				 *
+				 * All required NULLS LAST < row members can use NULL tuple
+				 * values to end forwards scans, just like with other values.
+				 * A qual "WHERE (a, b, c) < (9, 42, 'foo')" can terminate a
+				 * forwards scan upon reaching the index's leftmost "a = 9"
+				 * tuple whose "b" column contains a NULL (if not sooner).
+				 * Since "b" is NULLS LAST, we can treat its NULLs as ">" 42.
+				 */
+				reqflags = SK_BT_REQFWD;
+
+				/*
+				 * When a most significant required NULLS LAST > row compare
+				 * member sees NULL tuple values during a forwards scan, it
+				 * signals the end of matches for the whole row compare/scan.
+				 * A qual "WHERE (a, b, c) > (9, 42, 'foo')" will terminate a
+				 * forwards scan upon reaching the leftmost tuple whose "a"
+				 * column has a NULL.  The "a" NULL value is ">" 9, and yet
+				 * our > row compare will end the scan.  (This isn't safe with
+				 * later/lower-order row members.  Notice that it can only
+				 * happen with an "a" NULL some time after the scan completely
+				 * stops needing to use its "b" and "c" members.)
+				 */
+				if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
+					reqflags |= SK_BT_REQBKWD;	/* safe, first row member */
+
+				if ((subkey->sk_flags & reqflags) &&
+					ScanDirectionIsForward(dir))
+					*continuescan = false;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
 		/* Perform the test --- three-way comparison not bool operator */
 		cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
 													subkey->sk_collation,
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index bfb1a286e..dfb49ed60 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -200,17 +200,17 @@ ORDER BY proname DESC, proargtypes DESC, pronamespace DESC LIMIT 1;
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE proname > 'abbrz' AND (proname, proargtypes) < ('abs', NULL)
 ORDER BY proname, proargtypes, pronamespace;
-                                                 QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                   
+---------------------------------------------------------------------------------------------------------------
  Index Only Scan using pg_proc_proname_args_nsp_index on pg_proc
-   Index Cond: ((ROW(proname, proargtypes) < ROW('abs'::name, NULL::oidvector)) AND (proname = 'abs'::name))
+   Index Cond: ((proname > 'abbrz'::name) AND (ROW(proname, proargtypes) < ROW('abs'::name, NULL::oidvector)))
 (2 rows)
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE proname > 'abbrz' AND (proname, proargtypes) < ('abs', NULL)
 ORDER BY proname, proargtypes, pronamespace;
  proname | proargtypes | pronamespace 
 ---------+-------------+--------------
@@ -223,17 +223,39 @@ ORDER BY proname, proargtypes, pronamespace;
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname < 'abbrz' AND (proname, proargtypes) > ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
-                                                 QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                   
+---------------------------------------------------------------------------------------------------------------
  Index Only Scan Backward using pg_proc_proname_args_nsp_index on pg_proc
-   Index Cond: ((ROW(proname, proargtypes) > ROW('abs'::name, NULL::oidvector)) AND (proname = 'abs'::name))
+   Index Cond: ((proname < 'abbrz'::name) AND (ROW(proname, proargtypes) > ROW('abs'::name, NULL::oidvector)))
 (2 rows)
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname < 'abbrz' AND (proname, proargtypes) > ('abs', NULL)
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+ proname | proargtypes | pronamespace 
+---------+-------------+--------------
+(0 rows)
+
+-- Add coverage for B-Tree preprocessing path that deals with making redundant
+-- keys nonrequired (relies on the limited lack of support for detecting which
+-- row compare is really redundant)
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
+ Index Only Scan Backward using pg_proc_proname_args_nsp_index on pg_proc
+   Index Cond: ((ROW(proname, proargtypes) > ROW('abs'::name, NULL::oidvector)) AND (proname = 'zzzzzz'::name))
+(2 rows)
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
  proname | proargtypes | pronamespace 
 ---------+-------------+--------------
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 68c61dbc7..80021e706 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -148,12 +148,12 @@ ORDER BY proname DESC, proargtypes DESC, pronamespace DESC LIMIT 1;
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE proname > 'abbrz' AND (proname, proargtypes) < ('abs', NULL)
 ORDER BY proname, proargtypes, pronamespace;
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE proname > 'abbrz' AND (proname, proargtypes) < ('abs', NULL)
 ORDER BY proname, proargtypes, pronamespace;
 
 --
@@ -163,12 +163,26 @@ ORDER BY proname, proargtypes, pronamespace;
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname < 'abbrz' AND (proname, proargtypes) > ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname < 'abbrz' AND (proname, proargtypes) > ('abs', NULL)
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+
+-- Add coverage for B-Tree preprocessing path that deals with making redundant
+-- keys nonrequired (relies on the limited lack of support for detecting which
+-- row compare is really redundant)
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
 
 --
-- 
2.50.0

v4-0001-Make-handling-of-redundant-nbtree-keys-more-robus.patchapplication/octet-stream; name=v4-0001-Make-handling-of-redundant-nbtree-keys-more-robus.patchDownload

From be9c70fa814a6eaa44f84c0c32b5375107aa19d2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 23 Jun 2025 22:46:05 -0400
Subject: [PATCH v4 1/2] Make handling of redundant nbtree keys more robust.

nbtree preprocessing's handling of redundant (and contradictory) keys
created problems for scans with = arrays.  It was just about possible
for a scan with an = array and one or more redundant keys (keys that
preprocessing could not eliminate due an incomplete opfamily and a
cross-type key) to get stuck.  Testing has shown that infinite cycling
where the scan never manages to make forward progress was possible.
This could happen when the scan's arrays were reset in _bt_readpage's
forcenonrequired=true path (added by bugfix commit 5f4d98d4) when the
arrays weren't at least advanced up to the same point that they were in
at the start of the _bt_readpage call.  Earlier redundant keys prevented
the finaltup call to _bt_advance_array_keys from reaching lower-order
keys that needed to be used to sufficiently advance the scan's arrays.

It seems best to fix the bug by making preprocessing leave the scan's
keys in a state that is at least as close as possible to how it usually
leaves them (in the event of being unable to determine which keys are
truly redundant).  The design of _bt_advance_array_keys tacitly assumes
that there can be little interference from such redundant keys, so this
approach makes sense on general robustness grounds.

Now nbtree preprocessing reliably leaves behind at most one required
>/>= key per index column, and at most one required </<= key per index
column.  Columns that have one or more = keys that are eligible to be
marked required (based on the traditional rules) will now leave behind
only one of the = keys as the index column's only required key.
Standard preprocessing steps continue to mark the keys required by
following the traditional rules.  A new preprocessing step (used only
when preprocessing fails to handle a redundancy comprehensively) can now
take place after that.  This new step unmarks a subset of the keys, in
such a way as to allow _bt_advance_array_keys (and related code) to not
have to deal with redundant required keys at all.

Keys that are not marked required (whether due to the new preprocessing
step running or for some other reason) are relocated to the end of the
so->keyData[] array as needed.  That way they'll always be evaluated
after the scan's required keys, and so cannot prevent code in places
like _bt_advance_array_keys from reaching any required key.

Also teach _bt_first to decide which initial positioning keys to use
based on the same requiredness markings that have long been used by
_bt_checkkeys/_bt_advance_array_keys.  This is a necessary condition for
reliably avoiding infinite cycling.  _bt_advance_array_keys expects to
be able to reason about what'll happen in the next _bt_first call should
it start another primitive index scan, by evaluating inequality keys
that were marked required in the opposite-to-scan scan direction only.
Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys)
will always agree on which exact key will be used on each index column
to start and/or end the scan (except when row compare keys are involved,
which have similar problems not addressed by this commit).

An upcoming commit will finish off the work started by this commit by
harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys
apply row compare keys to start and end scans.

This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com
---
 src/backend/access/nbtree/nbtpreprocesskeys.c | 380 ++++++++++++++++--
 src/backend/access/nbtree/nbtsearch.c         |  54 ++-
 src/backend/access/nbtree/nbtutils.c          | 136 +------
 3 files changed, 371 insertions(+), 199 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..e4cb48359 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "common/int.h"
 #include "lib/qunique.h"
 #include "utils/array.h"
 #include "utils/lsyscache.h"
@@ -56,6 +57,8 @@ static void _bt_skiparray_strat_decrement(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
 static void _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
+static void _bt_unmark_extra_keys(IndexScanDesc scan, int *keyDataMap);
+static int	_bt_reorder_array_cmp(const void *a, const void *b);
 static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *new_numberOfKeys);
 static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_num_array_keys(IndexScanDesc scan, Oid *skip_eq_ops_out,
@@ -96,7 +99,7 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
  * incomplete sets of cross-type operators, we may fail to detect redundant
  * or contradictory keys, but we can survive that.)
  *
- * The output keys must be sorted by index attribute.  Presently we expect
+ * Required output keys are sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here.
@@ -127,29 +130,25 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
  * This has the potential to be much more efficient than a full index scan
  * (though it behaves like a full scan when there's many distinct "x" values).
  *
- * If possible, redundant keys are eliminated: we keep only the tightest
+ * Typically, redundant keys are eliminated: we keep only the tightest
  * >/>= bound and the tightest </<= bound, and if there's an = key then
  * that's the only one returned.  (So, we return either a single = key,
  * or one or two boundary-condition keys for each attr.)  However, if we
  * cannot compare two keys for lack of a suitable cross-type operator,
- * we cannot eliminate either.  If there are two such keys of the same
- * operator strategy, the second one is just pushed into the output array
- * without further processing here.  We may also emit both >/>= or both
- * </<= keys if we can't compare them.  The logic about required keys still
- * works if we don't eliminate redundant keys.
+ * we cannot eliminate either key.
  *
- * Note that one reason we need direction-sensitive required-key flags is
- * precisely that we may not be able to eliminate redundant keys.  Suppose
+ * When all redundant keys could not be eliminated, we'll still output a
+ * so->keyData[] array that's as close as possible to what we'd output in the
+ * typical case where all redundant keys could be fully eliminated.  Suppose
  * we have "x > 4::int AND x > 10::bigint", and we are unable to determine
  * which key is more restrictive for lack of a suitable cross-type operator.
- * _bt_first will arbitrarily pick one of the keys to do the initial
- * positioning with.  If it picks x > 4, then the x > 10 condition will fail
- * until we reach index entries > 10; but we can't stop the scan just because
- * x > 10 is failing.  On the other hand, if we are scanning backwards, then
- * failure of either key is indeed enough to stop the scan.  (In general, when
- * inequality keys are present, the initial-positioning code only promises to
- * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * We'll arbitrarily pick one of the keys to mark required/do the initial
+ * positioning with.  If x > 4 is chosen then the x > 10 condition will fail
+ * until we reach index entries > 10.  x > 10 will be output at the end of
+ * so->keyData[], after the x > 4 key (i.e. after all required keys).  Note
+ * that this means that so->keyData[] won't always have keys in index
+ * attribute order (only the initial group of required keys are reliably
+ * output in attribute order).
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
@@ -193,6 +192,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		arrayKeyData;
 	int		   *keyDataMap = NULL;
 	int			arrayidx = 0;
+	bool		comparisonfailed = false;
 
 	if (so->numberOfKeys > 0)
 	{
@@ -388,7 +388,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						xform[j].inkey = NULL;
 						xform[j].inkeyi = -1;
 					}
-					/* else, cannot determine redundancy, keep both keys */
+					else
+						comparisonfailed = true;
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
@@ -409,6 +410,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTLessStrategyNumber - 1].inkey = NULL;
 				}
+				else
+					comparisonfailed = true;
 			}
 
 			/* try to keep only one of >, >= */
@@ -426,6 +429,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTGreaterStrategyNumber - 1].inkey = NULL;
 				}
+				else
+					comparisonfailed = true;
 			}
 
 			/*
@@ -466,25 +471,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* check strategy this key's operator corresponds to */
 		j = inkey->sk_strategy - 1;
 
-		/* if row comparison, push it directly to the output array */
-		if (inkey->sk_flags & SK_ROW_HEADER)
-		{
-			ScanKey		outkey = &so->keyData[new_numberOfKeys++];
-
-			memcpy(outkey, inkey, sizeof(ScanKeyData));
-			if (arrayKeyData)
-				keyDataMap[new_numberOfKeys - 1] = i;
-			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
-
-			/*
-			 * We don't support RowCompare using equality; such a qual would
-			 * mess up the numberOfEqualCols tracking.
-			 */
-			Assert(j != (BTEqualStrategyNumber - 1));
-			continue;
-		}
-
 		if (inkey->sk_strategy == BTEqualStrategyNumber &&
 			(inkey->sk_flags & SK_SEARCHARRAY))
 		{
@@ -591,11 +577,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				 * We can't determine which key is more restrictive.  Push
 				 * xform[j] directly to the output array, then set xform[j] to
 				 * the new scan key.
-				 *
-				 * Note: We do things this way around so that our arrays are
-				 * always in the same order as their corresponding scan keys,
-				 * even with incomplete opfamilies.  _bt_advance_array_keys
-				 * depends on this.
 				 */
 				ScanKey		outkey = &so->keyData[new_numberOfKeys++];
 
@@ -607,6 +588,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				xform[j].inkey = inkey;
 				xform[j].inkeyi = i;
 				xform[j].arrayidx = arrayidx;
+				comparisonfailed = true;
 			}
 		}
 	}
@@ -622,6 +604,16 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (arrayKeyData)
 		_bt_preprocess_array_keys_final(scan, keyDataMap);
 
+	/*
+	 * If all prior preprocessing steps failed to eliminate any keys due to a
+	 * lack of suitable cross-type support (and a cross-type key), unset their
+	 * required markings now.  This ensures that there's never ambiguity about
+	 * which key must be used to determine where the scan starts and/or ends
+	 * for a given attribute (even when it's an _implicit_ NOT NULL key).
+	 */
+	if (unlikely(comparisonfailed && so->qual_ok))
+		_bt_unmark_extra_keys(scan, keyDataMap);
+
 	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
@@ -847,8 +839,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 				cmp_op;
 	StrategyNumber strat;
 
-	Assert(!((leftarg->sk_flags | rightarg->sk_flags) &
-			 (SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_MEMBER));
 
 	/*
 	 * First, deal with cases where one or both args are NULL.  This should
@@ -924,6 +915,16 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * We don't yet know how to determine redundancy when it involves a row
+	 * compare key (barring simple cases involving IS NULL/IS NOT NULL)
+	 */
+	if ((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_HEADER)
+	{
+		Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
+		return false;
+	}
+
 	/*
 	 * If either leftarg or rightarg are equality-type array scankeys, we need
 	 * specialized handling (since by now we know that IS NULL wasn't used)
@@ -1467,6 +1468,297 @@ _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 	}
 }
 
+/*
+ *	_bt_unmark_extra_keys() -- make "extra" so->keyData[] keys nonrequired.
+ *
+ * When _bt_preprocess_keys failed to eliminate one or more redundant keys, it
+ * calls here to make sure that no index attribute has more than one > or >=
+ * key marked required, and no more than one required < or <= key.  Attributes
+ * with = keys will always get one = key as their required key.  All other
+ * keys that were initially marked required get "unmarked" here.  This ensures
+ * that _bt_first and _bt_checkkeys reliably have exactly the same ideas about
+ * which keys to use to start and end the scan.  Scans with array keys rely on
+ * this guarantee.  They expect to be able to unambiguously determine whether
+ * starting another primscan (given the current array key elements, and all
+ * other keys) will land the scan on some later leaf page.
+ *
+ * We also relocate keys that become/started out nonrequired to the end of
+ * so->keyData[].  That way, _bt_first/_bt_checkkeys/_bt_advance_array_keys
+ * cannot fail to reach a required key due to some earlier nonrequired key.
+ *
+ * so->keyData[] won't be in strict attribute order when we're done with it,
+ * but that's okay; the group of remaining required scan keys that'll appear
+ * first in so->keyData[] retain their original order, which is good enough.
+ * (The group of nonrequired keys that'll appear at the end of so->keyData[]
+ * also retain their original order, though any order works for these keys.)
+ *
+ * Only call here when _bt_compare_scankey_args returned false at least once
+ * (otherwise, calling here will just waste cycles).  We always unmark at
+ * least one key per call to _bt_compare_scankey_args that reported failure.
+ */
+static void
+_bt_unmark_extra_keys(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	AttrNumber	curattr;
+	ScanKey		origkey;
+	bool	   *unmarkikey;
+	int			nunmark,
+				nunmarked,
+				nkept,
+				firsti;
+	ScanKey		keepKeys,
+				unmarkKeys;
+	FmgrInfo   *keepOrderProcs = NULL,
+			   *unmarkOrderProcs = NULL;
+	bool		haveReqEquals,
+				haveReqForward,
+				haveReqBackward;
+
+	/*
+	 * Do an initial pass over so->keyData[] that determines which keys to
+	 * keeo as required, and which to mark non-required.
+	 *
+	 * When both equality and inequality keys remain on a single attribute, we
+	 * *must* leave only the equality key as required to continue the scan.
+	 * Keys can only be considered required when they're on the first index
+	 * column, or when all earlier index columns have a required = key.
+	 */
+	unmarkikey = palloc0(so->numberOfKeys * sizeof(bool));
+	nunmark = 0;
+
+	/* Set things up for first key's attr */
+	origkey = so->keyData;
+	curattr = origkey->sk_attno;
+	firsti = 0;
+	haveReqEquals = false;
+	haveReqForward = false;
+	haveReqBackward = false;
+	for (int i = 0; i < so->numberOfKeys; origkey++, i++)
+	{
+		if (origkey->sk_attno != curattr)
+		{
+			/* Reset for next attr */
+			curattr = origkey->sk_attno;
+			firsti = i;
+
+			haveReqEquals = false;
+			haveReqForward = false;
+			haveReqBackward = false;
+		}
+
+		/* First consider equalities */
+		if (haveReqEquals)
+		{
+			/*
+			 * We've already found the first "=" key for attr.  We already
+			 * decided that every other key on the same attr is to be unmarked
+			 */
+			Assert(!(origkey->sk_flags & SK_SEARCHNULL));
+			unmarkikey[i] = true;
+			nunmark++;
+			continue;
+		}
+		else if ((origkey->sk_flags & SK_BT_REQFWD) &&
+				 (origkey->sk_flags & SK_BT_REQBKWD))
+		{
+			/*
+			 * Found the first "=" key for attr.  All other keys on the same
+			 * attr will be unmarked.
+			 */
+			Assert(origkey->sk_strategy == BTEqualStrategyNumber);
+
+			haveReqEquals = true;
+			for (int j = firsti; j < i; j++)
+			{
+				/* We'll unmark prior keys for the same attr after all */
+				if (!unmarkikey[j])
+				{
+					unmarkikey[j] = true;
+					nunmark++;
+				}
+			}
+			continue;
+		}
+
+		/* Next consider inequalities */
+		if ((origkey->sk_flags & SK_BT_REQFWD) && !haveReqForward)
+		{
+			haveReqForward = true;
+			continue;
+		}
+		else if ((origkey->sk_flags & SK_BT_REQBKWD) && !haveReqBackward)
+		{
+			haveReqBackward = true;
+			continue;
+		}
+		unmarkikey[i] = true;
+		nunmark++;
+	}
+
+	/*
+	 * We're only supposed to be called when _bt_compare_scankey_args reported
+	 * failure when called from the main _bt_preprocess_keys loop
+	 */
+	Assert(nunmark > 0);
+
+	/*
+	 * Next, allocate temp arrays: one set for unchanged keys, another for
+	 * keys that will be unmarked/made non-required
+	 */
+	unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	nunmarked = 0;
+	nkept = 0;
+	if (so->numArrayKeys)
+	{
+		unmarkOrderProcs = palloc(so->numberOfKeys * sizeof(FmgrInfo));
+		keepOrderProcs = palloc(so->numberOfKeys * sizeof(FmgrInfo));
+	}
+
+	/*
+	 * Next, copy the contents of so->keyData[] into the appropriate temp
+	 * array.
+	 *
+	 * Scans with = array keys need us to maintain invariants around the order
+	 * of so->orderProcs[] and so->arrayKeys[] relative to so->keyData[].  See
+	 * _bt_preprocess_array_keys_final for a full explanation.
+	 */
+	for (int i = 0; i < so->numberOfKeys; i++)
+	{
+		ScanKey		unmark;
+
+		origkey = so->keyData + i;
+
+		if (!unmarkikey[i])
+		{
+			/*
+			 * Key was chosen above all other keys on this key's attr (among
+			 * those keys with the same SK_BT_REQFWD/SK_BT_REQBKWD marking).
+			 *
+			 * Key will stay in its original position, unless we've unmarked
+			 * some earlier key (in which case this key gets placed earlier).
+			 */
+			memcpy(keepKeys + nkept, origkey, sizeof(ScanKeyData));
+
+			if (so->numArrayKeys)
+			{
+				keyDataMap[i] = nkept;
+				memcpy(keepOrderProcs + nkept, &so->orderProcs[i],
+					   sizeof(FmgrInfo));
+			}
+
+			nkept++;
+			continue;
+		}
+
+		/*
+		 * Key was _not_ chosen above all other keys on this key's attr (among
+		 * those keys with the same SK_BT_REQFWD/SK_BT_REQBKWD marking).
+		 *
+		 * Key will be unmarked, and moved to the end of the array, next to
+		 * other keys that we'll unmark (or that were always nonrequired).
+		 */
+		unmark = unmarkKeys + nunmarked;
+		memcpy(unmark, origkey, sizeof(ScanKeyData));
+
+		if (so->numArrayKeys)
+		{
+			keyDataMap[i] = (so->numberOfKeys - nunmark) + nunmarked;
+			memcpy(&unmarkOrderProcs[nunmarked], &so->orderProcs[i],
+				   sizeof(FmgrInfo));
+		}
+
+		/*
+		 * Preprocessing only generates skip arrays when it knows that they're
+		 * the only = key on the attr.  We favor = strategy keys over keys
+		 * with other strategies, so we should never unmark a skip array here.
+		 */
+		Assert(!(unmark->sk_flags & SK_BT_SKIP));
+
+		/*
+		 * Also shouldn't have to unmark an IS NULL or an IS NOT NULL key.
+		 * They aren't cross-type, so an incomplete opfamily can't matter.
+		 */
+		Assert(!(unmark->sk_flags & SK_ISNULL) ||
+			   !(unmark->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)));
+
+		/* Clear requiredness flags on redundant key (and on any subkeys) */
+		unmark->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
+		if (unmark->sk_flags & SK_ROW_HEADER)
+		{
+			ScanKey		subkey = (ScanKey) DatumGetPointer(origkey->sk_argument);
+
+			Assert(subkey->sk_strategy == origkey->sk_strategy);
+			for (;;)
+			{
+				Assert(subkey->sk_flags & SK_ROW_MEMBER);
+				subkey->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
+				if (subkey->sk_flags & SK_ROW_END)
+					break;
+				subkey++;
+			}
+		}
+
+		nunmarked++;
+	}
+
+	/*
+	 * Copy temp arrays back into so->keyData[]
+	 */
+	Assert(nkept == so->numberOfKeys - nunmark);
+	Assert(nunmarked == nunmark);
+	memcpy(so->keyData, keepKeys, sizeof(ScanKeyData) * nkept);
+	memcpy(so->keyData + nkept, unmarkKeys, sizeof(ScanKeyData) * nunmarked);
+
+	/* Done with temp arrays */
+	pfree(unmarkikey);
+	pfree(keepKeys);
+	pfree(unmarkKeys);
+
+	/*
+	 * Now copy so->orderProcs[] temp entries needed by scans with = array
+	 * keys back (just like with the so->keyData[] temp arrays)
+	 */
+	if (so->numArrayKeys)
+	{
+		memcpy(so->orderProcs, keepOrderProcs, sizeof(FmgrInfo) * nkept);
+		memcpy(so->orderProcs + nkept, unmarkOrderProcs,
+			   sizeof(FmgrInfo) * nunmarked);
+
+		/* Also fix-up array->scan_key references */
+		for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arridx];
+
+			array->scan_key = keyDataMap[array->scan_key];
+		}
+
+		/*
+		 * Also make sure that the scan's arrays appear in the expected order.
+		 * This must match corresponding scan key/so->orderProcs[] entries.
+		 */
+		qsort(so->arrayKeys, so->numArrayKeys, sizeof(BTArrayKeyInfo),
+			  _bt_reorder_array_cmp);
+
+		/* Done with temp arrays */
+		pfree(unmarkOrderProcs);
+		pfree(keepOrderProcs);
+	}
+}
+
+/*
+ * qsort comparator for reordering so->arrayKeys[] BTArrayKeyInfo entries
+ */
+static int
+_bt_reorder_array_cmp(const void *a, const void *b)
+{
+	BTArrayKeyInfo *arraya = (BTArrayKeyInfo *) a;
+	BTArrayKeyInfo *arrayb = (BTArrayKeyInfo *) b;
+
+	return pg_cmp_s32(arraya->scan_key, arrayb->scan_key);
+}
+
 /*
  *	_bt_preprocess_array_keys() -- Preprocess SK_SEARCHARRAY scan keys
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 36544ecfd..0b997282e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -960,6 +960,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
+	 * The selected scan keys (at most one per index column) are remembered by
+	 * storing their addresses into the local startKeys[] array.
 	 *
 	 * We want to identify the keys that can be used as starting boundaries;
 	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
@@ -968,25 +970,20 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * a > or < boundary or find an attribute with no boundary (which can be
 	 * thought of as the same as "> -infinity"), we can't use keys for any
 	 * attributes to its right, because it would break our simplistic notion
-	 * of what initial positioning strategy to use.
+	 * of what initial positioning strategy to use.  We can only pick keys
+	 * that were marked required (in the direction opposite our own scan's)
+	 * during preprocessing.  In general, keys that we use to find an initial
+	 * position when scanning forwards are the same keys that'll end the scan
+	 * when scanning backwards (and vice-versa).
 	 *
 	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
-	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
+	 * may not be able to eliminate redundant keys; in such cases it will
+	 * arbitrarily pick a usable one for each attribute, making sure that the
+	 * other key isn't marked required to continue the scan.  It'll still be
+	 * clear which key we should use here, since _bt_preprocess_keys outputs
+	 * the non-required key at the end of so->keyData[], after all of the
+	 * scan's required keys; we stop considering further keys once we reach
+	 * the first nonrequired key (which won't be in index attribute order!).
 	 *
 	 * In practice we rarely see any "attribute boundary key gaps" here.
 	 * Preprocessing can usually backfill skip array keys for any attributes
@@ -996,10 +993,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * This happens with range skip arrays, which store inequality keys in the
 	 * array's low_compare/high_compare fields (used to find the first/last
 	 * set of matches, when = key will lack a usable sk_argument value).
-	 * These are always preferred over any redundant "standard" inequality
-	 * keys on the same column (per the usual rule about preferring = keys).
-	 * Note also that any column with an = skip array key can never have an
-	 * additional, contradictory = key.
 	 *
 	 * All keys (with the exception of SK_SEARCHNULL keys and SK_BT_SKIP
 	 * array keys whose array is "null_elem=true") imply a NOT NULL qualifier.
@@ -1014,9 +1007,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * first (leftmost) columns.  We'll add on lower-order columns of the row
 	 * comparison below, if possible.
 	 *
-	 * The selected scan keys (at most one per index column) are remembered by
-	 * storing their addresses into the local startKeys[] array.
-	 *
 	 * _bt_checkkeys/_bt_advance_array_keys decide whether and when to start
 	 * the next primitive index scan (for scans with array keys) based in part
 	 * on an understanding of how it'll enable us to reposition the scan.
@@ -1058,6 +1048,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
 			{
+				Assert(chosen == NULL ||
+					   (chosen->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)));
+
 				/*
 				 * Done looking at keys for curattr.
 				 *
@@ -1180,17 +1173,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 				/*
 				 * Done if that was the last scan key output by preprocessing.
-				 * Also done if there is a gap index attribute that lacks a
-				 * usable key (only possible when preprocessing was unable to
-				 * generate a skip array key to "fill in the gap").
+				 * Also done if we've now examined all keys marked required.
 				 */
 				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
+					!(cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 					break;
 
 				/*
 				 * Reset for next attr.
 				 */
+				Assert(cur->sk_attno == curattr + 1);
 				curattr = cur->sk_attno;
 				chosen = NULL;
 				impliesNN = NULL;
@@ -1216,8 +1208,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					}
 					break;
 				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
-					chosen = cur;
+					/* first equality key always wins */
+					if (chosen == NULL)
+						chosen = cur;
+					Assert(chosen->sk_strategy == BTEqualStrategyNumber);
 					break;
 				case BTGreaterEqualStrategyNumber:
 				case BTGreaterStrategyNumber:
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2..eb6dbfda3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -44,7 +44,6 @@ static bool _bt_array_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *arra
 static bool _bt_array_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
 static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 											 bool *skip_array_set);
-static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
 										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
 										 bool readpagetup, int sktrig, bool *scanBehind);
@@ -52,7 +51,6 @@ static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								   int sktrig, bool sktrig_required);
 #ifdef USE_ASSERT_CHECKING
-static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 #endif
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
@@ -1034,73 +1032,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	return false;
 }
 
-/*
- * _bt_rewind_nonrequired_arrays() -- Rewind SAOP arrays not marked required
- *
- * Called when _bt_advance_array_keys decides to start a new primitive index
- * scan on the basis of the current scan position being before the position
- * that _bt_first is capable of repositioning the scan to by applying an
- * inequality operator required in the opposite-to-scan direction only.
- *
- * Although equality strategy scan keys (for both arrays and non-arrays alike)
- * are either marked required in both directions or in neither direction,
- * there is a sense in which non-required arrays behave like required arrays.
- * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
- * the scan key on "c" is non-required, but nevertheless enables positioning
- * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
- * first descent of the tree by _bt_first.  Later on, there could also be a
- * second descent, that places the scan right before tuples >= "(200, 3, 5)".
- * _bt_first must never be allowed to build an insertion scan key whose "c"
- * entry is set to a value other than 5, the "c" array's first element/value.
- * (Actually, it's the first in the current scan direction.  This example uses
- * a forward scan.)
- *
- * Calling here resets the array scan key elements for the scan's non-required
- * arrays.  This is strictly necessary for correctness in a subset of cases
- * involving "required in opposite direction"-triggered primitive index scans.
- * Not all callers are at risk of _bt_first using a non-required array like
- * this, but advancement always resets the arrays when another primitive scan
- * is scheduled, just to keep things simple.  Array advancement even makes
- * sure to reset non-required arrays during scans that have no inequalities.
- * (Advancement still won't call here when there are no inequalities, though
- * that's just because it's all handled indirectly instead.)
- *
- * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
- * everybody got this right.
- *
- * Note: In practice almost all SAOP arrays are marked required during
- * preprocessing (if necessary by generating skip arrays).  It is hardly ever
- * truly necessary to call here, but consistently doing so is simpler.
- */
-static void
-_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			arrayidx = 0;
-
-	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
-	{
-		ScanKey		cur = so->keyData + ikey;
-		BTArrayKeyInfo *array = NULL;
-
-		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
-			cur->sk_strategy != BTEqualStrategyNumber)
-			continue;
-
-		array = &so->arrayKeys[arrayidx++];
-		Assert(array->scan_key == ikey);
-
-		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
-			continue;
-
-		Assert(array->num_elems != -1); /* No non-required skip arrays */
-
-		_bt_array_set_low_or_high(rel, cur, array,
-								  ScanDirectionIsForward(dir));
-	}
-}
-
 /*
  * _bt_tuple_before_array_skeys() -- too early to advance required arrays?
  *
@@ -1380,8 +1311,6 @@ _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (so->needPrimScan)
 	{
-		Assert(_bt_verify_arrays_bt_first(scan, dir));
-
 		/*
 		 * Flag was set -- must call _bt_first again, which will reset the
 		 * scan's needPrimScan flag
@@ -2007,14 +1936,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 	 */
 	else if (has_required_opposite_direction_only && pstate->finaltup &&
 			 unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
-	{
-		/*
-		 * Make sure that any SAOP arrays that were not marked required by
-		 * preprocessing are reset to their first element for this direction
-		 */
-		_bt_rewind_nonrequired_arrays(scan, dir);
 		goto new_prim_scan;
-	}
 
 continue_scan:
 
@@ -2045,8 +1967,6 @@ continue_scan:
 		 */
 		so->oppositeDirCheck = has_required_opposite_direction_only;
 
-		_bt_rewind_nonrequired_arrays(scan, dir);
-
 		/*
 		 * skip by setting "look ahead" mechanism's offnum for forwards scans
 		 * (backwards scans check scanBehind flag directly instead)
@@ -2142,48 +2062,6 @@ end_toplevel_scan:
 }
 
 #ifdef USE_ASSERT_CHECKING
-/*
- * Verify that the scan's qual state matches what we expect at the point that
- * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
- *
- * We enforce a rule against non-required array scan keys: they must start out
- * with whatever element is the first for the scan's current scan direction.
- * See _bt_rewind_nonrequired_arrays comments for an explanation.
- */
-static bool
-_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			arrayidx = 0;
-
-	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
-	{
-		ScanKey		cur = so->keyData + ikey;
-		BTArrayKeyInfo *array = NULL;
-		int			first_elem_dir;
-
-		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
-			cur->sk_strategy != BTEqualStrategyNumber)
-			continue;
-
-		array = &so->arrayKeys[arrayidx++];
-
-		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
-			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
-			continue;
-
-		if (ScanDirectionIsForward(dir))
-			first_elem_dir = 0;
-		else
-			first_elem_dir = array->num_elems - 1;
-
-		if (array->cur_elem != first_elem_dir)
-			return false;
-	}
-
-	return _bt_verify_keys_with_arraykeys(scan);
-}
-
 /*
  * Verify that the scan's "so->keyData[]" scan keys are in agreement with
  * its array key state
@@ -2194,6 +2072,7 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			last_sk_attno = InvalidAttrNumber,
 				arrayidx = 0;
+	bool		nonrequiredseen = false;
 
 	if (!so->qual_ok)
 		return false;
@@ -2217,8 +2096,16 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 		if (array->num_elems != -1 &&
 			cur->sk_argument != array->elem_values[array->cur_elem])
 			return false;
-		if (last_sk_attno > cur->sk_attno)
-			return false;
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			if (last_sk_attno > cur->sk_attno)
+				return false;
+			if (nonrequiredseen)
+				return false;
+		}
+		else
+			nonrequiredseen = true;
+
 		last_sk_attno = cur->sk_attno;
 	}
 
@@ -2551,7 +2438,6 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 		if (!(key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 		{
 			/* Scan key isn't marked required (corner case) */
-			Assert(!(key->sk_flags & SK_ROW_HEADER));
 			break;				/* unsafe */
 		}
 		if (key->sk_flags & SK_ROW_HEADER)
-- 
2.50.0

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#4)

2 attachment(s)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Heikki said he'd be able to give this patch set at least a quick
review, so here's a new revision, v4.

This isn't really different to v4. It has more comment cleanup, and
better commit messages.

--
Peter Geoghegan

Attachments:

v5-0001-Make-handling-of-redundant-nbtree-keys-more-robus.patchapplication/octet-stream; name=v5-0001-Make-handling-of-redundant-nbtree-keys-more-robus.patchDownload

From 26a345980771397771c2703ccf461040480a382c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 23 Jun 2025 22:46:05 -0400
Subject: [PATCH v5 1/2] Make handling of redundant nbtree keys more robust.

nbtree preprocessing's handling of redundant (and contradictory) keys
created problems for scans with = arrays.  It was just about possible
for a scan with an = array and one or more redundant keys (keys that
preprocessing could not eliminate due an incomplete opfamily and a
cross-type key) to get stuck.  Testing has shown that infinite cycling
where the scan never manages to make forward progress was possible.
This could happen when the scan's arrays were reset in _bt_readpage's
forcenonrequired=true path (added by bugfix commit 5f4d98d4) when the
arrays weren't at least advanced up to the same point that they were in
at the start of the _bt_readpage call.  Earlier redundant keys prevented
the finaltup call to _bt_advance_array_keys from reaching lower-order
keys that needed to be used to sufficiently advance the scan's arrays.

To fix, make preprocessing leave the scan's keys in a state that is as
close as possible to how it'll usually leave them (in the common case
where there's no redundant keys that preprocessing failed to eliminate).
Now nbtree preprocessing _reliably_ leaves behind at most one required
>/>= key per index column, and at most one required </<= key per index
column.  Columns that have one or more = keys that are eligible to be
marked required (based on the traditional rules) prioritize the = keys
over redundant inequality keys; they'll _reliably_ be left with only one
of the = keys as the index column's only required key.

Keys that are not marked required (whether due to the new preprocessing
step running or for some other reason) are relocated to the end of the
so->keyData[] array as needed.  That way they'll always be evaluated
after the scan's required keys, and so cannot prevent code in places
like _bt_advance_array_keys and _bt_first from reaching a required key.

Also teach _bt_first to decide which initial positioning keys to use
based on the same requiredness markings that have long been used by
_bt_checkkeys/_bt_advance_array_keys.  This is a necessary condition for
reliably avoiding infinite cycling.  _bt_advance_array_keys expects to
be able to reason about what'll happen in the next _bt_first call should
it start another primitive index scan, by evaluating inequality keys
that were marked required in the opposite-to-scan scan direction only.
Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys)
will always agree on which exact key will be used on each index column
to start and/or end the scan (except when row compare keys are involved,
which have similar problems not addressed by this commit).

An upcoming commit will finish off the work started by this commit by
harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys
apply row compare keys to start and end scans.

This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com
---
 src/backend/access/nbtree/nbtpreprocesskeys.c | 381 +++++++++++++++---
 src/backend/access/nbtree/nbtsearch.c         | 204 +++++-----
 src/backend/access/nbtree/nbtutils.c          | 136 +------
 3 files changed, 453 insertions(+), 268 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..ca2a9050f 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "common/int.h"
 #include "lib/qunique.h"
 #include "utils/array.h"
 #include "utils/lsyscache.h"
@@ -56,6 +57,8 @@ static void _bt_skiparray_strat_decrement(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
 static void _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
+static void _bt_unmark_keys(IndexScanDesc scan, int *keyDataMap);
+static int	_bt_reorder_array_cmp(const void *a, const void *b);
 static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *new_numberOfKeys);
 static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_num_array_keys(IndexScanDesc scan, Oid *skip_eq_ops_out,
@@ -96,7 +99,7 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
  * incomplete sets of cross-type operators, we may fail to detect redundant
  * or contradictory keys, but we can survive that.)
  *
- * The output keys must be sorted by index attribute.  Presently we expect
+ * Required output keys are sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here.
@@ -127,29 +130,36 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
  * This has the potential to be much more efficient than a full index scan
  * (though it behaves like a full scan when there's many distinct "x" values).
  *
- * If possible, redundant keys are eliminated: we keep only the tightest
+ * Typically, redundant keys are eliminated: we keep only the tightest
  * >/>= bound and the tightest </<= bound, and if there's an = key then
  * that's the only one returned.  (So, we return either a single = key,
  * or one or two boundary-condition keys for each attr.)  However, if we
  * cannot compare two keys for lack of a suitable cross-type operator,
- * we cannot eliminate either.  If there are two such keys of the same
- * operator strategy, the second one is just pushed into the output array
- * without further processing here.  We may also emit both >/>= or both
- * </<= keys if we can't compare them.  The logic about required keys still
- * works if we don't eliminate redundant keys.
+ * we cannot eliminate either key.
  *
- * Note that one reason we need direction-sensitive required-key flags is
- * precisely that we may not be able to eliminate redundant keys.  Suppose
- * we have "x > 4::int AND x > 10::bigint", and we are unable to determine
- * which key is more restrictive for lack of a suitable cross-type operator.
- * _bt_first will arbitrarily pick one of the keys to do the initial
- * positioning with.  If it picks x > 4, then the x > 10 condition will fail
- * until we reach index entries > 10; but we can't stop the scan just because
- * x > 10 is failing.  On the other hand, if we are scanning backwards, then
- * failure of either key is indeed enough to stop the scan.  (In general, when
- * inequality keys are present, the initial-positioning code only promises to
- * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * When all redundant keys could not be eliminated, we'll output a key array
+ * that can more or less be treated as if it had no redundant keys.  Suppose
+ * we have "x > 4::int AND x > 10::bigint AND x < 70", and we are unable to
+ * determine which > key is more restrictive for lack of a suitable cross-type
+ * operator.  We'll arbitrarily pick one of the > keys; the other > key won't
+ * be marked required.  Obviously, the scan will be less efficient if we
+ * choose x > 4 over x > 10 -- but it can still largely proceed as if there
+ * was only a single > condition.  "x > 10" will be placed at the end of the
+ * so->keyData[] output array.  It'll always be evaluated last, after the keys
+ * that could be marked required in the usual way (after "x > 4 AND x < 70").
+ * This can sometimes result in so->keyData[] keys that aren't even in index
+ * attribute order (if the qual has multiple index attributes).  The scan's
+ * required keys will still be in attribute order, though, so it can't matter.
+ *
+ * This scheme ensures that _bt_first always uses the same set of keys at the
+ * start of a forwards scan as those _bt_checkkeys uses to determine when to
+ * end a similar backwards scan (and vice-versa).  _bt_advance_array_keys
+ * depends on this: it expects to be able to reliably predict what the next
+ * _bt_first call will do by testing whether _bt_checkkeys' routines report
+ * that the final tuple on the page is past the end of matches for the scan's
+ * keys with the scan direction flipped.  If it is (if continuescan=false),
+ * then it follows that calling _bt_first will, at a mimimum, relocate the
+ * scan to the very next leaf page (in the current scan direction).
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
@@ -188,7 +198,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfEqualCols;
 	ScanKey		inkeys;
 	BTScanKeyPreproc xform[BTMaxStrategyNumber];
-	bool		test_result;
+	bool		test_result,
+				redundant_key_kept = false;
 	AttrNumber	attno;
 	ScanKey		arrayKeyData;
 	int		   *keyDataMap = NULL;
@@ -388,7 +399,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						xform[j].inkey = NULL;
 						xform[j].inkeyi = -1;
 					}
-					/* else, cannot determine redundancy, keep both keys */
+					else
+						redundant_key_kept = true;
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
@@ -409,6 +421,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTLessStrategyNumber - 1].inkey = NULL;
 				}
+				else
+					redundant_key_kept = true;
 			}
 
 			/* try to keep only one of >, >= */
@@ -426,6 +440,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTGreaterStrategyNumber - 1].inkey = NULL;
 				}
+				else
+					redundant_key_kept = true;
 			}
 
 			/*
@@ -466,25 +482,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* check strategy this key's operator corresponds to */
 		j = inkey->sk_strategy - 1;
 
-		/* if row comparison, push it directly to the output array */
-		if (inkey->sk_flags & SK_ROW_HEADER)
-		{
-			ScanKey		outkey = &so->keyData[new_numberOfKeys++];
-
-			memcpy(outkey, inkey, sizeof(ScanKeyData));
-			if (arrayKeyData)
-				keyDataMap[new_numberOfKeys - 1] = i;
-			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
-
-			/*
-			 * We don't support RowCompare using equality; such a qual would
-			 * mess up the numberOfEqualCols tracking.
-			 */
-			Assert(j != (BTEqualStrategyNumber - 1));
-			continue;
-		}
-
 		if (inkey->sk_strategy == BTEqualStrategyNumber &&
 			(inkey->sk_flags & SK_SEARCHARRAY))
 		{
@@ -593,9 +590,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				 * the new scan key.
 				 *
 				 * Note: We do things this way around so that our arrays are
-				 * always in the same order as their corresponding scan keys,
-				 * even with incomplete opfamilies.  _bt_advance_array_keys
-				 * depends on this.
+				 * always in the same order as their corresponding scan keys.
+				 * _bt_preprocess_array_keys_final expects this.
 				 */
 				ScanKey		outkey = &so->keyData[new_numberOfKeys++];
 
@@ -607,6 +603,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				xform[j].inkey = inkey;
 				xform[j].inkeyi = i;
 				xform[j].arrayidx = arrayidx;
+				redundant_key_kept = true;
 			}
 		}
 	}
@@ -622,6 +619,15 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (arrayKeyData)
 		_bt_preprocess_array_keys_final(scan, keyDataMap);
 
+	/*
+	 * If there are remaining redundant inequality keys, we must make sure
+	 * that each index attribute has no more than one required >/>= key, and
+	 * no more than one required </<= key.  Attributes that have one or more
+	 * required = keys now must keep only one required key (the first = key).
+	 */
+	if (unlikely(redundant_key_kept) && so->qual_ok)
+		_bt_unmark_keys(scan, keyDataMap);
+
 	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
@@ -847,8 +853,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 				cmp_op;
 	StrategyNumber strat;
 
-	Assert(!((leftarg->sk_flags | rightarg->sk_flags) &
-			 (SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_MEMBER));
 
 	/*
 	 * First, deal with cases where one or both args are NULL.  This should
@@ -924,6 +929,16 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * We don't yet know how to determine redundancy when it involves a row
+	 * compare key (barring simple cases involving IS NULL/IS NOT NULL)
+	 */
+	if ((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_HEADER)
+	{
+		Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
+		return false;
+	}
+
 	/*
 	 * If either leftarg or rightarg are equality-type array scankeys, we need
 	 * specialized handling (since by now we know that IS NULL wasn't used)
@@ -1467,6 +1482,282 @@ _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 	}
 }
 
+/*
+ *	_bt_unmark_keys() -- make superfluous required keys nonrequired after all
+ *
+ * When _bt_preprocess_keys failed to eliminate one or more redundant keys, it
+ * calls here to make sure that no index attribute has more than one > or >=
+ * key marked required, and no more than one required < or <= key.  Attributes
+ * with = keys will always get one = key as their required key.  All other
+ * keys that were initially marked required get "unmarked" here.  This ensures
+ * that _bt_first and _bt_checkkeys reliably have exactly the same ideas about
+ * which keys to use to start and end the scan.
+ *
+ * We also relocate keys that become/started out nonrequired to the end of
+ * so->keyData[].  That way, _bt_first/_bt_checkkeys/_bt_advance_array_keys
+ * cannot fail to reach a required key due to some earlier nonrequired key.
+ *
+ * Only call here when _bt_compare_scankey_args returned false at least once
+ * (otherwise, calling here will just waste cycles).
+ */
+static void
+_bt_unmark_keys(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	AttrNumber	attno;
+	ScanKey		origkey;
+	bool	   *unmarkikey;
+	int			nunmark,
+				nunmarked,
+				nkept,
+				firsti;
+	ScanKey		keepKeys,
+				unmarkKeys;
+	FmgrInfo   *keepOrderProcs = NULL,
+			   *unmarkOrderProcs = NULL;
+	bool		haveReqEquals,
+				haveReqForward,
+				haveReqBackward;
+
+	/*
+	 * Do an initial pass over so->keyData[] that determines which keys to
+	 * keep as required.
+	 *
+	 * When both equality and inequality keys remain on a single attribute, we
+	 * *must* make sure that exactly one of the equalities remains required.
+	 * Any requiredness markings that we might leave on later keys/attributes
+	 * are predicated on there being required = keys on all prior columns.
+	 */
+	unmarkikey = palloc0(so->numberOfKeys * sizeof(bool));
+	nunmark = 0;
+
+	/* Set things up for first key's attr */
+	origkey = so->keyData;
+	attno = origkey->sk_attno;
+	firsti = 0;
+	haveReqEquals = false;
+	haveReqForward = false;
+	haveReqBackward = false;
+	for (int i = 0; i < so->numberOfKeys; origkey++, i++)
+	{
+		if (origkey->sk_attno != attno)
+		{
+			/* Reset for next attr */
+			attno = origkey->sk_attno;
+			firsti = i;
+
+			haveReqEquals = false;
+			haveReqForward = false;
+			haveReqBackward = false;
+		}
+
+		/* Equalities get priority over inequalities */
+		if (haveReqEquals)
+		{
+			/*
+			 * We've already found the first "=" key for attr.  We already
+			 * decided that every other key on attno has to be unmarked.
+			 */
+			Assert(!(origkey->sk_flags & SK_SEARCHNULL));
+			unmarkikey[i] = true;
+			nunmark++;
+			continue;
+		}
+		else if ((origkey->sk_flags & SK_BT_REQFWD) &&
+				 (origkey->sk_flags & SK_BT_REQBKWD))
+		{
+			/*
+			 * Found the first "=" key for attno.  All other attno keys will
+			 * be unmarked.
+			 */
+			Assert(origkey->sk_strategy == BTEqualStrategyNumber);
+
+			haveReqEquals = true;
+			for (int j = firsti; j < i; j++)
+			{
+				/* Unmark any prior inequality keys on attno after all */
+				if (!unmarkikey[j])
+				{
+					unmarkikey[j] = true;
+					nunmark++;
+				}
+			}
+			continue;
+		}
+
+		/* Deal with inequalities next */
+		if ((origkey->sk_flags & SK_BT_REQFWD) && !haveReqForward)
+		{
+			haveReqForward = true;
+			continue;
+		}
+		else if ((origkey->sk_flags & SK_BT_REQBKWD) && !haveReqBackward)
+		{
+			haveReqBackward = true;
+			continue;
+		}
+
+		/*
+		 * We have either a redundant inequality key that will be unmarked, or
+		 * we have a key that wasn't marked required in the first place
+		 */
+		unmarkikey[i] = true;
+		nunmark++;
+	}
+
+	/* Should only be called when _bt_compare_scankey_args reported failure */
+	Assert(nunmark > 0);
+
+	/*
+	 * Next, allocate temp arrays: one for required keys that'll remain
+	 * required, the other for all remaining keys
+	 */
+	unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	nunmarked = 0;
+	nkept = 0;
+	if (so->numArrayKeys)
+	{
+		unmarkOrderProcs = palloc(so->numberOfKeys * sizeof(FmgrInfo));
+		keepOrderProcs = palloc(so->numberOfKeys * sizeof(FmgrInfo));
+	}
+
+	/*
+	 * Next, copy the contents of so->keyData[] into the appropriate temp
+	 * array.
+	 *
+	 * Scans with = array keys need us to maintain invariants around the order
+	 * of so->orderProcs[] and so->arrayKeys[] relative to so->keyData[].  See
+	 * _bt_preprocess_array_keys_final for a full explanation.
+	 */
+	for (int i = 0; i < so->numberOfKeys; i++)
+	{
+		ScanKey		unmark;
+
+		origkey = so->keyData + i;
+
+		if (!unmarkikey[i])
+		{
+			/*
+			 * Key gets to keep its original requiredness markings.
+			 *
+			 * Key will stay in its original position, unless we're going to
+			 * unmark an earlier key (in which case this key gets moved back).
+			 */
+			memcpy(keepKeys + nkept, origkey, sizeof(ScanKeyData));
+
+			if (so->numArrayKeys)
+			{
+				keyDataMap[i] = nkept;
+				memcpy(keepOrderProcs + nkept, &so->orderProcs[i],
+					   sizeof(FmgrInfo));
+			}
+
+			nkept++;
+			continue;
+		}
+
+		/*
+		 * Key will be unmarked as needed, and moved to the end of the array,
+		 * next to other keys that will become (or always were) nonrequired
+		 */
+		unmark = unmarkKeys + nunmarked;
+		memcpy(unmark, origkey, sizeof(ScanKeyData));
+
+		if (so->numArrayKeys)
+		{
+			keyDataMap[i] = (so->numberOfKeys - nunmark) + nunmarked;
+			memcpy(&unmarkOrderProcs[nunmarked], &so->orderProcs[i],
+				   sizeof(FmgrInfo));
+		}
+
+		/*
+		 * Preprocessing only generates skip arrays when it knows that they'll
+		 * be the only required = key on the attr.  We'll never unmark them.
+		 */
+		Assert(!(unmark->sk_flags & SK_BT_SKIP));
+
+		/*
+		 * Also shouldn't have to unmark an IS NULL or an IS NOT NULL key.
+		 * They aren't cross-type, so an incomplete opfamily can't matter.
+		 */
+		Assert(!(unmark->sk_flags & SK_ISNULL) ||
+			   !(unmark->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)));
+
+		/* Clear requiredness flags on redundant key (and on any subkeys) */
+		unmark->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
+		if (unmark->sk_flags & SK_ROW_HEADER)
+		{
+			ScanKey		subkey = (ScanKey) DatumGetPointer(origkey->sk_argument);
+
+			Assert(subkey->sk_strategy == origkey->sk_strategy);
+			for (;;)
+			{
+				Assert(subkey->sk_flags & SK_ROW_MEMBER);
+				subkey->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
+				if (subkey->sk_flags & SK_ROW_END)
+					break;
+				subkey++;
+			}
+		}
+
+		nunmarked++;
+	}
+
+	/* Copy both temp arrays back into so->keyData[] to reorder */
+	Assert(nkept == so->numberOfKeys - nunmark);
+	Assert(nunmarked == nunmark);
+	memcpy(so->keyData, keepKeys, sizeof(ScanKeyData) * nkept);
+	memcpy(so->keyData + nkept, unmarkKeys, sizeof(ScanKeyData) * nunmarked);
+
+	/* Done with temp arrays */
+	pfree(unmarkikey);
+	pfree(keepKeys);
+	pfree(unmarkKeys);
+
+	/*
+	 * Now copy so->orderProcs[] temp entries needed by scans with = array
+	 * keys back (just like with the so->keyData[] temp arrays)
+	 */
+	if (so->numArrayKeys)
+	{
+		memcpy(so->orderProcs, keepOrderProcs, sizeof(FmgrInfo) * nkept);
+		memcpy(so->orderProcs + nkept, unmarkOrderProcs,
+			   sizeof(FmgrInfo) * nunmarked);
+
+		/* Also fix-up array->scan_key references */
+		for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arridx];
+
+			array->scan_key = keyDataMap[array->scan_key];
+		}
+
+		/*
+		 * Sort so->arrayKeys[] based on its new BTArrayKeyInfo.scan_key
+		 * offsets, so that its order matches so->keyData[] order as expected
+		 */
+		qsort(so->arrayKeys, so->numArrayKeys, sizeof(BTArrayKeyInfo),
+			  _bt_reorder_array_cmp);
+
+		/* Done with temp arrays */
+		pfree(unmarkOrderProcs);
+		pfree(keepOrderProcs);
+	}
+}
+
+/*
+ * qsort comparator for reordering so->arrayKeys[] BTArrayKeyInfo entries
+ */
+static int
+_bt_reorder_array_cmp(const void *a, const void *b)
+{
+	BTArrayKeyInfo *arraya = (BTArrayKeyInfo *) a;
+	BTArrayKeyInfo *arrayb = (BTArrayKeyInfo *) b;
+
+	return pg_cmp_s32(arraya->scan_key, arrayb->scan_key);
+}
+
 /*
  *	_bt_preprocess_array_keys() -- Preprocess SK_SEARCHARRAY scan keys
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 36544ecfd..9846ef6db 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -960,46 +960,51 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
+	 * The selected scan keys (at most one per index column) are remembered by
+	 * storing their addresses into the local startKeys[] array.  The final
+	 * startKeys[] entry's strategy is set in strat_total. (Actually, there
+	 * are a couple of cases where we force a less/more restrictive strategy.)
 	 *
-	 * We want to identify the keys that can be used as starting boundaries;
-	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
-	 * a backwards scan.  We can use keys for multiple attributes so long as
-	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
-	 * a > or < boundary or find an attribute with no boundary (which can be
-	 * thought of as the same as "> -infinity"), we can't use keys for any
-	 * attributes to its right, because it would break our simplistic notion
-	 * of what initial positioning strategy to use.
+	 * We must use the key that was marked required (in the direction opposite
+	 * our own scan's) during preprocessing.  Each index attribute can only
+	 * have one such required key.  In general, the keys that we use to find
+	 * an initial position when scanning forwards are the same keys that end
+	 * the scan on the leaf level when scanning backwards (and vice-versa).
 	 *
 	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
+	 * may not be able to eliminate redundant keys; in such cases it will
+	 * arbitrarily pick a usable key for each attribute (and scan direction),
+	 * ensuring that there is no more than one key required in each direction.
+	 * We stop considering further keys once we reach the first nonrequired
+	 * key (which must come after all required keys), so this can't affect us.
 	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
+	 * The required keys that we use as starting boundaries have to be =, >,
+	 * or >= keys for a forward scan or =, <, <= keys for a backwards scan.
+	 * We can use keys for multiple attributes so long as the prior attributes
+	 * had only =, >= (resp. =, <=) keys.  These rules are very similar to the
+	 * rules that preprocessing used to determine which keys to mark required.
+	 * We cannot always use every required key as a positioning key, though.
+	 * Skip arrays necessitate independently applying our own rules here.
+	 * Skip arrays are always generally considered = array keys, but we'll
+	 * nevertheless treat them as inequalities at certain points of the scan.
+	 * When that happens, it _might_ have implications for the number of
+	 * required keys that we can safely use for initial positioning purposes.
 	 *
-	 * In practice we rarely see any "attribute boundary key gaps" here.
-	 * Preprocessing can usually backfill skip array keys for any attributes
-	 * that were omitted from the original scan->keyData[] input keys.  All
-	 * array keys are always considered = keys, but we'll sometimes need to
-	 * treat the current key value as if we were using an inequality strategy.
-	 * This happens with range skip arrays, which store inequality keys in the
-	 * array's low_compare/high_compare fields (used to find the first/last
-	 * set of matches, when = key will lack a usable sk_argument value).
-	 * These are always preferred over any redundant "standard" inequality
-	 * keys on the same column (per the usual rule about preferring = keys).
-	 * Note also that any column with an = skip array key can never have an
-	 * additional, contradictory = key.
+	 * For example, a forward scan with a skip array on its leading attribute
+	 * (with no low_compare/high_compare) will have at least two required scan
+	 * keys, but we won't use any of them as boundary keys during the scan's
+	 * initial call here.  Our positioning key during the first call here can
+	 * be thought of as representing "> -infinity".  Similarly, if such a skip
+	 * array's low_compare is "a > 'foo'", then we position using "a > 'foo'"
+	 * during the scan's initial call here; a lower-order key such as "b = 42"
+	 * can't be used until the "a" array advances beyond MINVAL/low_compare.
+	 *
+	 * On the other hand, if such a skip array's low_compare was "a >= 'foo'",
+	 * then we _can_ use "a >= 'foo' AND b = 42" during the initial call here.
+	 * A subsequent call here might have us use "a = 'fop' AND b = 42".  Note
+	 * that we treat = and >= as equivalent when scanning forwards (just as we
+	 * treat = and <= as equivalent when scanning backwards).  We effectively
+	 * do the same thing (though with a distinct "a" element/value) each time.
 	 *
 	 * All keys (with the exception of SK_SEARCHNULL keys and SK_BT_SKIP
 	 * array keys whose array is "null_elem=true") imply a NOT NULL qualifier.
@@ -1014,18 +1019,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * first (leftmost) columns.  We'll add on lower-order columns of the row
 	 * comparison below, if possible.
 	 *
-	 * The selected scan keys (at most one per index column) are remembered by
-	 * storing their addresses into the local startKeys[] array.
-	 *
-	 * _bt_checkkeys/_bt_advance_array_keys decide whether and when to start
-	 * the next primitive index scan (for scans with array keys) based in part
-	 * on an understanding of how it'll enable us to reposition the scan.
-	 * They're directly aware of how we'll sometimes cons up an explicit
-	 * SK_SEARCHNOTNULL key.  They'll even end primitive scans by applying a
-	 * symmetric "deduce NOT NULL" rule of their own.  This allows top-level
-	 * scans to skip large groups of NULLs through repeated deductions about
-	 * key strictness (for a required inequality key) and whether NULLs in the
-	 * key's index column are stored last or first (relative to non-NULLs).
+	 * _bt_advance_array_keys needs to know exactly how we'll reposition the
+	 * scan (should it opt to schedule another primitive index scan).  It is
+	 * critical that primscans only be scheduled when they'll definitely make
+	 * some useful progress.  _bt_advance_array_keys does this by calling
+	 * _bt_checkkeys routines that report whether a tuple is past the end of
+	 * matches for the scan's keys (given the scan's current array elements).
+	 * If the page's final tuple is "after the end of matches" for a scan that
+	 * uses the *opposite* scan direction, then it must follow that it's also
+	 * "before the start of matches" for the actual current scan direction.
+	 * It is therefore essential that all of our initial positioning rules are
+	 * symmetric with _bt_checkkeys's corresponding continuescan=false rule.
 	 * If you update anything here, _bt_checkkeys/_bt_advance_array_keys might
 	 * need to be kept in sync.
 	 *----------
@@ -1034,18 +1038,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (so->numberOfKeys > 0)
 	{
 		AttrNumber	curattr;
-		ScanKey		chosen;
+		ScanKey		bkey;
 		ScanKey		impliesNN;
 		ScanKey		cur;
 
 		/*
-		 * chosen is the so-far-chosen key for the current attribute, if any.
-		 * We don't cast the decision in stone until we reach keys for the
-		 * next attribute.
+		 * bkey will be set to the key that preprocessing left behind as the
+		 * boundary key for this attribute, in this scan direction (if any)
 		 */
 		cur = so->keyData;
 		curattr = 1;
-		chosen = NULL;
+		bkey = NULL;
 		/* Also remember any scankey that implies a NOT NULL constraint */
 		impliesNN = NULL;
 
@@ -1058,23 +1061,29 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
 			{
+				/* Done looking for the curattr boundary key */
+				Assert(bkey == NULL ||
+					   (bkey->sk_attno == curattr &&
+						(bkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))));
+				Assert(impliesNN == NULL ||
+					   (impliesNN->sk_attno == curattr &&
+						(impliesNN->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))));
+
 				/*
-				 * Done looking at keys for curattr.
-				 *
 				 * If this is a scan key for a skip array whose current
 				 * element is MINVAL, choose low_compare (when scanning
 				 * backwards it'll be MAXVAL, and we'll choose high_compare).
 				 *
-				 * Note: if the array's low_compare key makes 'chosen' NULL,
+				 * Note: if the array's low_compare key makes 'bkey' NULL,
 				 * then we behave as if the array's first element is -inf,
 				 * except when !array->null_elem implies a usable NOT NULL
 				 * constraint.
 				 */
-				if (chosen != NULL &&
-					(chosen->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)))
+				if (bkey != NULL &&
+					(bkey->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)))
 				{
-					int			ikey = chosen - so->keyData;
-					ScanKey		skipequalitykey = chosen;
+					int			ikey = bkey - so->keyData;
+					ScanKey		skipequalitykey = bkey;
 					BTArrayKeyInfo *array = NULL;
 
 					for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
@@ -1087,35 +1096,35 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (ScanDirectionIsForward(dir))
 					{
 						Assert(!(skipequalitykey->sk_flags & SK_BT_MAXVAL));
-						chosen = array->low_compare;
+						bkey = array->low_compare;
 					}
 					else
 					{
 						Assert(!(skipequalitykey->sk_flags & SK_BT_MINVAL));
-						chosen = array->high_compare;
+						bkey = array->high_compare;
 					}
 
-					Assert(chosen == NULL ||
-						   chosen->sk_attno == skipequalitykey->sk_attno);
+					Assert(bkey == NULL ||
+						   bkey->sk_attno == skipequalitykey->sk_attno);
 
 					if (!array->null_elem)
 						impliesNN = skipequalitykey;
 					else
-						Assert(chosen == NULL && impliesNN == NULL);
+						Assert(bkey == NULL && impliesNN == NULL);
 				}
 
 				/*
 				 * If we didn't find a usable boundary key, see if we can
 				 * deduce a NOT NULL key
 				 */
-				if (chosen == NULL && impliesNN != NULL &&
+				if (bkey == NULL && impliesNN != NULL &&
 					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
 					 ScanDirectionIsForward(dir) :
 					 ScanDirectionIsBackward(dir)))
 				{
 					/* Yes, so build the key in notnullkeys[keysz] */
-					chosen = &notnullkeys[keysz];
-					ScanKeyEntryInitialize(chosen,
+					bkey = &notnullkeys[keysz];
+					ScanKeyEntryInitialize(bkey,
 										   (SK_SEARCHNOTNULL | SK_ISNULL |
 											(impliesNN->sk_flags &
 											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
@@ -1130,12 +1139,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				}
 
 				/*
-				 * If we still didn't find a usable boundary key, quit; else
-				 * save the boundary key pointer in startKeys.
+				 * If preprocessing didn't leave a usable boundary key, quit;
+				 * else save the boundary key pointer in startKeys[]
 				 */
-				if (chosen == NULL)
+				if (bkey == NULL)
 					break;
-				startKeys[keysz++] = chosen;
+				startKeys[keysz++] = bkey;
 
 				/*
 				 * We can only consider adding more boundary keys when the one
@@ -1143,7 +1152,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 * (during backwards scans we can only do so when the key that
 				 * we just added to startKeys[] uses the = or <= strategy)
 				 */
-				strat_total = chosen->sk_strategy;
+				strat_total = bkey->sk_strategy;
 				if (strat_total == BTGreaterStrategyNumber ||
 					strat_total == BTLessStrategyNumber)
 					break;
@@ -1154,19 +1163,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 * make strat_total > or < (and stop adding boundary keys).
 				 * This can only happen with opclasses that lack skip support.
 				 */
-				if (chosen->sk_flags & (SK_BT_NEXT | SK_BT_PRIOR))
+				if (bkey->sk_flags & (SK_BT_NEXT | SK_BT_PRIOR))
 				{
-					Assert(chosen->sk_flags & SK_BT_SKIP);
+					Assert(bkey->sk_flags & SK_BT_SKIP);
 					Assert(strat_total == BTEqualStrategyNumber);
 
 					if (ScanDirectionIsForward(dir))
 					{
-						Assert(!(chosen->sk_flags & SK_BT_PRIOR));
+						Assert(!(bkey->sk_flags & SK_BT_PRIOR));
 						strat_total = BTGreaterStrategyNumber;
 					}
 					else
 					{
-						Assert(!(chosen->sk_flags & SK_BT_NEXT));
+						Assert(!(bkey->sk_flags & SK_BT_NEXT));
 						strat_total = BTLessStrategyNumber;
 					}
 
@@ -1180,24 +1189,30 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 				/*
 				 * Done if that was the last scan key output by preprocessing.
-				 * Also done if there is a gap index attribute that lacks a
-				 * usable key (only possible when preprocessing was unable to
-				 * generate a skip array key to "fill in the gap").
+				 * Also done if we've now examined all keys marked required.
 				 */
 				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
+					!(cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 					break;
 
 				/*
 				 * Reset for next attr.
 				 */
+				Assert(cur->sk_attno == curattr + 1);
 				curattr = cur->sk_attno;
-				chosen = NULL;
+				bkey = NULL;
 				impliesNN = NULL;
 			}
 
 			/*
-			 * Can we use this key as a starting boundary for this attr?
+			 * If we've located the starting boundary key for curattr, we have
+			 * no interest in curattr's other required key
+			 */
+			if (bkey != NULL)
+				continue;
+
+			/*
+			 * Is this key the starting boundary key for curattr?
 			 *
 			 * If not, does it imply a NOT NULL constraint?  (Because
 			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
@@ -1207,27 +1222,20 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			{
 				case BTLessStrategyNumber:
 				case BTLessEqualStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsBackward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
+					if (ScanDirectionIsBackward(dir))
+						bkey = cur;
+					else if (impliesNN == NULL)
+						impliesNN = cur;
 					break;
 				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
-					chosen = cur;
+					bkey = cur;
 					break;
 				case BTGreaterEqualStrategyNumber:
 				case BTGreaterStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsForward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
+					if (ScanDirectionIsForward(dir))
+						bkey = cur;
+					else if (impliesNN == NULL)
+						impliesNN = cur;
 					break;
 			}
 		}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2..eb6dbfda3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -44,7 +44,6 @@ static bool _bt_array_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *arra
 static bool _bt_array_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
 static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 											 bool *skip_array_set);
-static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
 										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
 										 bool readpagetup, int sktrig, bool *scanBehind);
@@ -52,7 +51,6 @@ static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								   int sktrig, bool sktrig_required);
 #ifdef USE_ASSERT_CHECKING
-static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 #endif
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
@@ -1034,73 +1032,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	return false;
 }
 
-/*
- * _bt_rewind_nonrequired_arrays() -- Rewind SAOP arrays not marked required
- *
- * Called when _bt_advance_array_keys decides to start a new primitive index
- * scan on the basis of the current scan position being before the position
- * that _bt_first is capable of repositioning the scan to by applying an
- * inequality operator required in the opposite-to-scan direction only.
- *
- * Although equality strategy scan keys (for both arrays and non-arrays alike)
- * are either marked required in both directions or in neither direction,
- * there is a sense in which non-required arrays behave like required arrays.
- * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
- * the scan key on "c" is non-required, but nevertheless enables positioning
- * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
- * first descent of the tree by _bt_first.  Later on, there could also be a
- * second descent, that places the scan right before tuples >= "(200, 3, 5)".
- * _bt_first must never be allowed to build an insertion scan key whose "c"
- * entry is set to a value other than 5, the "c" array's first element/value.
- * (Actually, it's the first in the current scan direction.  This example uses
- * a forward scan.)
- *
- * Calling here resets the array scan key elements for the scan's non-required
- * arrays.  This is strictly necessary for correctness in a subset of cases
- * involving "required in opposite direction"-triggered primitive index scans.
- * Not all callers are at risk of _bt_first using a non-required array like
- * this, but advancement always resets the arrays when another primitive scan
- * is scheduled, just to keep things simple.  Array advancement even makes
- * sure to reset non-required arrays during scans that have no inequalities.
- * (Advancement still won't call here when there are no inequalities, though
- * that's just because it's all handled indirectly instead.)
- *
- * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
- * everybody got this right.
- *
- * Note: In practice almost all SAOP arrays are marked required during
- * preprocessing (if necessary by generating skip arrays).  It is hardly ever
- * truly necessary to call here, but consistently doing so is simpler.
- */
-static void
-_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			arrayidx = 0;
-
-	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
-	{
-		ScanKey		cur = so->keyData + ikey;
-		BTArrayKeyInfo *array = NULL;
-
-		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
-			cur->sk_strategy != BTEqualStrategyNumber)
-			continue;
-
-		array = &so->arrayKeys[arrayidx++];
-		Assert(array->scan_key == ikey);
-
-		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
-			continue;
-
-		Assert(array->num_elems != -1); /* No non-required skip arrays */
-
-		_bt_array_set_low_or_high(rel, cur, array,
-								  ScanDirectionIsForward(dir));
-	}
-}
-
 /*
  * _bt_tuple_before_array_skeys() -- too early to advance required arrays?
  *
@@ -1380,8 +1311,6 @@ _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (so->needPrimScan)
 	{
-		Assert(_bt_verify_arrays_bt_first(scan, dir));
-
 		/*
 		 * Flag was set -- must call _bt_first again, which will reset the
 		 * scan's needPrimScan flag
@@ -2007,14 +1936,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 	 */
 	else if (has_required_opposite_direction_only && pstate->finaltup &&
 			 unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
-	{
-		/*
-		 * Make sure that any SAOP arrays that were not marked required by
-		 * preprocessing are reset to their first element for this direction
-		 */
-		_bt_rewind_nonrequired_arrays(scan, dir);
 		goto new_prim_scan;
-	}
 
 continue_scan:
 
@@ -2045,8 +1967,6 @@ continue_scan:
 		 */
 		so->oppositeDirCheck = has_required_opposite_direction_only;
 
-		_bt_rewind_nonrequired_arrays(scan, dir);
-
 		/*
 		 * skip by setting "look ahead" mechanism's offnum for forwards scans
 		 * (backwards scans check scanBehind flag directly instead)
@@ -2142,48 +2062,6 @@ end_toplevel_scan:
 }
 
 #ifdef USE_ASSERT_CHECKING
-/*
- * Verify that the scan's qual state matches what we expect at the point that
- * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
- *
- * We enforce a rule against non-required array scan keys: they must start out
- * with whatever element is the first for the scan's current scan direction.
- * See _bt_rewind_nonrequired_arrays comments for an explanation.
- */
-static bool
-_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			arrayidx = 0;
-
-	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
-	{
-		ScanKey		cur = so->keyData + ikey;
-		BTArrayKeyInfo *array = NULL;
-		int			first_elem_dir;
-
-		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
-			cur->sk_strategy != BTEqualStrategyNumber)
-			continue;
-
-		array = &so->arrayKeys[arrayidx++];
-
-		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
-			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
-			continue;
-
-		if (ScanDirectionIsForward(dir))
-			first_elem_dir = 0;
-		else
-			first_elem_dir = array->num_elems - 1;
-
-		if (array->cur_elem != first_elem_dir)
-			return false;
-	}
-
-	return _bt_verify_keys_with_arraykeys(scan);
-}
-
 /*
  * Verify that the scan's "so->keyData[]" scan keys are in agreement with
  * its array key state
@@ -2194,6 +2072,7 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			last_sk_attno = InvalidAttrNumber,
 				arrayidx = 0;
+	bool		nonrequiredseen = false;
 
 	if (!so->qual_ok)
 		return false;
@@ -2217,8 +2096,16 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 		if (array->num_elems != -1 &&
 			cur->sk_argument != array->elem_values[array->cur_elem])
 			return false;
-		if (last_sk_attno > cur->sk_attno)
-			return false;
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			if (last_sk_attno > cur->sk_attno)
+				return false;
+			if (nonrequiredseen)
+				return false;
+		}
+		else
+			nonrequiredseen = true;
+
 		last_sk_attno = cur->sk_attno;
 	}
 
@@ -2551,7 +2438,6 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 		if (!(key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 		{
 			/* Scan key isn't marked required (corner case) */
-			Assert(!(key->sk_flags & SK_ROW_HEADER));
 			break;				/* unsafe */
 		}
 		if (key->sk_flags & SK_ROW_HEADER)
-- 
2.50.0

v5-0002-Make-row-compares-robust-during-scans-with-arrays.patchapplication/octet-stream; name=v5-0002-Make-row-compares-robust-during-scans-with-arrays.patchDownload

From a2c9ad3f8e87f9c2492511086fcc998e84797e48 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 18 Jun 2025 19:32:57 -0400
Subject: [PATCH v5 2/2] Make row compares robust during scans with arrays.

Recent nbtree bugfix commit 5f4d98d4 accounted for infinite cycling
behavior with row compare quals by refusing to apply forcenonrequired in
certain cases involving row compares, but it's not clear that that's
truly robust.  Testing has shown that a scan with a row compare can
still read the same leaf page twice (without the scan direction
changing), which isn't supposed to be possible following Postgres 17
commit 5bf748b8.  Also, we still allowed a required row compare key to
be used in forcenonrequired mode when the key happened to be beyond the
pstate.ikey offset, which seems quite brittle (and hard to understand).

The forcenonrequired row compare special case kludge was only ever
needed because row compare quals have gratuitously inconsistent rules
around how scans start (which keys can be used for initial positioning
purposes) and how scans end (which keys can set continuescan=false).
Quals with redundant keys that could not be eliminated by preprocessing
had that same quality to them prior to recent bugfix XXXXX, which made
sure that only one key could be used to start and/or end scans.  This
scheme avoided the same infinite cycling behavior (with redundant keys)
that we're concerned about here (with row compare keys).

Get rid of the forcenonrequired special case by making it unnecessary.
Do so by bringing row compare quals in line with the charter established
for all required keys by recent bugfix XXXXX:  make _bt_first reliably
agree with _bt_check_rowcompare about when and where scans with row
compare quals start and end.  That way, _bt_advance_array_keys can't get
the wrong idea about whether or not another call to _bt_first will at
least be able to move the scan to the next leaf page.  No infinite
cycling is possible, even in the absence of the forcenonrequired hack,
since the next _bt_first will now definitely make useful progress.

This commit fixes two points of disagreement between _bt_first and
_bt_check_rowcompare.  Firstly, _bt_check_rowcompare was capable of
ending the scan at the point where it needed to compare an ISNULL-marked
row compare member that came immediately after a required row compare
member.  _bt_first now has symmetric handling for NULL row compares.
Secondly, _bt_first had its own ideas about which keys were safe to use
for initial positioning purposes.  It could use fewer or more keys than
_bt_check_rowcompare.  _bt_first now uses the same requireness markings
as _bt_check_rowcompare for this (the same approach that _bt_first takes
with every other kind of qual following recent bugfix commit XXXXX).

Fixing these inconsistencies necessitates dealing with a related issue
with the way that row compares are marked required during preprocessing:
we never marked lower-order members as required following 2016 bugfix
commit a298a1e0.  The approach taken by that bugfix commit was overly
conservative.  The bug in question was actually an oversight in how
_bt_check_rowcompare dealt with tuple NULL values that failed to satisfy
a scan key marked required in the opposite scan direction (it was a bug
in 2011 commits 6980f817 and 882368e8 -- not in 2006 commit 3a0a16cb).
Go back to marking row compare members as required based on the original
2006 rules, and fix the 2016 bug in a more principled way: by limiting
use of the "set continuescan=false with a key required in the opposite
scan direction upon encountering a NULL tuple value" optimization to the
first/most significant row member.  It isn't safe to use an implied IS
NOT NULL qualifier to end the scan when it comes from the row compare's
lower-order member keys.  NULLs can legitimately occur between groups of
tuples that satisfy the qual as a whole, but that in itself doesn't make
it unsafe for a lower-order member key to end the scan.  It _is_ safe
whenever the key really does detect the absolute end of matching tuples.
That's only possible with a lower-order row compare member that's marked
required in the _current_ scan direction (not in the other direction).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com
---
 src/backend/access/nbtree/nbtpreprocesskeys.c |  19 +-
 src/backend/access/nbtree/nbtsearch.c         | 245 ++++++++++--------
 src/backend/access/nbtree/nbtutils.c          | 205 ++++++++-------
 src/test/regress/expected/btree_index.out     |  98 ++++++-
 src/test/regress/sql/btree_index.sql          |  58 ++++-
 5 files changed, 406 insertions(+), 219 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index ca2a9050f..76cd1ee78 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -792,12 +792,25 @@ _bt_mark_scankey_required(ScanKey skey)
 	if (skey->sk_flags & SK_ROW_HEADER)
 	{
 		ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+		AttrNumber	attno = skey->sk_attno;
 
 		/* First subkey should be same column/operator as the header */
-		Assert(subkey->sk_flags & SK_ROW_MEMBER);
-		Assert(subkey->sk_attno == skey->sk_attno);
+		Assert(subkey->sk_attno == attno);
 		Assert(subkey->sk_strategy == skey->sk_strategy);
-		subkey->sk_flags |= addflags;
+
+		for (;;)
+		{
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_attno != attno)
+				break;			/* non-adjacent key, so not required */
+			if (subkey->sk_strategy != skey->sk_strategy)
+				break;			/* wrong direction, so not required */
+			subkey->sk_flags |= addflags;
+			if (subkey->sk_flags & SK_ROW_END)
+				break;
+			subkey++;
+			attno++;
+		}
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 9846ef6db..905f2a3b7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1016,8 +1016,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * traversing a lot of null entries at the start of the scan.
 	 *
 	 * In this loop, row-comparison keys are treated the same as keys on their
-	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
+	 * first (leftmost) columns.  We'll add all lower-order columns of the row
+	 * comparison that were marked required during preprocessing below.
 	 *
 	 * _bt_advance_array_keys needs to know exactly how we'll reposition the
 	 * scan (should it opt to schedule another primitive index scan).  It is
@@ -1261,16 +1261,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Assert(keysz <= INDEX_MAX_KEYS);
 	for (int i = 0; i < keysz; i++)
 	{
-		ScanKey		cur = startKeys[i];
+		ScanKey		bkey = startKeys[i];
 
-		Assert(cur->sk_attno == i + 1);
+		Assert(bkey->sk_attno == i + 1);
 
-		if (cur->sk_flags & SK_ROW_HEADER)
+		if (bkey->sk_flags & SK_ROW_HEADER)
 		{
 			/*
 			 * Row comparison header: look to the first row member instead
 			 */
-			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			ScanKey		subkey = (ScanKey) DatumGetPointer(bkey->sk_argument);
+			bool		loosen_strat = false,
+						tighten_strat = false;
 
 			/*
 			 * Cannot be a NULL in the first row member: _bt_preprocess_keys
@@ -1278,9 +1280,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * ever getting this far
 			 */
 			Assert(subkey->sk_flags & SK_ROW_MEMBER);
-			Assert(subkey->sk_attno == cur->sk_attno);
+			Assert(subkey->sk_attno == bkey->sk_attno);
 			Assert(!(subkey->sk_flags & SK_ISNULL));
 
+			/*
+			 * This is either a > or >= key (during backwards scans it is
+			 * either < or <=) that was marked required during preprocessing.
+			 * Later so->keyData[] keys can't have been marked required, so
+			 * our row compare header key must be the final startKeys[] entry.
+			 */
+			Assert(subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD));
+			Assert(i == keysz - 1);
+
 			/*
 			 * The member scankeys are already in insertion format (ie, they
 			 * have sk_func = 3-way-comparison function)
@@ -1288,112 +1299,141 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
+			 * Now look to later row compare members.
+			 *
+			 * If there's an "index attribute gap" between two row compare
+			 * members, the second member won't have been marked required, and
+			 * so can't be used as starting boundary keys here.  The part of
+			 * the row comparison that we do still use has to be treated as a
+			 * ">=" or "<=" condition.  For example, a qual "(a, c) > (1, 42)"
+			 * with an omitted intervening index attribute "b" will use an
+			 * insertion scan key "a >= 1".  Even the first "a = 1" tuple on
+			 * the leaf level might satisfy the row compare qual.
+			 *
+			 * We're able to use a _more_ restrictive strategy when we reach a
+			 * NULL row compare member, since they're always unsatisfiable.
+			 * For example, a qual "(a, b, c) >= (1, NULL, 77)" will use an
+			 * insertion scan key "a > 1".  All tuples where "a = 1" cannot
+			 * possibly satisfy the row compare qual, so this is safe.
 			 */
-			if (i == keysz - 1)
+			Assert(!(subkey->sk_flags & SK_ROW_END));
+			for (;;)
 			{
-				bool		used_all_subkeys = false;
+				subkey++;
+				Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
+				if (subkey->sk_flags & SK_ISNULL)
 				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysz + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysz < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysz, subkey,
-						   sizeof(ScanKeyData));
-					keysz++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
+					/*
+					 * NULL member key, can only use earlier keys.
+					 *
+					 * We deliberately avoid checking if this key is marked
+					 * required.  All earlier keys are required, and this key
+					 * is unsatisfiable either way, so we can't miss anything.
+					 */
+					tighten_strat = true;
+					break;
 				}
-				if (!used_all_subkeys)
+
+				if (!(subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
+					/* nonrequired member key, can only use earlier keys */
+					loosen_strat = true;
+					break;
 				}
-				break;			/* done with outer loop */
+
+				Assert(subkey->sk_attno == keysz + 1);
+				Assert(subkey->sk_strategy == bkey->sk_strategy);
+				Assert(keysz < INDEX_MAX_KEYS);
+
+				memcpy(inskey.scankeys + keysz, subkey,
+					   sizeof(ScanKeyData));
+				keysz++;
+				if (subkey->sk_flags & SK_ROW_END)
+					break;
 			}
+			Assert(!(loosen_strat && tighten_strat));
+			if (loosen_strat)
+			{
+				/* Use less restrictive strategy (and fewer member keys) */
+				switch (strat_total)
+				{
+					case BTLessStrategyNumber:
+						strat_total = BTLessEqualStrategyNumber;
+						break;
+					case BTGreaterStrategyNumber:
+						strat_total = BTGreaterEqualStrategyNumber;
+						break;
+				}
+			}
+			if (tighten_strat)
+			{
+				/* Use more restrictive strategy (and fewer member keys) */
+				switch (strat_total)
+				{
+					case BTLessEqualStrategyNumber:
+						strat_total = BTLessStrategyNumber;
+						break;
+					case BTGreaterEqualStrategyNumber:
+						strat_total = BTGreaterStrategyNumber;
+						break;
+				}
+			}
+
+			/* done adding to inskey (row comparison keys always come last) */
+			break;
+		}
+
+		/*
+		 * Ordinary comparison key/search-style key.
+		 *
+		 * Transform the search-style scan key to an insertion scan key by
+		 * replacing the sk_func with the appropriate btree 3-way-comparison
+		 * function.
+		 *
+		 * If scankey operator is not a cross-type comparison, we can use the
+		 * cached comparison function; otherwise gotta look it up in the
+		 * catalogs.  (That can't lead to infinite recursion, since no
+		 * indexscan initiated by syscache lookup will use cross-data-type
+		 * operators.)
+		 *
+		 * We support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this hack simplifies life for ScanKeyInit().
+		 */
+		if (bkey->sk_subtype == rel->rd_opcintype[i] ||
+			bkey->sk_subtype == InvalidOid)
+		{
+			FmgrInfo   *procinfo;
+
+			procinfo = index_getprocinfo(rel, bkey->sk_attno, BTORDER_PROC);
+			ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
+										   bkey->sk_flags,
+										   bkey->sk_attno,
+										   InvalidStrategy,
+										   bkey->sk_subtype,
+										   bkey->sk_collation,
+										   procinfo,
+										   bkey->sk_argument);
 		}
 		else
 		{
-			/*
-			 * Ordinary comparison key.  Transform the search-style scan key
-			 * to an insertion scan key by replacing the sk_func with the
-			 * appropriate btree comparison function.
-			 *
-			 * If scankey operator is not a cross-type comparison, we can use
-			 * the cached comparison function; otherwise gotta look it up in
-			 * the catalogs.  (That can't lead to infinite recursion, since no
-			 * indexscan initiated by syscache lookup will use cross-data-type
-			 * operators.)
-			 *
-			 * We support the convention that sk_subtype == InvalidOid means
-			 * the opclass input type; this is a hack to simplify life for
-			 * ScanKeyInit().
-			 */
-			if (cur->sk_subtype == rel->rd_opcintype[i] ||
-				cur->sk_subtype == InvalidOid)
-			{
-				FmgrInfo   *procinfo;
+			RegProcedure cmp_proc;
 
-				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
-											   cur->sk_flags,
-											   cur->sk_attno,
-											   InvalidStrategy,
-											   cur->sk_subtype,
-											   cur->sk_collation,
-											   procinfo,
-											   cur->sk_argument);
-			}
-			else
-			{
-				RegProcedure cmp_proc;
-
-				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
-											 rel->rd_opcintype[i],
-											 cur->sk_subtype,
-											 BTORDER_PROC);
-				if (!RegProcedureIsValid(cmp_proc))
-					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
-						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
-						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(inskey.scankeys + i,
-									   cur->sk_flags,
-									   cur->sk_attno,
-									   InvalidStrategy,
-									   cur->sk_subtype,
-									   cur->sk_collation,
-									   cmp_proc,
-									   cur->sk_argument);
-			}
+			cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+										 rel->rd_opcintype[i],
+										 bkey->sk_subtype, BTORDER_PROC);
+			if (!RegProcedureIsValid(cmp_proc))
+				elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+					 BTORDER_PROC, rel->rd_opcintype[i], bkey->sk_subtype,
+					 bkey->sk_attno, RelationGetRelationName(rel));
+			ScanKeyEntryInitialize(inskey.scankeys + i,
+								   bkey->sk_flags,
+								   bkey->sk_attno,
+								   InvalidStrategy,
+								   bkey->sk_subtype,
+								   bkey->sk_collation,
+								   cmp_proc,
+								   bkey->sk_argument);
 		}
 	}
 
@@ -1482,6 +1522,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	if (!BufferIsValid(so->currPos.buf))
 	{
+		Assert(!so->needPrimScan);
+
 		/*
 		 * We only get here if the index is completely empty. Lock relation
 		 * because nothing finer to lock exists.  Without a buffer lock, it's
@@ -1500,7 +1542,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 		if (!BufferIsValid(so->currPos.buf))
 		{
-			Assert(!so->needPrimScan);
 			_bt_parallel_done(scan);
 			return false;
 		}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index eb6dbfda3..57d76a8d4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2443,31 +2443,9 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
 			/*
-			 * RowCompare inequality.
-			 *
-			 * Only the first subkey from a RowCompare can ever be marked
-			 * required (that happens when the row header is marked required).
-			 * There is no simple, general way for us to transitively deduce
-			 * whether or not every tuple on the page satisfies a RowCompare
-			 * key based only on firsttup and lasttup -- so we just give up.
+			 * RowCompare inequality.  Currently, we just punt on these.
 			 */
-			if (!start_past_saop_eq && !so->skipScan)
-				break;			/* unsafe to go further */
-
-			/*
-			 * We have to be even more careful with RowCompares that come
-			 * after an array: we assume it's unsafe to even bypass the array.
-			 * Calling _bt_start_array_keys to recover the scan's arrays
-			 * following use of forcenonrequired mode isn't compatible with
-			 * _bt_check_rowcompare's continuescan=false behavior with NULL
-			 * row compare members.  _bt_advance_array_keys must not make a
-			 * decision on the basis of a key not being satisfied in the
-			 * opposite-to-scan direction until the scan reaches a leaf page
-			 * where the same key begins to be satisfied in scan direction.
-			 * The _bt_first !used_all_subkeys behavior makes this limitation
-			 * hard to work around some other way.
-			 */
-			return;				/* completely unsafe to set pstate.startikey */
+			break;				/* "unsafe" */
 		}
 		if (key->sk_strategy != BTEqualStrategyNumber)
 		{
@@ -2964,76 +2942,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
-		if (subkey->sk_attno > tupnatts)
-		{
-			/*
-			 * This attribute is truncated (must be high key).  The value for
-			 * this attribute in the first non-pivot tuple on the page to the
-			 * right could be any possible value.  Assume that truncated
-			 * attribute passes the qual.
-			 */
-			Assert(BTreeTupleIsPivot(tuple));
-			cmpresult = 0;
-			if (subkey->sk_flags & SK_ROW_END)
-				break;
-			subkey++;
-			continue;
-		}
-
-		datum = index_getattr(tuple,
-							  subkey->sk_attno,
-							  tupdesc,
-							  &isNull);
-
-		if (isNull)
-		{
-			if (forcenonrequired)
-			{
-				/* treating scan's keys as non-required */
-			}
-			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
-			{
-				/*
-				 * Since NULLs are sorted before non-NULLs, we know we have
-				 * reached the lower limit of the range of values for this
-				 * index attr.  On a backward scan, we can stop if this qual
-				 * is one of the "must match" subset.  We can stop regardless
-				 * of whether the qual is > or <, so long as it's required,
-				 * because it's not possible for any future tuples to pass. On
-				 * a forward scan, however, we must keep going, because we may
-				 * have initially positioned to the start of the index.
-				 * (_bt_advance_array_keys also relies on this behavior during
-				 * forward scans.)
-				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
-					ScanDirectionIsBackward(dir))
-					*continuescan = false;
-			}
-			else
-			{
-				/*
-				 * Since NULLs are sorted after non-NULLs, we know we have
-				 * reached the upper limit of the range of values for this
-				 * index attr.  On a forward scan, we can stop if this qual is
-				 * one of the "must match" subset.  We can stop regardless of
-				 * whether the qual is > or <, so long as it's required,
-				 * because it's not possible for any future tuples to pass. On
-				 * a backward scan, however, we must keep going, because we
-				 * may have initially positioned to the end of the index.
-				 * (_bt_advance_array_keys also relies on this behavior during
-				 * backward scans.)
-				 */
-				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
-					ScanDirectionIsForward(dir))
-					*continuescan = false;
-			}
-
-			/*
-			 * In any case, this indextuple doesn't match the qual.
-			 */
-			return false;
-		}
-
+		/* When a NULL row member is compared, the row never matches */
 		if (subkey->sk_flags & SK_ISNULL)
 		{
 			/*
@@ -3058,6 +2967,114 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			return false;
 		}
 
+		if (subkey->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(BTreeTupleIsPivot(tuple));
+			return true;
+		}
+
+		datum = index_getattr(tuple,
+							  subkey->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (isNull)
+		{
+			int			reqflags;
+
+			if (forcenonrequired)
+			{
+				/* treating scan's keys as non-required */
+			}
+			else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  However, on a forwards
+				 * scan, we must keep going, because we may have initially
+				 * positioned to the start of the index.
+				 *
+				 * All required NULLS FIRST > row members can use NULL tuple
+				 * values to end backwards scans, just like with other values.
+				 * A qual "WHERE (a, b, c) > (9, 42, 'foo')" can terminate a
+				 * backwards scan upon reaching the index's rightmost "a = 9"
+				 * tuple whose "b" column contains a NULL (if not sooner).
+				 * Since "b" is NULLS FIRST, we can treat its NULLs as "<" 42.
+				 */
+				reqflags = SK_BT_REQBKWD;
+
+				/*
+				 * When a most significant required NULLS FIRST < row compare
+				 * member sees NULL tuple values during a backwards scan, it
+				 * signals the end of matches for the whole row compare/scan.
+				 * A qual "WHERE (a, b, c) < (9, 42, 'foo')" will terminate a
+				 * backwards scan upon reaching the rightmost tuple whose "a"
+				 * column has a NULL.  The "a" NULL value is "<" 9, and yet
+				 * our < row compare will still end the scan.  (This isn't
+				 * safe with later/lower-order row members.  Notice that it
+				 * can only happen with an "a" NULL some time after the scan
+				 * completely stops needing to use its "b" and "c" members.)
+				 */
+				if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
+					reqflags |= SK_BT_REQFWD;	/* safe, first row member */
+
+				if ((subkey->sk_flags & reqflags) &&
+					ScanDirectionIsBackward(dir))
+					*continuescan = false;
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  However, on a backward
+				 * scan, we must keep going, because we may have initially
+				 * positioned to the end of the index.
+				 *
+				 * All required NULLS LAST < row members can use NULL tuple
+				 * values to end forwards scans, just like with other values.
+				 * A qual "WHERE (a, b, c) < (9, 42, 'foo')" can terminate a
+				 * forwards scan upon reaching the index's leftmost "a = 9"
+				 * tuple whose "b" column contains a NULL (if not sooner).
+				 * Since "b" is NULLS LAST, we can treat its NULLs as ">" 42.
+				 */
+				reqflags = SK_BT_REQFWD;
+
+				/*
+				 * When a most significant required NULLS LAST > row compare
+				 * member sees NULL tuple values during a forwards scan, it
+				 * signals the end of matches for the whole row compare/scan.
+				 * A qual "WHERE (a, b, c) > (9, 42, 'foo')" will terminate a
+				 * forwards scan upon reaching the leftmost tuple whose "a"
+				 * column has a NULL.  The "a" NULL value is ">" 9, and yet
+				 * our > row compare will end the scan.  (This isn't safe with
+				 * later/lower-order row members.  Notice that it can only
+				 * happen with an "a" NULL some time after the scan completely
+				 * stops needing to use its "b" and "c" members.)
+				 */
+				if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
+					reqflags |= SK_BT_REQBKWD;	/* safe, first row member */
+
+				if ((subkey->sk_flags & reqflags) &&
+					ScanDirectionIsForward(dir))
+					*continuescan = false;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
 		/* Perform the test --- three-way comparison not bool operator */
 		cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
 													subkey->sk_collation,
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index bfb1a286e..5aa538699 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -195,46 +195,116 @@ ORDER BY proname DESC, proargtypes DESC, pronamespace DESC LIMIT 1;
 (1 row)
 
 --
--- Add coverage for RowCompare quals whose rhs row has a NULL that ends scan
+-- Add coverage for forwards scan RowCompare quals whose row arg has a NULL
+-- that affects the initial positioning strategy
 --
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE (proname, proargtypes) >= ('abs', NULL) AND proname <= 'abs'
 ORDER BY proname, proargtypes, pronamespace;
-                                                 QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                   
+---------------------------------------------------------------------------------------------------------------
  Index Only Scan using pg_proc_proname_args_nsp_index on pg_proc
-   Index Cond: ((ROW(proname, proargtypes) < ROW('abs'::name, NULL::oidvector)) AND (proname = 'abs'::name))
+   Index Cond: ((ROW(proname, proargtypes) >= ROW('abs'::name, NULL::oidvector)) AND (proname <= 'abs'::name))
 (2 rows)
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE (proname, proargtypes) >= ('abs', NULL) AND proname <= 'abs'
 ORDER BY proname, proargtypes, pronamespace;
  proname | proargtypes | pronamespace 
 ---------+-------------+--------------
 (0 rows)
 
 --
--- Add coverage for backwards scan RowCompare quals whose rhs row has a NULL
+-- Add coverage for forwards scan RowCompare quals whose row arg has a NULL
 -- that ends scan
 --
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
-ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
-                                                 QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
- Index Only Scan Backward using pg_proc_proname_args_nsp_index on pg_proc
-   Index Cond: ((ROW(proname, proargtypes) > ROW('abs'::name, NULL::oidvector)) AND (proname = 'abs'::name))
+   WHERE proname >= 'abs' AND (proname, proargtypes) < ('abs', NULL)
+ORDER BY proname, proargtypes, pronamespace;
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
+ Index Only Scan using pg_proc_proname_args_nsp_index on pg_proc
+   Index Cond: ((proname >= 'abs'::name) AND (ROW(proname, proargtypes) < ROW('abs'::name, NULL::oidvector)))
 (2 rows)
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname >= 'abs' AND (proname, proargtypes) < ('abs', NULL)
+ORDER BY proname, proargtypes, pronamespace;
+ proname | proargtypes | pronamespace 
+---------+-------------+--------------
+(0 rows)
+
+--
+-- Add coverage for backwards scan RowCompare quals whose row arg has a NULL
+-- that affects the initial positioning strategy
+--
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname >= 'abs' AND (proname, proargtypes) <= ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+                                                  QUERY PLAN                                                   
+---------------------------------------------------------------------------------------------------------------
+ Index Only Scan Backward using pg_proc_proname_args_nsp_index on pg_proc
+   Index Cond: ((proname >= 'abs'::name) AND (ROW(proname, proargtypes) <= ROW('abs'::name, NULL::oidvector)))
+(2 rows)
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname >= 'abs' AND (proname, proargtypes) <= ('abs', NULL)
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+ proname | proargtypes | pronamespace 
+---------+-------------+--------------
+(0 rows)
+
+--
+-- Add coverage for backwards scan RowCompare quals whose row arg has a NULL
+-- that ends scan
+--
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE (proname, proargtypes) > ('abs', NULL) AND proname <= 'abs'
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
+ Index Only Scan Backward using pg_proc_proname_args_nsp_index on pg_proc
+   Index Cond: ((ROW(proname, proargtypes) > ROW('abs'::name, NULL::oidvector)) AND (proname <= 'abs'::name))
+(2 rows)
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE (proname, proargtypes) > ('abs', NULL) AND proname <= 'abs'
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+ proname | proargtypes | pronamespace 
+---------+-------------+--------------
+(0 rows)
+
+-- Add coverage for B-Tree preprocessing path that deals with making redundant
+-- keys nonrequired (relies on current row compare preprocessing limitations)
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
+   AND pronamespace IN (1, 2, 3) AND proargtypes IN ('26 23', '5077')
+ORDER BY proname, proargtypes, pronamespace;
+                                                                                                     QUERY PLAN                                                                                                     
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Index Only Scan using pg_proc_proname_args_nsp_index on pg_proc
+   Index Cond: ((ROW(proname, proargtypes) > ROW('abs'::name, NULL::oidvector)) AND (proname = 'zzzzzz'::name) AND (proargtypes = ANY ('{"26 23",5077}'::oidvector[])) AND (pronamespace = ANY ('{1,2,3}'::oid[])))
+(2 rows)
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
+   AND pronamespace IN (1, 2, 3) AND proargtypes IN ('26 23', '5077')
+ORDER BY proname, proargtypes, pronamespace;
  proname | proargtypes | pronamespace 
 ---------+-------------+--------------
 (0 rows)
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 68c61dbc7..0d84e205f 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -143,34 +143,80 @@ SELECT proname, proargtypes, pronamespace
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC LIMIT 1;
 
 --
--- Add coverage for RowCompare quals whose rhs row has a NULL that ends scan
+-- Add coverage for forwards scan RowCompare quals whose row arg has a NULL
+-- that affects the initial positioning strategy
 --
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE (proname, proargtypes) >= ('abs', NULL) AND proname <= 'abs'
 ORDER BY proname, proargtypes, pronamespace;
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) < ('abs', NULL)
+   WHERE (proname, proargtypes) >= ('abs', NULL) AND proname <= 'abs'
 ORDER BY proname, proargtypes, pronamespace;
 
 --
--- Add coverage for backwards scan RowCompare quals whose rhs row has a NULL
+-- Add coverage for forwards scan RowCompare quals whose row arg has a NULL
 -- that ends scan
 --
 explain (costs off)
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname >= 'abs' AND (proname, proargtypes) < ('abs', NULL)
+ORDER BY proname, proargtypes, pronamespace;
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname >= 'abs' AND (proname, proargtypes) < ('abs', NULL)
+ORDER BY proname, proargtypes, pronamespace;
+
+--
+-- Add coverage for backwards scan RowCompare quals whose row arg has a NULL
+-- that affects the initial positioning strategy
+--
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname >= 'abs' AND (proname, proargtypes) <= ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
 
 SELECT proname, proargtypes, pronamespace
    FROM pg_proc
-   WHERE proname = 'abs' AND (proname, proargtypes) > ('abs', NULL)
+   WHERE proname >= 'abs' AND (proname, proargtypes) <= ('abs', NULL)
 ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
 
+--
+-- Add coverage for backwards scan RowCompare quals whose row arg has a NULL
+-- that ends scan
+--
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE (proname, proargtypes) > ('abs', NULL) AND proname <= 'abs'
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE (proname, proargtypes) > ('abs', NULL) AND proname <= 'abs'
+ORDER BY proname DESC, proargtypes DESC, pronamespace DESC;
+
+-- Add coverage for B-Tree preprocessing path that deals with making redundant
+-- keys nonrequired (relies on current row compare preprocessing limitations)
+explain (costs off)
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
+   AND pronamespace IN (1, 2, 3) AND proargtypes IN ('26 23', '5077')
+ORDER BY proname, proargtypes, pronamespace;
+
+SELECT proname, proargtypes, pronamespace
+   FROM pg_proc
+   WHERE proname = 'zzzzzz' AND (proname, proargtypes) > ('abs', NULL)
+   AND pronamespace IN (1, 2, 3) AND proargtypes IN ('26 23', '5077')
+ORDER BY proname, proargtypes, pronamespace;
+
 --
 -- Add coverage for recheck of > key following array advancement on previous
 -- (left sibling) page that used a high key whose attribute value corresponding
-- 
2.50.0

Heikki Linnakangas

hlinnaka@iki.fi

7 months ago

In reply to: Peter Geoghegan (#5)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On 01/07/2025 19:37, Peter Geoghegan wrote:

On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Heikki said he'd be able to give this patch set at least a quick
review, so here's a new revision, v4.

This isn't really different to v4. It has more comment cleanup, and
better commit messages.

I like how this makes row comparisons less special. To be honest I don't
quite understand what the bug was, I'd have to dig much deeper into this
to swap enough context into my memory for that. But just reading the
comments in these patches, I get the impression that they make this all
less fragile.

On bt_unmark_extra_keys() function: The function comment explains well
what it does; that's good. A few very minor stylistic remarks on its
implementation:

+	/* Set things up for first key's attr */
+	origkey = so->keyData;
+	curattr = origkey->sk_attno;
+	firsti = 0;
+	haveReqEquals = false;
+	haveReqForward = false;
+	haveReqBackward = false;
+	for (int i = 0; i < so->numberOfKeys; origkey++, i++)
+	{

It took me a moment to understand that origkey is always pointing to
&so->keyData[i]. I'd suggest making it more clear with:

/* Set things up for first key's attr */
curattr = so->keyData[0].sk_attno;
firsti = 0;
haveReqEquals = false;
haveReqForward = false;
haveReqBackward = false;
for (int i = 0; i < so->numberOfKeys; i++)
{
ScanKey origkey = &so->keyData[i];

In the second loop in the function you're actually doing this:

+ origkey = so->keyData + i;

which is yet another way to say the same thing. IMHO "&so->keyData[i]"
is more readable.

Would it be worth adding a comment that we assume keyData to be in
sk_attno order? Or is that a widely-known assumption? I don't remember.

+	/*
+	 * Next, allocate temp arrays: one set for unchanged keys, another for
+	 * keys that will be unmarked/made non-required
+	 */
+	unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
+	keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));

Doesn't matter much but I think this could be:

unmarkKeys = palloc(nunmark * sizeof(ScanKeyData));
keepKeys = palloc((so->numberOfKeys - nunmark) * sizeof(ScanKeyData));

Or use a single array, putting the unmarked entries to the end of the array.

PS. Not a new thing, but the "Add coverage..." comments in the tests
seem redundant. All tests add coverage for something.

- Heikki

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Heikki Linnakangas (#6)

Re: Making Row Comparison NULL row member handling more robust during skip scans

On Tue, Jul 1, 2025 at 2:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I like how this makes row comparisons less special. To be honest I don't
quite understand what the bug was, I'd have to dig much deeper into this
to swap enough context into my memory for that.

If you're interested, I can provide you with all you'll need to
reproduce the infinite cycling behavior, where _bt_first is called
again and again, without the scan ever making useful progress. The
test data is fairly small, but it's a bit too big to just post to the
list.

My testing *simulates* cases where preprocessing cannot eliminate
redundant keys, without actually creating an incomplete opfamily, and
without really using cross-type operators in the queries. This
shouldn't matter -- it's just much easier to work with.

But just reading the
comments in these patches, I get the impression that they make this all
less fragile.

Technically the second patch (the row compare patch) isn't strictly
necessary to make my tests stop showing infinite cycling behavior. But
it's very easy to show infinite cycling behavior if I remove the
"refuse to forcenonrequired on row compare" kludge that I added to
_bt_set_startikey in bugfix commit 5f4d98d4 -- that's why I added it
in the first place. As you said, the goal is to make row compare quals
less special. And, in particular, to make it safe to remove that hack.

IMO the second patch is essential, since the hack I have in place
right now seems very fragile. The first patch (which adds the new
preprocessing step) is unambiguously needed, to fix a live bug
(without it, the infinite cycling behavior *is* still possible with
redundant keys that preprocessing can't eliminate).

On bt_unmark_extra_keys() function: The function comment explains well
what it does; that's good. A few very minor stylistic remarks on its
implementation:

I agree with all of your feedback; will make all those changes.

Would it be worth adding a comment that we assume keyData to be in
sk_attno order? Or is that a widely-known assumption? I don't remember.

I'll add a comment like that, just to be on the safe side.

There is an old comment at the top of _bt_preprocess_keys that says
that scan->keyData[] is in attribute order. And I add a new one in the
patch, which points out that that *isn't* always true anymore when the
so->keyData[] output array has nonrequired keys. There's no comment
that says what the new preprocessing function/pass expects about the
order of the so->keyDatap[] (which is both its input array and its
output array).

PS. Not a new thing, but the "Add coverage..." comments in the tests
seem redundant. All tests add coverage for something.

I'll adjust them along those lines.

Unless there are any objections, and assuming you're done giving
feedback, I'll commit both patches tomorrow.

Thanks!
--
Peter Geoghegan