Bug in nbtree SAOP scans with non-required arrays, truncated high key
Attached test case shows a wrong answer bug affecting Postgres 17 and
the master branch/18 devel. My nbtree SAOP commit (commit 5bf748b8) is
to blame. The conditions under which the bug can produce incorrect
behavior are rather narrow (more on those details below).
Attached fix addresses the issue by consistently resetting the scan's
so->scanBehind flag (which might still be set to true from the
previous page's high key) at the start of _bt_advance_array_keys. In
particular, this will now happen during sktrig_required=false calls to
_bt_advance_array_keys (and not just during sktrig_required=true
calls). The problem really does boil down to a failure to reset
so->scanBehind consistently -- it never made any sense to not always
do that at the top of _bt_advance_array_keys like this.
Here's the full, complicated explanation for why it is that not
resetting so->scanBehind consistently leads to wrong answers when this
test case is run against an unpatched server:
The index scan must visit two leaf pages (with or without the fix in place).
The first leaf page has all index tuples returned as expected. It also
has a truncated leaf page high key, with truncated attributes that
correspond to a required scan key (namely the scan key for
"inequal_one_ten_range <= 10"). We set so->scanBehind at this point,
which is needed to make it safe to continue onto the second leaf page
(the alternative is to start another primitive index scan, but that
would be inefficient).
Next we arrive on the second leaf page. We have so->scanBehind set
from before (from the high key for the first leaf page). The
non-required scan lower-order key (namely the
"nonrequired_equal_one_ten_range in (1, 2)" scan key) isn't satisfied
right away, and so _bt_check_compare must perform a
sktrig_required=false call to _bt_advance_array_keys. Unpatched
servers won't reset so->scanBehind (i.e. set it 'false') at the top of
_bt_advance_array_keys at this point, which leads to trouble later on
(later on during the same call to _bt_advance_array_keys).
Since _bt_advance_array_keys successfully finds a matching array
element for the non-required scankey, it'll need a recheck call to
_bt_check_compare. But, since so->scanBehind wasn't reset at the start
of our call, it'll be allowed to suppress the code path where
_bt_check_compare makes _bt_advance_array_keys return 'true'. It'll
return 'false' instead, a bit later on -- which is wrong (ultimately
_bt_checkkeys returns false to _bt_readpage as a result, which is
wrong).
The "suppress returning true" behavior was only ever intended to be
used when advancing the scan's arrays using a page high key -- but
this isn't the page high key (it's the second page's first non-pivot
tuple, not the first page's high key/finaltup). It doesn't matter if
we return false instead of true for the page high key in this way,
since it isn't really supposed to be returned to the scan anyway. As I
said, we need to forget what we saw when we looked at the last page's
high key (the second page is a whole other page).
--
Peter Geoghegan
Attachments:
v1-0001-Fix-failure-to-reset-nbtree-scanBehind-flag.patchapplication/octet-stream; name=v1-0001-Fix-failure-to-reset-nbtree-scanBehind-flag.patchDownload
From fe153095046571d90c35e3debd4c6547c64e34dd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 18 Dec 2024 14:19:45 -0500
Subject: [PATCH v1] Fix failure to reset nbtree scanBehind flag.
Consistently reset so->scanBehind during nbtree array advancement, even
when array advancement was triggered by a non-required array's scan key.
Otherwise it's possible for queries to return wrong answers: any later
recheck call to _bt_check_compare cannot return true unless the scan's
so->scanBehind flag is already set to 'false'.
The so->scanBehind flag is only supposed to suppress returning true from
_bt_advance_array_keys like this when the arrays are advanced using the
page high key -- it was never supposed to happen because so->scanBehind
is still set from back when we read the prior page's high key.
---
src/backend/access/nbtree/nbtutils.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb..7ce1df771 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1813,6 +1813,8 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
all_required_satisfied = true,
all_satisfied = true;
+ so->scanBehind = so->oppositeDirCheck = false; /* reset */
+
if (sktrig_required)
{
/*
@@ -1821,8 +1823,6 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
tupnatts, false, 0, NULL));
- so->scanBehind = so->oppositeDirCheck = false; /* reset */
-
/*
* Required scan key wasn't satisfied, so required arrays will have to
* advance. Invalidate page-level state that tracks whether the
--
2.45.2
On Wed, Dec 18, 2024 at 3:20 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached fix addresses the issue by consistently resetting the scan's
so->scanBehind flag (which might still be set to true from the
previous page's high key) at the start of _bt_advance_array_keys.
I pushed this fix just now.
I should point out (for the benefit of Tom, or whoever writes the next
set of release notes) that I think that this bug is very unlikely to
occur in practice. Getting wrong answers to queries could only happen
given an index with more than 3 columns, and only for scans with just
the right combination of scan key types. It could also only happen
when the scan advances its array keys using high keys with more than a
single truncated attribute.
Note also that assert-enabled builds would always have seen assertion
failures whenever something like my test case ran. Presumably we'd
have seen a report about such an assertion failure if the underlying
problem was at all common.
--
Peter Geoghegan
On Thu, 2024-12-19 at 11:09 -0500, Peter Geoghegan wrote:
I should point out (for the benefit of Tom, or whoever writes the next
set of release notes) that I think that this bug is very unlikely to
occur in practice. Getting wrong answers to queries could only happen
given an index with more than 3 columns, and only for scans with just
the right combination of scan key types. It could also only happen
when the scan advances its array keys using high keys with more than a
single truncated attribute.
Just for the record: a customer just ran into this bug. REINDEX did
not fix the bad query result, but update to 17.3 did.
Yours,
Laurenz Albe
--
*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte,
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.
*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are
confidential and may be privileged or otherwise protected from disclosure
and solely for the use of the person(s) or entity to whom it is intended.
If you have received this message in error and are not the intended
recipient, please notify the sender immediately and delete this message and
any attachment from your system. If you are not the intended recipient, be
advised that any use of this message is prohibited and may be unlawful, and
you must not copy this message or attachment or disclose the contents to
any other person.
On Tue, Feb 18, 2025 at 8:34 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Just for the record: a customer just ran into this bug. REINDEX did
not fix the bad query result, but update to 17.3 did.
That's surprising, since REINDEX definitely will "fix" my test case
when run on an unpatched server with the bug.
The issue that my repro highlighted hinged upon 2 low-order truncated
attributes. nbtsort.c (used by CREATE INDEX/REINDEX) doesn't care
about making suffix truncation as effective as reasonably possible,
unlike retail inserts + page splits, which use the nbtsplitloc.c logic
for this. This was also why index fillfactor was tuned by the test
case. Basically, the test case is very sensitive to various
parameters.
--
Peter Geoghegan