Wrong Results from SP-GiST with Collations

Started by Emre Hasegeliabout 8 years ago11 messagesbugs
Jump to latest
#1Emre Hasegeli
emre@hasegeli.com

Please see:

# create table t (a text collate "cs_CZ");
CREATE TABLE

# insert into t values ('c'), ('ch');
INSERT 0 2

# select * from t where a < 'd';
a
---
c
(1 row)

xx2_game=# create index on t using spgist (a);
CREATE INDEX

xx2_game=# set enable_seqscan = 0;
SET

xx2_game=# select * from t where a < 'd';
a
----
c
ch
(2 rows)

I assume we don't have any option than removing support for comparison
operators with collations just like btree.

The bug may apply to the other access methods.

In reply to: Emre Hasegeli (#1)
Re: Wrong Results from SP-GiST with Collations

On Tue, Apr 10, 2018 at 9:21 AM, Emre Hasegeli <emre@hasegeli.com> wrote:

I assume we don't have any option than removing support for comparison
operators with collations just like btree.

I can't imagine how this could ever work, since a non C-collation
string cannot be atomized into tokens like that. Considering each
token successively during a comparison does not always produce the
same answer as a straight comparison against the original string
would.

I believe that there are cases that won't work with *any* glibc or ICU
collation, not just "cs_CZ".

--
Peter Geoghegan

#3Emre Hasegeli
emre@hasegeli.com
In reply to: Peter Geoghegan (#2)
Re: Wrong Results from SP-GiST with Collations

I can't imagine how this could ever work, since a non C-collation
string cannot be atomized into tokens like that. Considering each
token successively during a comparison does not always produce the
same answer as a straight comparison against the original string
would.

I think the only fix is to remove support for those operators from the
operator class. The logic on match_special_index_operator() doesn't
seem to need any change. Patch attached. As far as I understand, we
cannot back-patch catalog changes. In this case, we can leave it
broken and apply the documentation part.

Attachments:

remove-wrong-spgist-text-operators-v00.patchapplication/octet-stream; name=remove-wrong-spgist-text-operators-v00.patchDownload+11-193
In reply to: Emre Hasegeli (#3)
Re: Wrong Results from SP-GiST with Collations

On Mon, Apr 16, 2018 at 8:16 AM, Emre Hasegeli <emre@hasegeli.com> wrote:

I think the only fix is to remove support for those operators from the
operator class. The logic on match_special_index_operator() doesn't
seem to need any change. Patch attached. As far as I understand, we
cannot back-patch catalog changes. In this case, we can leave it
broken and apply the documentation part.

Has the operator class really been completely broken since SP-GiST was
first introduced? I tend to doubt that. spg_text_inner_consistent()
has the following code, which appears to acknowledge the problem with
non-C collations:

for (j = 0; j < in->nkeys; j++)
{
StrategyNumber strategy = in->scankeys[j].sk_strategy;
text *inText;
int inSize;
int r;

/*
* If it's a collation-aware operator, but the collation is C, we
* can treat it as non-collation-aware. With non-C collation we
* need to traverse whole tree :-( so there's no point in making
* any check here. (Note also that our reconstructed value may
* well end with a partial multibyte character, so that applying
* any encoding-sensitive test to it would be risky anyhow.)
*/
if (SPG_IS_COLLATION_AWARE_STRATEGY(strategy))
{
if (collate_is_c)
strategy -= SPG_STRATEGY_ADDITION;
else
continue;
}

Maybe this is a simple, relatively benign bug. Maybe this problem is
due to commit 710d90da, which is only a couple of weeks old. Have you
reproduced the problem on v10 of Postgres?

--
Peter Geoghegan

#5Emre Hasegeli
emre@hasegeli.com
In reply to: Peter Geoghegan (#4)
Re: Wrong Results from SP-GiST with Collations

Maybe this is a simple, relatively benign bug. Maybe this problem is
due to commit 710d90da, which is only a couple of weeks old. Have you
reproduced the problem on v10 of Postgres?

Yes, the initial example I posted was from 9.5.12 on Debian.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#4)
Re: Wrong Results from SP-GiST with Collations

Peter Geoghegan <pg@bowt.ie> writes:

Has the operator class really been completely broken since SP-GiST was
first introduced? I tend to doubt that. spg_text_inner_consistent()
has the following code, which appears to acknowledge the problem with
non-C collations:

You're on to something, but I think the bug is in
spg_text_leaf_consistent, which thinks it can do collation-aware
comparisons like this:

r = varstr_cmp(fullValue, Min(queryLen, fullLen),
VARDATA_ANY(query), Min(queryLen, fullLen),
PG_GET_COLLATION());

That's got nothing to do with reality for non-C collations, and it seems
rather pointless anyway, Why isn't this just

r = varstr_cmp(fullValue, fullLen,
VARDATA_ANY(query), queryLen,
PG_GET_COLLATION());

and then the bit below about

if (r == 0)
{
if (queryLen > fullLen)
r = -1;
else if (queryLen < fullLen)
r = 1;
}

needs to move into the "non-collation-aware" branch.

regards, tom lane

In reply to: Tom Lane (#6)
Re: Wrong Results from SP-GiST with Collations

On Mon, Apr 16, 2018 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

You're on to something, but I think the bug is in
spg_text_leaf_consistent, which thinks it can do collation-aware
comparisons like this:

r = varstr_cmp(fullValue, Min(queryLen, fullLen),
VARDATA_ANY(query), Min(queryLen, fullLen),
PG_GET_COLLATION());

Ah. Those arguments make that code completely broken.

and then the bit below about

if (r == 0)
{
if (queryLen > fullLen)
r = -1;
else if (queryLen < fullLen)
r = 1;
}

needs to move into the "non-collation-aware" branch.

Right. Alternatively, you could actually call varstr_cmp() within the
"non-collation-aware" branch.

--
Peter Geoghegan

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#7)
Re: Wrong Results from SP-GiST with Collations

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Apr 16, 2018 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

and then the bit below about ...
needs to move into the "non-collation-aware" branch.

Right. Alternatively, you could actually call varstr_cmp() within the
"non-collation-aware" branch.

True. This way saves a few cycles, but maybe it's not worth the extra
code. I think the only case where you could really notice the difference
is for an equality search operator, which might end up doing a lot more
work in non-C collations (full-blown strcoll vs memcmp).

regards, tom lane

In reply to: Tom Lane (#8)
Re: Wrong Results from SP-GiST with Collations

On Mon, Apr 16, 2018 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Right. Alternatively, you could actually call varstr_cmp() within the
"non-collation-aware" branch.

True. This way saves a few cycles, but maybe it's not worth the extra
code. I think the only case where you could really notice the difference
is for an equality search operator, which might end up doing a lot more
work in non-C collations (full-blown strcoll vs memcmp).

I don't have a full understanding of this particular problem, but it
sounds to me that that would be a non-issue due to the equality
fast-path added to varstr_cmp() several years ago. I microbenchmarked
it quite extensively at the time, and concluded that it was all but
free in cases where it didn't work out.

--
Peter Geoghegan

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#9)
Re: Wrong Results from SP-GiST with Collations

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Apr 16, 2018 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

True. This way saves a few cycles, but maybe it's not worth the extra
code. I think the only case where you could really notice the difference
is for an equality search operator, which might end up doing a lot more
work in non-C collations (full-blown strcoll vs memcmp).

I don't have a full understanding of this particular problem, but it
sounds to me that that would be a non-issue due to the equality
fast-path added to varstr_cmp() several years ago. I microbenchmarked
it quite extensively at the time, and concluded that it was all but
free in cases where it didn't work out.

I'm not following --- varstr_cmp has no way to know that we only
care about equality vs inequality. Yes, it might give back an
answer quickly when the strings are equal, but when they aren't,
it has to decide which is greater. In this case we don't care,
so long as the search operator is "=".

regards, tom lane

In reply to: Tom Lane (#10)
Re: Wrong Results from SP-GiST with Collations

On Mon, Apr 16, 2018 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not following --- varstr_cmp has no way to know that we only
care about equality vs inequality. Yes, it might give back an
answer quickly when the strings are equal, but when they aren't,
it has to decide which is greater. In this case we don't care,
so long as the search operator is "=".

That was the subtlety I was missing. I didn't understand what you
meant, but thought that the fast-path might be relevant. Clearly not.

--
Peter Geoghegan