Wrong results with equality search using trigram index and non-deterministic collation

Started by Laurenz Albealmost 2 years ago7 messageshackers

Jump to latest

Laurenz Albe

laurenz.albe@cybertec.at

almost 2 years ago

Using a trigram index with an non-deterministic collation can
lead to wrong query results:

CREATE COLLATION faux_cn (PROVIDER = icu, LOCALE = 'und', DETERMINISTIC = FALSE, RULES = '&l = r');

CREATE TABLE boom (id integer PRIMARY KEY, t text COLLATE faux_cn);

INSERT INTO boom VALUES (1, 'right'), (2, 'light');

SELECT * FROM boom WHERE t = 'right';

id │ t
════╪═══════
1 │ right
2 │ light
(2 rows)

CREATE INDEX ON boom USING gin (t gin_trgm_ops);

SET enable_seqscan = off;

SELECT * FROM boom WHERE t = 'right';

id │ t
════╪═══════
1 │ right
(1 row)

I also see questionable results with the similarity operator (with and
without the index):

SELECT * FROM boom WHERE t % 'rigor';

id │ t
════╪═══════
1 │ right
(1 row)

But here you could argue that the operator ignores the collation, so
the result is correct. With equality, there is no such loophole.

I don't know what the correct fix would be. Perhaps just refusing to use
the index for equality comparisons with non-deterministic collations.

Yours,
Laurenz Albe

Laurenz Albe

laurenz.albe@cybertec.at

almost 2 years ago

In reply to: Laurenz Albe (#1)

Re: Wrong results with equality search using trigram index and non-deterministic collation

On Tue, 2024-09-17 at 08:00 +0200, Laurenz Albe wrote:

Using a trigram index with an non-deterministic collation can
lead to wrong query results:
[...]

I don't know what the correct fix would be. Perhaps just refusing to use
the index for equality comparisons with non-deterministic collations.

Looking into fixing that, how can you tell the optimizer to consider
a certain index only for certain collations?

Yours,
Laurenz Albe

David Geier

geidav.pg@gmail.com

2 months ago

In reply to: Laurenz Albe (#1)

Re: Wrong results with equality search using trigram index and non-deterministic collation

On 17.09.2024 08:00, Laurenz Albe wrote:

Using a trigram index with an non-deterministic collation can
lead to wrong query results:

CREATE COLLATION faux_cn (PROVIDER = icu, LOCALE = 'und', DETERMINISTIC = FALSE, RULES = '&l = r');

CREATE TABLE boom (id integer PRIMARY KEY, t text COLLATE faux_cn);

INSERT INTO boom VALUES (1, 'right'), (2, 'light');

SELECT * FROM boom WHERE t = 'right';

id │ t
════╪═══════
1 │ right
2 │ light
(2 rows)

CREATE INDEX ON boom USING gin (t gin_trgm_ops);

SET enable_seqscan = off;

SELECT * FROM boom WHERE t = 'right';

id │ t
════╪═══════
1 │ right
(1 row)

I also see questionable results with the similarity operator (with and
without the index):

SELECT * FROM boom WHERE t % 'rigor';

id │ t
════╪═══════
1 │ right
(1 row)

But here you could argue that the operator ignores the collation, so
the result is correct. With equality, there is no such loophole.

I think we should change that. It's very counter intuitive that a query
can change behavior when the planner flips from using e.g. a Seq Scan to
a Bitmap Index Scan or the other way around. There's already a patch for
that, see [1]/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com.

I don't know what the correct fix would be. Perhaps just refusing to use
the index for equality comparisons with non-deterministic collations.

If we merge [1]/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com, then not only = but also LIKE would be incorrect. How
about disabling CREATE INDEX USING gin on columns with non-deterministic
collations?

Or is there maybe a way to make these cases work correctly for
non-deterministic collations by applying the collation when extracting
the search trigrams? I take a look into that.

[1]: /messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com
/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com

--
David Geier

Laurenz Albe

laurenz.albe@cybertec.at

2 months ago

In reply to: David Geier (#3)

Re: Wrong results with equality search using trigram index and non-deterministic collation

On Wed, 2026-04-22 at 08:45 +0200, David Geier wrote:

On 17.09.2024 08:00, Laurenz Albe wrote:

Using a trigram index with an non-deterministic collation can
lead to wrong query results:

CREATE COLLATION faux_cn (PROVIDER = icu, LOCALE = 'und', DETERMINISTIC = FALSE, RULES = '&l = r');

CREATE TABLE boom (id integer PRIMARY KEY, t text COLLATE faux_cn);

INSERT INTO boom VALUES (1, 'right'), (2, 'light');

SELECT * FROM boom WHERE t = 'right';

id │ t
════╪═══════
1 │ right
2 │ light
(2 rows)

CREATE INDEX ON boom USING gin (t gin_trgm_ops);

SET enable_seqscan = off;

SELECT * FROM boom WHERE t = 'right';

id │ t
════╪═══════
1 │ right
(1 row)

I also see questionable results with the similarity operator (with and
without the index):

SELECT * FROM boom WHERE t % 'rigor';

id │ t
════╪═══════
1 │ right
(1 row)

But here you could argue that the operator ignores the collation, so
the result is correct. With equality, there is no such loophole.

I think we should change that. It's very counter intuitive that a query
can change behavior when the planner flips from using e.g. a Seq Scan to
a Bitmap Index Scan or the other way around. There's already a patch for
that, see [1].

[1]
/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com

That's not only unintuitive, it is a clear bug.
An index is not allowed to change the semantics of a query.

Does your patch fix the bug, that is, will the query with "WHERE t = 'right'"
return both results? That's the case that is mostly in need of fixing.
I am not sure if the behavior for the % operator should also be considered
a bug.

I don't know what the correct fix would be. Perhaps just refusing to use
the index for equality comparisons with non-deterministic collations.

If we merge [1], then not only = but also LIKE would be incorrect. How
about disabling CREATE INDEX USING gin on columns with non-deterministic
collations?

Oh, I see. So your patch won't fix the bug.

I am not sure if refusing to *create* the index is the best solution.
Perhaps a warning will be better:

WARNING: GIN indexes won't be used columns with non-deterministic collations

Even if you refuse to create the index, you'd still have to deal with the
indexes that already exist. An upgrade cannot decide to remove an index.
At the very least, there would have to be a check in pg_upgrade.

Or is there maybe a way to make these cases work correctly for
non-deterministic collations by applying the collation when extracting
the search trigrams? I take a look into that.

Thank you. Making it work would of course be the best option.

Yours,
Laurenz Albe

David Geier

geidav.pg@gmail.com

about 2 months ago

In reply to: Laurenz Albe (#4)

Re: Wrong results with equality search using trigram index and non-deterministic collation

I think we should change that. It's very counter intuitive that a query
can change behavior when the planner flips from using e.g. a Seq Scan to
a Bitmap Index Scan or the other way around. There's already a patch for
that, see [1].

[1]
/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com

That's not only unintuitive, it is a clear bug.
An index is not allowed to change the semantics of a query.

Does your patch fix the bug, that is, will the query with "WHERE t = 'right'"
return both results? That's the case that is mostly in need of fixing.
I am not sure if the behavior for the % operator should also be considered
a bug.

Not yet :).

I don't know what the correct fix would be. Perhaps just refusing to use
the index for equality comparisons with non-deterministic collations.

If we merge [1], then not only = but also LIKE would be incorrect. How
about disabling CREATE INDEX USING gin on columns with non-deterministic
collations?

Oh, I see. So your patch won't fix the bug.

Indeed.

I am not sure if refusing to *create* the index is the best solution.
Perhaps a warning will be better:

WARNING: GIN indexes won't be used columns with non-deterministic collations

At this point we could more generally say

WARNING: pg_trgm does not use inferred column collations but always
uses the default database collation.

But I'm hoping the patch in [1] gets merged and we can also use the
inferred collation to fix the bug you found.

Or is there maybe a way to make these cases work correctly for
non-deterministic collations by applying the collation when extracting
the search trigrams? I take a look into that.

Attached patch makes your case work, including the % case. It builds on
top of the other patches from [1] that makes pg_trgm use the inferred
collation trigram extraction.

Instead of using btint4cmp() to compare trigrams, the patch uses a
collation-aware string comparison function.

This is just a PoC. I haven't given much thought to the details but e.g.
when three consecutive characters exceed 3 bytes then compact_trigram()
uses a truncated 32-bit hash value as trigram instead. Such trigrams
won't work in all cases. We could omit them from the query string but
for languages where the majority of trigrams are hashed or where the
query string consists of only a few trigrams, the look-up performance
would suffer.

I guess better would be using a collation-aware hash function that maps
different values that compare equal to the same hash value. hashtext()
does that already. The new comparison function would then have to
distinguish between plain text trigrams and hash trigrams.
Alternatively, we could store all trigrams as hashes but that would
break functions such as show_trgm().

--
David Geier

Laurenz Albe

laurenz.albe@cybertec.at

about 2 months ago

In reply to: David Geier (#5)

Re: Wrong results with equality search using trigram index and non-deterministic collation

On Mon, 2026-05-04 at 13:53 +0200, David Geier wrote:

Attached patch makes your case work, including the % case. It builds on
top of the other patches from [1] that makes pg_trgm use the inferred
collation trigram extraction.

Instead of using btint4cmp() to compare trigrams, the patch uses a
collation-aware string comparison function.

Thanks! I tried your patch, and it does indeed fix the bug I reported.

I looked at your patch, and it is pretty straightforward.
("git am" complained about an empty line at the end of
"pg_trgm--1.6--1.7.sql", but that's merely cosmetic.)

This is just a PoC. I haven't given much thought to the details but e.g.
when three consecutive characters exceed 3 bytes then compact_trigram()
uses a truncated 32-bit hash value as trigram instead. Such trigrams
won't work in all cases. We could omit them from the query string but
for languages where the majority of trigrams are hashed or where the
query string consists of only a few trigrams, the look-up performance
would suffer.

Does that mean that you could end up with wrong results (which would not
be acceptable), or that you could end up with false positives that
later get eliminated by the recheck (which would be fine)?

I am worried about collations that have digraphs - the letters would be
split when trigrams are formed, and that might cause trouble.

And indeed, I am able to break it with a "quadrigraph":

CREATE COLLATION crazy (
PROVIDER = icu,
LOCALE = 'da-DK',
DETERMINISTIC = FALSE,
RULES = '& a = zzzz'
);

CREATE TABLE boom2 (id integer PRIMARY KEY, t text COLLATE crazy);

INSERT INTO boom2 VALUES (1, 'myad'), (2, 'myzzzzd');

SELECT * FROM boom2 WHERE t = 'myad';

id │ t
════╪═════════
1 │ myad
2 │ myzzzzd
(2 rows)

CREATE INDEX trgm_idx2 ON boom2 USING gin (t gin_trgm_ops);

SET enable_seqscan = off;

SELECT * FROM boom2 WHERE t = 'myad';

id │ t
════╪══════
1 │ myad
(1 row)

I guess better would be using a collation-aware hash function that maps
different values that compare equal to the same hash value. hashtext()
does that already. The new comparison function would then have to
distinguish between plain text trigrams and hash trigrams.
Alternatively, we could store all trigrams as hashes but that would
break functions such as show_trgm().

But that would probably not fix the above problem, right?

My initial thought about this bug was to just not consider a trigram
index if a non-deterministic collation is involved, but I can't see
how that could be done in the planner.

Still, I think that the first two patches of your set do the right thing.

Yours,
Laurenz Albe

Zsolt Parragi

zsolt.parragi@percona.com

about 2 months ago

In reply to: Laurenz Albe (#6)

Re: Wrong results with equality search using trigram index and non-deterministic collation

Hello

Does that mean that you could end up with wrong results (which would not
be acceptable), or that you could end up with false positives that
later get eliminated by the recheck (which would be fine)?

+ /*
+ * For non-C collations, extract the three bytes from each trigram
+ * and compare them using the collation's comparison function.
+ */

...

+ /* Use collation-aware comparison */
+ result = pg_strncoll(str_a, 3, str_b, 3, locale);
+ PG_RETURN_INT32(result);

For non-C collations, isn't the trigram likely a hash rather than a
proper string, where pg_strncoll won't work properly?

Wrong results with equality search using trigram index and non-deterministic collation

Attachments: