BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit

Started by Greg Navisalmost 10 years ago2 messagesbugs
Jump to latest
#1Greg Navis
contact@gregnavis.com

The following bug has been logged on the website:

Bug reference: 14202
Logged by: Greg Navis
Email address: contact@gregnavis.com
PostgreSQL version: 9.6beta1
Operating system: Fedora 23 (Linux 4.5.6-200.fc23.x86_64)
Description:

# Summary

Due to a bug gtrgm_consistent, using a GiST or GIN trigram index can return
extraneous rows whose trigram-similarity was below set_limit().

# Version tested

PostgreSQL Git commit: 4d48adc

$ gcc --version
gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

$ uname -a
Linux localhost.localdomain 4.5.6-200.fc23.x86_64 #1 SMP Wed Jun 1 21:28:20
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# How to reproduce

Save the below to trgm_bug.sql:

-- Clean up after previous run.
DROP TABLE IF EXISTS restaurants;
DROP EXTENSION IF EXISTS pg_trgm CASCADE;

-- Create the table.
CREATE EXTENSION pg_trgm;
CREATE TABLE restaurants(city VARCHAR(255) NOT NULL);
CREATE INDEX restaurants_on_city_idx ON restaurants USING gist(city
gist_trgm_ops);

-- Insert 10000 rows to trigger index use.
INSERT INTO restaurants SELECT 'Warsaw' FROM generate_series(1, 10000);
INSERT INTO restaurants SELECT 'Szczecin' FROM generate_series(1, 10000);

-- Similarity of the two names (for reference).
SELECT similarity('Szczecin', 'Warsaw');

-- Set the limit to 0.3. Only Warsaw is returned (as it should).
SELECT set_limit(0.3);
SELECT DISTINCT city, similarity(city, 'Warsaw'), show_limit() FROM
restaurants WHERE city % 'Warsaw';

-- Raise the limit to 0.5. Now _both_ Warsaw and Szczecin are returned.
SELECT set_limit(0.5);
SELECT DISTINCT city, similarity(city, 'Warsaw'), show_limit() FROM
restaurants WHERE city % 'Warsaw';

and run "psql -f trgm_bug.sql".

# Results

The script returns the right result when the threshold is set to 0.3:

city | similarity | show_limit
--------+------------+------------
Warsaw | 1 | 0.3
(1 row)

However, it returns _both cities after rising the threshold to 0.5_:

city | similarity | show_limit
----------+------------+------------
Szczecin | 0 | 0.5
Warsaw | 1 | 0.5

# Root cause

The root cause is this line in contrib/pg_trgm/trgm_gist.c:

/* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);

nlimit is of type double. tmpsml is of type float4. On my system
sizeof(float4) == 4 and sizeof(double) == 8. After adding:

printf("tmpsml = %f tmpsml:int = %i nlimit = %f nlimit:int = %i res = %i\n",
tmpsml, *(int *)&tmpsml, nlimit, *(int *)&nlimit, res);

after the offending line I can see the following output (after passing
through "sort | uniq"):

tmpsml = 0.000000 tmpsml:int = 0 nlimit = 0.300000 nlimit:int = 858993459
res = 0
tmpsml = 0.000000 tmpsml:int = 0 nlimit = 0.500000 nlimit:int = 0 res = 1
tmpsml = 1.000000 tmpsml:int = 1065353216 nlimit = 0.300000 nlimit:int =
858993459 res = 1
tmpsml = 1.000000 tmpsml:int = 1065353216 nlimit = 0.500000 nlimit:int = 0
res = 1

On line 2, we can see that nlimit = 0.5 was converted to 0 which triggered
the error.

# Possible fix

If we're okay with abandoning the bug work-around for FreeBSD 5.2.1 / gcc
3.3.3 then I suggest replacing

res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);

with

res = tmpsml >= nlimit;

This resolves the issue.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Navis (#1)
Re: BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit

contact@gregnavis.com writes:

Due to a bug gtrgm_consistent, using a GiST or GIN trigram index can return
extraneous rows whose trigram-similarity was below set_limit().
...
The root cause is this line in contrib/pg_trgm/trgm_gist.c:

/* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);

nlimit is of type double. tmpsml is of type float4.

Ugh, yeah, that's completely broken now isn't it.

I suspect that changing tmpsml to double would be enough to get around the
alleged gcc bug and let us write the comparison naturally, ie the whole
thing was likely an artifact of using float4. Will try it that way.

Thanks for the report!

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs