BUG #14623: pg_trgm doesn't correctly process some regexp with negative lookahead
The following bug has been logged on the website:
Bug reference: 14623
Logged by: Jeff Janes
Email address: jeff.janes@gmail.com
PostgreSQL version: 9.6.2
Operating system: Linux
Description:
I'm reporting this bug based on
https://dba.stackexchange.com/questions/170669/postgres-function-index-doesnt-work-correctly-with-regular-expressions/170897#170897
This simplified case is easy to reproduce back to at least 9.4, and is still
present in 10dev HEAD. Matches are missed when using the trgm index, but
not when doing the full table scan.
create table foobar (x text);
insert into foobar values ('eldrazi scion'),('eldrazi scio');
create extension pg_trgm ;
create index on foobar using gin (x gin_trgm_ops);
select * from foobar where x ~ 'eldrazi (?!scion)'; -- returns 1 row
set enable_seqscan TO off;
select * from foobar where x ~ 'eldrazi (?!scion)'; -- returns 0 rows
Building a gist rather than gin index exhibits the same bug.
The trigrams seem to be extracted correctly, but the graph stuffed into
extra_data is not correct. Looking at /tmp/packed.dot, there are no arrays
pointing to the successful termination state s1. Instead, I get lead to a
dead-end state s7.
If I change the regexp to 'eldrazi (?!s)', then bug goes away, and
/tmp/packed.dot shows the correct graph pointing to s1.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
jeff.janes@gmail.com writes:
This simplified case is easy to reproduce back to at least 9.4, and is still
present in 10dev HEAD. Matches are missed when using the trgm index, but
not when doing the full table scan.
select * from foobar where x ~ 'eldrazi (?!scion)'; -- returns 0 rows
The trigrams seem to be extracted correctly, but the graph stuffed into
extra_data is not correct. Looking at /tmp/packed.dot, there are no arrays
pointing to the successful termination state s1. Instead, I get lead to a
dead-end state s7.
Hm. I think what is happening here is that regexport.c is being too
simplistic by ignoring LACON arcs. Because it does so, there's actually
no path in the exported search NFA that can reach the success state,
which explains your observation of the lack of such a path in
/tmp/packed.dot. It's not entirely unreasonable for pg_trgm to be
assuming that such a path must exist.
What regexport.c should be doing is assuming that every LACON constraint
succeeds, hence treating such an arc as a traversable but
zero-input-consuming arc.
Since there's no notion of zero input consumption in the exported
representation, we're going to need some logic in regexport.c to
convert that (by traversing to the arc target state and emitting
its output arcs, possibly recursively).
If I change the regexp to 'eldrazi (?!s)', then bug goes away, and
/tmp/packed.dot shows the correct graph pointing to s1.
That seems to be because processlacons will simplify a single-character
LACON into a plain AHEAD/BEHIND constraint, which leads to an NFA that
doesn't confuse the export logic.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
I don’t know if this is related or a separate bug, but I am also having this problem with the `+` operator and character classes. Very similar test case:
CREATE TABLE foobar (x text);
INSERT INTO foobar VALUES ('Trying +1/+1'),('Also doing +20/+20');
CREATE EXTENSION pg_trgm;
CREATE INDEX ON foobar USING gin (x gin_trgm_ops);
SET enable_seqscan TO on;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns both rows
SET enable_seqscan TO off;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns 0 rows
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Import Notes
Resolved by subject fallback
Corey Csuhta <its+postgres@csuhta.com> writes:
I don’t know if this is related or a separate bug, but I am also having this problem with the `+` operator and character classes. Very similar test case:
CREATE TABLE foobar (x text);
INSERT INTO foobar VALUES ('Trying +1/+1'),('Also doing +20/+20');
CREATE EXTENSION pg_trgm;
CREATE INDEX ON foobar USING gin (x gin_trgm_ops);
SET enable_seqscan TO on;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns both rows
SET enable_seqscan TO off;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns 0 rows
Hmm ... seems unrelated, since there's no lookahead constraint in this
example. Also, the patch I committed earlier today doesn't fix it.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Does this still seem like a problem? Is it helpful if I submit a new bug # for tracking purposes?
On Apr 14, 2017, at 12:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Csuhta <its+postgres@csuhta.com> writes:
I don’t know if this is related or a separate bug, but I am also having this problem with the `+` operator and character classes. Very similar test case:
CREATE TABLE foobar (x text);
INSERT INTO foobar VALUES ('Trying +1/+1'),('Also doing +20/+20');
CREATE EXTENSION pg_trgm;
CREATE INDEX ON foobar USING gin (x gin_trgm_ops);
SET enable_seqscan TO on;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns both rows
SET enable_seqscan TO off;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns 0 rowsHmm ... seems unrelated, since there's no lookahead constraint in this
example. Also, the patch I committed earlier today doesn't fix it.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Corey Csuhta <its+postgres@csuhta.com> writes:
Does this still seem like a problem? Is it helpful if I submit a new bug # for tracking purposes?
Yes, and not really. These days we track stuff by mail message ID.
But thanks for reporting the problem!
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Corey Csuhta <its+postgres@csuhta.com> writes:
I don’t know if this is related or a separate bug, but I am also having this problem with the `+` operator and character classes. Very similar test case:
CREATE TABLE foobar (x text);
INSERT INTO foobar VALUES ('Trying +1/+1'),('Also doing +20/+20');
CREATE EXTENSION pg_trgm;
CREATE INDEX ON foobar USING gin (x gin_trgm_ops);
SET enable_seqscan TO on;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns both rows
SET enable_seqscan TO off;
SELECT * FROM foobar WHERE x ~ '\+\d+\/\+\d+'; -- Returns 0 rows
I've pushed a fix for this. Thanks for the report!
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs