BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Started by PG Bug reporting formabout 5 years ago8 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16865
Logged by: Dimitri Nüscheler
Email address: dimitri.nuescheler@gmail.com
PostgreSQL version: 13.1
Operating system: Debian
Description:

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):

commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri Jul 24 15:26:51 2020 -0400

Fix assorted bugs by changing TS_execute's callback API to ternary

logic.

...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
Limit (cost=130.61..12789.24 rows=1 width=136)
-> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1
width=136)
Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
Filter: ((metadata)::text ~~ '%Discontinued%'::text)
-> Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
= off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
4962
(1 Zeile)

Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
0
(1 Zeile)

Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
count
-------
13127
(1 Zeile)

So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler

#2Dimitri Nüscheler
dimitri.nuescheler@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog
USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't
include the data file yet. The file "steps" explains the steps.:
https://www.violetsky.ch/postgres-issue.tar.gz

I could share the data file with a few individuals until I find a way to
anonymize or reduce the data set. My attempts so far, resulted in the bug
no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <
noreply@postgresql.org>:

Show quoted text

The following bug has been logged on the website:

Bug reference: 16865
Logged by: Dimitri Nüscheler
Email address: dimitri.nuescheler@gmail.com
PostgreSQL version: 13.1
Operating system: Debian
Description:

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this
commit
(see the query and expected and actual result further below):

commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri Jul 24 15:26:51 2020 -0400

Fix assorted bugs by changing TS_execute's callback API to ternary

logic.

...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
Limit (cost=130.61..12789.24 rows=1 width=136)
-> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1
width=136)
Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
Filter: ((metadata)::text ~~ '%Discontinued%'::text)
-> Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5
respectively
versions before that commit, but if I alter the plan using (SET
enable_*scan
= off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
4962
(1 Zeile)

Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
0
(1 Zeile)

Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
count
-------
13127
(1 Zeile)

So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler

#3Dimitri Nüscheler
dimitri.nuescheler@gmail.com
In reply to: Dimitri Nüscheler (#2)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

It seems the issue is not related to negation, because I also get these
results:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata NOT
LIKE '%iscontinu%';
count
-------
4499
(1 Zeile)

These results don't contain "discontinu" in the tsvector.

Am So., 14. Feb. 2021 um 08:41 Uhr schrieb Dimitri Nüscheler <
dimitri.nuescheler@gmail.com>:

Show quoted text

Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog
USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't
include the data file yet. The file "steps" explains the steps.:
https://www.violetsky.ch/postgres-issue.tar.gz

I could share the data file with a few individuals until I find a way to
anonymize or reduce the data set. My attempts so far, resulted in the bug
no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <
noreply@postgresql.org>:

The following bug has been logged on the website:

Bug reference: 16865
Logged by: Dimitri Nüscheler
Email address: dimitri.nuescheler@gmail.com
PostgreSQL version: 13.1
Operating system: Debian
Description:

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this
commit
(see the query and expected and actual result further below):

commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri Jul 24 15:26:51 2020 -0400

Fix assorted bugs by changing TS_execute's callback API to ternary

logic.

...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
Limit (cost=130.61..12789.24 rows=1 width=136)
-> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1
width=136)
Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
Filter: ((metadata)::text ~~ '%Discontinued%'::text)
-> Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5
respectively
versions before that commit, but if I alter the plan using (SET
enable_*scan
= off) so that the planner comes up with a sequential scan, I will get
the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
4962
(1 Zeile)

Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
0
(1 Zeile)

Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
count
-------
13127
(1 Zeile)

So it seems like the negated query manages to filter out some data, but
not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces
the
problem, but I also wonder if I manage to dive into the code and
understand
something about it :-)

Kind Regards
Dimitri Nüscheler

#4Dimitri Nüscheler
dimitri.nuescheler@gmail.com
In reply to: Dimitri Nüscheler (#3)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Meanwhile I managed to anonymize the data. I put it in this archive
https://www.violetsky.ch/postgres-issue-anonymized.tar.gz
It's called anonymous.csv and the scripts to reproduce it are also included
(See the file "steps").

The query that reproduces the issue is now a bit different with the data
anonymized:

SELECT * FROM sherlock_catalog WHERE textsearch @@ '''glonoin'':* &
!''urinarium'':*'::tsquery AND metadata LIKE '%urinarium%' LIMIT 2;

You will be able to see that the textsearch tsvector contains "urinarium"
when it shouldn't.
I also checked that the commit right before
(25244b8972a34b838c4033fe9efc1d31cba9d0e3) does not have the issue and
returns 0 rows instead only.

Am So., 14. Feb. 2021 um 11:17 Uhr schrieb Dimitri Nüscheler <
dimitri.nuescheler@gmail.com>:

Show quoted text

It seems the issue is not related to negation, because I also get these
results:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata NOT
LIKE '%iscontinu%';
count
-------
4499
(1 Zeile)

These results don't contain "discontinu" in the tsvector.

Am So., 14. Feb. 2021 um 08:41 Uhr schrieb Dimitri Nüscheler <
dimitri.nuescheler@gmail.com>:

Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON
sherlock_catalog USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't
include the data file yet. The file "steps" explains the steps.:
https://www.violetsky.ch/postgres-issue.tar.gz

I could share the data file with a few individuals until I find a way to
anonymize or reduce the data set. My attempts so far, resulted in the bug
no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <
noreply@postgresql.org>:

The following bug has been logged on the website:

Bug reference: 16865
Logged by: Dimitri Nüscheler
Email address: dimitri.nuescheler@gmail.com
PostgreSQL version: 13.1
Operating system: Debian
Description:

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a
tsquery
when matched to a tsvector using the @@ operator no longer works
properly,
so I started bisecting the commit history and tracked it down to this
commit
(see the query and expected and actual result further below):

commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri Jul 24 15:26:51 2020 -0400

Fix assorted bugs by changing TS_execute's callback API to ternary

logic.

...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata)))
STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS
varchar
AS $$
BEGIN
RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
Limit (cost=130.61..12789.24 rows=1 width=136)
-> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24
rows=1
width=136)
Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
Filter: ((metadata)::text ~~ '%Discontinued%'::text)
-> Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5
respectively
versions before that commit, but if I alter the plan using (SET
enable_*scan
= off) so that the planner comes up with a sequential scan, I will get
the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
4962
(1 Zeile)

Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
0
(1 Zeile)

Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
count
-------
13127
(1 Zeile)

So it seems like the negated query manages to filter out some data, but
not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces
the
problem, but I also wonder if I manage to dive into the code and
understand
something about it :-)

Kind Regards
Dimitri Nüscheler

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Nüscheler (#4)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <dimitri.nuescheler@gmail.com> writes:

Meanwhile I managed to anonymize the data. I put it in this archive
https://www.violetsky.ch/postgres-issue-anonymized.tar.gz

Thanks. I've reproduced the issue and it boils down to doing the
wrong thing when there are GIN_MAYBE values in the input data for
checkcondition_gin, specifically when the bitmap holding the original
GIN index results has become lossy. We correctly get a TS_MAYBE
result out of the calculation in TS_execute_recurse, but then
TS_execute figures it can throw that detail away and just return
TS_YES. I think that this problem existed before 2f2007fbb, but
we were masking it by always forcing rechecks.

Anyway, this seems to put the final nail in the coffin of the idea
that it's sufficient for TS_execute to return a boolean. At least
the tsginidx.c callers really need to see the 3-way result. Hence
I propose the attached patch. It fixes the given test case, but
I wonder if you can try it and see if you see any remaining problems.

I wasted quite a bit of time trying to devise a test case that is
short enough to be reasonable to include in our regression tests,
without success ... you need quite a bit of data to make the GIN
bitmap become lossy. So no test case here, but I'm not super
comfortable with that.

regards, tom lane

Attachments:

fix-lossy-gin-index-behavior.patchtext/x-diff; charset=us-ascii; name=fix-lossy-gin-index-behavior.patchDownload+37-9
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

I wrote:

Anyway, this seems to put the final nail in the coffin of the idea
that it's sufficient for TS_execute to return a boolean. At least
the tsginidx.c callers really need to see the 3-way result. Hence
I propose the attached patch. It fixes the given test case, but
I wonder if you can try it and see if you see any remaining problems.

After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.

AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like

(a & b:B) | c

and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES. This is a correct result: the row
unquestionably matches the query. But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean. Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.

regards, tom lane

Attachments:

fix-lossy-gin-index-behavior-2.patchtext/x-diff; charset=us-ascii; name=fix-lossy-gin-index-behavior-2.patchDownload+44-28
#7Dimitri Nüscheler
dimitri.nuescheler@gmail.com
In reply to: Tom Lane (#6)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Hi Tom

Wow, this is going fast. I applied the second patch and it fixes my
original test case on my private data.
Thank you also for the explanations. I'm happy it's possible to get insight
into PostgreSQL interna so quickly :-)

Let me know if there's more I can do (testing particular cases or patch
variants etc).

Kind regards
Dimitri

Am Di., 16. Feb. 2021 um 01:27 Uhr schrieb Tom Lane <tgl@sss.pgh.pa.us>:

Show quoted text

I wrote:

Anyway, this seems to put the final nail in the coffin of the idea
that it's sufficient for TS_execute to return a boolean. At least
the tsginidx.c callers really need to see the 3-way result. Hence
I propose the attached patch. It fixes the given test case, but
I wonder if you can try it and see if you see any remaining problems.

After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.

AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like

(a & b:B) | c

and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES. This is a correct result: the row
unquestionably matches the query. But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean. Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Nüscheler (#7)
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <dimitri.nuescheler@gmail.com> writes:

Wow, this is going fast. I applied the second patch and it fixes my
original test case on my private data.

Pushed, thanks for testing!

regards, tom lane