bug of pg_trgm?

Started by Fujii Masaoover 13 years ago12 messages
#1Fujii Masao
masao.fujii@gmail.com
1 attachment(s)

Hi,

When I used pg_trgm, I encountered the problem that the search result of
SeqScan was the different from that of BitmapScan even if the search
keyword was the same. Is this a bug? Here is the test case:

---------------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE tbl (col text);
CREATE INDEX idx ON tbl USING gin (col gin_trgm_ops);
INSERT INTO tbl VALUES ('abc'), ('ab c');

SET enable_seqscan TO off;
SET enable_bitmapscan TO on;
SELECT * FROM tbl WHERE col LIKE E'%\\c%';
col
------
ab c
(1 row)

SET enable_seqscan TO on;
SET enable_bitmapscan TO off;
SELECT * FROM tbl WHERE col LIKE E'%\\c%';
col
------
abc
ab c
(2 rows)
---------------------------

The cause is ISTM that pg_trgm wrongly ignores the heading wildcard
character (i.e., %) when an escape (i.e., \\) follows the wildcard character.
Attached patch fixes this.

The patch fixes another problem: pg_trgm wrongly ignores the backslash \\
following the escape, i.e., \\\\. This problem might be harmless when
KEEPONLYALNUM is enabled because any characters other than
alphabets and digits are ignored. But, when KEEPONLYALNUM is disabled,
\\\\ should be interpreted as a backslash character itself, but
pg_trgm does not.

Regards,

--
Fujii Masao

Attachments:

trgm_bugfix_v1.patchapplication/octet-stream; name=trgm_bugfix_v1.patchDownload
*** a/contrib/pg_trgm/trgm_op.c
--- b/contrib/pg_trgm/trgm_op.c
***************
*** 284,293 **** get_wildcard_part(const char *str, int lenstr,
  	{
  		if (in_escape)
  		{
- 			in_escape = false;
- 			in_wildcard_meta = false;
  			if (iswordchr(beginword))
  				break;
  		}
  		else
  		{
--- 284,292 ----
  	{
  		if (in_escape)
  		{
  			if (iswordchr(beginword))
  				break;
+ 			in_escape = false;
  		}
  		else
  		{
***************
*** 334,340 **** get_wildcard_part(const char *str, int lenstr,
  	 */
  	endword = beginword;
  	in_wildcard_meta = false;
- 	in_escape = false;
  	while (endword - str < lenstr)
  	{
  		clen = pg_mblen(endword);
--- 333,338 ----
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#1)
Re: bug of pg_trgm?

Fujii Masao <masao.fujii@gmail.com> writes:

When I used pg_trgm, I encountered the problem that the search result of
SeqScan was the different from that of BitmapScan even if the search
keyword was the same. Is this a bug?

Surely.

The cause is ISTM that pg_trgm wrongly ignores the heading wildcard
character (i.e., %) when an escape (i.e., \\) follows the wildcard character.
Attached patch fixes this.

This patch doesn't seem quite right to me, though. I agree that given
'%\x...', we should exit the loop with in_wildcard_meta still true.
However, if we have say '%\+...', the loop will continue, and now we
must reset in_wildcard_meta, no? The next character is not adjacent to
a meta. So I think in the "if (in_escape)" block, *both* assignments
should be moved after the iswordchr check. Is there something I'm
missing?

Also, shouldn't we make a similar change in the second loop? I guess
it's not strictly necessary given that that loop will exit as soon as
it sets in_wildcard_meta, but if you want to depend on that then the
reset in the second "if (in_escape)" block is altogether useless. It
seems better to keep the logic of the two loops as similar as possible.

I'm also inclined to think that we should remove *both* flag resets
before the second loop. The logic here is that we are reprocessing
the same character seen in the last iteration of the first loop,
right? So the flag state ought to remain the same.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: bug of pg_trgm?

... btw, I think there is another problem here, which is that
generate_wildcard_trgm will restart get_wildcard_part at the
same place that the second loop exits, which means it would
do the wrong thing if what it returns is a pointer to the
second char of an escape pair. Consider for instance

foo\\%bar

The first call of get_wildcard_part will correctly extract "foo",
but then return a pointer to the second backslash. So the second
call will think that the % is escaped, which it is not, leading to
a wrong decision about whether to pad "bar".

Probably a minimal fix for this could be made by backing up "endword"
one byte before returning it if in_escape is true when the second
loop exits. That would not scale up to preserving the state of
in_wildcard_meta, but since the second loop never advances past a
meta char, that's okay for the moment.

regards, tom lane

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#2)
Re: bug of pg_trgm?

On Thu, Aug 9, 2012 at 2:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

When I used pg_trgm, I encountered the problem that the search result of
SeqScan was the different from that of BitmapScan even if the search
keyword was the same. Is this a bug?

Surely.

The cause is ISTM that pg_trgm wrongly ignores the heading wildcard
character (i.e., %) when an escape (i.e., \\) follows the wildcard character.
Attached patch fixes this.

This patch doesn't seem quite right to me, though. I agree that given
'%\x...', we should exit the loop with in_wildcard_meta still true.
However, if we have say '%\+...', the loop will continue, and now we
must reset in_wildcard_meta, no? The next character is not adjacent to
a meta. So I think in the "if (in_escape)" block, *both* assignments
should be moved after the iswordchr check. Is there something I'm
missing?

No. You're right. We must reset in_wildcard_meta and must ignore %\+
in '%\+...'.

Also, shouldn't we make a similar change in the second loop? I guess
it's not strictly necessary given that that loop will exit as soon as
it sets in_wildcard_meta, but if you want to depend on that then the
reset in the second "if (in_escape)" block is altogether useless. It
seems better to keep the logic of the two loops as similar as possible.

Yes. There is another useless reset of in_wildcard_meta in the second
loop. We should also remove that?

I'm also inclined to think that we should remove *both* flag resets
before the second loop. The logic here is that we are reprocessing
the same character seen in the last iteration of the first loop,
right? So the flag state ought to remain the same.

No. ISTM that in_wildcard_meta must be reset before the second loop.
Because the meaning of that flag in the first loop is different from that in
the second loop. The former and the latter indicate whether the search
string has *heading* and *tailing* wildcard character, respectively. No?

Regards,

--
Fujii Masao

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#3)
Re: bug of pg_trgm?

On Thu, Aug 9, 2012 at 3:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... btw, I think there is another problem here, which is that
generate_wildcard_trgm will restart get_wildcard_part at the
same place that the second loop exits, which means it would
do the wrong thing if what it returns is a pointer to the
second char of an escape pair. Consider for instance

foo\\%bar

The first call of get_wildcard_part will correctly extract "foo",
but then return a pointer to the second backslash. So the second
call will think that the % is escaped, which it is not, leading to
a wrong decision about whether to pad "bar".

Good catch!

Probably a minimal fix for this could be made by backing up "endword"
one byte before returning it if in_escape is true when the second
loop exits. That would not scale up to preserving the state of
in_wildcard_meta, but since the second loop never advances past a
meta char, that's okay for the moment.

Or what about extending get_wildcard_part() so that it accepts the pointer
to in_escape as an argument? generate_wildcard_trgm() can know the last
value of in_escape and specify it the next call of get_wildcard_part(). Looks
very simple.

Regards,

--
Fujii Masao

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#4)
Re: bug of pg_trgm?

Fujii Masao <masao.fujii@gmail.com> writes:

On Thu, Aug 9, 2012 at 2:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm also inclined to think that we should remove *both* flag resets
before the second loop. The logic here is that we are reprocessing
the same character seen in the last iteration of the first loop,
right? So the flag state ought to remain the same.

No. ISTM that in_wildcard_meta must be reset before the second loop.
Because the meaning of that flag in the first loop is different from that in
the second loop. The former and the latter indicate whether the search
string has *heading* and *tailing* wildcard character, respectively. No?

Oh, good point. Maybe it would be clearer to use two separate
flag variables?

The thought I'd had was that the flag would necessarily get reset
during the first iteration of the second loop, which means it all
ends up the same anyway. But if we want to think of the flag as
meaning two different things for the two loops, I'd be inclined to
use two variables.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#5)
Re: bug of pg_trgm?

Fujii Masao <masao.fujii@gmail.com> writes:

On Thu, Aug 9, 2012 at 3:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Probably a minimal fix for this could be made by backing up "endword"
one byte before returning it if in_escape is true when the second
loop exits. That would not scale up to preserving the state of
in_wildcard_meta, but since the second loop never advances past a
meta char, that's okay for the moment.

Or what about extending get_wildcard_part() so that it accepts the pointer
to in_escape as an argument? generate_wildcard_trgm() can know the last
value of in_escape and specify it the next call of get_wildcard_part(). Looks
very simple.

Yeah, I had considered pushing the state variables out to the caller.
If there were any prospect of wanting more state than just in_escape,
I'd be for that --- but I don't see any reason to possibly need more,
especially in view of your point that in_wildcard_meta isn't really
a single flag with an interpretation that remains fixed throughout.
I think it's probably better just to take care of the issue inside
get_wildcard_part, and not complicate its API.

regards, tom lane

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#7)
1 attachment(s)
Re: bug of pg_trgm?

On Sat, Aug 11, 2012 at 8:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

No. ISTM that in_wildcard_meta must be reset before the second loop.
Because the meaning of that flag in the first loop is different from that in
the second loop. The former and the latter indicate whether the search
string has *heading* and *tailing* wildcard character, respectively. No?

Oh, good point. Maybe it would be clearer to use two separate
flag variables?

Agreed. Attached patch uses two separate flag variables.

On Sat, Aug 11, 2012 at 8:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

On Thu, Aug 9, 2012 at 3:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Probably a minimal fix for this could be made by backing up "endword"
one byte before returning it if in_escape is true when the second
loop exits. That would not scale up to preserving the state of
in_wildcard_meta, but since the second loop never advances past a
meta char, that's okay for the moment.

Or what about extending get_wildcard_part() so that it accepts the pointer
to in_escape as an argument? generate_wildcard_trgm() can know the last
value of in_escape and specify it the next call of get_wildcard_part(). Looks
very simple.

Yeah, I had considered pushing the state variables out to the caller.
If there were any prospect of wanting more state than just in_escape,
I'd be for that --- but I don't see any reason to possibly need more,
especially in view of your point that in_wildcard_meta isn't really
a single flag with an interpretation that remains fixed throughout.
I think it's probably better just to take care of the issue inside
get_wildcard_part, and not complicate its API.

OK. Attached patch fixes the problem as you suggested, i.e., it backs up
"endword" if the second loop exits in an escape pair.

Regards,

--
Fujii Masao

Attachments:

trgm_bugfix_v2.patchapplication/octet-stream; name=trgm_bugfix_v2.patchDownload
*** a/contrib/pg_trgm/trgm_op.c
--- b/contrib/pg_trgm/trgm_op.c
***************
*** 272,278 **** get_wildcard_part(const char *str, int lenstr,
  	const char *beginword = str;
  	const char *endword;
  	char	   *s = buf;
! 	bool		in_wildcard_meta = false;
  	bool		in_escape = false;
  	int			clen;
  
--- 272,279 ----
  	const char *beginword = str;
  	const char *endword;
  	char	   *s = buf;
! 	bool		in_leading_wildcard_meta = false;
! 	bool		in_trailing_wildcard_meta = false;
  	bool		in_escape = false;
  	int			clen;
  
***************
*** 284,304 **** get_wildcard_part(const char *str, int lenstr,
  	{
  		if (in_escape)
  		{
- 			in_escape = false;
- 			in_wildcard_meta = false;
  			if (iswordchr(beginword))
  				break;
  		}
  		else
  		{
  			if (ISESCAPECHAR(beginword))
  				in_escape = true;
  			else if (ISWILDCARDCHAR(beginword))
! 				in_wildcard_meta = true;
  			else if (iswordchr(beginword))
  				break;
  			else
! 				in_wildcard_meta = false;
  		}
  		beginword += pg_mblen(beginword);
  	}
--- 285,305 ----
  	{
  		if (in_escape)
  		{
  			if (iswordchr(beginword))
  				break;
+ 			in_escape = false;
+ 			in_leading_wildcard_meta = false;
  		}
  		else
  		{
  			if (ISESCAPECHAR(beginword))
  				in_escape = true;
  			else if (ISWILDCARDCHAR(beginword))
! 				in_leading_wildcard_meta = true;
  			else if (iswordchr(beginword))
  				break;
  			else
! 				in_leading_wildcard_meta = false;
  		}
  		beginword += pg_mblen(beginword);
  	}
***************
*** 314,320 **** get_wildcard_part(const char *str, int lenstr,
  	 * meta-character.
  	 */
  	*charlen = 0;
! 	if (!in_wildcard_meta)
  	{
  		if (LPADDING > 0)
  		{
--- 315,321 ----
  	 * meta-character.
  	 */
  	*charlen = 0;
! 	if (!in_leading_wildcard_meta)
  	{
  		if (LPADDING > 0)
  		{
***************
*** 333,347 **** get_wildcard_part(const char *str, int lenstr,
  	 * string boundary.  Strip escapes during copy.
  	 */
  	endword = beginword;
- 	in_wildcard_meta = false;
- 	in_escape = false;
  	while (endword - str < lenstr)
  	{
  		clen = pg_mblen(endword);
  		if (in_escape)
  		{
  			in_escape = false;
- 			in_wildcard_meta = false;
  			if (iswordchr(endword))
  			{
  				memcpy(s, endword, clen);
--- 334,345 ----
***************
*** 349,355 **** get_wildcard_part(const char *str, int lenstr,
--- 347,361 ----
  				s += clen;
  			}
  			else
+ 			{
+ 				/*
+ 				 * Back up endword one byte if the last two characters are
+ 				 * an escape pair, so that subsequent get_wildcard_part will
+ 				 * restart from its escape character.
+ 				 */
+ 				endword--;
  				break;
+ 			}
  		}
  		else
  		{
***************
*** 357,363 **** get_wildcard_part(const char *str, int lenstr,
  				in_escape = true;
  			else if (ISWILDCARDCHAR(endword))
  			{
! 				in_wildcard_meta = true;
  				break;
  			}
  			else if (iswordchr(endword))
--- 363,369 ----
  				in_escape = true;
  			else if (ISWILDCARDCHAR(endword))
  			{
! 				in_trailing_wildcard_meta = true;
  				break;
  			}
  			else if (iswordchr(endword))
***************
*** 367,376 **** get_wildcard_part(const char *str, int lenstr,
  				s += clen;
  			}
  			else
- 			{
- 				in_wildcard_meta = false;
  				break;
- 			}
  		}
  		endword += clen;
  	}
--- 373,379 ----
***************
*** 379,385 **** get_wildcard_part(const char *str, int lenstr,
  	 * Add right padding spaces if last character wasn't wildcard
  	 * meta-character.
  	 */
! 	if (!in_wildcard_meta)
  	{
  		if (RPADDING > 0)
  		{
--- 382,388 ----
  	 * Add right padding spaces if last character wasn't wildcard
  	 * meta-character.
  	 */
! 	if (!in_trailing_wildcard_meta)
  	{
  		if (RPADDING > 0)
  		{
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#8)
Re: bug of pg_trgm?

Fujii Masao <masao.fujii@gmail.com> writes:

OK. Attached patch fixes the problem as you suggested, i.e., it backs up
"endword" if the second loop exits in an escape pair.

Applied with a bit of further adjustment of the comments. Thanks!

regards, tom lane

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#9)
Re: bug of pg_trgm?

On 08/20/2012 01:26 PM, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

OK. Attached patch fixes the problem as you suggested, i.e., it backs up
"endword" if the second loop exits in an escape pair.

Applied with a bit of further adjustment of the comments. Thanks!

When moving to a release with this change, will users need to reindex
trgm indexes?

cheers

andrew

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#10)
Re: bug of pg_trgm?

Andrew Dunstan <andrew@dunslane.net> writes:

On 08/20/2012 01:26 PM, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

OK. Attached patch fixes the problem as you suggested, i.e., it backs up
"endword" if the second loop exits in an escape pair.

Applied with a bit of further adjustment of the comments. Thanks!

When moving to a release with this change, will users need to reindex
trgm indexes?

No. This only changes extraction of search trigrams from a LIKE pattern,
not extraction of trigrams from text to be indexed.

regards, tom lane

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#11)
Re: bug of pg_trgm?

On 09/06/2012 05:09 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 08/20/2012 01:26 PM, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

OK. Attached patch fixes the problem as you suggested, i.e., it backs up
"endword" if the second loop exits in an escape pair.

Applied with a bit of further adjustment of the comments. Thanks!

When moving to a release with this change, will users need to reindex
trgm indexes?

No. This only changes extraction of search trigrams from a LIKE pattern,
not extraction of trigrams from text to be indexed.

OK, Thanks.

cheers

andrew