old bug in full text parser

Started by Oleg Bartunovalmost 10 years ago6 messages

obartunov@gmail.com

almost 10 years ago

It looks like there is a very old bug in full text parser (somebody
pointed me on it), which appeared after moving tsearch2 into the core. The
problem is in how full text parser process hyphenated words. Our original
idea was to report hyphenated word itself as well as its parts and ignore
hyphen. That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
than ones with plain text words like 'four-dot', no hyphenated word itself
reported.

I think we should consider this as a bug and produce fix for all supported
versions.

After investigation we found this commit:

commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 27 19:03:45 2007 +0000

Change text search parsing rules for hyphenated words so that digit
strings
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent
part-by-part
reparsing states so that we don't get different answers about how much
text
is part of the hyphenated word. Per my gripe of a few days ago.

8.2.23

8.3.23

Regards,
Oleg

Oleg Bartunov

obartunov@gmail.com

almost 10 years ago

In reply to: Oleg Bartunov (#1)

Re: old bug in full text parser

On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartunov@gmail.com> wrote:

It looks like there is a very old bug in full text parser (somebody
pointed me on it), which appeared after moving tsearch2 into the core. The
problem is in how full text parser process hyphenated words. Our original
idea was to report hyphenated word itself as well as its parts and ignore
hyphen. That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed
differently than ones with plain text words like 'four-dot', no hyphenated
word itself reported.

I think we should consider this as a bug and produce fix for all supported
versions.

After investigation we found this commit:

commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 27 19:03:45 2007 +0000

Change text search parsing rules for hyphenated words so that digit
strings
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent
part-by-part
reparsing states so that we don't get different answers about how much
text
is part of the hyphenated word. Per my gripe of a few days ago.

8.2.23

select tok_type, description, token from ts_debug('dot-four');
tok_type | description | token
-------------+-------------------------------+----------
lhword | Latin hyphenated word | dot-four
lpart_hword | Latin part of hyphenated word | dot
lpart_hword | Latin part of hyphenated word | four
(3 rows)

select tok_type, description, token from ts_debug('dot-4');
tok_type | description | token
-------------+-------------------------------+-------
hword | Hyphenated word | dot-4
lpart_hword | Latin part of hyphenated word | dot
uint | Unsigned integer | 4
(3 rows)

select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)

8.3.23

select alias, description, token from ts_debug('dot-four');
alias | description | token
-----------------+---------------------------------+----------
asciihword | Hyphenated word, all ASCII | dot-four
hword_asciipart | Hyphenated word part, all ASCII | dot
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)

select alias, description, token from ts_debug('dot-4');
alias | description | token
-----------+-----------------+-------
asciiword | Word, all ASCII | dot
int | Signed integer | -4
(2 rows)

select alias, description, token from ts_debug('4-dot');
alias | description | token
-----------+------------------+-------
uint | Unsigned integer | 4
blank | Space symbols | -
asciiword | Word, all ASCII | dot
(3 rows)

Oh, one more bug, which existed even in tsearch2.

Show quoted text

Regards,
Oleg

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Oleg Bartunov (#1)

Re: old bug in full text parser

Oleg Bartunov <obartunov@gmail.com> writes:

It looks like there is a very old bug in full text parser (somebody
pointed me on it), which appeared after moving tsearch2 into the core. The
problem is in how full text parser process hyphenated words. Our original
idea was to report hyphenated word itself as well as its parts and ignore
hyphen. That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
than ones with plain text words like 'four-dot', no hyphenated word itself
reported.

I think we should consider this as a bug and produce fix for all supported
versions.

I don't see anything here that looks like a bug, more like a definition
disagreement. As such, I'd be pretty dubious about back-patching a
change. But it's hard to debate the merits when you haven't said exactly
what you'd do instead.

I believe the commit you mention was intended to fix this inconsistency:

/messages/by-id/6269.1193184058@sss.pgh.pa.us

so I would be against simply reverting it. In any case, the examples
given there make it look like there was already inconsistency about mixed
words and numbers. Do we really think that "4-dot" should be considered
a hyphenated word? I'm not sure.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mike Rylander

mrylander@gmail.com

almost 10 years ago

In reply to: Oleg Bartunov (#1)

Re: old bug in full text parser

On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com> wrote:

It looks like there is a very old bug in full text parser (somebody pointed
me on it), which appeared after moving tsearch2 into the core. The problem
is in how full text parser process hyphenated words. Our original idea was
to report hyphenated word itself as well as its parts and ignore hyphen.
That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
than ones with plain text words like 'four-dot', no hyphenated word itself
reported.

I think we should consider this as a bug and produce fix for all supported
versions.

The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance. Strings like that end up being parsed as two
lexems, "1979" and "-2014". We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us. Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?

Thanks!

--
Mike Rylander

After investigation we found this commit:

commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 27 19:03:45 2007 +0000

Change text search parsing rules for hyphenated words so that digit
strings
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent
part-by-part
reparsing states so that we don't get different answers about how much
text
is part of the hyphenated word. Per my gripe of a few days ago.

8.2.23

select tok_type, description, token from ts_debug('dot-four');
tok_type | description | token
-------------+-------------------------------+----------
lhword | Latin hyphenated word | dot-four
lpart_hword | Latin part of hyphenated word | dot
lpart_hword | Latin part of hyphenated word | four
(3 rows)

select tok_type, description, token from ts_debug('dot-4');
tok_type | description | token
-------------+-------------------------------+-------
hword | Hyphenated word | dot-4
lpart_hword | Latin part of hyphenated word | dot
uint | Unsigned integer | 4
(3 rows)

select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)

8.3.23

select alias, description, token from ts_debug('dot-four');
alias | description | token
-----------------+---------------------------------+----------
asciihword | Hyphenated word, all ASCII | dot-four
hword_asciipart | Hyphenated word part, all ASCII | dot
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)

select alias, description, token from ts_debug('dot-4');
alias | description | token
-----------+-----------------+-------
asciiword | Word, all ASCII | dot
int | Signed integer | -4
(2 rows)

select alias, description, token from ts_debug('4-dot');
alias | description | token
-----------+------------------+-------
uint | Unsigned integer | 4
blank | Space symbols | -
asciiword | Word, all ASCII | dot
(3 rows)

Regards,
Oleg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oleg Bartunov

obartunov@gmail.com

almost 10 years ago

In reply to: Mike Rylander (#4)

Re: old bug in full text parser

On Wed, Feb 10, 2016 at 7:45 PM, Mike Rylander <mrylander@gmail.com> wrote:

On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com>
wrote:

It looks like there is a very old bug in full text parser (somebody

pointed

me on it), which appeared after moving tsearch2 into the core. The

problem

is in how full text parser process hyphenated words. Our original idea

was

to report hyphenated word itself as well as its parts and ignore hyphen.
That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed

differently

than ones with plain text words like 'four-dot', no hyphenated word

itself

reported.

I think we should consider this as a bug and produce fix for all

supported

versions.

The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance. Strings like that end up being parsed as two
lexems, "1979" and "-2014". We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us. Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?

I'd love to consider all hyphenated "words" in one way, disregarding to
what is "a word", number of plain text, namely, 'w1-w2' should be reported
as {'w1-w2', 'w1', 'w2'}. The problem is in definition of "word".

We'll definitely look on parser again, fortunately, we could just fork
default parser and develop new one to not break compatibility. You have
chance to help us to produce "consistent" view of what tokens new parser
should recognize and how process them.

Show quoted text

Thanks!

--
Mike Rylander

After investigation we found this commit:

commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 27 19:03:45 2007 +0000

Change text search parsing rules for hyphenated words so that digit
strings
containing decimal points aren't considered part of a hyphenated

word.

Sync the hyphenated-word lookahead states with the subsequent
part-by-part
reparsing states so that we don't get different answers about how

much

text
is part of the hyphenated word. Per my gripe of a few days ago.

8.2.23

select tok_type, description, token from ts_debug('dot-four');
tok_type | description | token
-------------+-------------------------------+----------
lhword | Latin hyphenated word | dot-four
lpart_hword | Latin part of hyphenated word | dot
lpart_hword | Latin part of hyphenated word | four
(3 rows)

select tok_type, description, token from ts_debug('dot-4');
tok_type | description | token
-------------+-------------------------------+-------
hword | Hyphenated word | dot-4
lpart_hword | Latin part of hyphenated word | dot
uint | Unsigned integer | 4
(3 rows)

select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)

8.3.23

select alias, description, token from ts_debug('dot-four');
alias | description | token
-----------------+---------------------------------+----------
asciihword | Hyphenated word, all ASCII | dot-four
hword_asciipart | Hyphenated word part, all ASCII | dot
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)

select alias, description, token from ts_debug('dot-4');
alias | description | token
-----------+-----------------+-------
asciiword | Word, all ASCII | dot
int | Signed integer | -4
(2 rows)

select alias, description, token from ts_debug('4-dot');
alias | description | token
-----------+------------------+-------
uint | Unsigned integer | 4
blank | Space symbols | -
asciiword | Word, all ASCII | dot
(3 rows)

Regards,
Oleg

Oleg Bartunov

obartunov@gmail.com

almost 10 years ago

In reply to: Tom Lane (#3)

Re: old bug in full text parser

On Wed, Feb 10, 2016 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oleg Bartunov <obartunov@gmail.com> writes:

It looks like there is a very old bug in full text parser (somebody
pointed me on it), which appeared after moving tsearch2 into the core.

The

problem is in how full text parser process hyphenated words. Our original
idea was to report hyphenated word itself as well as its parts and ignore
hyphen. That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed

differently

than ones with plain text words like 'four-dot', no hyphenated word

itself

reported.

I think we should consider this as a bug and produce fix for all

supported

versions.

I don't see anything here that looks like a bug, more like a definition
disagreement. As such, I'd be pretty dubious about back-patching a
change. But it's hard to debate the merits when you haven't said exactly
what you'd do instead.

Yeah, better say not bug, but inconsistency. We definitely should work on
better
"consistent" parser with predicted behaviour.

I believe the commit you mention was intended to fix this inconsistency:

/messages/by-id/6269.1193184058@sss.pgh.pa.us

so I would be against simply reverting it. In any case, the examples
given there make it look like there was already inconsistency about mixed
words and numbers. Do we really think that "4-dot" should be considered
a hyphenated word? I'm not sure.

I agree, that we shouldn't just revert it. My idea is to work on new
parser and leave old as is for compatibility reason. Fortunately, fts is
flexible enough, so we could add new parser at any time as an extension.

Show quoted text

regards, tom lane