to_tsvector() with hyphens

Started by Brian DeRocheralmost 11 years ago2 messagesgeneral
Jump to latest
#1Brian DeRocher
brian@derocher.org

Hey everyone,

I think it's great that the full text search parser breaks hyphenated words into multiple parts. I think this really could help, but something is not right.

rasmas_hackathon=> select * from ts_debug( 'gn-foo' );
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+---------+----------------+--------------+----------
asciihword | Hyphenated word, all ASCII | gn-foo | {english_stem} | english_stem | {gn-foo}
hword_asciipart | Hyphenated word part, all ASCII | gn | {english_stem} | english_stem | {gn}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | foo | {english_stem} | english_stem | {foo}
blank | Space symbols | | {} | |
(6 rows)

But why does to_tsquery() AND them?

rasmas_hackathon=> select * from to_tsquery( 'gn-foo | bandage' );
to_tsquery
------------------------------------
'gn-foo' & 'gn' & 'foo' | 'bandag'
(1 row)

Perhaps my vector is like this:

rasmas_hackathon=> select to_tsvector( 'gn series bandage' );
to_tsvector
-----------------------------
'bandag':3 'gn':1 'seri':2
(1 row)

The rank is so bad.

rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn-foo | bandage' ) );
ts_rank_cd
------------
0.1
(1 row)

Without the hyphen the rank is better, despite the process above.

rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn | bandage' ) );
ts_rank_cd
------------
0.2
(1 row)

So wouldn't this be a better query for hyphenated words?

'gn-foo' | 'gn' | 'foo'

Aside: Best i can tell the parser is giving instructions to pushval_morph() to treat hyphenated words as
"same variants".

thanks,
Brian

--
http://brian.derocher.org
http://mappingdc.org
http://about.me/brian.derocher

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brian DeRocher (#1)
Re: to_tsvector() with hyphens

Brian DeRocher <brian@derocher.org> writes:

But why does to_tsquery() AND them?

rasmas_hackathon=> select * from to_tsquery( 'gn-foo | bandage' );
to_tsquery
------------------------------------
'gn-foo' & 'gn' & 'foo' | 'bandag'
(1 row)

Because what you're looking for is gn-foo, not either gn alone or foo
alone. Converting to "OR" would be the wrong thing.

The rank is so bad.

rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn-foo | bandage' ) );
ts_rank_cd
------------
0.1
(1 row)

Without the hyphen the rank is better, despite the process above.

rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn | bandage' ) );
ts_rank_cd
------------
0.2
(1 row)

Don't see the problem. The first case doesn't match the query as well as
the second one does, so I'd fully expect a higher rank for the second.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general