integrated tsearch has different results than tsearch2

Started by Pavel Stehuleover 18 years ago8 messages
#1Pavel Stehule
pavel.stehule@gmail.com

Hello

I am testing fulltext.

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).

2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.

Original (old) result

ts=# select * from ts_debug('Příliš žluťoučký kůň se napil žluté vody');
ts_name | tok_type | description | token | dict_name
| tsvector
--------------+----------+-------------+-----------+
-------------------+ ------------
default_czech | word | Word | Příliš |
{cz_ispell,simple} | 'příliš'
default_czech | word | Word | žluťoučký |
{cz_ispell,simple} | 'žluťoučký'
default_czech | word | Word | kůň | {cz_ispell,simple} | 'kůň'
default_czech | lword | Latin word | se | {cz_ispell,simple} |
default_czech | lword | Latin word | napil |
{cz_ispell,simple} | 'napít'
default_czech | word | Word | žluté |
{cz_ispell,simple} | 'žlutý'
default_czech | lword | Latin word | vody |
{cz_ispell,simple} | 'voda'
(7 řádek)

New results:
postgres=# create Text search dictionary cspell(template=ispell,
dictfile=czech, afffile=czech, stopwords=czech);
CREATE TEXT SEARCH DICTIONARY
postgres=# CREATE text search configuration cs (copy=english);
CREATE TEXT SEARCH CONFIGURATION

postgres=# alter text search configuration cs alter mapping for word,
lword with cspell, simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-----------+-----------------+---------------------
word | Word | Příliš | {cspell,simple} | cspell: {příliš}
blank | Space symbols | | {} |
word | Word | žluťoučký | {cspell,simple} | cspell: {žluťoučký}
blank | Space symbols | | {} |
word | Word | kůň | {cspell,simple} | cspell: {kůň}
blank | Space symbols | | {} |
lword | Latin word | se | {cspell,simple} | cspell: {}
blank | Space symbols | | {} |
lword | Latin word | napil | {cspell,simple} | simple: {napil}
blank | Space symbols | | {} |
word | Word | žluté | {cspell,simple} | simple: {žluté}
blank | Space symbols | | {} |
lword | Latin word | vody | {cspell,simple} | simple: {vody}
(13 rows)

This query returned true in 8.2 and now:

postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
?column?
----------
f
(1 row)

Regards
Pavel Stehule

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Pavel Stehule (#1)
Re: integrated tsearch has different results than tsearch2

Pavel,

I can't read your posting. Can you use plain text format ?

Oleg
On Mon, 3 Sep 2007, Pavel Stehule wrote:

Hello
I am testing fulltext.
1. I am not able use fulltext with latin2 encoding :( I missing noteabout only utf8 dictionaries in doc).

2. with hspell dictionaries (fresh copy from open office) I gotdifferent and wrong results.
Original (old) result
ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody'); ts_name | tok_type | description | token | dict_name | tsvector --------------+----------+-------------+-----------+-------------------+ ------------ default_czech | word | Word | P??li? |{cz_ispell,simple} | 'p??li?' default_czech | word | Word | ?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word | k?? | {cz_ispell,simple} | 'k??' default_czech | lword | Latin word | se | {cz_ispell,simple} | default_czech | lword | Latin word | napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word | ?lut? |{cz_ispell,simple} | '?lut?' default_czech | lword | Latin word | vody |{cz_ispell,simple} | 'voda' (7 ??dek)
New results:postgres=# create Text search dictionary cspell(template=ispell,dictfile=czech, afffile=czech, stopwords=czech);CREATE TEXT SEARCH DICTIONARYpostgres=# CREATE text search configuration cs (copy=english);CREATE TEXT SEARCH CONFIGURATION
postgres=# alter text search configuration cs alter mapping for word,lword with cspell, simple;ALTER TEXT SEARCH CONFIGURATIONpostgres=# select * from ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias | Description | Token | Dictionaries | Lexized token-------+---------------+-----------+-----------------+--------------------- word | Word | P??li? | {cspell,simple} | cspell: {p??li?} blank | Space symbols | | {} | word | Word | ?lu?ou?k? | {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols | | {} | word | Word | k?? | {cspell,simple} | cspell: {k??} blank | Space symbols | | {} | lword | Latin word | se | {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word | napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space symbols | | {} | lword | Latin word | vody | {cspell,simple} | simple: {vody}(13 rows)
This query returned true in 8.2 and now:
postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ to_tsquery('cs','nap?t'); ?column?---------- f(1 row)
RegardsPavel Stehule

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Teodor Sigaev
teodor@sigaev.ru
In reply to: Pavel Stehule (#1)
Re: integrated tsearch has different results than tsearch2

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).

You can use any server encoding, but dictionary's files should be in utf8 -
dictionary will convert utf8 files into server encoding.

2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
?column?
----------
f
(1 row)

Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#4Pavel Stehule
pavel.stehule@gmail.com
In reply to: Teodor Sigaev (#3)
Re: integrated tsearch has different results than tsearch2

2007/9/3, Teodor Sigaev <teodor@sigaev.ru>:

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).

You can use any server encoding, but dictionary's files should be in utf8 -
dictionary will convert utf8 files into server encoding.

2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
?column?
----------
f
(1 row)

Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');

postgres=# select ts_lexize('cspell','napil');
ts_lexize
-----------

(1 row)
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
to_tsvector
-----------------------------------------------------------
'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
(1 row)

There is difference
8.2.x
postgres=# select lexize('cz_ispell','jablka');
lexize
----------
{jablko}
(1 row)
8.3
postgres=# select ts_lexize('cspell','jablka');
ts_lexize
-----------

(1 row)
postgres=# select ts_lexize('cspell','jablko');
ts_lexize
-----------
{jablko}
(1 row)

Pavel Stehule

#5Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Pavel Stehule (#4)
Re: integrated tsearch has different results than tsearch2

Pavel Stehule wrote:

2007/9/3, Teodor Sigaev <teodor@sigaev.ru>:

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).

You can use any server encoding, but dictionary's files should be in utf8 -
dictionary will convert utf8 files into server encoding.

2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
?column?
----------
f
(1 row)

Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');

postgres=# select ts_lexize('cspell','napil');
ts_lexize
-----------

(1 row)
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
to_tsvector
-----------------------------------------------------------
'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
(1 row)

There is difference
8.2.x
postgres=# select lexize('cz_ispell','jablka');
lexize
----------
{jablko}
(1 row)
8.3
postgres=# select ts_lexize('cspell','jablka');
ts_lexize
-----------

(1 row)
postgres=# select ts_lexize('cspell','jablko');
ts_lexize
-----------
{jablko}
(1 row)

Can you post a link to the ispell dictionary file you're using so I and
others can reproduce that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Pavel Stehule
pavel.stehule@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: integrated tsearch has different results than tsearch2

I used dictionaries from fedora core packages

hunspell-cs-20060303-5.fc7.i386.rpm

then I converted it to utf8 with iconv

Pavel

2007/9/4, Heikki Linnakangas <heikki@enterprisedb.com>:

Show quoted text

Pavel Stehule wrote:

2007/9/3, Teodor Sigaev <teodor@sigaev.ru>:

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).

You can use any server encoding, but dictionary's files should be in utf8 -
dictionary will convert utf8 files into server encoding.

2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
?column?
----------
f
(1 row)

Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');

postgres=# select ts_lexize('cspell','napil');
ts_lexize
-----------

(1 row)
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
to_tsvector
-----------------------------------------------------------
'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
(1 row)

There is difference
8.2.x
postgres=# select lexize('cz_ispell','jablka');
lexize
----------
{jablko}
(1 row)
8.3
postgres=# select ts_lexize('cspell','jablka');
ts_lexize
-----------

(1 row)
postgres=# select ts_lexize('cspell','jablko');
ts_lexize
-----------
{jablko}
(1 row)

Can you post a link to the ispell dictionary file you're using so I and
others can reproduce that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Pavel Stehule (#6)
1 attachment(s)
Re: integrated tsearch has different results than tsearch2

Pavel Stehule wrote:

I used dictionaries from fedora core packages

hunspell-cs-20060303-5.fc7.i386.rpm

then I converted it to utf8 with iconv

Ok, thanks.

Apparently it's a bug I introduced when I refactored spell.c to use the
readline function for reading and recoding the input file. I didn't
notice that some calls to STRNCMP used the non-lowercased version of the
input line. Patch attached.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

spell-fix-1.patchtext/x-diff; name=spell-fix-1.patchDownload
Index: src/backend/tsearch/spell.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tsearch/spell.c,v
retrieving revision 1.2
diff -c -r1.2 spell.c
*** src/backend/tsearch/spell.c	25 Aug 2007 00:03:59 -0000	1.2
--- src/backend/tsearch/spell.c	4 Sep 2007 12:31:55 -0000
***************
*** 733,739 ****
  	while ((recoded = t_readline(affix)) != NULL)
  	{
  		pstr = lowerstr(recoded);
- 		pfree(recoded);
  
  		lineno++;
  
--- 733,738 ----
***************
*** 813,820 ****
  			flag = (unsigned char) *s;
  			goto nextline;
  		}
! 		if (STRNCMP(str, "COMPOUNDFLAG") == 0 || STRNCMP(str, "COMPOUNDMIN") == 0 ||
! 			STRNCMP(str, "PFX") == 0 || STRNCMP(str, "SFX") == 0)
  		{
  			if (oldformat)
  				ereport(ERROR,
--- 812,819 ----
  			flag = (unsigned char) *s;
  			goto nextline;
  		}
! 		if (STRNCMP(recoded, "COMPOUNDFLAG") == 0 || STRNCMP(recoded, "COMPOUNDMIN") == 0 ||
! 			STRNCMP(recoded, "PFX") == 0 || STRNCMP(recoded, "SFX") == 0)
  		{
  			if (oldformat)
  				ereport(ERROR,
***************
*** 834,839 ****
--- 833,839 ----
  		NIAddAffix(Conf, flag, flagflags, mask, find, repl, suffixes ? FF_SUFFIX : FF_PREFIX);
  
  	nextline:
+ 		pfree(recoded);
  		pfree(pstr);
  	}
  	FreeFile(affix);
#8Pavel Stehule
pavel.stehule@gmail.com
In reply to: Heikki Linnakangas (#7)
Re: integrated tsearch has different results than tsearch2

2007/9/4, Heikki Linnakangas <heikki@enterprisedb.com>:

Pavel Stehule wrote:

I used dictionaries from fedora core packages

hunspell-cs-20060303-5.fc7.i386.rpm

then I converted it to utf8 with iconv

Ok, thanks.

Apparently it's a bug I introduced when I refactored spell.c to use the
readline function for reading and recoding the input file. I didn't
notice that some calls to STRNCMP used the non-lowercased version of the
input line. Patch attached.

--

It works

Thank you
Pavel