to_tsvector in 8.2.3

Started by richardcraigabout 19 years ago14 messagesgeneral
Jump to latest
#1richardcraig
richard@v3fm.com

It may solve my query if anybody can tell me if anything has changes in
tsearch2 recently? otherwise...

I have two installations of Postgres on Windows machines, one is 8.2.1 and
the other is 8.2.3

Both installed the same way, selecting tsearch2 during the installation and
restoring the same backup database file.

In 8.2.1
select to_tsvector('test text')
returns
'test':1 'text':2
as expected

In 8.2.3 the same code returns
'test text':1

Can anybody help?
--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592230
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: richardcraig (#1)
Re: to_tsvector in 8.2.3

What parse returns ? 8.1.5 and 8.3 return

www=# select parse('test text');
parse
----------
(1,test)
(12," ")
(1,text)
(3 rows)

Also, what is your configuration ?

On Wed, 21 Mar 2007, richardcraig wrote:

It may solve my query if anybody can tell me if anything has changes in
tsearch2 recently? otherwise...

I have two installations of Postgres on Windows machines, one is 8.2.1 and
the other is 8.2.3

Both installed the same way, selecting tsearch2 during the installation and
restoring the same backup database file.

In 8.2.1
select to_tsvector('test text')
returns
'test':1 'text':2
as expected

In 8.2.3 the same code returns
'test text':1

Can anybody help?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3richardcraig
richard@v3fm.com
In reply to: Oleg Bartunov (#2)
Re: to_tsvector in 8.2.3

Oleg

Thanks for the quick response.

On 8.2.1 I get the same result as yourself, however on 8.2.3 I get
(2,"test text")

Configuration?
Both databases are UTF8 encoded with language set to C
Standard windows installation except for tsearch2
I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK

As far as I can tell they are identical except for the postgresql versions.

It's been a long time since I did much with tsearch2 so I'm suspecting a
config problem with it that's I've long forgotten about.

Richard

Oleg Bartunov wrote:

What parse returns ? 8.1.5 and 8.3 return

www=# select parse('test text');
parse
----------
(1,test)
(12," ")
(1,text)
(3 rows)

Also, what is your configuration ?

On Wed, 21 Mar 2007, richardcraig wrote:

It may solve my query if anybody can tell me if anything has changes in
tsearch2 recently? otherwise...

I have two installations of Postgres on Windows machines, one is 8.2.1
and
the other is 8.2.3

Both installed the same way, selecting tsearch2 during the installation
and
restoring the same backup database file.

In 8.2.1
select to_tsvector('test text')
returns
'test':1 'text':2
as expected

In 8.2.3 the same code returns
'test text':1

Can anybody help?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592631
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#4richardcraig
richard@v3fm.com
In reply to: richardcraig (#3)
Re: to_tsvector in 8.2.3

As an update, I've just copied the tsvector.dll file from an 8.2.1 computer
to the 8.2.3 computer and it's working OK now. I suspect a change in the dll
is causing the problem.

Richard

richardcraig wrote:

Oleg

Thanks for the quick response.

On 8.2.1 I get the same result as yourself, however on 8.2.3 I get
(2,"test text")

Configuration?
Both databases are UTF8 encoded with language set to C
Standard windows installation except for tsearch2
I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK

As far as I can tell they are identical except for the postgresql
versions.

It's been a long time since I did much with tsearch2 so I'm suspecting a
config problem with it that's I've long forgotten about.

Richard

Oleg Bartunov wrote:

What parse returns ? 8.1.5 and 8.3 return

www=# select parse('test text');
parse
----------
(1,test)
(12," ")
(1,text)
(3 rows)

Also, what is your configuration ?

On Wed, 21 Mar 2007, richardcraig wrote:

It may solve my query if anybody can tell me if anything has changes in
tsearch2 recently? otherwise...

I have two installations of Postgres on Windows machines, one is 8.2.1
and
the other is 8.2.3

Both installed the same way, selecting tsearch2 during the installation
and
restoring the same backup database file.

In 8.2.1
select to_tsvector('test text')
returns
'test':1 'text':2
as expected

In 8.2.3 the same code returns
'test text':1

Can anybody help?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592812
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#5Teodor Sigaev
teodor@sigaev.ru
In reply to: richardcraig (#3)
Re: to_tsvector in 8.2.3

I can't reproduce your problem, but I have not Windows box, can anybody
reproduce that?

contrib_regression=# select version();
version

PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC) 3.4.6
[FreeBSD] 20060305
(1 row)
contrib_regression=# show server_encoding ;
server_encoding
-----------------
UTF8
(1 row)

contrib_regression=# show lc_collate;
lc_collate
------------
C
(1 row)

contrib_regression=# show lc_ctype;
lc_ctype
----------
C
(1 row)

contrib_regression=# select to_tsvector('test text');
to_tsvector
-------------------
'test':1 'text':2
(1 row)

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#6Thomas Pundt
mlists@rp-online.de
In reply to: Teodor Sigaev (#5)
Re: to_tsvector in 8.2.3

On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote:
| I can't reproduce your problem, but I have not Windows box, can anybody
| reproduce that?

just a guess in the wild; I once had a similar phenomen and tracked it down
to a "non breaking space character" (0xA0). Since then I'm patching the
tsearch2 lexer:

--- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l
+++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l
@@ -78,8 +78,8 @@
 /* cyrillic koi8 char */
 CYRALNUM       [0-9\200-\377]
 CYRALPHA       [\200-\377]
-ALPHA          [a-zA-Z\200-\377]
-ALNUM          [0-9a-zA-Z\200-\377]
+ALPHA          [a-zA-Z\200-\237\241-\377]
+ALNUM          [0-9a-zA-Z\200-\237\241-\377]

HOSTNAME ([-_[:alnum:]]+\.)+[[:alpha:]]+
@@ -307,7 +307,7 @@
return UWORD;
}

-[ \r\n\t]+ {
+[ \240\r\n\t]+ {
token = tsearch2_yytext;
tokenlen = tsearch2_yyleng;
return SPACE;

Ciao,
Thomas

--
Thomas Pundt <thomas.pundt@rp-online.de> ---- http://rp-online.de/ ----

#7Teodor Sigaev
teodor@sigaev.ru
In reply to: Thomas Pundt (#6)
Re: to_tsvector in 8.2.3

8.2 has fully rewritten text parser based on POSIX is* functions.

Thomas Pundt wrote:

On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote:
| I can't reproduce your problem, but I have not Windows box, can anybody
| reproduce that?

just a guess in the wild; I once had a similar phenomen and tracked it down
to a "non breaking space character" (0xA0). Since then I'm patching the
tsearch2 lexer:

--- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l
+++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l
@@ -78,8 +78,8 @@
/* cyrillic koi8 char */
CYRALNUM       [0-9\200-\377]
CYRALPHA       [\200-\377]
-ALPHA          [a-zA-Z\200-\377]
-ALNUM          [0-9a-zA-Z\200-\377]
+ALPHA          [a-zA-Z\200-\237\241-\377]
+ALNUM          [0-9a-zA-Z\200-\237\241-\377]

HOSTNAME ([-_[:alnum:]]+\.)+[[:alpha:]]+
@@ -307,7 +307,7 @@
return UWORD;
}

-[ \r\n\t]+ {
+[ \240\r\n\t]+ {
token = tsearch2_yytext;
tokenlen = tsearch2_yyleng;
return SPACE;

Ciao,
Thomas

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#8Oleg Bartunov
oleg@sai.msu.su
In reply to: richardcraig (#4)
Re: to_tsvector in 8.2.3

On Wed, 21 Mar 2007, richardcraig wrote:

As an update, I've just copied the tsvector.dll file from an 8.2.1 computer
to the 8.2.3 computer and it's working OK now. I suspect a change in the dll
is causing the problem.

Interesting.

Richard

richardcraig wrote:

Oleg

Thanks for the quick response.

On 8.2.1 I get the same result as yourself, however on 8.2.3 I get
(2,"test text")

Configuration?
Both databases are UTF8 encoded with language set to C
Standard windows installation except for tsearch2
I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK

As far as I can tell they are identical except for the postgresql
versions.

It's been a long time since I did much with tsearch2 so I'm suspecting a
config problem with it that's I've long forgotten about.

Richard

Oleg Bartunov wrote:

What parse returns ? 8.1.5 and 8.3 return

www=# select parse('test text');
parse
----------
(1,test)
(12," ")
(1,text)
(3 rows)

Also, what is your configuration ?

On Wed, 21 Mar 2007, richardcraig wrote:

It may solve my query if anybody can tell me if anything has changes in
tsearch2 recently? otherwise...

I have two installations of Postgres on Windows machines, one is 8.2.1
and
the other is 8.2.3

Both installed the same way, selecting tsearch2 during the installation
and
restoring the same backup database file.

In 8.2.1
select to_tsvector('test text')
returns
'test':1 'text':2
as expected

In 8.2.3 the same code returns
'test text':1

Can anybody help?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#9Magnus Hagander
magnus@hagander.net
In reply to: Teodor Sigaev (#5)
Re: to_tsvector in 8.2.3

On Wed, Mar 21, 2007 at 04:25:30PM +0300, Teodor Sigaev wrote:

I can't reproduce your problem, but I have not Windows box, can anybody
reproduce that?

contrib_regression=# select version();
version

PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC)
3.4.6 [FreeBSD] 20060305
(1 row)
contrib_regression=# show server_encoding ;
server_encoding
-----------------
UTF8
(1 row)

contrib_regression=# show lc_collate;
lc_collate
------------
C
(1 row)

contrib_regression=# show lc_ctype;
lc_ctype
----------
C
(1 row)

contrib_regression=# select to_tsvector('test text');
to_tsvector
-------------------
'test':1 'text':2
(1 row)

postgres=# select version();show lc_collate;show lc_ctype;
version

--------------------------------------------------------------------------------
-------------
PostgreSQL 8.3devel on i686-pc-mingw32, compiled by GCC gcc.exe (GCC)
3.4.2 (mi
ngw-special)
(1 row)

lc_collate
------------
C
(1 row)

lc_ctype
----------
C
(1 row)

postgres=# select to_tsvector('test text');
to_tsvector
---------------
'test text':1
(1 row)

(encoding is UTF8 on a fresh database)

//Magnus

#10Teodor Sigaev
teodor@sigaev.ru
In reply to: Magnus Hagander (#9)
Re: to_tsvector in 8.2.3

postgres=# select to_tsvector('test text');
to_tsvector
---------------
'test text':1
(1 row)

Ok. that's related to
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h
commit. Thomas pointed that it can be non-breakable space (0xa0) and that commit
assumes any character with C locale and multibyte encoding and > 0x7f is alpha.
To check theory, pls, apply attached patch.

If so, I'm confused, we can not assume that 0xa0 is a space symbol in any
multibyte encoding, even in Windows.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

nonbreak.patchtext/plain; name=nonbreak.patchDownload+8-0
#11Magnus Hagander
magnus@hagander.net
In reply to: Teodor Sigaev (#10)
Re: to_tsvector in 8.2.3

On Wed, Mar 21, 2007 at 09:13:55PM +0300, Teodor Sigaev wrote:

postgres=# select to_tsvector('test text');
to_tsvector
---------------
'test text':1
(1 row)

Ok. that's related to
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h
commit. Thomas pointed that it can be non-breakable space (0xa0) and that
commit assumes any character with C locale and multibyte encoding and >
0x7f is alpha.
To check theory, pls, apply attached patch.

If so, I'm confused, we can not assume that 0xa0 is a space symbol in any
multibyte encoding, even in Windows.

Nope, same result with this patch.

//Magnus

#12Teodor Sigaev
teodor@sigaev.ru
In reply to: Magnus Hagander (#11)
Re: to_tsvector in 8.2.3

Solved, see attached patch. I had found old Celeron-300 box and install Windows
on it, and it was very slow :)

Nope, same result with this patch.

Thank you.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

ttttext/plain; name=tttDownload+2-2
#13richardcraig
richard@v3fm.com
In reply to: Teodor Sigaev (#12)
Re: to_tsvector in 8.2.3

Teodor

As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed
dll for this patch that I can try?

Thanks

Richard

Teodor Sigaev-2 wrote:

Solved, see attached patch. I had found old Celeron-300 box and install
Windows
on it, and it was very slow :)

Nope, same result with this patch.

Thank you.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW:
http://www.sigaev.ru/

*** ./contrib/tsearch2.orig/./wordparser/parser.c	Thu Mar 22 18:39:23 2007
--- ./contrib/tsearch2/./wordparser/parser.c	Thu Mar 22 18:51:23 2007
***************
*** 117,123 ****
{
if (lc_ctype_is_c())
{
! 			unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);
/*
* any non-ascii symbol with multibyte encoding
--- 117,123 ----
{
if (lc_ctype_is_c())
{
! 			unsigned int c = *(prs->wstr + prs->state->poschar);

/*
* any non-ascii symbol with multibyte encoding

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9790691
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#14Teodor Sigaev
teodor@sigaev.ru
In reply to: richardcraig (#13)
Re: to_tsvector in 8.2.3

Sorry, no - I tested on CVS HEAD, so dll isn't compatible :(
Wait a bit for 8.2.4

richardcraig wrote:

Teodor

As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed
dll for this patch that I can try?

Thanks

Richard

Teodor Sigaev-2 wrote:

Solved, see attached patch. I had found old Celeron-300 box and install
Windows
on it, and it was very slow :)

Nope, same result with this patch.

Thank you.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW:
http://www.sigaev.ru/

*** ./contrib/tsearch2.orig/./wordparser/parser.c	Thu Mar 22 18:39:23 2007
--- ./contrib/tsearch2/./wordparser/parser.c	Thu Mar 22 18:51:23 2007
***************
*** 117,123 ****
{
if (lc_ctype_is_c())
{
! 			unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);
/*
* any non-ascii symbol with multibyte encoding
--- 117,123 ----
{
if (lc_ctype_is_c())
{
! 			unsigned int c = *(prs->wstr + prs->state->poschar);

/*
* any non-ascii symbol with multibyte encoding

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/