tsvector/tsearch equality and/or portability issue issue ?

Started by Stefan Kaltenbrunnerover 19 years ago16 messages
#1Stefan Kaltenbrunner
stefan@kaltenbrunner.cc

We just had a complaint on IRC that:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

and that searches for certain values would not return all matches under
some circumstances.

a little bit of testing shows the following:

postgres=# create table foo (bla tsvector);
CREATE TABLE
postgres=# insert into foo values ('bla bla');
INSERT 0 1
postgres=# insert into foo values ('bla bla');
INSERT 0 1
postgres=# select bla from foo group by bla;
bla
-------
'bla'
(1 row)

postgres=# create index foo_idx on foo(bla);
CREATE INDEX
postgres=# set enable_seqscan to off;
SET
postgres=# select bla from foo group by bla;
bla
-------
'bla'
'bla'
(2 rows)

postgres=# set enable_seqscan to on;
SET
postgres=# select bla from foo group by bla;
bla
-------
'bla'
(1 row)

ouch :-(

I can reproduce that at least on OpenBSD/i386 and Debian Etch/x86_64.

It is also noteworthy that the existing regression tests for tsearch2 do
not seem to do any equality testing ...

Stefan

#2Andrew J. Kopciuch
akopciuch@bddf.ca
In reply to: Stefan Kaltenbrunner (#1)
Re: tsvector/tsearch equality and/or portability issue issue ?

On Thursday 24 August 2006 10:34, Stefan Kaltenbrunner wrote:

We just had a complaint on IRC that:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

This could be an endianess issue?

This was probably the same person who posted this on the OpenFTS list.

He's compiled from source :

<snip>
dew=# select version();
PostgreSQL 8.1.4 on powerpc-apple-darwin8.6.0, compiled by GCC
powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build
5250)
</snip>

I don't have any access to an OSX box to verify things ATM. I am trying to
get access to one though. :S Can someone else verify this right now?

Andy

#3Teodor Sigaev
teodor@sigaev.ru
In reply to: Stefan Kaltenbrunner (#1)
Re: tsvector/tsearch equality and/or portability issue

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

Fixed in 8.1 and HEAD. Thank you

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#4AgentM
agentm@themactionfaction.com
In reply to: Andrew J. Kopciuch (#2)
Re: tsvector/tsearch equality and/or portability issue issue ?

On Aug 24, 2006, at 12:58 , Andrew J. Kopciuch wrote:

On Thursday 24 August 2006 10:34, Stefan Kaltenbrunner wrote:

We just had a complaint on IRC that:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

This could be an endianess issue?

This was probably the same person who posted this on the OpenFTS list.

He's compiled from source :

<snip>
dew=# select version();
PostgreSQL 8.1.4 on powerpc-apple-darwin8.6.0, compiled by GCC
powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc.
build
5250)
</snip>

I don't have any access to an OSX box to verify things ATM. I am
trying to
get access to one though. :S Can someone else verify this right
now?

Stefan said he reproduced on OpenBSD/i386 so it is unlikely to be an
endianness issue. Anyway, here's the comparison code- I guess it
doesn't use strcmp to avoid encoding silliness. (?)

static int
silly_cmp_tsvector(const tsvector * a, const tsvector * b)
{
if (a->len < b->len)
return -1;
else if (a->len > b->len)
return 1;
else if (a->size < b->size)
return -1;
else if (a->size > b->size)
return 1;
else
{
unsigned char *aptr = (unsigned char *) (a->data) +
DATAHDRSIZE;
unsigned char *bptr = (unsigned char *) (b->data) +
DATAHDRSIZE;

while (aptr - ((unsigned char *) (a->data)) < a->len)
{
if (*aptr != *bptr)
return (*aptr < *bptr) ? -1 : 1;
aptr++;
bptr++;
}
}
return 0;
}

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew J. Kopciuch (#2)
Re: tsvector/tsearch equality and/or portability issue issue ?

"Andrew J. Kopciuch" <akopciuch@bddf.ca> writes:

On Thursday 24 August 2006 10:34, Stefan Kaltenbrunner wrote:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

This could be an endianess issue?

Apparently not, it works for me on HPPA (big endian) and on Darwin/PPC
(ditto). I'm testing CVS HEAD though, not 8.1 branch.

However ... I also see that tsearch2's regression test is dumping
core on my OS X machine. I haven't cvs update'd for awhile on this
machine though --- will bring it to HEAD and report back.

Can some other people try this? We need to get a handle on which
machines show the problem.

regards, tom lane

#6Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Teodor Sigaev (#3)
Re: tsvector/tsearch equality and/or portability issue

Teodor Sigaev wrote:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

Fixed in 8.1 and HEAD. Thank you

thanks for the fast response - would it maybe be worthwhile to add
regression tests for this kind of thing though ?

Stefan

#7Teodor Sigaev
teodor@sigaev.ru
In reply to: AgentM (#4)
Re: tsvector/tsearch equality and/or portability issue

Stefan said he reproduced on OpenBSD/i386 so it is unlikely to be an
endianness issue. Anyway, here's the comparison code- I guess it doesn't
use strcmp to avoid encoding silliness. (?)

I suppose that ordering for tsvector type is some strange and it hasn't any
matter. For me, it's a secret why it's needed :)
The reason of bug was: some internal parts of tsvector should be shortaligned,
so there was an unused bytes. Previous comparing function compares they too...

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#8Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#5)
Re: tsvector/tsearch equality and/or portability issue

Tom Lane wrote:

"Andrew J. Kopciuch" <akopciuch@bddf.ca> writes:

On Thursday 24 August 2006 10:34, Stefan Kaltenbrunner wrote:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

This could be an endianess issue?

Apparently not, it works for me on HPPA (big endian) and on Darwin/PPC
(ditto). I'm testing CVS HEAD though, not 8.1 branch.

However ... I also see that tsearch2's regression test is dumping
core on my OS X machine. I haven't cvs update'd for awhile on this
machine though --- will bring it to HEAD and report back.

Can some other people try this? We need to get a handle on which
machines show the problem.

I am trying on current copy of HEAD.. however:

jd@scratch:~/pgsqldev$ bin/psql -U postgres postgres <
share/contrib/tsearch2.sql
SET
BEGIN
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"pg_ts_dict_pkey" for table "pg_ts_dict"
CREATE TABLE
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
INSERT 57434167 1
CREATE FUNCTION
CREATE FUNCTION
INSERT 57434170 1
ERROR: could not find function "snb_ru_init_koi8" in file
"/usr/local/pgsql/lib/tsearch2.so"
ERROR: current transaction is aborted, commands ignored until end of
transaction block
ERROR: current transaction is aborted, commands ignored until end of
transaction block

I will try on 8.1 in a moment.

Joshua D. Drake

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

#9Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#5)
Re: tsvector/tsearch equality and/or portability issue

Can some other people try this? We need to get a handle on which
machines show the problem.

d@scratch:~/pgsqldev$ /usr/local/pgsql/bin/psql -U postgres postgres
Welcome to psql 8.1.3, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit

postgres=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
t
(1 row)

postgres=#

AMD 64 X2, Ubuntu Dapper LTS.

Sincerely,

Joshua D. Drake

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

#10Joshua D. Drake
jd@commandprompt.com
In reply to: Joshua D. Drake (#8)
Re: tsvector/tsearch equality and/or portability issue

Can some other people try this? We need to get a handle on which
machines show the problem.

I am trying on current copy of HEAD.. however:

Ignore the below... This is an error with my linker/ld.so.conf

Joshua D. Drake

jd@scratch:~/pgsqldev$ bin/psql -U postgres postgres <
share/contrib/tsearch2.sql
SET
BEGIN
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"pg_ts_dict_pkey" for table "pg_ts_dict"
CREATE TABLE
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
INSERT 57434167 1
CREATE FUNCTION
CREATE FUNCTION
INSERT 57434170 1
ERROR: could not find function "snb_ru_init_koi8" in file
"/usr/local/pgsql/lib/tsearch2.so"
ERROR: current transaction is aborted, commands ignored until end of
transaction block
ERROR: current transaction is aborted, commands ignored until end of
transaction block

I will try on 8.1 in a moment.

Joshua D. Drake

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#3)
Re: tsvector/tsearch equality and/or portability issue

Teodor Sigaev <teodor@sigaev.ru> writes:

Fixed in 8.1 and HEAD. Thank you

This appears to have created a regression test failure:

*** ./expected/tsearch2.out	Sun Jun 18 12:55:28 2006
--- ./results/tsearch2.out	Thu Aug 24 14:30:02 2006
***************
*** 2496,2503 ****
   f        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
   f        | '345':1 'qwerti':2 'copyright':3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   f        | 'qq':7 'bar':2,8 'foo':1,3,6 'copyright':9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
-  f        | 'a':1A,2,3C 'b':5A,6B,7C,8B                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
   f        | 'a':1A,2,3B 'b':5A,6A,7C,8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
   f        | '7w' 'ch' 'd7' 'eo' 'gw' 'i4' 'lq' 'o6' 'qt' 'y0'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
   f        | 'ar' 'ei' 'kq' 'ma' 'qa' 'qh' 'qq' 'qz' 'rx' 'st'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
   f        | 'gs' 'i6' 'i9' 'j2' 'l0' 'oq' 'qx' 'sc' 'xe' 'yu'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
--- 2496,2503 ----
   f        | 
   f        | '345':1 'qwerti':2 'copyright':3
   f        | 'qq':7 'bar':2,8 'foo':1,3,6 'copyright':9
   f        | 'a':1A,2,3B 'b':5A,6A,7C,8
+  f        | 'a':1A,2,3C 'b':5A,6B,7C,8B
   f        | '7w' 'ch' 'd7' 'eo' 'gw' 'i4' 'lq' 'o6' 'qt' 'y0'
   f        | 'ar' 'ei' 'kq' 'ma' 'qa' 'qh' 'qq' 'qz' 'rx' 'st'
   f        | 'gs' 'i6' 'i9' 'j2' 'l0' 'oq' 'qx' 'sc' 'xe' 'yu'

======================================================================

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#10)
Re: tsvector/tsearch equality and/or portability issue

"Joshua D. Drake" <jd@commandprompt.com> writes:

Can some other people try this? We need to get a handle on which
machines show the problem.

I am trying on current copy of HEAD.. however:

Looks like Teodor already solved the problem, so no need for a fire
drill anymore.

regards, tom lane

#13Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#11)
Re: tsvector/tsearch equality and/or portability issue

Oops. Fixed.

Tom Lane wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

Fixed in 8.1 and HEAD. Thank you

This appears to have created a regression test failure:

*** ./expected/tsearch2.out	Sun Jun 18 12:55:28 2006
--- ./results/tsearch2.out	Thu Aug 24 14:30:02 2006
***************
*** 2496,2503 ****
f        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
f        | '345':1 'qwerti':2 'copyright':3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
f        | 'qq':7 'bar':2,8 'foo':1,3,6 'copyright':9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
-  f        | 'a':1A,2,3C 'b':5A,6B,7C,8B                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
f        | 'a':1A,2,3B 'b':5A,6A,7C,8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
f        | '7w' 'ch' 'd7' 'eo' 'gw' 'i4' 'lq' 'o6' 'qt' 'y0'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
f        | 'ar' 'ei' 'kq' 'ma' 'qa' 'qh' 'qq' 'qz' 'rx' 'st'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
f        | 'gs' 'i6' 'i9' 'j2' 'l0' 'oq' 'qx' 'sc' 'xe' 'yu'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
--- 2496,2503 ----
f        | 
f        | '345':1 'qwerti':2 'copyright':3
f        | 'qq':7 'bar':2,8 'foo':1,3,6 'copyright':9
f        | 'a':1A,2,3B 'b':5A,6A,7C,8
+  f        | 'a':1A,2,3C 'b':5A,6B,7C,8B
f        | '7w' 'ch' 'd7' 'eo' 'gw' 'i4' 'lq' 'o6' 'qt' 'y0'
f        | 'ar' 'ei' 'kq' 'ma' 'qa' 'qh' 'qq' 'qz' 'rx' 'st'
f        | 'gs' 'i6' 'i9' 'j2' 'l0' 'oq' 'qx' 'sc' 'xe' 'yu'

======================================================================

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#14Phil Frost
indigo@bitglue.com
In reply to: Teodor Sigaev (#3)
Re: tsvector/tsearch equality and/or portability issue

On Thu, Aug 24, 2006 at 09:40:13PM +0400, Teodor Sigaev wrote:

devel=# select 'blah foo bar'::tsvector = 'blah foo bar'::tsvector;
?column?
----------
f
(1 row)

Fixed in 8.1 and HEAD. Thank you

Things still seem to be broken for me. Among other things, the script at
<http://unununium.org/~indigo/testvectors.sql.bz2&gt; fails. It performs two
tests, comparing 1000 random vectors with positions and random weights, and
comparing the same vectors, but stripped. Oddly, the unstripped comparisons all
pass, which is not consistant with what I am seeing in my database. However,
I'm yet unable to reproduce those problems.

It's worth noting that in running this script I have seen the number of
failures change, which seems to indicate that some uninitialized memory
is still being compared.

test=# \i testvectors.sql
BEGIN
CREATE FUNCTION
CREATE TABLE
total vectors in test set
---------------------------
1000
(1 row)

failing unstripped equality
-----------------------------
0
(1 row)

failing stripped equality
---------------------------
389
(1 row)

ROLLBACK
test=#

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phil Frost (#14)
Re: tsvector/tsearch equality and/or portability issue

Phil Frost <indigo@bitglue.com> writes:

Things still seem to be broken for me. Among other things, the script at
<http://unununium.org/~indigo/testvectors.sql.bz2&gt; fails. It performs two
tests, comparing 1000 random vectors with positions and random weights, and
comparing the same vectors, but stripped. Oddly, the unstripped comparisons all
pass, which is not consistant with what I am seeing in my database. However,
I'm yet unable to reproduce those problems.

It looks to me like tsvector comparison may be too strong. The strip()
function evidently thinks that it's OK to rearrange the string chunks
into the same order as the WordEntry items, which suggests to me that
the "pos" fields are not really semantically significant. But
silly_cmp_tsvector() considers that a difference in pos values is
important. I don't understand the data structure well enough to know
which one to believe, but something's not consistent here.

regards, tom lane

#16Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#15)
Re: tsvector/tsearch equality and/or portability issue

comparing the same vectors, but stripped. Oddly, the unstripped comparisons all
pass, which is not consistant with what I am seeing in my database. However,
I'm yet unable to reproduce those problems.

Fixed: strncmp was called with wrong length parameter.

It looks to me like tsvector comparison may be too strong. The strip()
function evidently thinks that it's OK to rearrange the string chunks
into the same order as the WordEntry items, which suggests to me that
the "pos" fields are not really semantically significant. But
silly_cmp_tsvector() considers that a difference in pos values is
important. I don't understand the data structure well enough to know
which one to believe, but something's not consistent here.

You are right: Pos really means position of lexeme itself in a tail of tsvector
structure. So, it's removed from comparison.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/