Like vs '=' bug with indexing

Started by m walmost 25 years ago15 messages
#1m w
mttf2000@yahoo.com

I am reposting this because I'm not sure it actually
made it to the list.

I have a function to transform text into a
pseudo-metaphone variable, take this example:

cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 3 ;
song | metatext
-------------+----------
Born To Run | brntorn
Born To Run | brntorn
Born To Run | brntorn
(3 rows)

Here is the problem: Depending on whether there is an
index or not, 'like' behaves differently. Here is a
transcript:

cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;
song | metatext
-------------+----------
Born To Run | brntorn
(1 row)

cddbsql=# create index cdsongs_meta_song on cdsongs
(metatext(song)) ;
CREATE
cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;
song | metatext
------+----------
(0
rows)

This happens in both 7.0 and
7.1.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#2Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: m w (#1)
AW: Like vs '=' bug with indexing

I am reposting this because I'm not sure it actually
made it to the list.

It did make it to the list, but can you give more details whether or which of the
following you use:
--enable-locale ?
--enable-multibyte ?
LANG=?
LC_COLLATE=?

I have a function to transform text into a
pseudo-metaphone variable, take this example:

Here is the problem: Depending on whether there is an
index or not, 'like' behaves differently. Here is a
transcript:

cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;
song | metatext
-------------+----------
Born To Run | brntorn
(1 row)

cddbsql=# create index cdsongs_meta_song on cdsongs
(metatext(song)) ;
CREATE
cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;
song | metatext
------+----------
(0
rows)

While I do see, that this is bogous, I do not really understand why you
use like in this case when your metatext function does not return any
wildcards. A simple = should lead to the same result. Can you check that
with the index in place ?

Andreas

Show quoted text

This happens in both 7.0 and 7.1.

#3Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#2)
AW: AW: Like vs '=' bug with indexing

I am reposting this because I'm not sure it actually
made it to the list.

It did make it to the list, but can you give more
details whether or which of the
following you use:
--enable-locale ?
--enable-multibyte ?
LANG=?
LC_COLLATE=?

I did not change any of the language or local
information. My config like is:

./configure --with-tcl --with-odbc
--with-maxbackends=128 --prefix=/usr/local/pgsql

I have a function to transform text into a
pseudo-metaphone variable, take this example:

Here is the problem: Depending on whether there is

an

index or not, 'like' behaves differently. Here is

a

transcript:

cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;
song | metatext
-------------+----------
Born To Run | brntorn
(1 row)

cddbsql=# create index cdsongs_meta_song on

cdsongs

(metatext(song)) ;
CREATE
cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;
song | metatext
------+----------
(0
rows)

While I do see, that this is bogous, I do not really understand why you
use like in this case when your metatext function does not return any
wildcards. A simple = should lead to the same result. Can you check that
with the index in place ?

Oddly enough when I use '=' it works, but adding that
means I have to special case when someone adds '%' at
the end of the word, which means I will have to parse
the string comming in. If I add the '%' sign to all
queries, then I will not get the results intended.

Ah, I wonder wether it might be trailing blanks, that are involved here.
Is your return type of metatext() sql type text ? text and varchar are trailing
blank sensitive. Still sounds strange, that the seq scan ind index scan behaviors
are different.

Show quoted text

This happens in both 7.0 and 7.1.

#4m w
mttf2000@yahoo.com
In reply to: Zeugswetter Andreas SB (#3)
Re: AW: AW: Like vs '=' bug with indexing
--- Zeugswetter Andreas SB
<ZeugswetterA@wien.spardat.at> wrote:

I am reposting this because I'm not sure it

actually

made it to the list.

It did make it to the list, but can you give

more

details whether or which of the
following you use:
--enable-locale ?
--enable-multibyte ?
LANG=?
LC_COLLATE=?

I did not change any of the language or local
information. My config like is:

./configure --with-tcl --with-odbc
--with-maxbackends=128 --prefix=/usr/local/pgsql

I have a function to transform text into a
pseudo-metaphone variable, take this example:

Here is the problem: Depending on whether

there is

an

index or not, 'like' behaves differently. Here

is

a

transcript:

cddbsql=# select song, metatext(song) from

cdsongs

where metatext(song) like metatext('born to

run')

limit 1 ;
song | metatext
-------------+----------
Born To Run | brntorn
(1 row)

cddbsql=# create index cdsongs_meta_song on

cdsongs

(metatext(song)) ;
CREATE
cddbsql=# select song, metatext(song) from

cdsongs

where metatext(song) like metatext('born to

run')

limit 1 ;
song | metatext
------+----------
(0
rows)

While I do see, that this is bogous, I do not

really understand why you

use like in this case when your metatext

function does not return any

wildcards. A simple = should lead to the same

result. Can you check that

with the index in place ?

Oddly enough when I use '=' it works, but adding

that

means I have to special case when someone adds '%'

at

the end of the word, which means I will have to

parse

the string comming in. If I add the '%' sign to

all

queries, then I will not get the results intended.

Ah, I wonder wether it might be trailing blanks,
that are involved here.
Is your return type of metatext() sql type text ?
text and varchar are trailing
blank sensitive. Still sounds strange, that the seq
scan ind index scan behaviors
are different.

That doesn't make sense, if there were trailing
blanks, '=' would not work. Also, since the same
function is being used on both ends, one presumes that
any trailing blanks would also be present on each. Why
would spaces kill a 'like' during and index scan, but
not during a table scan?

This happens in both 7.0 and 7.1.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: m w (#1)
Re: Like vs '=' bug with indexing

m w <mttf2000@yahoo.com> writes:

Here is the problem: Depending on whether there is an
index or not, 'like' behaves differently.

Please provide a complete, self-contained example with which
we can reproduce the problem.

regards, tom lane

#6m w
mttf2000@yahoo.com
In reply to: Tom Lane (#5)
Re: Like vs '=' bug with indexing
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:

m w <mttf2000@yahoo.com> writes:

Here is the problem: Depending on whether there is

an

index or not, 'like' behaves differently.

Please provide a complete, self-contained example
with which
we can reproduce the problem.

I am trying to create a test function and some data
that reproduces the problem easily, but I think I know
what it is.

It is a two bug issue. I had a bug in my code in that
I added the "\0" to the end of my strings when I
converted from a C string to a postgres "text" object.

I think this exposes a bug in postgres where either
index or table scans (I'm not sure which just yet)
treat a zero differently than a non-zero in a varchar.

It looks as if some section of code is using the zero
to terminate a string and another section of code is
not. So at some point data which should be the same
differes either on length of data, or the trailing
zero compared to an uninitialized byte.

Removing the terminating zero from the postgres string
fixes the problem, but, if I understand postgres well
enough, this should not make a difference, and should
be reported as a bug anyway.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: m w (#6)
Re: Like vs '=' bug with indexing

m w <mttf2000@yahoo.com> writes:

I think this exposes a bug in postgres where either
index or table scans (I'm not sure which just yet)
treat a zero differently than a non-zero in a varchar.

Embedded zeroes aren't supported in char/varchar/text fields,
and cannot be supported in a portable fashion, since these datatypes
rely on functions like strcoll() that don't allow embedded nulls in
strings. It wouldn't surprise me too much if there are inconsistent
behaviors between indexscans and seqscans for such invalid data.

It doesn't seem real practical for us to examine the output of every
C-coded function to make sure it produces a valid value of the datatype.
Illegal returned values are a fault of the function, and ensuing
misbehaviors are still its fault ...

regards, tom lane

#8m w
mttf2000@yahoo.com
In reply to: Tom Lane (#7)
Re: Like vs '=' bug with indexing

Embedded zeroes aren't supported in

char/varchar/text fields,
and cannot be supported in a portable fashion, since
these datatypes
rely on functions like strcoll() that don't allow
embedded nulls in
strings. It wouldn't surprise me too much if there
are inconsistent
behaviors between indexscans and seqscans for such
invalid data.

It doesn't seem real practical for us to examine the
output of every
C-coded function to make sure it produces a valid
value of the datatype.
Illegal returned values are a fault of the function,
and ensuing
misbehaviors are still its fault ...

Fair enough, but I think it should raise a caution
flag when two different behaviors can be seen with the
same query. This may have a common cause with other
index vs non-index behavior.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#9Hannu Krosing
hannu@tm.ee
In reply to: m w (#6)
Re: Like vs '=' bug with indexing

Tom Lane wrote:

m w <mttf2000@yahoo.com> writes:

I think this exposes a bug in postgres where either
index or table scans (I'm not sure which just yet)
treat a zero differently than a non-zero in a varchar.

Embedded zeroes aren't supported in char/varchar/text fields,
and cannot be supported in a portable fashion, since these datatypes
rely on functions like strcoll() that don't allow embedded nulls in
strings.

Is there no simple (i.e. cheap) way to disallow \0 alltogether for
these types then ?

perhaps just strip them out in textin() (or is it text_in()) ?

It wouldn't surprise me too much if there are inconsistent
behaviors between indexscans and seqscans for such invalid data.

should'nt they both use the _same_ strcoll() and friends ?

It doesn't seem real practical for us to examine the output of every
C-coded function to make sure it produces a valid value of the datatype.
Illegal returned values are a fault of the function, and ensuing
misbehaviors are still its fault ...

Should we not examine "the _possible_ outputs of every C-coded function
to make sure it produces a valid value of the datatype" ;)

For me producing an invalid data for a datatype seems very much like
a bug and it _should_ be reported.

-------------
Hannu

#10m w
mttf2000@yahoo.com
In reply to: Hannu Krosing (#9)
Re: Like vs '=' bug with indexing
--- Hannu Krosing <hannu@tm.ee> wrote:

Should we not examine "the _possible_ outputs of
every C-coded function
to make sure it produces a valid value of the
datatype" ;)

For me producing an invalid data for a datatype
seems very much like
a bug and it _should_ be reported.

No, I think Tom is right, there should be no
validation on C functions incorporated into Postgres
by users. Who wants that overhead in a production
system?

However, I think when the same SQL query produces
different results when you add an index, speaks of an
inconsistency in the system, which could be the source
of other problems.

I have seen a couple posts where results from an index
scan are not the same as the results from a table
scan, granted they were language issues, but still, my
gut tells me if I set the length of a variable to x,
and a trailing zero is included, the system should
either fail consistently or work consistently. I don't
care which, it should just be consistent.

Inconsistent behavior indicates that a different
matching algorithm is used if one uses an index
instead of a table scan. That scares me.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#9)
Re: Like vs '=' bug with indexing

Hannu Krosing <hannu@tm.ee> writes:

Is there no simple (i.e. cheap) way to disallow \0 alltogether for
these types then ?
perhaps just strip them out in textin() (or is it text_in()) ?

They *are* stripped out in textin(), by virtue of the fact that
textin expects to see a null-terminated input string.

It wouldn't surprise me too much if there are inconsistent
behaviors between indexscans and seqscans for such invalid data.

should'nt they both use the _same_ strcoll() and friends ?

Irrelevant; the issue is that the various comparison operators may
produce inconsistent results given invalid input. For instance
texteq() short-circuits to a FALSE result if the lengths of the
inputs are different, which means that 'ab\0' = 'ab' will produce
false, even though a strcoll-based comparison will claim they are
equal. I don't think that means that texteq() is wrong to check the
lengths first.

Should we not examine "the _possible_ outputs of every C-coded function
to make sure it produces a valid value of the datatype" ;)

Go for it.

Possibly chr() should reject chr(0) ...

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: m w (#10)
Re: Like vs '=' bug with indexing

m w <mttf2000@yahoo.com> writes:

Inconsistent behavior indicates that a different
matching algorithm is used if one uses an index
instead of a table scan. That scares me.

A seq scan and an index scan are inherently different algorithms,
so I don't see exactly how you think we can avoid this risk.

In particular, if you are dealing with a btree index and a "WHERE
column = constant" query, then a seq scan is only going to be concerned
with the behavior of the '=' operator --- does it return TRUE or not
for any particular row? But an index search is inherently going to make
ordered comparisons (<, =, >). So there is always a potential for
inconsistent behavior if the ordering operators produce results that are
inconsistent with simple '='. We cannot design that away --- all we can
do is fix such bugs when one is discovered in a particular datatype.

regards, tom lane

#13Hannu Krosing
hannu@tm.ee
In reply to: m w (#6)
Re: Like vs '=' bug with indexing

Tom Lane wrote:

Hannu Krosing <hannu@tm.ee> writes:

Is there no simple (i.e. cheap) way to disallow \0 alltogether for
these types then ?
perhaps just strip them out in textin() (or is it text_in()) ?

They *are* stripped out in textin(), by virtue of the fact that
textin expects to see a null-terminated input string.

Ok, I was mistaken to think that pg_trigger.tgargs contained real \0's
and not fakes and I was able to get similar output from other char
types by using \\000 - You never can tell how many \\\\ are required to
input a single \ to next level.

Also I remember being told that bytea _can_ hold embedded \0, no ?

Should we not examine "the _possible_ outputs of every C-coded function
to make sure it produces a valid value of the datatype" ;)

Go for it.

Possibly chr() should reject chr(0) ...

there is no function chr() at least in 7.0.2.

and char is not usable from psql (gives out strange errors);

hannu=# select char(0);
ERROR: length for type 'bpchar' must be at least 1
hannu=# select char(32);
ERROR: parser: parse error at or near ";"
hannu=# select char('a');
ERROR: parser: parse error at or near "'"

could it be possible somehow distinguish between user callable (safe)
functions and "internal" ones ?

------------
Hannu

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#13)
Re: Like vs '=' bug with indexing

Hannu Krosing <hannu@tm.ee> writes:

Also I remember being told that bytea _can_ hold embedded \0, no ?

Yes. What has that got to do with text et al?

Possibly chr() should reject chr(0) ...

there is no function chr() at least in 7.0.2.

I think it used to be called ichar(), but then someone pointed out that
Oracle calls it chr().

and char is not usable from psql (gives out strange errors);

char(n) is a type name, not a function call.

regards, tom lane

#15Hannu Krosing
hannu@tm.ee
In reply to: m w (#6)
Re: Like vs '=' bug with indexing

Tom Lane wrote:

Hannu Krosing <hannu@tm.ee> writes:

Also I remember being told that bytea _can_ hold embedded \0, no ?

Yes. What has that got to do with text et al?

Possibly chr() should reject chr(0) ...

there is no function chr() at least in 7.0.2.

I think it used to be called ichar(), but then someone pointed out that
Oracle calls it chr().

and char is not usable from psql (gives out strange errors);

char(n) is a type name, not a function call.

\df list it as a function call taking either bpchar or text argument.

hannu=# \df
List of functions
Result | Function |
Arguments
---------------------+----------------------+------------------------------------------
...
char | char | bpchar
char | char | text
...

Maybe the cast functions (named similar to types) should be removed from
\df output
or at least marked as such, also in pg_proc?

------------
Hannu