Case-Insensitve Text Comparison

Started by David E. Wheeleralmost 18 years ago33 messageshackers
Jump to latest
#1David E. Wheeler
david@kineticode.com

Howdy,

I'm sure I'm just showing off my ignorance here, but here goes…

I really need case-insensitive string comparison in my database.
Ideally there'd be a nice ITEXT data type (and friends, ichar,
ivarchar, etc.). But of course there isn't, and for years I've just
used LOWER() on indexes and queries to get the same result.

Only it turns out that I'm of course not getting the same result. This
script:

#!/usr/local/bin/perl -w

use strict;
use warnings;
use utf8;
binmode STDOUT, ':utf8';
use DBI;

my $dbh = DBI->connect('dbi:Pg:dbname=try', 'postgres', '',
{ pg_enable_utf8 => 1 });
for my $char qw( À Á Â Ã Ä Å Ç Ć Č Ĉ Ċ Ď Đ A B C D ) {
print "$char: ", $dbh->selectrow_array('SELECT LOWER(?)', undef,
$char ), $/;
}

Yields this output:

À: À
Á: Á
Â: Â
Ã: Ã
Ä: Ä
Å: Å
Ç: Ç
Ć: Ć
Č: Č
Ĉ: Ĉ
Ċ: Ċ
Ď: Ď
Đ: Đ
A: a
B: b
C: c
D: d

So it doesn't really work on anything other than ASCII, it looks like.
So I have two questions:

1. Does the use of the tolower() C function in the citext data type on
pgfoundry basically give me the same results as using lower() in my
SQL has for all these years? IOW, does it convert letters to lowercase
in the same way that the LOWER() SQL function does? If so, I think I
might start to use it for my case-insensitive columns and simplify my
SQL a bit.

http://pgfoundry.org/projects/citext/

2. Isn't the ICU library distributed with PostgreSQL? And if so, could
it not be used to create proper case conversions in LOWER() and
friends and, ultimately, to create a case-insensitive text type in
core? I'm seeing that it has a constant named U_COMPARE_IGNORE_CASE
that can be used with its unorm_compare() function:

http://www.icu-project.org/apiref/icu4c/unorm_8h.html#6cc4c8b0d5be7ce1ac4b600ace7817f5
http://www.icu-project.org/apiref/icu4c/unorm_8h.html#991e0fe6f0d062dd6e8e924517f3f437

I don't really know C, but if that's stuff there, can't we take
advantage of it for proper case-insensitive comparisons (and
conversions)?

Thanks,

David

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#1)
Re: Case-Insensitve Text Comparison

"David E. Wheeler" <david@kineticode.com> writes:

I really need case-insensitive string comparison in my database.

Okay ... according to whose locale?

Ideally there'd be a nice ITEXT data type (and friends, ichar,
ivarchar, etc.). But of course there isn't, and for years I've just
used LOWER() on indexes and queries to get the same result.

Only it turns out that I'm of course not getting the same result.

I think that means you're not using the right locale.

1. Does the use of the tolower() C function in the citext data type on
pgfoundry basically give me the same results as using lower() in my
SQL has for all these years?

[ broken record... ] Kinda depends on your locale. However, tolower()
is 100% guaranteed not to work for multibyte encodings, so citext is
quite useless if you're using UTF8. This is fixable, no doubt, but
it's not fixed in the project as it stands.

2. Isn't the ICU library distributed with PostgreSQL?

Nope, it is not, and we have already pretty much determined that we
do not want to make Postgres depend on ICU. See the archives.

regards, tom lane

#3Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: David E. Wheeler (#1)
Re: Case-Insensitve Text Comparison

David E. Wheeler napsal(a):

Howdy,

I'm sure I'm just showing off my ignorance here, but here goes…

I really need case-insensitive string comparison in my database.

Collation per database level should be help you. It is now under development and
I hope it will be part of 8.4. You can see
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php

Zdenek

#4David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#2)
Re: Case-Insensitve Text Comparison

On Jun 1, 2008, at 21:08, Tom Lane wrote:

"David E. Wheeler" <david@kineticode.com> writes:

I really need case-insensitive string comparison in my database.

Okay ... according to whose locale?

I'm using C. Of course you're correct that it depends on the locale, I
always forget that. But does not the Unicode standard offer up some
sort locale-independent case-insensitivity, so that it gets it right
some large percentage of the time?

Ideally there'd be a nice ITEXT data type (and friends, ichar,
ivarchar, etc.). But of course there isn't, and for years I've just
used LOWER() on indexes and queries to get the same result.

Only it turns out that I'm of course not getting the same result.

I think that means you're not using the right locale.

What locale is right? If I have a Web app, there could be data in many
different languages in a single table/column.

1. Does the use of the tolower() C function in the citext data type
on
pgfoundry basically give me the same results as using lower() in my
SQL has for all these years?

[ broken record... ] Kinda depends on your locale. However,
tolower()
is 100% guaranteed not to work for multibyte encodings, so citext is
quite useless if you're using UTF8. This is fixable, no doubt, but
it's not fixed in the project as it stands.

Right, okay; thanks. I'm thinking about using it for email addresses
and domain names, however, so it might be adequate for those
applications.

2. Isn't the ICU library distributed with PostgreSQL?

Nope, it is not, and we have already pretty much determined that we
do not want to make Postgres depend on ICU. See the archives.

Damn. Okay, thanks.

David

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#4)
Re: Case-Insensitve Text Comparison

"David E. Wheeler" <david@kineticode.com> writes:

On Jun 1, 2008, at 21:08, Tom Lane wrote:

Okay ... according to whose locale?

I'm using C. Of course you're correct that it depends on the locale, I
always forget that. But does not the Unicode standard offer up some
sort locale-independent case-insensitivity, so that it gets it right
some large percentage of the time?

Not really, and in any case the C locale completely disables any
knowledge of Unicode. C locale knows about 7-bit ASCII and nothing
more.

regards, tom lane

#6Oleg Bartunov
oleg@sai.msu.su
In reply to: David E. Wheeler (#1)
Re: Case-Insensitve Text Comparison

David,

we wrote contrib module (mchar) for one customer, which ports its
application from mssql to postgres. It does case-insensitive comparison
for new data type 'mchar' and linked with ICU for system independent locale.

Oleg

On Sun, 1 Jun 2008, David E. Wheeler wrote:

Howdy,

I'm sure I'm just showing off my ignorance here, but here goes

I really need case-insensitive string comparison in my database. Ideally
there'd be a nice ITEXT data type (and friends, ichar, ivarchar, etc.). But
of course there isn't, and for years I've just used LOWER() on indexes and
queries to get the same result.

Only it turns out that I'm of course not getting the same result. This
script:

#!/usr/local/bin/perl -w

use strict;
use warnings;
use utf8;
binmode STDOUT, ':utf8';
use DBI;

my $dbh = DBI->connect('dbi:Pg:dbname=try', 'postgres', '', { pg_enable_utf8
=> 1 });
for my $char qw( A B C D ) {
print "$char: ", $dbh->selectrow_array('SELECT LOWER(?)', undef, $char ),
$/;
}

Yields this output:

:
:
:
:
:
:
:
:
:
:
:
:
:
A: a
B: b
C: c
D: d

So it doesn't really work on anything other than ASCII, it looks like. So I
have two questions:

1. Does the use of the tolower() C function in the citext data type on
pgfoundry basically give me the same results as using lower() in my SQL has
for all these years? IOW, does it convert letters to lowercase in the same
way that the LOWER() SQL function does? If so, I think I might start to use
it for my case-insensitive columns and simplify my SQL a bit.

http://pgfoundry.org/projects/citext/

2. Isn't the ICU library distributed with PostgreSQL? And if so, could it not
be used to create proper case conversions in LOWER() and friends and,
ultimately, to create a case-insensitive text type in core? I'm seeing that
it has a constant named U_COMPARE_IGNORE_CASE that can be used with its
unorm_compare() function:

http://www.icu-project.org/apiref/icu4c/unorm_8h.html#6cc4c8b0d5be7ce1ac4b600ace7817f5
http://www.icu-project.org/apiref/icu4c/unorm_8h.html#991e0fe6f0d062dd6e8e924517f3f437

I don't really know C, but if that's stuff there, can't we take advantage of
it for proper case-insensitive comparisons (and conversions)?

Thanks,

David

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#7David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#5)
Re: Case-Insensitve Text Comparison

On Jun 1, 2008, at 22:18, Tom Lane wrote:

I'm using C. Of course you're correct that it depends on the
locale, I
always forget that. But does not the Unicode standard offer up some
sort locale-independent case-insensitivity, so that it gets it right
some large percentage of the time?

Not really, and in any case the C locale completely disables any
knowledge of Unicode. C locale knows about 7-bit ASCII and nothing
more.

And the locale can only be set by initdb?

I don't suppose that there are any collations that sort and index case-
insensitively, are there? I don't see anything suggestive in `locale -
a`…

Thanks,

David

#8David E. Wheeler
david@kineticode.com
In reply to: Oleg Bartunov (#6)
Re: Case-Insensitve Text Comparison

On Jun 1, 2008, at 22:21, Oleg Bartunov wrote:

David,

we wrote contrib module (mchar) for one customer, which ports its
application from mssql to postgres. It does case-insensitive
comparison for new data type 'mchar' and linked with ICU for system
independent locale.

That sounds promising. I don't suppose that it has been released, has
it?

Thanks,

David

#9Oleg Bartunov
oleg@sai.msu.su
In reply to: David E. Wheeler (#8)
Re: Case-Insensitve Text Comparison

On Sun, 1 Jun 2008, David E. Wheeler wrote:

On Jun 1, 2008, at 22:21, Oleg Bartunov wrote:

David,

we wrote contrib module (mchar) for one customer, which ports its
application from mssql to postgres. It does case-insensitive comparison for
new data type 'mchar' and linked with ICU for system independent locale.

That sounds promising. I don't suppose that it has been released, has it?

It's available as a part of patch, see (use google translate)
http://v8.1c.ru/overview/postgres_patches_notes.htm

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#10Andrew Sullivan
ajs@commandprompt.com
In reply to: David E. Wheeler (#4)
Re: Case-Insensitve Text Comparison

On Sun, Jun 01, 2008 at 10:13:07PM -0700, David E. Wheeler wrote:

What locale is right? If I have a Web app, there could be data in many
different languages in a single table/column.

I think the above amounts to a need for per-session locale settings or
something, no?

A

--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

#11David E. Wheeler
david@kineticode.com
In reply to: Andrew Sullivan (#10)
Re: Case-Insensitve Text Comparison

On Jun 2, 2008, at 06:51, Andrew Sullivan wrote:

On Sun, Jun 01, 2008 at 10:13:07PM -0700, David E. Wheeler wrote:

What locale is right? If I have a Web app, there could be data in
many
different languages in a single table/column.

I think the above amounts to a need for per-session locale settings or
something, no?

Yes, that's what I was getting at.

Thanks,

David

#12David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#2)
Re: Case-Insensitve Text Comparison

On Jun 1, 2008, at 21:08, Tom Lane wrote:

1. Does the use of the tolower() C function in the citext data type
on
pgfoundry basically give me the same results as using lower() in my
SQL has for all these years?

[ broken record... ] Kinda depends on your locale. However,
tolower()
is 100% guaranteed not to work for multibyte encodings, so citext is
quite useless if you're using UTF8. This is fixable, no doubt, but
it's not fixed in the project as it stands.

Would the use of str_tolower() in formatting.c fix that?

Thanks,

David

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#12)
Re: Case-Insensitve Text Comparison

"David E. Wheeler" <david@kineticode.com> writes:

On Jun 1, 2008, at 21:08, Tom Lane wrote:

[ broken record... ] Kinda depends on your locale. However,
tolower()
is 100% guaranteed not to work for multibyte encodings, so citext is
quite useless if you're using UTF8. This is fixable, no doubt, but
it's not fixed in the project as it stands.

Would the use of str_tolower() in formatting.c fix that?

Yeah, you need something equivalent to that. I think that whole area
is due for refactoring, though --- we've got kind of a weird collection
of upper/lower/initcap APIs spread through a couple of different files.

regards, tom lane

#14David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#13)
Re: Case-Insensitve Text Comparison

On Jun 2, 2008, at 09:33, Tom Lane wrote:

Would the use of str_tolower() in formatting.c fix that?

Yeah, you need something equivalent to that. I think that whole area
is due for refactoring, though --- we've got kind of a weird
collection
of upper/lower/initcap APIs spread through a couple of different
files.

And I just ran into this on 8.3 when trying to install citext:

psql:citext.sql:350: ERROR: there is no built-in function named
"oid_text"

I'm assuming that this is because a lot of automatic casts were
removed in 8.3 or 8.2; There are a bunch of these:

CREATE FUNCTION citext(oid) RETURNS citext AS 'oid_text' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION citext(oid) TO PUBLIC;
COMMENT ON FUNCTION citext(oid) IS 'convert oid to citext';
CREATE FUNCTION oid(citext) RETURNS oid AS 'text_oid' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION oid(citext) TO PUBLIC;
COMMENT ON FUNCTION oid(citext) IS 'convert citext to oid';
CREATE CAST (citext AS oid) WITH FUNCTION oid(citext);
CREATE CAST (oid AS citext) WITH FUNCTION citext(oid);

CREATE FUNCTION citext(int2) RETURNS citext AS 'int2_text' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION citext(int2) TO PUBLIC;
COMMENT ON FUNCTION citext(int2) IS 'convert int2 to citext';
CREATE FUNCTION int2(citext) RETURNS int2 AS 'text_int2' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION int2(citext) TO PUBLIC;
COMMENT ON FUNCTION int2(citext) IS 'convert citext to int2';
CREATE CAST (citext AS int2) WITH FUNCTION int2(citext);
CREATE CAST (int2 AS citext) WITH FUNCTION citext(int2);

And on and on. Clearly this module needs updating for newer
PostgreSQLs. I tried removing them all in order to get the data type
and tried it out with this script:

my $dbh = DBI->connect('dbi:Pg:dbname=try', 'postgres', '',
{ pg_enable_utf8 => 1 });
for my $char qw( À Á Â Ã Ä Å Ç Ć Č Ĉ Ċ Ď Đ A B C D ) {
print "$char: ", $dbh->selectrow_array('SELECT LOWER(?::citext)',
undef, $char ), $/;
}

Naturally it didn't work:

À: Ã
Á: á
Â: â
Ã: ã
Ä: ä
Å: Ã¥
Ç: ç
Ć: ć
Č: č
Ĉ: ĉ
Ċ: ċ
Ď: ď
Đ: đ
A: a
B: b
C: c
D: d

BTW, I rebuilt my cluster with --locale en_US.UTF-8 and the script
works on a text type, so having a locale is key.

Thanks,

David

#15Jeff Davis
pgsql@j-davis.com
In reply to: Andrew Sullivan (#10)
Re: Case-Insensitve Text Comparison

On Mon, 2008-06-02 at 09:51 -0400, Andrew Sullivan wrote:

On Sun, Jun 01, 2008 at 10:13:07PM -0700, David E. Wheeler wrote:

What locale is right? If I have a Web app, there could be data in many
different languages in a single table/column.

I think the above amounts to a need for per-session locale settings or
something, no?

What if you had a CHECK constraint that was locale-sensitive? Would the
constraint only be non-false (true or null) for records inserted under
the same locale? That's not very useful.

I think if you want some special treatment of text for some users, it
should be explicit. Text in one locale is really a different type from
text in another locale, and so changing the locale of some text variable
is really a typecast. I don't think GUCs are the correct mechanism for
this.

Regards,
Jeff Davis

#16Martijn van Oosterhout
kleptog@svana.org
In reply to: Jeff Davis (#15)
Re: Case-Insensitve Text Comparison

On Mon, Jun 02, 2008 at 10:29:30AM -0700, Jeff Davis wrote:

I think if you want some special treatment of text for some users, it
should be explicit. Text in one locale is really a different type from
text in another locale, and so changing the locale of some text variable
is really a typecast. I don't think GUCs are the correct mechanism for
this.

The SQL COLLATE syntax handles this just fine. Either the original
COLLATE patch or the new one will let people tags strings with any
collation they like.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#17Jeff Davis
pgsql@j-davis.com
In reply to: David E. Wheeler (#4)
Re: Case-Insensitve Text Comparison

On Sun, 2008-06-01 at 22:13 -0700, David E. Wheeler wrote:

What locale is right? If I have a Web app, there could be data in many
different languages in a single table/column.

I think the values should be explicitly treated differently.

It would be nice if you could just typecast, like:
"lower(somevalue::text(fr_CA))"

which would then lowercase according to the fr_CA locale, regardless of
the locale of "somevalue".

Using typmod for localization was brought up here:
http://archives.postgresql.org/pgsql-hackers/2007-06/msg00635.php

Has it been discussed further? I happen to like the idea of the TEXT
type taking a locale as a typmod. No typmod would, of course, fall back
to the cluster setting. And it would throw an exception if the encoding
couldn't represent that locale.

Regards,
Jeff Davis

#18Jeff Davis
pgsql@j-davis.com
In reply to: Martijn van Oosterhout (#16)
Re: Case-Insensitve Text Comparison

On Mon, 2008-06-02 at 19:55 +0200, Martijn van Oosterhout wrote:

The SQL COLLATE syntax handles this just fine. Either the original
COLLATE patch or the new one will let people tags strings with any
collation they like.

http://wiki.postgresql.org/wiki/Todo:Collate

The last reference I see on that page is from 2005. Is there any updated
information? Are there any major obstacles holding this up aside from
the platform issues mentioned on that page?

Regards,
Jeff Davis

#19Andrew Sullivan
ajs@commandprompt.com
In reply to: Jeff Davis (#15)
Re: Case-Insensitve Text Comparison

On Mon, Jun 02, 2008 at 10:29:30AM -0700, Jeff Davis wrote:

What if you had a CHECK constraint that was locale-sensitive? Would the
constraint only be non-false (true or null) for records inserted under
the same locale? That's not very useful.

It would seem that this is one of the important cases that needs to be
worked out. I wasn't suggesting that per-session locale (or whatever
we want to call it) is _easy_ or, for that matter, even possible; just
that it would solve a large number of the problems that people
complain about.

In fact, I suspect that what we really need is something a little more
like "in-database locale" or something.

I think if you want some special treatment of text for some users, it
should be explicit.

Yes. Also, not just text. Think of currency, numeric separators, &c.

A

--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

#20Tino Wildenhain
tino@wildenhain.de
In reply to: Andrew Sullivan (#19)
Re: Case-Insensitve Text Comparison

Andrew Sullivan wrote:
...

I think if you want some special treatment of text for some users, it
should be explicit.

Yes. Also, not just text. Think of currency, numeric separators, &c.

Which imho, should not really be the business of the type interface
but instead something to_char() and to_{type} handles.

Tino

#21Shane Ambler
pgsql@Sheeky.Biz
In reply to: Martijn van Oosterhout (#16)
#22Martijn van Oosterhout
kleptog@svana.org
In reply to: Jeff Davis (#18)
#23Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Martijn van Oosterhout (#22)
#24David E. Wheeler
david@kineticode.com
In reply to: Zdenek Kotala (#23)
#25Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: David E. Wheeler (#24)
#26David E. Wheeler
david@kineticode.com
In reply to: Zdenek Kotala (#25)
#27Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: David E. Wheeler (#26)
#28Pavel Stehule
pavel.stehule@gmail.com
In reply to: Zdenek Kotala (#25)
#29Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Pavel Stehule (#28)
#30Jeff Davis
pgsql@j-davis.com
In reply to: Zdenek Kotala (#27)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#30)
#32Martijn van Oosterhout
kleptog@svana.org
In reply to: Jeff Davis (#30)
#33Jeff Davis
pgsql@j-davis.com
In reply to: Martijn van Oosterhout (#32)