PATCH: CITEXT 2.0

Started by David E. Wheeleralmost 18 years ago68 messageshackers

david@kineticode.com

almost 18 years ago

Howdy,

[N.B.: I tried to send this a while ago but it didn't get delivered,
I'm assuming because, with the uncompressed patch, the email was too
big for -hackers. So this is a re-send with the patch gzip'd. Sorry
for any duplication].

Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:

* I had originally called it lctext, as it's not a true case-
insensitive type, but just converts strings to lowercase before
comparing them. I changed it to citext at Tom Lane's suggestion, to
ease compatibility for users of the original citext module on pgFoundry.

* Differences from the original citext are:
  + Locale-aware lowercasing of strings, rather than just lowercasing
    ASCII characters.
  + No implicit casts from text to citext except via assignment.
  + A few more functions overloaded
  + Works with 8.3 and CVS head

* Many thanks to whoever added str_tolower() to formatting.c. If I had
known about that, I could have saved myself a lot of grief! My
original implementation for 8.3.1 had copied a lot of code from
oracle_compat.c to get things working. With this patch, I've
eliminated a whole lot of code, as I can now just call str_tolower().
So thank you for that! I'll probably keep my original in my personal
Subversion repository, but don't now about releasing it if it will be
accepted as a contrib module for 8.4.

* All comparisons simply convert the strings to be compared to
lowercase using str_tolower(). I've made no other optimizations,
though I'm sure someone with more experience with collations and such
could add them.

* The regression test uses a new module I've created, now on
pgFoundry, called pgtap. It should just work. sql/citext.sql adds
plpgsql to the database and then includes pgtap.sql, which has the
test functions in it.

* I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd
work with most West European languages, and maybe all languages other
than the C locale, but I'm not sure. YMMV. If there's a way to
generalize it and still be able to test the locale awareness, that
would be great. What locales do the build farm servers use?

* In the documentation, I've pitched this type as a replacement for
the use of LOWER() in ad-hoc queries, while also stipulating that this
is not a "true" case-insensitive text type, and is furthermore less
efficient than just TEXT (though I'm sure more efficient than ad-hock
LOWER()s). I've also mentioned a few other caveats, including casts
for TEXT that don't work for citext and non-case-insensitive matching
in replace(), regexp_replace(), and a few others.

* I wrote all the code here myself, but of course used the original
citext implementation (which is case-insensitive only for ASCII
characters) for inspiration and guidance. Thanks to Donald Fraser for
that original implementation.

I've compiled the CVS checkout, run its regressions, then built and
installed the citext module (hence my discovery of the deprecation of
wstring_lower and the addition of str_tolower -- should the
declaration of the former be removed from formatting.c?), and all
tests passed as of an hour ago.

I of course welcome feedback, advice, insults, commiserations, and
just about any mode of comment on this patch. Please let me know if I
need to provide any additional information.

Best,

David

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: David E. Wheeler (#1)

Re: PATCH: CITEXT 2.0

David E. Wheeler napsal(a):

Howdy,

Howdy,

Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:

What is benefit to have this type when collation per database will be
implemented? It seems to me that its overlapped feature? Definition of collation
should allow to setup case sensitivity. Only advantage what I see there is that
you can combine case sensitive and case insensitive text in one database.
However, it will be solved when collation per column will be implemented.

Zdenek

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: Zdenek Kotala (#2)

Re: PATCH: CITEXT 2.0

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

However, it will be solved when collation per column will be implemented.

Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.

regards, tom lane

PATCH: CITEXT 2.0

Attachments: