PATCH: CITEXT 2.0
Howdy,
[N.B.: I tried to send this a while ago but it didn't get delivered,
I'm assuming because, with the uncompressed patch, the email was too
big for -hackers. So this is a re-send with the patch gzip'd. Sorry
for any duplication].
Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:
* I had originally called it lctext, as it's not a true case-
insensitive type, but just converts strings to lowercase before
comparing them. I changed it to citext at Tom Lane's suggestion, to
ease compatibility for users of the original citext module on pgFoundry.
* Differences from the original citext are:
+ Locale-aware lowercasing of strings, rather than just lowercasing
ASCII characters.
+ No implicit casts from text to citext except via assignment.
+ A few more functions overloaded
+ Works with 8.3 and CVS head
* Many thanks to whoever added str_tolower() to formatting.c. If I had
known about that, I could have saved myself a lot of grief! My
original implementation for 8.3.1 had copied a lot of code from
oracle_compat.c to get things working. With this patch, I've
eliminated a whole lot of code, as I can now just call str_tolower().
So thank you for that! I'll probably keep my original in my personal
Subversion repository, but don't now about releasing it if it will be
accepted as a contrib module for 8.4.
* All comparisons simply convert the strings to be compared to
lowercase using str_tolower(). I've made no other optimizations,
though I'm sure someone with more experience with collations and such
could add them.
* The regression test uses a new module I've created, now on
pgFoundry, called pgtap. It should just work. sql/citext.sql adds
plpgsql to the database and then includes pgtap.sql, which has the
test functions in it.
* I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd
work with most West European languages, and maybe all languages other
than the C locale, but I'm not sure. YMMV. If there's a way to
generalize it and still be able to test the locale awareness, that
would be great. What locales do the build farm servers use?
* In the documentation, I've pitched this type as a replacement for
the use of LOWER() in ad-hoc queries, while also stipulating that this
is not a "true" case-insensitive text type, and is furthermore less
efficient than just TEXT (though I'm sure more efficient than ad-hock
LOWER()s). I've also mentioned a few other caveats, including casts
for TEXT that don't work for citext and non-case-insensitive matching
in replace(), regexp_replace(), and a few others.
* I wrote all the code here myself, but of course used the original
citext implementation (which is case-insensitive only for ASCII
characters) for inspiration and guidance. Thanks to Donald Fraser for
that original implementation.
I've compiled the CVS checkout, run its regressions, then built and
installed the citext module (hence my discovery of the deprecation of
wstring_lower and the addition of str_tolower -- should the
declaration of the former be removed from formatting.c?), and all
tests passed as of an hour ago.
I of course welcome feedback, advice, insults, commiserations, and
just about any mode of comment on this patch. Please let me know if I
need to provide any additional information.
Best,
David
Attachments:
David E. Wheeler napsal(a):
Howdy,
Howdy,
Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:
What is benefit to have this type when collation per database will be
implemented? It seems to me that its overlapped feature? Definition of collation
should allow to setup case sensitivity. Only advantage what I see there is that
you can combine case sensitive and case insensitive text in one database.
However, it will be solved when collation per column will be implemented.
Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
However, it will be solved when collation per column will be implemented.
Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.
regards, tom lane
On Jul 1, 2008, at 07:38, Tom Lane wrote:
However, it will be solved when collation per column will be
implemented.Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.
And even more of us have written queries using LOWER(col) = LOWER(?),
which is just a PITA.
Best,
David
David E. Wheeler napsal(a):
On Jul 1, 2008, at 07:38, Tom Lane wrote:
However, it will be solved when collation per column will be
implemented.Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.And even more of us have written queries using LOWER(col) = LOWER(?),
which is just a PITA.
OK. I'm going to review your patch.
Zdenek
Yay!
Best,
David
Sent from my iPhone
On Jul 2, 2008, at 5:03, Zdenek Kotala <Zdenek.Kotala@Sun.COM> wrote:
Show quoted text
David E. Wheeler napsal(a):
On Jul 1, 2008, at 07:38, Tom Lane wrote:
However, it will be solved when collation per column will be
implemented.Well, yeah, but that seems a very long way off. In the meantime a
lot
of people use the existing pgfoundry citext module.And even more of us have written queries using LOWER(col) = LOWER
(?), which is just a PITA.OK. I'm going to review your patch.
Zdenek
David E. Wheeler napsal(a):
Howdy,
Howdy
Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:
I went through your code and I have following comments/questions:
1) formating
Please, do not type space before asterix:
char * lcstr, * rcstr -> char *lcstr, *rcstr
and do not put extra space in a brackets
citextcmp( left, right ) -> citextcmp(left, right)
Other seems to me OK (remove 8.2 version mention at the bottom)
2) citextcmp function returns int but citext_cmp returns int32. I think
citextcmp should returns int32 as well.
3) citext_larger, citext_smaller function have memory leak. You don't call
PG_FREE_IF_COPY on unused text.
I'm not sure if return value in these function should be duplicated or not.
4) Operator = citext_eq is not correct. See comment
http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac
There must be difference between equality and collation for example in Czech
language 'láska' and 'laská' are different word it means that 'láska' !=
'laská'. But there is no difference in collation order. See Unicode Universal
Collation Algorithm for detail.
5) There are several commented out lines in CREATE OPERATOR statement mostly
related to NEGATOR. Is there some reason for that? Also OPERATOR || has probably
wrong negator.
6) You use pgTAP for unit test. It is something what should be widely discussed.
It is new approach and I'm not correct person to say if it good or not.
I see there two problems:
At first point there is not clear licensing
(https://svn.kineticode.com/pgtap/trunk/README.pgtap).
And, It should be implemented as a general framework by my opinion not only as a
part of citext contrib module.
Please, let me know when you will have updated code and I will start deep testing.
With regards Zdenek
Am Dienstag, 1. Juli 2008 schrieb Tom Lane:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
However, it will be solved when collation per column will be implemented.
Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.
Indeed, but why isn't this code put there as well?
I went through your code and I have following comments/questions:
one more comment:
7) Hash opclass is absent. Hash opclass framework is used for hash join.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Peter Eisentraut <peter_e@gmx.net> writes:
Am Dienstag, 1. Juli 2008 schrieb Tom Lane:
Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.
Indeed, but why isn't this code put there as well?
Well, actually, that might be the best thing to do with it. But it
would be sensible to give it a code review anyway, no?
regards, tom lane
On Jul 2, 2008, at 09:13, Zdenek Kotala wrote:
I went through your code and I have following comments/questions:
Thanks. I'm on a short family vacation, so it will probably be Monday
before I can submit a new patch. I got a few replies below, though.
1) formating
Please, do not type space before asterix:
char * lcstr, * rcstr -> char *lcstr, *rcstrand do not put extra space in a brackets
citextcmp( left, right ) -> citextcmp(left, right)
Okay.
Other seems to me OK (remove 8.2 version mention at the bottom)
Um, are you referring to this (at the top):
+// PostgreSQL 8.2 Magic.
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
That's gotta be there (though I can of course remove the comment).
2) citextcmp function returns int but citext_cmp returns int32. I
think citextcmp should returns int32 as well.
Yeah, I'm a bit confused by this. I followed what's in varlena.c on
this. The comparison functions are documented supposed to return 1, 0,
or -1, which of course would be covered by int. But varstr_cmp(),
which ultimately returns the value, returns all kinds of numbers. So,
perhaps someone could say what it's *supposed* to be?
3) citext_larger, citext_smaller function have memory leak. You
don't call PG_FREE_IF_COPY on unused text.I'm not sure if return value in these function should be duplicated
or not.
These I also duplicated from varlena.c, and I asked about a potential
memory leak in a previous email. If there's a leak here, would there
not also be a leak in varlena.c?
4) Operator = citext_eq is not correct. See comment http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac
So should citextcmp() call strncmp() instead of varst_cmp()? The
latter is what I saw in varlena.c.
There must be difference between equality and collation for example
in Czech language 'láska' and 'laská' are different word it means
that 'láska' != 'laská'. But there is no difference in collation
order. See Unicode Universal Collation Algorithm for detail.
I'll leave the collation stuff to the functions I call (*far* from my
specialty), but I'll add a test for this and make sure it works as
expected. Um, although, with what collation should it be tested? The
tests I wrote assume en_US.UTF-8.
5) There are several commented out lines in CREATE OPERATOR
statement mostly related to NEGATOR. Is there some reason for that?
I copied it from the original citext.sql. Not sure what effect it has.
Also OPERATOR || has probably wrong negator.
Right, good catch.
6) You use pgTAP for unit test. It is something what should be
widely discussed. It is new approach and I'm not correct person to
say if it good or not.
Sure. I created a pgFoundry project for it here:
http://pgtap.projects.postgresql.org/
I see there two problems:
At first point there is not clear licensing (https://svn.kineticode.com/pgtap/trunk/README.pgtap
).
That's a standard revised BSD license.
And, It should be implemented as a general framework by my opinion
not only as a part of citext contrib module.
It is. I just put the file in citext so that the tests can use it.
It's distributed on pgFoundry and maintained at the SVN URL quoted in
the file.
Please, let me know when you will have updated code and I will start
deep testing.
Will do.
Best,
David
On Jul 2, 2008, at 09:39, Peter Eisentraut wrote:
Well, yeah, but that seems a very long way off. In the meantime a
lot
of people use the existing pgfoundry citext module.Indeed, but why isn't this code put there as well?
It could be, but this is *such* a common need (and few even know about
pgFoundry), that I thought it'd be worthwhile to get it distributed as
part of contrib.
Best,
David
On Jul 2, 2008, at 12:18, Teodor Sigaev wrote:
7) Hash opclass is absent. Hash opclass framework is used for hash
join.
Thanks. I haven't seen docs on this. The original citext only supports
btree, and I don't remember reading about creating a hash opclass in
the Douglass book, though I probably missed it. Anyone got a link for
me to read to make it happen?
Thanks,
David
On Jul 2, 2008, at 17:21, Tom Lane wrote:
Indeed, but why isn't this code put there as well?
Well, actually, that might be the best thing to do with it. But it
would be sensible to give it a code review anyway, no?
Obviously, I would argue that contrib is a good place for it, hence
the patch. I can say more on my reasoning if you'd like. But even if
it doesn't end up in contrib, I'm extremely grateful for the code
reviews.
Best,
David
"David E. Wheeler" <david@kineticode.com> writes:
On Jul 2, 2008, at 09:13, Zdenek Kotala wrote:
Please, do not type space before asterix:
char * lcstr, * rcstr -> char *lcstr, *rcstrand do not put extra space in a brackets
citextcmp( left, right ) -> citextcmp(left, right)
Okay.
Note that this sort of stuff will mostly be fixed by pg_indent,
whether or not David does anything about it. But certainly
conforming to the project style to begin with will cause less
pain to reviewers' eyeballs.
Um, are you referring to this (at the top):
+// PostgreSQL 8.2 Magic. +#ifdef PG_MODULE_MAGIC +PG_MODULE_MAGIC; +#endif
Here however is an outright bug: we do not allow // comments, because we
still support ANSI-spec compilers that don't recognize them.
Yeah, I'm a bit confused by this. I followed what's in varlena.c on
this. The comparison functions are documented supposed to return 1, 0,
or -1, which of course would be covered by int. But varstr_cmp(),
which ultimately returns the value, returns all kinds of numbers. So,
perhaps someone could say what it's *supposed* to be?
btree cmp functions are allowed to return int32 negative, zero, or
positive. There is *not* any requirement that negative or positive
results be exactly -1 or +1. However, if you are comparing values
that are int32 or wider then you can't just return their difference
because it might overflow.
3) citext_larger, citext_smaller function have memory leak. You
don't call PG_FREE_IF_COPY on unused text.
These I also duplicated from varlena.c, and I asked about a potential
memory leak in a previous email. If there's a leak here, would there
not also be a leak in varlena.c?
The "leak" is irrelevant for larger/smaller. The only place where it's
actually useful to do PG_FREE_IF_COPY is in a btree or hash index
support function. In other cases you can assume that you're being
called in a memory context that's too short-lived for it to matter.
5) There are several commented out lines in CREATE OPERATOR
statement mostly related to NEGATOR. Is there some reason for that?
I copied it from the original citext.sql. Not sure what effect it has.
http://developer.postgresql.org/pgdocs/postgres/xoper-optimization.html
regards, tom lane
Douglass book, though I probably missed it. Anyone got a link for me to
read to make it happen?
Hash opclass is 5-times simpler that btree one :)
CREATE FUNCTION citext_hash(mchar)
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;
CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE mchar USING hash AS
OPERATOR 1 = (citext, citext),
FUNCTION 1 citext_hash(citext);
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
CREATE FUNCTION citext_hash(*citext*)
DEFAULT FOR TYPE *citext* USING hash AS
Oops, citext of course.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Jul 2, 2008, at 22:14, Tom Lane wrote:
Note that this sort of stuff will mostly be fixed by pg_indent,
whether or not David does anything about it. But certainly
conforming to the project style to begin with will cause less
pain to reviewers' eyeballs.
Yeah, I'll change it. I'm JAPH, so kind of made up the formatting as I
went, though I did try to copy the style in varlena.c.
+// PostgreSQL 8.2 Magic. +#ifdef PG_MODULE_MAGIC +PG_MODULE_MAGIC; +#endifHere however is an outright bug: we do not allow // comments,
because we
still support ANSI-spec compilers that don't recognize them.
Forgot about that. I'll change it for the next version.
btree cmp functions are allowed to return int32 negative, zero, or
positive. There is *not* any requirement that negative or positive
results be exactly -1 or +1. However, if you are comparing values
that are int32 or wider then you can't just return their difference
because it might overflow.
Thanks for the explanation. I'll make sure that they're both int32.
The "leak" is irrelevant for larger/smaller. The only place where
it's
actually useful to do PG_FREE_IF_COPY is in a btree or hash index
support function. In other cases you can assume that you're being
called in a memory context that's too short-lived for it to matter.
So would that be for any function used by
CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE CITEXT USING btree AS
OPERATOR 1 < (citext, citext),
OPERATOR 2 <= (citext, citext),
OPERATOR 3 = (citext, citext),
OPERATOR 4 >= (citext, citext),
OPERATOR 5 > (citext, citext),
FUNCTION 1 citext_cmp(citext, citext);
? (And then the btree operator and function to be added, of course.)
5) There are several commented out lines in CREATE OPERATOR
statement mostly related to NEGATOR. Is there some reason for that?I copied it from the original citext.sql. Not sure what effect it
has.http://developer.postgresql.org/pgdocs/postgres/xoper-
optimization.html
Thanks. Sounds like it'd be valuable to have them in there. I'll add
tests, as well.
Best,
David
On Jul 3, 2008, at 00:19, Teodor Sigaev wrote:
Hash opclass is 5-times simpler that btree one :)
CREATE FUNCTION citext_hash(mchar)
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE mchar USING hash AS
OPERATOR 1 = (citext, citext),
FUNCTION 1 citext_hash(citext);
Thanks. What would citext_hash() look like? I don't see a text_hash()
to borrow from anywhere in src/.
Thanks,
David
David E. Wheeler wrote:
On Jul 3, 2008, at 00:19, Teodor Sigaev wrote:
Hash opclass is 5-times simpler that btree one :)
CREATE FUNCTION citext_hash(mchar)
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE mchar USING hash AS
OPERATOR 1 = (citext, citext),
FUNCTION 1 citext_hash(citext);Thanks. What would citext_hash() look like? I don't see a text_hash() to
borrow from anywhere in src/.
See hash_any(). I assume the difficulty is making sure that
hash("FOO") = hash("foo") ...
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support