PATCH: CITEXT 2.0 v3

Started by David E. Wheeleralmost 18 years ago59 messageshackers
Jump to latest
#1David E. Wheeler
david@kineticode.com

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp().
Per discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators (=,
<>, <, >, <=, >=). These improve statistics estimations, increasing
the liklihood that indices will be used.

Thanks!

David

Attachments:

citext3.patch.gzapplication/x-gzip; name=citext3.patch.gz; x-unix-mode=0644Download
#2David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#1)
Re: PATCH: CITEXT 2.0 v3

I guess you're all just blown away by the perfection of this patch? ;-)

Best,

David

On Jul 7, 2008, at 18:03, David E. Wheeler wrote:

Show quoted text

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp().
Per discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators
(=, <>, <, >, <=, >=). These improve statistics estimations,
increasing the liklihood that indices will be used.

Thanks!

David

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David E. Wheeler (#2)
Re: PATCH: CITEXT 2.0 v3

David E. Wheeler wrote:

I guess you're all just blown away by the perfection of this patch? ;-)

The problem is that we're in the middle of a commitfest, so everybody is
busy reviewing other patches (in theory at least).

One thing that jumps at me is pgTAP usage, as Zdenek said. I understand
that it's neat and all that, but we can't include the tests because they
won't run unless one installs pgTAP which seems a nonstarter. So if you
want the tests in the repository along the rest of the stuff, they
really should use pg_regress.

It's not even difficult to use. Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a separate
discussion.

The other possibility being, of course, that you are proposing citext to
live on pgFoundry.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#4David E. Wheeler
david@kineticode.com
In reply to: Alvaro Herrera (#3)
Re: PATCH: CITEXT 2.0 v3

On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

The problem is that we're in the middle of a commitfest, so
everybody is
busy reviewing other patches (in theory at least).

Of course.

One thing that jumps at me is pgTAP usage, as Zdenek said. I
understand
that it's neat and all that, but we can't include the tests because
they
won't run unless one installs pgTAP which seems a nonstarter. So if
you
want the tests in the repository along the rest of the stuff, they
really should use pg_regress.

It does use pg_regress. The test just load the included pgtap.sql file
to get the tap functions, and then away they go. If you run `make
installcheck` it works.

It's not even difficult to use. Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a
separate
discussion.

Agreed. I've sent a couple of messages in a thread about that, the
latest this morning.

The other possibility being, of course, that you are proposing
citext to
live on pgFoundry.

I'm not, actually. I mean, I have an updated version for 8.3, but it'd
be quite a pita to maintain them both, since the api for lowercasing
text is so much simpler in 8.4.

Best,

David

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David E. Wheeler (#4)
Re: PATCH: CITEXT 2.0 v3

David E. Wheeler wrote:

On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

One thing that jumps at me is pgTAP usage, as Zdenek said. I
understand that it's neat and all that, but we can't include the
tests because they won't run unless one installs pgTAP which seems a
nonstarter. So if you want the tests in the repository along the
rest of the stuff, they really should use pg_regress.

It does use pg_regress. The test just load the included pgtap.sql file
to get the tap functions, and then away they go. If you run `make
installcheck` it works.

Uh-huh, OK, that's fine then I guess.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#6Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: David E. Wheeler (#1)
Re: PATCH: CITEXT 2.0 v3

David E. Wheeler napsal(a):

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp(). Per
discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators (=,
<>, <, >, <=, >=). These improve statistics estimations, increasing the
liklihood that indices will be used.

I'm sorry for late response. I'm little bit busy this week. I quickly look on
your latest changes and it seems to me that everything is OK. I'm going to
change status to ready for commit. After discussion with Pavel Stehule I changed
meaning about pgFoundry vs. contrib. Contrib is OK for me.

Maybe some native speaker should review documentation.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#7David E. Wheeler
david@kineticode.com
In reply to: Zdenek Kotala (#6)
Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

I'm sorry for late response. I'm little bit busy this week. I
quickly look on your latest changes and it seems to me that
everything is OK. I'm going to change status to ready for commit.
After discussion with Pavel Stehule I changed meaning about
pgFoundry vs. contrib. Contrib is OK for me.

Thank you, Zdenek. Have you had a chance to try citext yet? Or did you
just read the source?

Best,

David

#8Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: David E. Wheeler (#7)
Re: PATCH: CITEXT 2.0 v3

David E. Wheeler napsal(a):

On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

I'm sorry for late response. I'm little bit busy this week. I quickly
look on your latest changes and it seems to me that everything is OK.
I'm going to change status to ready for commit. After discussion with
Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib
is OK for me.

Thank you, Zdenek. Have you had a chance to try citext yet? Or did you
just read the source?

I tested version two on Solaris/SPARC and Sun studio compiler. I checked last
version only quickly (comparing your changes).

With regards Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#9David E. Wheeler
david@kineticode.com
In reply to: Zdenek Kotala (#8)
Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 13:02, Zdenek Kotala wrote:

Thank you, Zdenek. Have you had a chance to try citext yet? Or did
you just read the source?

I tested version two on Solaris/SPARC and Sun studio compiler. I
checked last version only quickly (comparing your changes).

Thanks.

I just updated my performance test script (attached) by increasing the
number of rows tested by an order of magnitude. So now it creates
1,000,000 rows, queries them, adds indexes, and then queries them
again. Unfortunately, CITEXT seems to have a memory leak somewhere,
because when I index the CITEXT column, it fails with "ERROR: out of
memory". So obviously something's not getting cleaned up. Here's the
btree indexing function:

Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text *left = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
int32 result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32(result);
}

And here's citextcmp():

citextcmp (text *left, text *right)
{
char *lcstr, *rcstr;
int result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(
lcstr,
VARSIZE_ANY_EXHDR(left),
rcstr,
VARSIZE_ANY_EXHDR(right)
);

pfree(lcstr);
pfree(rcstr);
return result;
}

Can anyone see where I'm failing to free up memory? Might it be in
some other function?

Thanks!

David

Attachments:

try.sqlapplication/octet-stream; name=try.sql; x-unix-mode=0644Download
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#9)
Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(
lcstr,
VARSIZE_ANY_EXHDR(left),
rcstr,
VARSIZE_ANY_EXHDR(right)
);

Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too). You need to do strlen's on the lowercased strings.

regards, tom lane

#11David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#10)
Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 16:45, Tom Lane wrote:

Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too). You need to do strlen's on the lowercased strings.

Ah, great point. Thanks.

Anyone else got an idea on the memory leak?

Best,

David

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#9)
Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

Unfortunately, CITEXT seems to have a memory leak somewhere,

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

regards, tom lane

#13David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#12)
Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 19:18, Tom Lane wrote:

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

PostgreSQL 8.3.3 on i386-apple-darwin9.4.0, compiled by GCC i686-
apple-darwin9-gcc-4.0.1 (G

That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. This is using my
version for 8.3.3 from svn.

https://svn.kineticode.com/citext/trunk/

I'll give it a try myself with the patch against CVS in a bit.

Thanks,

David

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#13)
Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 11, 2008, at 19:18, Tom Lane wrote:

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8.

Oh, got one of those too ... [ time passes ... ] Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...

Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass. You're depending way too much
on platform-specific behavior there, I fear.

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#13)
Re: PATCH: CITEXT 2.0 v3

Here's an updated version of citext.c. Some changes cosmetic, some
not so much ...

regards, tom lane

#16David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#14)
Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 20:10, Tom Lane wrote:

Oh, got one of those too ... [ time passes ... ] Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...

Yeah, gotta love that, right?

So the issue must be with my version for 8.3.x, meaning, likely, my
custom cilower() function. I'll have another look at it. Thanks for
giving it a whirl on your end.

Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass. You're depending way too much
on platform-specific behavior there, I fear.

Hrm. I wonder what that could be. I'll have to give it a shot on my
Ubuntu box and find out. Thanks.

David

#17David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#15)
Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 20:22, Tom Lane wrote:

Here's an updated version of citext.c. Some changes cosmetic, some
not so much ...

Thanks! Good catch on my missing all of the s/int/int32/ stuff related
to citextcmp(). Stupid oversites on my part. I'll submit a new patch
with these changes shortly.

Best,

David

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#9)
Re: PATCH: CITEXT 2.0 v3

BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments:

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

* The type declaration needs to say storage = extended, else the type
won't be toastable.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1. Compare the bpchar to text cast.

* <= is surely not its own commutator. You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures. The overloaded miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these. It is likewise pretty pointless AFAICS
to introduce regex functions taking citext pattern arguments.

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

regards, tom lane

#19David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#18)
Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 12:17, Tom Lane wrote:

BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments:

Thanks a million for these, Tom. I greatly appreciate it.

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

I've added SQL comments. Were you talking about a COMMENT?

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.

I wondered about that; those were copied from CITEXT 1. I've removed
all GRANTs and COMMENTs.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

I'm confused. Is that not what the citextin and citextout functions are?

* The type declaration needs to say storage = extended, else the type
won't be toastable.

Ah, good, thanks.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1. Compare the bpchar to text cast.

Where do I find that? I have trouble finding the SQL that creates the
core types. :-(

* <= is surely not its own commutator.

Changed to >=.

You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

Thanks. Added to my list.

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures. The overloaded
miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these. It is likewise pretty pointless
AFAICS
to introduce regex functions taking citext pattern arguments.

I added most of those as I wrote tests and they failed to find the
functions. Once I added the functions, they worked. But I'll do an
audit to make sure that I didn't inadvertantly leave in any unneeded
ones (I'm happy to have less code :-)).

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

Sorry, don't know what you're referring to here. CREATE OPERATOR
appears to require parens…

Thanks for the great feedback! I'll work on getting things all
straightened out and a new patch in tonight.

Best,

David

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#9)
Re: PATCH: CITEXT 2.0 v3

There was some discussion earlier about whether the proposed regression
tests for citext are suitable for use in contrib or not. After playing
with them for awhile, I have to come down very firmly on the side of
"not". I have these gripes:

1. The style is gratuitously different from every other regression test
in the system. This is not a good thing. If it were an amazingly
better way to do things, then maybe, but as far as I can tell the style
pgTAP forces on you is really pretty darn poorly suited for SQL tests.
You have to contort what could naturally be expressed in SQL as a table
result into a scalar. Plus it's redundant with the expected-output file.

2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL. This is a very bad thing. Speed of
regression tests matters a lot to those of us who run them a dozen times
per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)

Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
I have a couple of gripes about the substance of the tests as well:

3. What you are mostly testing is not the behavior of citext itself,
but the behavior of the underlying strcoll function. This is pretty
much doomed to failure, first because the regression tests are intended
to work in multiple locales (and we are *not* giving that up for
citext), and second because the behavior of strcoll isn't all that
portable across platforms even given allegedly similar locale settings
(as we already found out yesterday). Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works. The odds of ever finding
anything with those tests are not distinguishable from zero (unless the
underlying text function is busted, which is not your responsibility to
test). So I don't see any point in putting them into the standard
regression package. (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext rather
than text.)

I suggest something more like the attached as a suitable regression
test. I got bored about halfway through and started to skim, so there
might be a few tests toward the end of the original set that would be
worth transposing into this one, but nothing jumped out at me.

regards, tom lane

#21David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#20)
#22David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#21)
#23David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#19)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#19)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#21)
#27David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#24)
#28David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#25)
#29David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#25)
#30David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#30)
#32David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#25)
#33David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#26)
#34David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#20)
#35Andrew Dunstan
andrew@dunslane.net
In reply to: David E. Wheeler (#33)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#32)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#33)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#34)
#40David E. Wheeler
david@kineticode.com
In reply to: Andrew Dunstan (#35)
#41David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#38)
#42David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#39)
#43David E. Wheeler
david@kineticode.com
In reply to: Alvaro Herrera (#36)
#44Andrew Dunstan
andrew@dunslane.net
In reply to: David E. Wheeler (#40)
#45David E. Wheeler
david@kineticode.com
In reply to: Andrew Dunstan (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#40)
#47David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#41)
#49David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#48)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#49)
#51David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#50)
#52David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#21)
#53David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#37)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#53)
#55David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#54)
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#55)
#57David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#56)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#57)
#59David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#58)