PATCH: CITEXT 2.0 v3

Started by David E. Wheelerover 17 years ago59 messages

david@kineticode.com

over 17 years ago

1 attachment(s)

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp().
Per discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators (=,
<>, <, >, <=, >=). These improve statistics estimations, increasing
the liklihood that indices will be used.

Thanks!

David

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: David E. Wheeler (#1)

Re: PATCH: CITEXT 2.0 v3

I guess you're all just blown away by the perfection of this patch? ;-)

Best,

David

On Jul 7, 2008, at 18:03, David E. Wheeler wrote:

Show quoted text

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp().
Per discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators
(=, <>, <, >, <=, >=). These improve statistics estimations,
increasing the liklihood that indices will be used.

Thanks!

David

Alvaro Herrera

alvherre@commandprompt.com

over 17 years ago

In reply to: David E. Wheeler (#2)

Re: PATCH: CITEXT 2.0 v3

David E. Wheeler wrote:

I guess you're all just blown away by the perfection of this patch? ;-)

The problem is that we're in the middle of a commitfest, so everybody is
busy reviewing other patches (in theory at least).

One thing that jumps at me is pgTAP usage, as Zdenek said. I understand
that it's neat and all that, but we can't include the tests because they
won't run unless one installs pgTAP which seems a nonstarter. So if you
want the tests in the repository along the rest of the stuff, they
really should use pg_regress.

It's not even difficult to use. Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a separate
discussion.

The other possibility being, of course, that you are proposing citext to
live on pgFoundry.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Alvaro Herrera (#3)

Re: PATCH: CITEXT 2.0 v3

On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

The problem is that we're in the middle of a commitfest, so
everybody is
busy reviewing other patches (in theory at least).

Of course.

One thing that jumps at me is pgTAP usage, as Zdenek said. I
understand
that it's neat and all that, but we can't include the tests because
they
won't run unless one installs pgTAP which seems a nonstarter. So if
you
want the tests in the repository along the rest of the stuff, they
really should use pg_regress.

It does use pg_regress. The test just load the included pgtap.sql file
to get the tap functions, and then away they go. If you run `make
installcheck` it works.

It's not even difficult to use. Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a
separate
discussion.

Agreed. I've sent a couple of messages in a thread about that, the
latest this morning.

The other possibility being, of course, that you are proposing
citext to
live on pgFoundry.

I'm not, actually. I mean, I have an updated version for 8.3, but it'd
be quite a pita to maintain them both, since the api for lowercasing
text is so much simpler in 8.4.

Best,

David

Alvaro Herrera

alvherre@commandprompt.com

over 17 years ago

In reply to: David E. Wheeler (#4)

Re: PATCH: CITEXT 2.0 v3

David E. Wheeler wrote:

On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

One thing that jumps at me is pgTAP usage, as Zdenek said. I
understand that it's neat and all that, but we can't include the
tests because they won't run unless one installs pgTAP which seems a
nonstarter. So if you want the tests in the repository along the
rest of the stuff, they really should use pg_regress.

It does use pg_regress. The test just load the included pgtap.sql file
to get the tap functions, and then away they go. If you run `make
installcheck` it works.

Uh-huh, OK, that's fine then I guess.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: David E. Wheeler (#1)

Re: PATCH: CITEXT 2.0 v3

David E. Wheeler napsal(a):

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp(). Per
discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators (=,
<>, <, >, <=, >=). These improve statistics estimations, increasing the
liklihood that indices will be used.

I'm sorry for late response. I'm little bit busy this week. I quickly look on
your latest changes and it seems to me that everything is OK. I'm going to
change status to ready for commit. After discussion with Pavel Stehule I changed
meaning about pgFoundry vs. contrib. Contrib is OK for me.

Maybe some native speaker should review documentation.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Zdenek Kotala (#6)

Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

I'm sorry for late response. I'm little bit busy this week. I
quickly look on your latest changes and it seems to me that
everything is OK. I'm going to change status to ready for commit.
After discussion with Pavel Stehule I changed meaning about
pgFoundry vs. contrib. Contrib is OK for me.

Thank you, Zdenek. Have you had a chance to try citext yet? Or did you
just read the source?

Best,

David

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: David E. Wheeler (#7)

Re: PATCH: CITEXT 2.0 v3

David E. Wheeler napsal(a):

On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

I'm sorry for late response. I'm little bit busy this week. I quickly
look on your latest changes and it seems to me that everything is OK.
I'm going to change status to ready for commit. After discussion with
Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib
is OK for me.

Thank you, Zdenek. Have you had a chance to try citext yet? Or did you
just read the source?

I tested version two on Solaris/SPARC and Sun studio compiler. I checked last
version only quickly (comparing your changes).

With regards Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Zdenek Kotala (#8)

1 attachment(s)

Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 13:02, Zdenek Kotala wrote:

Thank you, Zdenek. Have you had a chance to try citext yet? Or did
you just read the source?

I tested version two on Solaris/SPARC and Sun studio compiler. I
checked last version only quickly (comparing your changes).

Thanks.

I just updated my performance test script (attached) by increasing the
number of rows tested by an order of magnitude. So now it creates
1,000,000 rows, queries them, adds indexes, and then queries them
again. Unfortunately, CITEXT seems to have a memory leak somewhere,
because when I index the CITEXT column, it fails with "ERROR: out of
memory". So obviously something's not getting cleaned up. Here's the
btree indexing function:

Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text *left = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
int32 result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32(result);
}

And here's citextcmp():

citextcmp (text *left, text *right)
{
char *lcstr, *rcstr;
int result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(
lcstr,
VARSIZE_ANY_EXHDR(left),
rcstr,
VARSIZE_ANY_EXHDR(right)
);

pfree(lcstr);
pfree(rcstr);
return result;
}

Can anyone see where I'm failing to free up memory? Might it be in
some other function?

Thanks!

David

#10

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#9)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(
lcstr,
VARSIZE_ANY_EXHDR(left),
rcstr,
VARSIZE_ANY_EXHDR(right)
);

Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too). You need to do strlen's on the lowercased strings.

regards, tom lane

#11

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#10)

Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 16:45, Tom Lane wrote:

Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too). You need to do strlen's on the lowercased strings.

Ah, great point. Thanks.

Anyone else got an idea on the memory leak?

Best,

David

#12

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#9)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

Unfortunately, CITEXT seems to have a memory leak somewhere,

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

regards, tom lane

#13

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#12)

Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 19:18, Tom Lane wrote:

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

PostgreSQL 8.3.3 on i386-apple-darwin9.4.0, compiled by GCC i686-
apple-darwin9-gcc-4.0.1 (G

That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. This is using my
version for 8.3.3 from svn.

https://svn.kineticode.com/citext/trunk/

I'll give it a try myself with the patch against CVS in a bit.

Thanks,

David

#14

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#13)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 11, 2008, at 19:18, Tom Lane wrote:

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8.

Oh, got one of those too ... [ time passes ... ] Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...

Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass. You're depending way too much
on platform-specific behavior there, I fear.

regards, tom lane

#15

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#13)

Re: PATCH: CITEXT 2.0 v3

Here's an updated version of citext.c. Some changes cosmetic, some
not so much ...

regards, tom lane

#16

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#14)

Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 20:10, Tom Lane wrote:

Oh, got one of those too ... [ time passes ... ] Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...

Yeah, gotta love that, right?

So the issue must be with my version for 8.3.x, meaning, likely, my
custom cilower() function. I'll have another look at it. Thanks for
giving it a whirl on your end.

Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass. You're depending way too much
on platform-specific behavior there, I fear.

Hrm. I wonder what that could be. I'll have to give it a shot on my
Ubuntu box and find out. Thanks.

David

#17

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#15)

Re: PATCH: CITEXT 2.0 v3

On Jul 11, 2008, at 20:22, Tom Lane wrote:

Here's an updated version of citext.c. Some changes cosmetic, some
not so much ...

Thanks! Good catch on my missing all of the s/int/int32/ stuff related
to citextcmp(). Stupid oversites on my part. I'll submit a new patch
with these changes shortly.

Best,

David

#18

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#9)

Re: PATCH: CITEXT 2.0 v3

BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments:

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

* The type declaration needs to say storage = extended, else the type
won't be toastable.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1. Compare the bpchar to text cast.

* <= is surely not its own commutator. You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures. The overloaded miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these. It is likewise pretty pointless AFAICS
to introduce regex functions taking citext pattern arguments.

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

regards, tom lane

#19

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#18)

Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 12:17, Tom Lane wrote:

BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments:

Thanks a million for these, Tom. I greatly appreciate it.

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

I've added SQL comments. Were you talking about a COMMENT?

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.

I wondered about that; those were copied from CITEXT 1. I've removed
all GRANTs and COMMENTs.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

I'm confused. Is that not what the citextin and citextout functions are?

* The type declaration needs to say storage = extended, else the type
won't be toastable.

Ah, good, thanks.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1. Compare the bpchar to text cast.

Where do I find that? I have trouble finding the SQL that creates the
core types. :-(

* <= is surely not its own commutator.

Changed to >=.

You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

Thanks. Added to my list.

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures. The overloaded
miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these. It is likewise pretty pointless
AFAICS
to introduce regex functions taking citext pattern arguments.

I added most of those as I wrote tests and they failed to find the
functions. Once I added the functions, they worked. But I'll do an
audit to make sure that I didn't inadvertantly leave in any unneeded
ones (I'm happy to have less code :-)).

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

Sorry, don't know what you're referring to here. CREATE OPERATOR
appears to require parens…

Thanks for the great feedback! I'll work on getting things all
straightened out and a new patch in tonight.

Best,

David

#20

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#9)

Re: PATCH: CITEXT 2.0 v3

There was some discussion earlier about whether the proposed regression
tests for citext are suitable for use in contrib or not. After playing
with them for awhile, I have to come down very firmly on the side of
"not". I have these gripes:

1. The style is gratuitously different from every other regression test
in the system. This is not a good thing. If it were an amazingly
better way to do things, then maybe, but as far as I can tell the style
pgTAP forces on you is really pretty darn poorly suited for SQL tests.
You have to contort what could naturally be expressed in SQL as a table
result into a scalar. Plus it's redundant with the expected-output file.

2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL. This is a very bad thing. Speed of
regression tests matters a lot to those of us who run them a dozen times
per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)

Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
I have a couple of gripes about the substance of the tests as well:

3. What you are mostly testing is not the behavior of citext itself,
but the behavior of the underlying strcoll function. This is pretty
much doomed to failure, first because the regression tests are intended
to work in multiple locales (and we are *not* giving that up for
citext), and second because the behavior of strcoll isn't all that
portable across platforms even given allegedly similar locale settings
(as we already found out yesterday). Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works. The odds of ever finding
anything with those tests are not distinguishable from zero (unless the
underlying text function is busted, which is not your responsibility to
test). So I don't see any point in putting them into the standard
regression package. (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext rather
than text.)

I suggest something more like the attached as a suitable regression
test. I got bored about halfway through and started to skim, so there
might be a few tests toward the end of the original set that would be
worth transposing into this one, but nothing jumped out at me.

regards, tom lane

#21

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#20)

Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 14:57, Tom Lane wrote:

There was some discussion earlier about whether the proposed
regression
tests for citext are suitable for use in contrib or not. After
playing
with them for awhile, I have to come down very firmly on the side of
"not". I have these gripes:

You're nothing if not thorough, Tom.

1. The style is gratuitously different from every other regression
test
in the system. This is not a good thing. If it were an amazingly
better way to do things, then maybe, but as far as I can tell the
style
pgTAP forces on you is really pretty darn poorly suited for SQL tests.
You have to contort what could naturally be expressed in SQL as a
table
result into a scalar. Plus it's redundant with the expected-output
file.

Sure. I wasn't aware of the existing regression test methodology when
I wrote pgTAP and those tests. I'm fine to change them and use pgTAP
for testing my app code, rather than PostgreSQL itself.

2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL. This is a very bad thing. Speed of
regression tests matters a lot to those of us who run them a dozen
times
per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)

Hrm. I'm wonder why it's so slow? The test functions don't really do a
lot. Anyway, I agree that they should perform well.

Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
I have a couple of gripes about the substance of the tests as well:

3. What you are mostly testing is not the behavior of citext itself,
but the behavior of the underlying strcoll function. This is pretty
much doomed to failure, first because the regression tests are
intended
to work in multiple locales (and we are *not* giving that up for
citext), and second because the behavior of strcoll isn't all that
portable across platforms even given allegedly similar locale settings
(as we already found out yesterday).

Yes, I *just* ran the tests on Ubuntu and opened my mail to ask about
the bizarre differences when I saw this message. Thanks for answering
my question before I asked it. :-)

Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.

Grr. Kind of defeats the purpose. Is there no infrastructure for
testing multibyte functionality? Are test database clusters all built
using SQL_ASCII and the C locale?

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works. The odds of ever finding
anything with those tests are not distinguishable from zero (unless
the
underlying text function is busted, which is not your responsibility
to
test). So I don't see any point in putting them into the standard
regression package. (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext
rather
than text.)

Well, I was doing test-driven development: I wrote the tests to ensure
that I had added the functions for CITEXT properly, and when they
passed, I could move on. As a unit tester, it'd feel weird for me to
drop these. I'm not testing the underlying functions; I'm making sure
that they work properly with CITEXT.

I suggest something more like the attached as a suitable regression
test. I got bored about halfway through and started to skim, so there
might be a few tests toward the end of the original set that would be
worth transposing into this one, but nothing jumped out at me.

Thanks! I'll work this in.

Best,

David

PS: I confirmed late yesterday that the memory leak I saw was with my
version for 8.3, where I had my own str_tolower(). It clearly has a
leak that I'll have to fix, but it has no bearing on the contribution
to 8.4, which has no such leak. Thanks for running the test yourself
to confirm.

#22

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: David E. Wheeler (#21)

Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 15:13, David E. Wheeler wrote:

Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine
yourself
to tests using ASCII strings.

Grr. Kind of defeats the purpose. Is there no infrastructure for
testing multibyte functionality? Are test database clusters all
built using SQL_ASCII and the C locale?

I just tried to take your modified tests and add multibyte tests that
run only on OS X with en_US.UTF-8. They worked like this:

CREATE OR REPLACE FUNCTION test_multibyte ()
RETURNS BOOLEAN AS $$
SELECT version() ~ 'apple-darwin'
AND (select setting from pg_settings where name = 'lc_collate')
= 'en_US.UTF-8';
$$ LANGUAGE SQL IMMUTABLE;

SELECT 'À'::citext = 'À'::citext WHERE test_multibyte() = true;
SELECT 'À'::citext = 'à'::citext WHERE test_multibyte() = true;

But then I realized that this would change the expected output
depending on the platform, and thus make the tests fail. This is one
reason why the inflexibility of the existing regression test model is
a drag: it limits you to testing only what works everywhere!

Grrr.

I'll remove all the multibyte character tests, but I have to say that,
as a result, the CITEXT 1 module would likely pass all such tests,
but it still isn't locale-aware. How can one add regressions to ensure
that something truly is locale-aware?

Thanks,

David

#23

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: David E. Wheeler (#19)

Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 14:50, David E. Wheeler wrote:

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

I've added SQL comments. Were you talking about a COMMENT?

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.

I wondered about that; those were copied from CITEXT 1. I've removed
all GRANTs and COMMENTs.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

I'm confused. Is that not what the citextin and citextout functions
are?

* The type declaration needs to say storage = extended, else the type
won't be toastable.

Ah, good, thanks.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1. Compare the bpchar to text cast.

Where do I find that? I have trouble finding the SQL that creates
the core types. :-(

Duh, you just told me. Added, thanks.

* <= is surely not its own commutator.

Changed to >=.

You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

Thanks. Added to my list.

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures. The overloaded
miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these. It is likewise pretty pointless
AFAICS
to introduce regex functions taking citext pattern arguments.

I added most of those as I wrote tests and they failed to find the
functions. Once I added the functions, they worked. But I'll do an
audit to make sure that I didn't inadvertantly leave in any unneeded
ones (I'm happy to have less code :-)).

Of the size functions, I was able to remove only this one and keep all
of my pgTAP tests passing:

CREATE FUNCTION textlen(citext)
RETURNS int4 AS 'textlen'
LANGUAGE 'internal' IMMUTABLE STRICT;

When I deleted any of the others, I got errors like this:

psql:sql/citext.sql:865: ERROR: function length(citext) is not unique
LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||...
^
HINT: Could not choose a best candidate function. You might need to
add explicit type casts.

I think you can see now why I wrote the tests: I wanted to ensure that
CITEXT really does work (almost) just like TEXT.

I was able to eliminate *all* of the miscellaneous functions, but the
upper() and lower() functions now return TEXT instead of CITEXT, which
I don't think is exactly what we want, is it? For now, I'e left
upper() and lower() in. It just seems like more expected functionality.

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

Sorry, don't know what you're referring to here. CREATE OPERATOR
appears to require parens…

Best,

David

#24

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#23)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

When I deleted any of the others, I got errors like this:

psql:sql/citext.sql:865: ERROR: function length(citext) is not unique
LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||...
^
HINT: Could not choose a best candidate function. You might need to
add explicit type casts.

Hmm. I think what that actually means is that the cast from citext to
bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is happening
is that the system can't figure out whether to use length(text) or
length(bpchar) when presented with a citext argument. I had been
thinking yesterday that it would automatically prefer length(text)
because text is a "preferred type", but after tracing through it I see
that that doesn't happen because citext is not thought to be of the
string category. (We really need a way to let user-defined types
specify their category...)

The fact that you need all these piggyback functions is a red flag
because what it implies is that citext will not work nicely for any
situation where both text and bpchar functions have been provided
--- and that includes user-added functions, so it's hopeless to think
that you can get to a solution this way.  Downgrading the cast seems
like the right thing to me.

The implicit cast to varchar is a bit worrisome because it creates the
same issue if someone has provided both varchar and text versions of a
function. However, that seems a bit pointless given the lack of
semantic difference, and I suspect that a lot of user-written functions
come only in varchar variants --- so on balance my recommendation is to
keep that one as implicit.

regards, tom lane

#25

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#19)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 12, 2008, at 12:17, Tom Lane wrote:

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

I'm confused. Is that not what the citextin and citextout functions are?

No, those are text I/O. You need analogues of textsend and textrecv
too.

You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

Thanks. Added to my list.

BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it present.
typ_sanity and oidjoins might possibly find issues too.

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

Sorry, don't know what you're referring to here.

Some (not all) of your CREATE OPERATOR commands have things like

NEGATOR = OPERATOR(!~),

which seems unnecessary, and is certainly inconsistent.

regards, tom lane

#26

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#21)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 12, 2008, at 14:57, Tom Lane wrote:

Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.

Grr. Kind of defeats the purpose. Is there no infrastructure for
testing multibyte functionality?

There's some stuff under src/test/locale and src/test/mb, though it
doesn't get exercised regularly. The contrib tests are a particularly
bad place for trying to do any locale-dependent testing, because we
only support "make installcheck" which means there is no way to set
the database locale --- you *have to* work with whatever locale is
predetermined by the postmaster you're pointed at.

I don't recall the reason for not supporting "make check" in the contrib
modules; maybe it was just that preparing a test installation for each
contrib module sounded too slow? In any case, what you'd need to pursue
is having some additional tests available that are not executed by the
standard contrib regression test sequence.

(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)

regards, tom lane

#27

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#24)

Re: PATCH: CITEXT 2.0 v3

On Jul 13, 2008, at 10:16, Tom Lane wrote:

Hmm. I think what that actually means is that the cast from citext to
bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is
happening
is that the system can't figure out whether to use length(text) or
length(bpchar) when presented with a citext argument. I had been
thinking yesterday that it would automatically prefer length(text)
because text is a "preferred type", but after tracing through it I see
that that doesn't happen because citext is not thought to be of the
string category. (We really need a way to let user-defined types
specify their category...)

That'd be nice.

The fact that you need all these piggyback functions is a red flag
because what it implies is that citext will not work nicely for any
situation where both text and bpchar functions have been provided
--- and that includes user-added functions, so it's hopeless to think
that you can get to a solution this way.  Downgrading the cast seems
like the right thing to me.

Yes, that works for me. I've downgraded it and can now remove the size
functions and all the tests still pass.

The implicit cast to varchar is a bit worrisome because it creates the
same issue if someone has provided both varchar and text versions of a
function. However, that seems a bit pointless given the lack of
semantic difference, and I suspect that a lot of user-written
functions
come only in varchar variants --- so on balance my recommendation is
to
keep that one as implicit.

Yes, I agree. Thanks for tracing this out, Tom, it never would have
ocurred to me -- and now I know more than I did before!

Best,

David

#28

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#25)

Re: PATCH: CITEXT 2.0 v3

On Jul 13, 2008, at 10:19, Tom Lane wrote:

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 12, 2008, at 12:17, Tom Lane wrote:

* You should provide binary I/O (send/receive) functions, if you
want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

I'm confused. Is that not what the citextin and citextout functions
are?

No, those are text I/O. You need analogues of textsend and textrecv
too.

Okay.

Thanks. Added to my list.

BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it
present.
typ_sanity and oidjoins might possibly find issues too.

Also added to my list. :-)

Some (not all) of your CREATE OPERATOR commands have things like

NEGATOR = OPERATOR(!~),

which seems unnecessary, and is certainly inconsistent.

Oh, I hadn't even noticed those; I'd just copied them from citext 1.
Fixed!

Best,

David

#29

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#25)

Re: PATCH: CITEXT 2.0 v3

On Jul 13, 2008, at 10:19, Tom Lane wrote:

I'm confused. Is that not what the citextin and citextout functions
are?

No, those are text I/O. You need analogues of textsend and textrecv
too.

Should those return bytea and citext, respectively? IOW, are these
right?

CREATE OR REPLACE FUNCTION citextrecv(bytea)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

Thanks,

David

#30

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: David E. Wheeler (#29)

Re: PATCH: CITEXT 2.0 v3

On Jul 13, 2008, at 16:06, David E. Wheeler wrote:

Should those return bytea and citext, respectively? IOW, are these
right?

CREATE OR REPLACE FUNCTION citextrecv(bytea)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

To judge by the User-Defined Types docs, I was close. :-) I just
changed the argument to citextrecv to "internal".

Thanks,

David

#31

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#30)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

To judge by the User-Defined Types docs, I was close. :-) I just
changed the argument to citextrecv to "internal".

Right. The APIs for send and recv aren't inverses. (Oh, the sins
we'll commit in the name of performance ...)

regards, tom lane

#32

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#25)

Re: PATCH: CITEXT 2.0 v3

On Jul 13, 2008, at 10:19, Tom Lane wrote:

You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

Thanks. Added to my list.

BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it
present.
typ_sanity and oidjoins might possibly find issues too.

Um, stupid question (sorry, I'm not familiar with how the regression
tests are organized): serial_schedule doesn't look like a file into
which I can insert an SQL file. How would I do that?

Thanks,

David

#33

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#26)

Re: PATCH: CITEXT 2.0 v3

On Jul 13, 2008, at 10:31, Tom Lane wrote:

Grr. Kind of defeats the purpose. Is there no infrastructure for
testing multibyte functionality?

There's some stuff under src/test/locale and src/test/mb, though it
doesn't get exercised regularly. The contrib tests are a particularly
bad place for trying to do any locale-dependent testing, because we
only support "make installcheck" which means there is no way to set
the database locale --- you *have to* work with whatever locale is
predetermined by the postmaster you're pointed at.

I don't recall the reason for not supporting "make check" in the
contrib
modules; maybe it was just that preparing a test installation for each
contrib module sounded too slow? In any case, what you'd need to
pursue
is having some additional tests available that are not executed by the
standard contrib regression test sequence.

(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)

Could I supply two comparison files, one for Mac OS X with en_US.UTF-8
and one for everything else, as described in the last three paragraphs
here?

http://www.postgresql.org/docs/current/static/regress-variant.html

That way, I can at least make sure that the multibyte stuff *does*
work. Even if it's tested on only one platform, the purpose is not to
test a particular collation, but to test that CITEXT is actually
sensitive to locale.

Thanks,

David

#34

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#20)

Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 14:57, Tom Lane wrote:

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works. The odds of ever finding
anything with those tests are not distinguishable from zero (unless
the
underlying text function is busted, which is not your responsibility
to
test). So I don't see any point in putting them into the standard
regression package. (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext
rather
than text.)

I'd like to keep these tests, since they ensure not just that the
functions work but that they work with citext. Given what we found
with length() and friends not working when there was an implicit cast
to bpchar, I think it's valuable to continue to ensure that these
functions work as expected with citext. Otherwise someone in the
future might come along and make the cast to bpchar implicit again,
and no tests would fail to tell him/her otherwise.

These tests make good regressions.

Thanks,

David

#35

Andrew Dunstan

andrew@dunslane.net

over 17 years ago

In reply to: David E. Wheeler (#33)

Re: PATCH: CITEXT 2.0 v3

David E. Wheeler wrote:

(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)

Could I supply two comparison files, one for Mac OS X with en_US.UTF-8
and one for everything else, as described in the last three paragraphs
here?

http://www.postgresql.org/docs/current/static/regress-variant.html

That way, I can at least make sure that the multibyte stuff *does*
work. Even if it's tested on only one platform, the purpose is not to
test a particular collation, but to test that CITEXT is actually
sensitive to locale.

You can certainly add any tests you like. But the buildfarm does all its
regression tests using an install done with --no-locale. Any test that
requires some locale to work, or make sense, should probably not be in
your Makefile's REGRESS target.

Some time ago I raised the question of doing locale- and encoding-aware
regression tests, and IIRC the consensus seemed to be that it was not
worth the effort, but maybe we need to revisit that. We certainly seem
to be doing a lot more work now to get Postgres to where it needs to be
w.r.t. locale support, and we've cleaned up a lot of the encoding issues
we had, so maybe we need to reflect that in our testing regime more.

cheers

andrew

#36

Alvaro Herrera

alvherre@commandprompt.com

over 17 years ago

In reply to: Andrew Dunstan (#35)

testing locales and encodings

Andrew Dunstan wrote:

Some time ago I raised the question of doing locale- and encoding-aware
regression tests, and IIRC the consensus seemed to be that it was not
worth the effort, but maybe we need to revisit that. We certainly seem
to be doing a lot more work now to get Postgres to where it needs to be
w.r.t. locale support, and we've cleaned up a lot of the encoding issues
we had, so maybe we need to reflect that in our testing regime more.

I agree. Maybe we need to unearth the src/test/mb or src/test/locale;
perhaps have the buildfarm run those tests when they get to a useful
point.

I notice that src/test/locale is not VPATH-aware, and I'm unsure if it's
got a clear distinction of locales vs. encodings (it calls koi8-r a
locale).

src/test/mb is not VPATH aware either (and it doesn't have a Makefile at
all), and is now broken by the fact that createdb refuses to create a
database with mismatching encoding.

$ LC_ALL=C sh mbregress.sh
euc_jp .. createdb: database creation failed: ERROR: encoding EUC_JP does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
sjis .. failed
euc_kr .. createdb: database creation failed: ERROR: encoding EUC_KR does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
euc_cn .. createdb: database creation failed: ERROR: encoding EUC_CN does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
euc_tw .. createdb: database creation failed: ERROR: encoding EUC_TW does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
big5 .. failed
utf8 .. ok
mule_internal .. createdb: database creation failed: ERROR: encoding MULE_INTERNAL does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed

So this whole area would seem to need a lot of love ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#37

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#32)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 13, 2008, at 10:19, Tom Lane wrote:

BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it
present.
typ_sanity and oidjoins might possibly find issues too.

Um, stupid question (sorry, I'm not familiar with how the regression
tests are organized): serial_schedule doesn't look like a file into
which I can insert an SQL file. How would I do that?

You're really making it into another test. Just copy the citext.sql
file into the sql/ subdirectory and add a "citext" entry to the schedule
file.

The last time I did this, I had to at least "touch" a corresponding
expected file (expected/citext.out) or pg_regress would go belly-up
without running the rest of the tests. There's no special need to
make the expected file match, of course, since you can just ignore
the "failure" report from that one test.

regards, tom lane

#38

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#33)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

Could I supply two comparison files, one for Mac OS X with en_US.UTF-8
and one for everything else, as described in the last three paragraphs
here?

The fallacy in that proposal is the assumption that there are only two
behaviors out there. Let me recalibrate your thoughts a bit: so far
I have tried citext on three different machines (Mac, Fedora 8, HPUX),
and I got three different answers from those tests. That's despite
endeavoring to make the database locales match ... which is less than
trivial in itself because they use three slightly different spellings of
"en_US.UTF8".

Given that you were more or less deliberately testing corner cases,
I think it's quite likely that the number of observable reactions from
N platforms would be more nearly O(N) than O(1).

In the real world, to the extent that we are able to control the locale
of the regression tests, we make it "C" --- and to a large extent we
can't control it at all, which means you have another uncontrolled
variable besides platform. So in the current universe there is
absolutely no value in submitting locale-specific tests for a contrib
module.

I see some discussion in the thread about improving the situation, but
until we are able to decouple database locale from environment locale,
I doubt we'll be able to do a whole lot about automating this kind
of test. There are too many variables at the moment.

regards, tom lane

#39

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#34)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 12, 2008, at 14:57, Tom Lane wrote:

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works.

I'd like to keep these tests, since they ensure not just that the
functions work but that they work with citext.

It might be reasonable to test a couple of them for that purpose.
If your agenda is "test every function in the system that comes or
might come in a bpchar variant", I think that's pointless.

regards, tom lane

#40

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Andrew Dunstan (#35)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 06:05, Andrew Dunstan wrote:

You can certainly add any tests you like. But the buildfarm does all
its regression tests using an install done with --no-locale. Any
test that requires some locale to work, or make sense, should
probably not be in your Makefile's REGRESS target.

Well, unless the display of the characters would be broken (the build
farm databases use UTF-8, no?), the output would look like this, which
should more or less work (I'm hoping):

SELECT citext_larger( 'Â'::citext, 'ç'::citext ) = 'ç' AS t WHERE
false and test_multibyte();
t
---
(0 rows)

Some time ago I raised the question of doing locale- and encoding-
aware regression tests, and IIRC the consensus seemed to be that it
was not worth the effort, but maybe we need to revisit that. We
certainly seem to be doing a lot more work now to get Postgres to
where it needs to be w.r.t. locale support, and we've cleaned up a
lot of the encoding issues we had, so maybe we need to reflect that
in our testing regime more.

Makes sense to me.

Best,

David

#41

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#38)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 07:24, Tom Lane wrote:

"David E. Wheeler" <david@kineticode.com> writes:

Could I supply two comparison files, one for Mac OS X with
en_US.UTF-8
and one for everything else, as described in the last three
paragraphs
here?

The fallacy in that proposal is the assumption that there are only two
behaviors out there.

Well, no, that's not the assumption at all. The assumption is that the
type works properly with multibyte characters under multibyte-aware
locales. So I want to have tests to ensure that such is true by having
multibyte characters run under a very specific locale and platform. I
don't really care what platform or locale; the point is to make sure
that the type is actually multibyte-aware.

Let me recalibrate your thoughts a bit: so far
I have tried citext on three different machines (Mac, Fedora 8, HPUX),
and I got three different answers from those tests. That's despite
endeavoring to make the database locales match ... which is less than
trivial in itself because they use three slightly different
spellings of
"en_US.UTF8".

<rant>
This is a truly pitiful state of affairs. Rhetorical question: Why is
there no standardization of locales? I'm sure there are a lot of
opinions out there (should all uppercase chars should precede all
lowercase chars or be mixed in with lowercase chars), but I should
think that, in this day and age, there would be some sort of standard
defining locales and how they work -- and to allow such opinions to be
expressed by different locales, not in the same locale names on
different platforms.
</rant>

Given that you were more or less deliberately testing corner cases,
I think it's quite likely that the number of observable reactions from
N platforms would be more nearly O(N) than O(1).

To me they're not corner cases. To me it is just, "given a specific
platform/locale, does CITEXT respect the locale's rules?" I don't care
to test all platforms and locales (I'm not *that* stupid :-)).

In the real world, to the extent that we are able to control the
locale
of the regression tests, we make it "C" --- and to a large extent we
can't control it at all, which means you have another uncontrolled
variable besides platform. So in the current universe there is
absolutely no value in submitting locale-specific tests for a contrib
module.

Then how do we know that it will continue to be locale-aware over
time? Someone could replace the comparison function with one that just
lowercases ASCII characters, like CITEXT 1 does, and no tests would
fail. How do you prevent that from happening without being hyper-
vigilant (and never leaving the project, I might add)?

I see some discussion in the thread about improving the situation, but
until we are able to decouple database locale from environment locale,
I doubt we'll be able to do a whole lot about automating this kind
of test. There are too many variables at the moment.

Is the decoupling of database locale from environment locale likely to
happen anytime soon? Now that I've written CITEXT, I dare say that
such might become my top-desired feature (aside from replication).

Thanks for the discussion, much appreciated, and I'm learning a ton. I
retain the right to be opinionated, however. ;-)

Best,

David

#42

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#39)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 07:26, Tom Lane wrote:

I'd like to keep these tests, since they ensure not just that the
functions work but that they work with citext.

It might be reasonable to test a couple of them for that purpose.
If your agenda is "test every function in the system that comes or
might come in a bpchar variant", I think that's pointless.

Or a varchar variant, or where such a variant might be added in the
future. To my mind, it's important to have good coverage in my unit
tests to ensure that things continue to work exactly the same over time.

So, since the tests are already written, and are unlikely to add more
than a few milliseconds to test runtime, can you at least agree that
such tests are harmless?

Updated patch later today.

Thanks,

David

#43

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Alvaro Herrera (#36)

Re: testing locales and encodings

On Jul 14, 2008, at 06:49, Alvaro Herrera wrote:

So this whole area would seem to need a lot of love ...

Do the tests control for platforms, as well, since locales with the
same spellings can vary on different platforms? Or even on different
versions of the same platforms?

Thanks,

David

#44

Andrew Dunstan

andrew@dunslane.net

over 17 years ago

In reply to: David E. Wheeler (#40)

Re: PATCH: CITEXT 2.0 v3

David E. Wheeler wrote:

On Jul 14, 2008, at 06:05, Andrew Dunstan wrote:

You can certainly add any tests you like. But the buildfarm does all
its regression tests using an install done with --no-locale. Any test
that requires some locale to work, or make sense, should probably not
be in your Makefile's REGRESS target.

Well, unless the display of the characters would be broken (the build
farm databases use UTF-8, no?),

No.

--no-locale is the same as --locale=C which means the encoding is SQL_ASCII.

cheers

andrew

#45

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Andrew Dunstan (#44)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 10:55, Andrew Dunstan wrote:

No.

--no-locale is the same as --locale=C which means the encoding is
SQL_ASCII.

I've used --no-locale --encoding UTF-8 many times. But if the
regressions run with SQL_ASCII…well, I'm just amazed that there
haven't been more unexpected side-effects to source code changes
without controlling for such variables in tests. Seems like a lot to
keep in one's head.

Anyway, are tests for contrib modules run on the build farms? If so, I
could still potentially get around this issue by turning off ECHO.

Best,

David

#46

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#40)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

Well, unless the display of the characters would be broken (the build
farm databases use UTF-8, no?),

No, they use C.

regards, tom lane

#47

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#46)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 10:58, Tom Lane wrote:

Well, unless the display of the characters would be broken (the build
farm databases use UTF-8, no?),

No, they use C.

Um, you mean SQL_ASCII?

Best,

David

#48

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#41)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 14, 2008, at 07:24, Tom Lane wrote:

The fallacy in that proposal is the assumption that there are only two
behaviors out there.

Well, no, that's not the assumption at all. The assumption is that the
type works properly with multibyte characters under multibyte-aware
locales. So I want to have tests to ensure that such is true by having
multibyte characters run under a very specific locale and platform.

[ shrug... ] Seems pretty useless to me: we already know that it works
for you. The point of a regression test in my mind is to make sure that
it works for everybody. Given the platform variations involved in
strcoll's behavior, the definition of "works for everybody" is going to
be pretty darn narrowly circumscribed anyway, and thus I don't have a
big problem with restricting the tests to ASCII cases.

Let me put it another way: if we test on another platform and find out
that strcoll's behavior is different there, are you going to fix that
version of strcoll? No, you're not. So you might as well just test the
behavior of the code that's actually under your control.

regards, tom lane

#49

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#48)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 11:06, Tom Lane wrote:

[ shrug... ] Seems pretty useless to me: we already know that it
works
for you. The point of a regression test in my mind is to make sure
that
it works for everybody. Given the platform variations involved in
strcoll's behavior, the definition of "works for everybody" is going
to
be pretty darn narrowly circumscribed anyway, and thus I don't have a
big problem with restricting the tests to ASCII cases.

Neither do I, as long as there is *some* context to ensure that the
type remains locale-aware. We only know that it works for me because
I've written the tests.

Let me put it another way: if we test on another platform and find out
that strcoll's behavior is different there, are you going to fix that
version of strcoll? No, you're not. So you might as well just test
the
behavior of the code that's actually under your control.

You don't seem to understand what I'm suggesting: I'm not talking
about testing strcoll. I'm talking about making sure that citext
*uses* strcoll. Whether or not strcoll actually works properly is not
my concern.

Best,

David

#50

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#49)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

On Jul 14, 2008, at 11:06, Tom Lane wrote:

Let me put it another way: if we test on another platform and find out
that strcoll's behavior is different there, are you going to fix that
version of strcoll? No, you're not. So you might as well just test
the
behavior of the code that's actually under your control.

You don't seem to understand what I'm suggesting: I'm not talking
about testing strcoll. I'm talking about making sure that citext
*uses* strcoll.

This seems pointless, as well as essentially impossible to enforce by
black-box testing. Nobody is likely to consider a patch that removes
such obviously basic functionality of the module. I think you're
worrying about testing the wrong things --- and as evidence I'd note the
serious errors that slipped by your testing (including the fact that the
last submitted version would still claim that 'a' = 'ab'). What we
generally ask a regression test to do is catch portability problems
(which is certainly a real-enough issue here, but we don't know how
to address it well) or catch foreseeable breakage (such as someone
introducing a cast that breaks resolution of citext function calls).
The methodology can't catch everything, and trying to push it beyond
what it can do is just a time sink for effort that can more usefully
be spent other ways, such as code-reading.

regards, tom lane

#51

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#50)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 11:49, Tom Lane wrote:

You don't seem to understand what I'm suggesting: I'm not talking
about testing strcoll. I'm talking about making sure that citext
*uses* strcoll.

This seems pointless, as well as essentially impossible to enforce by
black-box testing. Nobody is likely to consider a patch that removes
such obviously basic functionality of the module.

Never underestimate the human capacity for incompetence. One of my
mottos.

I think you're
worrying about testing the wrong things --- and as evidence I'd note
the
serious errors that slipped by your testing (including the fact that
the
last submitted version would still claim that 'a' = 'ab').

Um, say what? I had that problem at one time, but it was when I was
writing my own lower() function, which was shitty (I wasn't creating a
nul-terminated string). I don't recall that being in any patch I sent
to the list, and indeed you wrote that very test in the revised test
script you sent in.

At any rate, I make no claims that my tests properly covered
everything. There is a lot I didn't know to test, but I'm learning
more all the time. That's why this code review has been so immensely
valuable, both for me and for CITEXT.

What we
generally ask a regression test to do is catch portability problems
(which is certainly a real-enough issue here, but we don't know how
to address it well) or catch foreseeable breakage (such as someone
introducing a cast that breaks resolution of citext function calls).
The methodology can't catch everything, and trying to push it beyond
what it can do is just a time sink for effort that can more usefully
be spent other ways, such as code-reading.

I guess what I'm thinking of is not regression tests but unit tests.
And with unit testing, one of the goals is coverage. Good coverage has
saved my ass many times in the past, even catching changes that, in
hindsight, should obviously never have been attempted. Perhaps it'd be
worth thinking about creating a project for unit testing PostgreSQL in
controlled environments? Maybe having tests that can contain proper
conditionals?

Anyway, it seems that, given the limitations of the current testing
system with regard to testing multibyte characters, the issue is moot.
I'll throw in a few tests that test multibyte characters, but I'll
comment them out and just uncomment them for individual test runs on
my box for the purposes of my own sanity going forward.

Thanks,

David

#52

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: David E. Wheeler (#21)

Re: PATCH: CITEXT 2.0 v3

On Jul 12, 2008, at 15:13, David E. Wheeler wrote:

2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL. This is a very bad thing. Speed
of
regression tests matters a lot to those of us who run them a dozen
times
per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)

Hrm. I'm wonder why it's so slow? The test functions don't really do
a lot. Anyway, I agree that they should perform well.

Just as an FYI, I've just moved all the tests to regular SQL instead
of using pgTAP. The difference in runtime is:

psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 19% cpu 0.253
total
psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 4% cpu 1.298
total

So it's close to a factor of five, though subtract .125 for the time
to load the pgTAP functions. The pgTAP tests *are* doing a lot more
work, but I'm sure that they could be made a lot more efficient,
though of course the TAP functions will always introduce some
overhead. One just needs to decide whether the tradeoffs are worth it.

Best,

David

#53

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#37)

Re: PATCH: CITEXT 2.0 v3

On Jul 14, 2008, at 07:08, Tom Lane wrote:

You're really making it into another test. Just copy the citext.sql
file into the sql/ subdirectory and add a "citext" entry to the
schedule
file.

The last time I did this, I had to at least "touch" a corresponding
expected file (expected/citext.out) or pg_regress would go belly-up
without running the rest of the tests. There's no special need to
make the expected file match, of course, since you can just ignore
the "failure" report from that one test.

Okay, I copied citext.sql into src/test/regress/sql and then added
"test: citext" to the top of src/test/regress/serial_schedule. Then I
ran `make check`. All tests passed, but I don't think that citext was
tested.

Do I need to install the server, build a cluster, and run `make
installcheck`?

Thanks for the hand-holding.

Best,

David

#54

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#53)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

Okay, I copied citext.sql into src/test/regress/sql and then added
"test: citext" to the top of src/test/regress/serial_schedule. Then I
ran `make check`. All tests passed, but I don't think that citext was
tested.
Do I need to install the server, build a cluster, and run `make
installcheck`?

Yeah, probably. I don't think the "make check" path will support it
because it doesn't install contrib into the temp installation.
(You'd also need to have put the extra entry in parallel_schedule
not serial_schedule, but it's not gonna work anyway.)

regards, tom lane

#55

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#54)

1 attachment(s)

Re: PATCH: CITEXT 2.0 v3

On Jul 15, 2008, at 07:09, Tom Lane wrote:

Yeah, probably. I don't think the "make check" path will support it
because it doesn't install contrib into the temp installation.
(You'd also need to have put the extra entry in parallel_schedule
not serial_schedule, but it's not gonna work anyway.)

Well now that was cool to see. I got some failures, of course, but
nothing stands out to me as an obvious bug. I attach the diffs file
(with the citext.sql failure removed) for your perusal. What would be
the best way for me to resolve those permission issues? Or do they
matter for sanity-checking citext?

Thanks,

David

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffs; x-unix-mode=0644Download

--- ./expected/opr_sanity.out	Mon Jul 14 21:55:49 2008
+++ ./results/opr_sanity.out	Tue Jul 15 11:03:04 2008
@@ -87,8 +87,10 @@
      p1.provolatile != p2.provolatile OR
      p1.pronargs != p2.pronargs);
  oid | proname | oid | proname 
------+---------+-----+---------
-(0 rows)
+------+----------+-------+------------
+ 2414 | textrecv | 16478 | citextrecv
+ 2415 | textsend | 16479 | citextsend
+(2 rows)
 
 -- Look for uses of different type OIDs in the argument/result type fields
 -- for different aliases of the same built-in function.
@@ -110,8 +112,9 @@
  prorettype | prorettype 
 ------------+------------
          25 |       1043
+         25 |      16475
        1114 |       1184
-(2 rows)
+(3 rows)
 
 SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
 FROM pg_proc AS p1, pg_proc AS p2
@@ -124,10 +127,12 @@
 -------------+-------------
           25 |        1042
           25 |        1043
+          25 |       16475
+        1042 |       16475
         1114 |        1184
         1560 |        1562
         2277 |        2283
-(5 rows)
+(7 rows)
 
 SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
 FROM pg_proc AS p1, pg_proc AS p2
@@ -139,10 +144,11 @@
  proargtypes | proargtypes 
 -------------+-------------
           23 |          28
+          25 |       16475
         1114 |        1184
         1560 |        1562
         2277 |        2283
-(4 rows)
+(5 rows)
 
 SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
 FROM pg_proc AS p1, pg_proc AS p2
@@ -305,7 +311,8 @@
         142 |         25 |        0 | a
         142 |       1043 |        0 | a
         142 |       1042 |        0 | a
-(6 rows)
+      16475 |       1042 |        0 | a
+(7 rows)
 
 -- **************** pg_operator ****************
 -- Look for illegal values in pg_operator fields.

======================================================================

--- ./expected/copy.out	Tue Jul 15 11:03:00 2008
+++ ./results/copy.out	Tue Jul 15 11:03:06 2008
@@ -7,6 +7,7 @@
 COPY aggtest FROM '/Users/david/dev/pgsql.citext/src/test/regress/data/agg.data';
 COPY onek FROM '/Users/david/dev/pgsql.citext/src/test/regress/data/onek.data';
 COPY onek TO '/Users/david/dev/pgsql.citext/src/test/regress/results/onek.data';
+ERROR:  could not open file "/Users/david/dev/pgsql.citext/src/test/regress/results/onek.data" for writing: Permission denied
 DELETE FROM onek;
 COPY onek FROM '/Users/david/dev/pgsql.citext/src/test/regress/results/onek.data';
 COPY tenk1 FROM '/Users/david/dev/pgsql.citext/src/test/regress/data/tenk.data';
@@ -45,16 +46,27 @@
 insert into copytest values('Mac',E'abc\rdef',3);
 insert into copytest values(E'esc\\ape',E'a\\r\\\r\\\n\\nb',4);
 copy copytest to '/Users/david/dev/pgsql.citext/src/test/regress/results/copytest.csv' csv;
+ERROR:  could not open file "/Users/david/dev/pgsql.citext/src/test/regress/results/copytest.csv" for writing: Permission denied
 create temp table copytest2 (like copytest);
 copy copytest2 from '/Users/david/dev/pgsql.citext/src/test/regress/results/copytest.csv' csv;
+ERROR:  missing data for column "filler"
+CONTEXT:  COPY copytest2, line 1: "DOS,'abc"
 select * from copytest except select * from copytest2;
  style | test | filler 
--------+------+--------
-(0 rows)
+---------+----------+--------
+ DOS     | abc\r    |      1
+         : def        
+ Mac     | abc\rdef |      3
+ Unix    | abc      |      2
+         : def        
+ esc\ape | a\r\\r\  |      4
+         : \nb        
+(4 rows)
 
 truncate copytest2;
 --- same test but with an escape char different from quote char
 copy copytest to '/Users/david/dev/pgsql.citext/src/test/regress/results/copytest.csv' csv quote '''' escape E'\\';
+ERROR:  could not open file "/Users/david/dev/pgsql.citext/src/test/regress/results/copytest.csv" for writing: Permission denied
 copy copytest2 from '/Users/david/dev/pgsql.citext/src/test/regress/results/copytest.csv' csv quote '''' escape E'\\';
 select * from copytest except select * from copytest2;
  style | test | filler 

======================================================================

--- ./expected/misc.out	Tue Jul 15 11:03:00 2008
+++ ./results/misc.out	Tue Jul 15 11:03:15 2008
@@ -42,6 +42,7 @@
 -- copy
 --
 COPY onek TO '/Users/david/dev/pgsql.citext/src/test/regress/results/onek.data';
+ERROR:  could not open file "/Users/david/dev/pgsql.citext/src/test/regress/results/onek.data" for writing: Permission denied
 DELETE FROM onek;
 COPY onek FROM '/Users/david/dev/pgsql.citext/src/test/regress/results/onek.data';
 SELECT unique1 FROM onek WHERE unique1 < 2 ORDER BY unique1;
@@ -61,6 +62,7 @@
 (2 rows)
 
 COPY BINARY stud_emp TO '/Users/david/dev/pgsql.citext/src/test/regress/results/stud_emp.data';
+ERROR:  could not open file "/Users/david/dev/pgsql.citext/src/test/regress/results/stud_emp.data" for writing: Permission denied
 DELETE FROM stud_emp;
 COPY BINARY stud_emp FROM '/Users/david/dev/pgsql.citext/src/test/regress/results/stud_emp.data';
 SELECT * FROM stud_emp;

======================================================================

--- ./expected/largeobject.out	Tue Jul 15 11:03:00 2008
+++ ./results/largeobject.out	Tue Jul 15 11:03:25 2008
@@ -246,11 +246,7 @@
 
 END;
 SELECT lo_export(loid, '/Users/david/dev/pgsql.citext/src/test/regress/results/lotest.txt') FROM lotest_stash_values;
- lo_export 
------------
-         1
-(1 row)
-
+ERROR:  could not create server file "/Users/david/dev/pgsql.citext/src/test/regress/results/lotest.txt": Permission denied
 \lo_import 'results/lotest.txt'
 \set newloid :LASTOID
 -- just make sure \lo_export does not barf

======================================================================

--- ./expected/tablespace.out	Tue Jul 15 11:03:00 2008
+++ ./results/tablespace.out	Tue Jul 15 11:03:29 2008
@@ -1,51 +1,56 @@
 -- create a tablespace we can use
 CREATE TABLESPACE testspace LOCATION '/Users/david/dev/pgsql.citext/src/test/regress/testtablespace';
+ERROR:  could not set permissions on directory "/Users/david/dev/pgsql.citext/src/test/regress/testtablespace": Operation not permitted
 -- create a schema we can use
 CREATE SCHEMA testschema;
 -- try a table
 CREATE TABLE testschema.foo (i int) TABLESPACE testspace;
+ERROR:  tablespace "testspace" does not exist
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
     where c.reltablespace = t.oid AND c.relname = 'foo';
  relname |  spcname  
----------+-----------
- foo     | testspace
-(1 row)
+---------+---------
+(0 rows)
 
 INSERT INTO testschema.foo VALUES(1);
+ERROR:  relation "testschema.foo" does not exist
 INSERT INTO testschema.foo VALUES(2);
+ERROR:  relation "testschema.foo" does not exist
 -- tables from dynamic sources
 CREATE TABLE testschema.asselect TABLESPACE testspace AS SELECT 1;
+ERROR:  tablespace "testspace" does not exist
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
     where c.reltablespace = t.oid AND c.relname = 'asselect';
  relname  |  spcname  
-----------+-----------
- asselect | testspace
-(1 row)
+---------+---------
+(0 rows)
 
 PREPARE selectsource(int) AS SELECT $1;
 CREATE TABLE testschema.asexecute TABLESPACE testspace
     AS EXECUTE selectsource(2);
+ERROR:  tablespace "testspace" does not exist
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
     where c.reltablespace = t.oid AND c.relname = 'asexecute';
   relname  |  spcname  
------------+-----------
- asexecute | testspace
-(1 row)
+---------+---------
+(0 rows)
 
 -- index
 CREATE INDEX foo_idx on testschema.foo(i) TABLESPACE testspace;
+ERROR:  relation "testschema.foo" does not exist
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
     where c.reltablespace = t.oid AND c.relname = 'foo_idx';
  relname |  spcname  
----------+-----------
- foo_idx | testspace
-(1 row)
+---------+---------
+(0 rows)
 
 -- let's try moving a table from one place to another
 CREATE TABLE testschema.atable AS VALUES (1), (2);
 CREATE UNIQUE INDEX anindex ON testschema.atable(column1);
 ALTER TABLE testschema.atable SET TABLESPACE testspace;
+ERROR:  tablespace "testspace" does not exist
 ALTER INDEX testschema.anindex SET TABLESPACE testspace;
+ERROR:  tablespace "testspace" does not exist
 INSERT INTO testschema.atable VALUES(3);	-- ok
 INSERT INTO testschema.atable VALUES(1);	-- fail (checks index)
 ERROR:  duplicate key value violates unique constraint "anindex"
@@ -63,12 +68,9 @@
 ERROR:  tablespace "nosuchspace" does not exist
 -- Fail, not empty
 DROP TABLESPACE testspace;
-ERROR:  tablespace "testspace" is not empty
+ERROR:  tablespace "testspace" does not exist
 DROP SCHEMA testschema CASCADE;
-NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table testschema.foo
-drop cascades to table testschema.asselect
-drop cascades to table testschema.asexecute
-drop cascades to table testschema.atable
+NOTICE:  drop cascades to table testschema.atable
 -- Should succeed
 DROP TABLESPACE testspace;
+ERROR:  tablespace "testspace" does not exist

======================================================================

#56

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#55)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

Well now that was cool to see. I got some failures, of course, but
nothing stands out to me as an obvious bug. I attach the diffs file
(with the citext.sql failure removed) for your perusal. What would be
the best way for me to resolve those permission issues?

Don't run the tests in a read-only directory, perhaps.

Or do they matter for sanity-checking citext?

Hard to tell --- I'd suggest trying to get a clean run. As for what you
have, the first diff hunk suggests you've got the wrong function
properties for citextsend/citextrecv.

regards, tom lane

#57

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#56)

Re: PATCH: CITEXT 2.0 v3

On Jul 15, 2008, at 12:56, Tom Lane wrote:

Don't run the tests in a read-only directory, perhaps.

Yes, I changed the owner to the postgres system user and that did the
trick.

Or do they matter for sanity-checking citext?

Hard to tell --- I'd suggest trying to get a clean run. As for what
you
have, the first diff hunk suggests you've got the wrong function
properties for citextsend/citextrecv.

Here's the new diff:

*** ./expected/opr_sanity.out	Mon Jul 14 21:55:49 2008
--- ./results/opr_sanity.out	Tue Jul 15 17:41:03 2008
***************
*** 87,94 ****
        p1.provolatile != p2.provolatile OR
        p1.pronargs != p2.pronargs);
    oid | proname | oid | proname
! -----+---------+-----+---------
! (0 rows)

   -- Look for uses of different type OIDs in the argument/result type  
fields
   -- for different aliases of the same built-in function.
--- 87,96 ----
        p1.provolatile != p2.provolatile OR
        p1.pronargs != p2.pronargs);
    oid  | proname  |  oid  |  proname
! ------+----------+-------+------------
!  2414 | textrecv | 87258 | citextrecv
!  2415 | textsend | 87259 | citextsend
! (2 rows)

-- Look for uses of different type OIDs in the argument/result type
fields
-- for different aliases of the same built-in function.
***************
*** 110,117 ****
prorettype | prorettype
------------+------------
25 | 1043
1114 | 1184
! (2 rows)

   SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
   FROM pg_proc AS p1, pg_proc AS p2
--- 112,120 ----
    prorettype | prorettype
   ------------+------------
            25 |       1043
+          25 |      87255
          1114 |       1184
! (3 rows)

SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
FROM pg_proc AS p1, pg_proc AS p2
***************
*** 124,133 ****
-------------+-------------
25 | 1042
25 | 1043
1114 | 1184
1560 | 1562
2277 | 2283
! (5 rows)

   SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
   FROM pg_proc AS p1, pg_proc AS p2
--- 127,138 ----
   -------------+-------------
             25 |        1042
             25 |        1043
+           25 |       87255
+         1042 |       87255
           1114 |        1184
           1560 |        1562
           2277 |        2283
! (7 rows)

SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
FROM pg_proc AS p1, pg_proc AS p2
***************
*** 139,148 ****
proargtypes | proargtypes
-------------+-------------
23 | 28
1114 | 1184
1560 | 1562
2277 | 2283
! (4 rows)

   SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
   FROM pg_proc AS p1, pg_proc AS p2
--- 144,154 ----
    proargtypes | proargtypes
   -------------+-------------
             23 |          28
+           25 |       87255
           1114 |        1184
           1560 |        1562
           2277 |        2283
! (5 rows)

SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
FROM pg_proc AS p1, pg_proc AS p2
***************
*** 305,311 ****
142 | 25 | 0 | a
142 | 1043 | 0 | a
142 | 1042 | 0 | a
! (6 rows)

   -- **************** pg_operator ****************
   -- Look for illegal values in pg_operator fields.
--- 311,318 ----
           142 |         25 |        0 | a
           142 |       1043 |        0 | a
           142 |       1042 |        0 | a
!       87255 |       1042 |        0 | a
! (7 rows)

-- **************** pg_operator ****************
-- Look for illegal values in pg_operator fields.

======================================================================

So I guess my question is: what is wrong with the properties for
citextsend/citextrecv and what else might these failures be indicating
is wrong?

CREATE OR REPLACE FUNCTION citextrecv(internal)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE TYPE citext (
INPUT = citextin,
OUTPUT = citextout,
RECEIVE = citextrecv,
SEND = citextsend,
INTERNALLENGTH = VARIABLE,
STORAGE = extended
);

Thanks,

David

#58

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David E. Wheeler (#57)

Re: PATCH: CITEXT 2.0 v3

"David E. Wheeler" <david@kineticode.com> writes:

So I guess my question is: what is wrong with the properties for
citextsend/citextrecv

[ checks catalogs... ] textsend and textrecv are marked STABLE not
IMMUTABLE. I am not totally sure about the reasoning offhand --- it
might be because their behavior depends on client_encoding.

and what else might these failures be indicating
is wrong?

I think the other diffs are okay, they just reflect the fact that you're
depending on binary equivalence of text and citext.

regards, tom lane

#59

David E. Wheeler

david@kineticode.com

over 17 years ago

In reply to: Tom Lane (#58)

1 attachment(s)

Re: PATCH: CITEXT 2.0 v3

On Jul 15, 2008, at 20:26, Tom Lane wrote:

"David E. Wheeler" <david@kineticode.com> writes:

So I guess my question is: what is wrong with the properties for
citextsend/citextrecv

[ checks catalogs... ] textsend and textrecv are marked STABLE not
IMMUTABLE. I am not totally sure about the reasoning offhand --- it
might be because their behavior depends on client_encoding.

Thanks. Looks like maybe the xtypes docs need to be updated?

http://www.postgresql.org/docs/8.3/static/xtypes.html

Anyway, changing them to "STABLE STRICT" appears to have done the
trick (diff attached).

and what else might these failures be indicating
is wrong?

I think the other diffs are okay, they just reflect the fact that
you're
depending on binary equivalence of text and citext.

Great, thanks. And with that, I think I'm just about ready to submit a
new version of the patch, coming up shortly.

Best,

David

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffs; x-unix-mode=0644Download

*** ./expected/opr_sanity.out	Mon Jul 14 21:55:49 2008
--- ./results/opr_sanity.out	Tue Jul 15 21:49:24 2008
***************
*** 110,117 ****
   prorettype | prorettype 
  ------------+------------
           25 |       1043
         1114 |       1184
! (2 rows)
  
  SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
  FROM pg_proc AS p1, pg_proc AS p2
--- 110,118 ----
   prorettype | prorettype 
  ------------+------------
           25 |       1043
+          25 |     110858
         1114 |       1184
! (3 rows)
  
  SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
  FROM pg_proc AS p1, pg_proc AS p2
***************
*** 124,133 ****
  -------------+-------------
            25 |        1042
            25 |        1043
          1114 |        1184
          1560 |        1562
          2277 |        2283
! (5 rows)
  
  SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
  FROM pg_proc AS p1, pg_proc AS p2
--- 125,136 ----
  -------------+-------------
            25 |        1042
            25 |        1043
+           25 |      110858
+         1042 |      110858
          1114 |        1184
          1560 |        1562
          2277 |        2283
! (7 rows)
  
  SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
  FROM pg_proc AS p1, pg_proc AS p2
***************
*** 139,148 ****
   proargtypes | proargtypes 
  -------------+-------------
            23 |          28
          1114 |        1184
          1560 |        1562
          2277 |        2283
! (4 rows)
  
  SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
  FROM pg_proc AS p1, pg_proc AS p2
--- 142,152 ----
   proargtypes | proargtypes 
  -------------+-------------
            23 |          28
+           25 |      110858
          1114 |        1184
          1560 |        1562
          2277 |        2283
! (5 rows)
  
  SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
  FROM pg_proc AS p1, pg_proc AS p2
***************
*** 305,311 ****
          142 |         25 |        0 | a
          142 |       1043 |        0 | a
          142 |       1042 |        0 | a
! (6 rows)
  
  -- **************** pg_operator ****************
  -- Look for illegal values in pg_operator fields.
--- 309,316 ----
          142 |         25 |        0 | a
          142 |       1043 |        0 | a
          142 |       1042 |        0 | a
!      110858 |       1042 |        0 | a
! (7 rows)
  
  -- **************** pg_operator ****************
  -- Look for illegal values in pg_operator fields.

======================================================================