Comparison semantics of CHAR data type

Started by Thomas Fanghaenelover 12 years ago8 messages

tfanghaenel@salesforce.com

over 12 years ago

I was wondering about the proper semantics of CHAR comparisons in some
corner cases that involve control characters with values that are less than
0x20 (space).

Consider the following testcase:

===
create table t (a int, b char(10));

insert into t values (1, 'foo');
insert into t values (2, 'foo ');
insert into t values (3, E'foo\t');
insert into t values (4, E'foo\n');
insert into t values (5, E'foo \n');
insert into t values (6, 'foobar');

select * from t order by b;
===

What's the proper order of these string values in the CHAR domain? The way
I interpret the SQL Standard (and assuming that \t and \n collate lower
than a space), it's supposed to be this:

(3) < (4) < (5) < (1) <= (2) < (6)

Postgres comes up with this:

(1) <= (2) < (3) < (4) < (5) < (6)

The reason is that the bpchar functions that implement the relative
comparison operators for CHAR(n) effectively strip trailing whitespaces
before doing the comparison. One might argue that doing this is not
correct. The standard seems to mandate that all CHAR(n) values are
actually considered to be of width n, and that trailing spaces are indeed
relevant for comparison. In other words, stripping them would only be
possible if it can be guaranteed that there are no characters in the
character set that collate lower than a space.

Any thoughts on this? I searched the mailing list archives, but couldn't
find any relevant discussion. There were plenty of threads that argue
whether or not it's semantically correct to strip trailing spaces from
CHAR(n) values, but the issue of characters collating below a space does
not seem to have brought up in any of those discussions before.

Cheers,

-- Thomas

Bruce Momjian

bruce@momjian.us

over 12 years ago

In reply to: Thomas Fanghaenel (#1)

Re: [SQL] Comparison semantics of CHAR data type

On Sun, Sep 22, 2013 at 08:51:26PM -0400, Thomas Fanghaenel wrote:

I was wondering about the proper semantics of CHAR comparisons in some corner
cases that involve control characters with values that are less than 0x20
(space).

Consider the following testcase:

===
create table t (a int, b char(10));

insert into t values (1, 'foo');
insert into t values (2, 'foo ');
insert into t values (3, E'foo\t');
insert into t values (4, E'foo\n');
insert into t values (5, E'foo \n');
insert into t values (6, 'foobar');

select * from t order by b;
===

What's the proper order of these string values in the CHAR domain? The way I
interpret the SQL Standard (and assuming that \t and \n collate lower than a
space), it's supposed to be this:

(3) < (4) < (5) < (1) <= (2) < (6)

Postgres comes up with this:

(1) <= (2) < (3) < (4) < (5) < (6)

The reason is that the bpchar functions that implement the relative comparison
operators for CHAR(n) effectively strip trailing whitespaces before doing the
comparison. One might argue that doing this is not correct. The standard
seems to mandate that all CHAR(n) values are actually considered to be of width
n, and that trailing spaces are indeed relevant for comparison. In other
words, stripping them would only be possible if it can be guaranteed that there
are no characters in the character set that collate lower than a space.

Any thoughts on this? I searched the mailing list archives, but couldn't find
any relevant discussion. There were plenty of threads that argue whether or
not it's semantically correct to strip trailing spaces from CHAR(n) values, but
the issue of characters collating below a space does not seem to have brought
up in any of those discussions before.

[I am moving this thread to hackers because I think it needs internals
review.]

You have some good questions here, though there are two interrelated
things going on here. First is collation, and the second is the
trimming of spaces from char() comparisons. Let's look at collation
first:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> SELECT 'a c' UNION ALL SELECT 'ab' ORDER BY 1;
?column?
----------
ab
a c
(2 rows)

You will notice spaces are not considered important in a UTF8 collation.
If we do this in the C collation, we get a different result:

test=> CREATE DATABASE test2 WITH LC_COLLATE = 'C' TEMPLATE template0;
CREATE DATABASE
test=> \c test2
You are now connected to database "test2" as user "postgres".
test2=> SELECT 'a c' UNION ALL SELECT 'ab' ORDER BY 1;
?column?
----------
a c
ab
(2 rows)

Also, when using ORDER BY, it isn't clear if the values are ordered that
way due to being greater/less-than, or just randomly. For example, I
found your example above gave different ordering if I inserted the
values differently, using a UTF8 collation.

Let me use comparisons instead using UTF8 for clarity:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select 'a c' < 'ab';
?column?
----------
f
(1 row)

and "C":

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select 'a c' < 'ab';
?column?
----------
t
(1 row)

Now, let's look at ASCII characters less than space, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'a\nb' < E'a b';
?column?
----------
f
(1 row)

and in C:

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'a\nb' < E'a b';
?column?
----------
t
(1 row)

You can see that newline is greater than space in UTF8, but not in C.

Now, on to the trailing space issue using the default TEXT value for
strings, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'ab\n' < E'ab ';
?column?
----------
f
(1 row)

then in "C":

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'ab\n' < E'ab ';
?column?
----------
t
(1 row)

This matches the \n/space issue we saw above. Now, here is where CHAR()
starts to show the unusual behavior you saw, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?
----------
f
(1 row)

then in C:

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?
----------
f
(1 row)

You can see the UTF8 case is fine because \n is considered greater than
space, but in the C locale, where \n is less than space, the false
return value shows the problem with internal_bpchar_pattern_compare()
trimming the string and first comparing on lengths. This is exactly the
problem you outline, where space trimming assumes everything is less
than a space.

I am sorry for this long email, but I would be interested to see what
other hackers think about this issue.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kevin Grittner

kgrittn@ymail.com

over 12 years ago

In reply to: Bruce Momjian (#2)

Re: [SQL] Comparison semantics of CHAR data type

Bruce Momjian <bruce@momjian.us> wrote:

Thomas Fanghaenel wrote:

I was wondering about the proper semantics of CHAR comparisons in some corner
cases that involve control characters with values that are less than 0x20
(space).

What matters in general isn't where the characters fall when
comparing individual bytes, but how the strings containing them
sort according to the applicable collation. That said, my
recollection of the spec is that when two CHAR(n) values are
compared, the shorter should be blank-padded before making the
comparison. *That* said, I think the general advice is to stay
away from CHAR(n) in favor or VARCHAR(n) or TEXT, and I think that
is good advice.

I am sorry for this long email, but I would be interested to see what
other hackers think about this issue.

Since we only have the CHAR(n) type to improve compliance with the
SQL specification, and we don't generally encourage its use, I
think we should fix any non-compliant behavior. That seems to mean
that if you take two CHAR values and compare them, it should give
the same result as comparing the same two values as VARCHAR using
the same collation with the shorter value padded with spaces.

So this is correct:

test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

... because it matches:

test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
?column?
----------
t
(1 row)

But this is incorrect:

test=# select 'ab'::char(3) collate "C" < E'ab\n'::char(3) collate "C";
?column?
----------
t
(1 row)

... because it doesn't match:

test=# select 'ab '::varchar(3) collate "C" < E'ab\n'::varchar(3) collate "C";
?column?
----------
f
(1 row)

Of course, I have no skin in the game, because it took me about two
weeks after my first time converting a database with CHAR columns
to PostgreSQL to change them all to VARCHAR, and do that as part of
all future conversions.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Fanghaenel

tfanghaenel@salesforce.com

about 12 years ago

In reply to: Kevin Grittner (#3)

Re: [SQL] Comparison semantics of CHAR data type

On Fri, Oct 11, 2013 at 3:44 PM, Bruce Momjian <bruce@momjian.us> wrote:

You have some good questions here, though there are two interrelated
things going on here. First is collation, and the second is the
trimming of spaces from char() comparisons.

Sorry, I should have probably mentioned more explicitly that I'm only
concerned about the latter. I might get yelled at for saying this, but
I think the rather ubiquitous right-trimming of bpchar values is wrong,
and does - in combination with memcmp() for comparison - lead to some
non-standard behavior.

Now, on to the trailing space issue using the default TEXT value for
strings, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'ab\n' < E'ab ';
?column?
----------
f
(1 row)

then in "C":

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'ab\n' < E'ab ';
?column?
----------
t
(1 row)

This matches the \n/space issue we saw above. Now, here is where
CHAR() starts to show the unusual behavior you saw, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?
----------
f
(1 row)

then in C:

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?
----------
f
(1 row)

You can see the UTF8 case is fine because \n is considered greater
than space, but in the C locale, where \n is less than space, the
false return value shows the problem with
internal_bpchar_pattern_compare() trimming the string and first
comparing on lengths. This is exactly the problem you outline, where
space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values. If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

For example, this seems to be the case on OS X:

select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10)
collate "en_US";
?column?
----------
t
(1 row)

select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C";
?column?
----------
t
(1 row)

select 'ab '::varchar(10) collate "en_US" <
E'ab\n'::varchar(10) collate "en_US";
?column?
----------
f
(1 row)

select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10)
collate "C";
?column?
----------
f
(1 row)

So here there's actually not only the same \n/space issue as in the C
collation (which would go away if the bpchar value weren't trimmed).
It also shows that there might be slight differences in behavior,
depending which platform you're running on.

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

What matters in general isn't where the characters fall when comparing
individual bytes, but how the strings containing them sort according
to the applicable collation. That said, my recollection of the spec
is that when two CHAR(n) values are compared, the shorter should be
blank-padded before making the comparison.

Not necessarily. The SQL Standard actually ties this to the collation
sequence that is in use. Without a lot of context, this is from
Subclause 8.2, "<comparison predicate>", General Rule 3)b):

b) If the length in characters of X is not equal to the length in
characters of Y, then the shorter string is effectively replaced,
for the purposes of comparison, with a copy of itself that has been
extended to the length of the longer string by concatenation on the
right of one or more pad characters, where the pad character is
chosen based on CS. If CS has the NO PAD characteristic, then the
pad character is an implementation-dependent character different
from any character in the character set of X and Y that collates
less than any string under CS. Otherwise, the pad character is a
<space>.

In my opinion, that's just a lot of handwaving, to the extent that in
practice different vendors interpret this clause differently. It seems
that SQLServer and DB2 do PAD semantics across the board, whereas Oracle
has uses NO PAD semantics whenever there's a VARCHAR type involved in
the comparison.

But all that is actually a whole different can of worms, and slightly
besides the point of my original question. How to properly compare
strings with different lentgths has been discussed before, see for
instance the thread in [1]/messages/by-id/A737B7A37273E048B164557ADEF4A58B0579A7AB@ntex2010a.host.magwien.gv.at. My intention was not to get that started
again. As far as I can see, the consensus seems to be that when using
the C locale, string comparisons should be done using NO PAD semantics.
(It sure gives some strange semantics if you have varchars with trailing
spaces, but it's perfectly legal.)

The point is that my testcase deals with strings of the same length.
Thus, the above clause doesn't really apply. The standard, to my
understanding, says that fixed-length character values are padded when
the row is constructed. And once that happens, those spaces become part
of the value. It's invalid to strip them, unless done explicitly.

Since we only have the CHAR(n) type to improve compliance with the SQL
specification, and we don't generally encourage its use, I think we
should fix any non-compliant behavior. That seems to mean that if you
take two CHAR values and compare them, it should give the same result
as comparing the same two values as VARCHAR using the same collation
with the shorter value padded with spaces.

So this is correct:

test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

... because it matches:

test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
?column?
----------
t
(1 row)

Again, this touches on the same point as Bruce's example above. Right
now these two queries might produce identical results on Linux, because
of the way strcoll() behaves. On OS X you get different results:

select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3)
collate "en_US";
?column?
----------
f
(1 row)

I have no idea who's right. But doesn't this count as evidence that
right-trimming pbchars is not even safe with the en_US collation?

-- Thomas

[1]: /messages/by-id/A737B7A37273E048B164557ADEF4A58B0579A7AB@ntex2010a.host.magwien.gv.at

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Bruce Momjian <bruce@momjian.us> wrote:

Thomas Fanghaenel wrote:

I was wondering about the proper semantics of CHAR comparisons in some corner
cases that involve control characters with values that are less than 0x20
(space).

What matters in general isn't where the characters fall when
comparing individual bytes, but how the strings containing them
sort according to the applicable collation. That said, my
recollection of the spec is that when two CHAR(n) values are
compared, the shorter should be blank-padded before making the
comparison. *That* said, I think the general advice is to stay
away from CHAR(n) in favor or VARCHAR(n) or TEXT, and I think that
is good advice.

I am sorry for this long email, but I would be interested to see what
other hackers think about this issue.

Since we only have the CHAR(n) type to improve compliance with the
SQL specification, and we don't generally encourage its use, I
think we should fix any non-compliant behavior. That seems to mean
that if you take two CHAR values and compare them, it should give
the same result as comparing the same two values as VARCHAR using
the same collation with the shorter value padded with spaces.

So this is correct:

test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

... because it matches:

test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
?column?
----------
t
(1 row)

But this is incorrect:

test=# select 'ab'::char(3) collate "C" < E'ab\n'::char(3) collate "C";
?column?
----------
t
(1 row)

... because it doesn't match:

test=# select 'ab '::varchar(3) collate "C" < E'ab\n'::varchar(3) collate "C";
?column?
----------
f
(1 row)

Of course, I have no skin in the game, because it took me about two
weeks after my first time converting a database with CHAR columns
to PostgreSQL to change them all to VARCHAR, and do that as part of
all future conversions.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

about 12 years ago

In reply to: Thomas Fanghaenel (#4)

Re: [SQL] Comparison semantics of CHAR data type

You can see the UTF8 case is fine because \n is considered greater
than space, but in the C locale, where \n is less than space, the
false return value shows the problem with
internal_bpchar_pattern_compare() trimming the string and first
comparing on lengths. This is exactly the problem you outline, where
space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values. If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

For example, this seems to be the case on OS X:

select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10)
collate "en_US";
?column?
----------
t
(1 row)

select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C";
?column?
----------
t
(1 row)

select 'ab '::varchar(10) collate "en_US" <
E'ab\n'::varchar(10) collate "en_US";
?column?
----------
f
(1 row)

The above query returns true on Linux, so there certainly is a
platform-specific difference there. The others are the same.

select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10)
collate "C";
?column?
----------
f
(1 row)

So here there's actually not only the same \n/space issue as in the C
collation (which would go away if the bpchar value weren't trimmed).
It also shows that there might be slight differences in behavior,
depending which platform you're running on.

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

What matters in general isn't where the characters fall when comparing
individual bytes, but how the strings containing them sort according
to the applicable collation. That said, my recollection of the spec
is that when two CHAR(n) values are compared, the shorter should be
blank-padded before making the comparison.

Not necessarily. The SQL Standard actually ties this to the collation
sequence that is in use. Without a lot of context, this is from
Subclause 8.2, "<comparison predicate>", General Rule 3)b):

b) If the length in characters of X is not equal to the length in
characters of Y, then the shorter string is effectively replaced,
for the purposes of comparison, with a copy of itself that has been
extended to the length of the longer string by concatenation on the
right of one or more pad characters, where the pad character is
chosen based on CS. If CS has the NO PAD characteristic, then the
pad character is an implementation-dependent character different
from any character in the character set of X and Y that collates
less than any string under CS. Otherwise, the pad character is a
<space>.

In my opinion, that's just a lot of handwaving, to the extent that in
practice different vendors interpret this clause differently. It seems
that SQLServer and DB2 do PAD semantics across the board, whereas Oracle
has uses NO PAD semantics whenever there's a VARCHAR type involved in
the comparison.

But all that is actually a whole different can of worms, and slightly
besides the point of my original question. How to properly compare
strings with different lentgths has been discussed before, see for
instance the thread in [1]. My intention was not to get that started
again. As far as I can see, the consensus seems to be that when using
the C locale, string comparisons should be done using NO PAD semantics.
(It sure gives some strange semantics if you have varchars with trailing
spaces, but it's perfectly legal.)

The point is that my testcase deals with strings of the same length.
Thus, the above clause doesn't really apply. The standard, to my
understanding, says that fixed-length character values are padded when
the row is constructed. And once that happens, those spaces become part
of the value. It's invalid to strip them, unless done explicitly.

Yes, there are three types of comparisons that are important here:

1. 'a'::CHAR(3) < 'a'::CHAR(3)
2. 'a '::CHAR(3) < E'a\n'::CHAR(3)
3. 'a'::CHAR(3) < 'a'::CHAR(4)

You are saying it is only #3 where we can substitute the special
always-lower pad character, while it appears that Postgres does this in
cases #2 and #3.

Since we only have the CHAR(n) type to improve compliance with the SQL
specification, and we don't generally encourage its use, I think we
should fix any non-compliant behavior. That seems to mean that if you
take two CHAR values and compare them, it should give the same result
as comparing the same two values as VARCHAR using the same collation
with the shorter value padded with spaces.

So this is correct:

test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

... because it matches:

test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
?column?
----------
t
(1 row)

Again, this touches on the same point as Bruce's example above. Right
now these two queries might produce identical results on Linux, because
of the way strcoll() behaves. On OS X you get different results:

select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3)
collate "en_US";
?column?
----------
f
(1 row)

The above returns true on Linux.

I have no idea who's right. But doesn't this count as evidence that
right-trimming pbchars is not even safe with the en_US collation?

Well, one problem in fixing this is that there are existing CHAR()
indexes that embed this behavior, and unless we want to break pg_upgrade
for CHAR() indexes, I am not sure what options we have except to
document this.

Let me also add that we don't propogate the CHAR() length through
any/most/all? function calls and operators, so the length comparison
wouldn't work in all cases anyway, e.g.

SELECT 'a'::CHAR(4) || 'b'::CHAR(4);
?column?
----------
ab
(1 row)

While space truncation is odd, it is predictable.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Bruce Momjian (#5)

1 attachment(s)

Re: [SQL] Comparison semantics of CHAR data type

On Wed, Oct 16, 2013 at 02:17:11PM -0400, Bruce Momjian wrote:

You can see the UTF8 case is fine because \n is considered greater
than space, but in the C locale, where \n is less than space, the
false return value shows the problem with
internal_bpchar_pattern_compare() trimming the string and first
comparing on lengths. This is exactly the problem you outline, where
space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values. If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

I have added the attached C comment to explain the problem, and added a
TODO item to fix it if we ever break binary upgrading.

Does anyone think this warrants a doc mention?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Attachments:

char.difftext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
new file mode 100644
index 502ca44..284b5d1
*** a/src/backend/utils/adt/varchar.c
--- b/src/backend/utils/adt/varchar.c
*************** bpcharcmp(PG_FUNCTION_ARGS)
*** 846,851 ****
--- 846,863 ----
  				len2;
  	int			cmp;
  
+ 	/*
+ 	 * Trimming trailing spaces off of both strings can cause a string
+ 	 * with a character less than a space to compare greater than a
+ 	 * space-extended string, e.g. this returns false:
+ 	 *		SELECT E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
+ 	 * even though '\n' is less than the space if CHAR(10) was
+ 	 * space-extended.  The correct solution would be to trim only
+ 	 * the longer string to be the same length of the shorter, if
+ 	 * possible, then do the comparison.  However, changing this
+ 	 * might break existing indexes, breaking binary upgrades.
+ 	 * For details, see http://www.postgresql.org/message-id/CAK+WP1xdmyswEehMuetNztM4H199Z1w9KWRHVMKzyyFM+hV=zA@mail.gmail.com
+ 	 */
  	len1 = bcTruelen(arg1);
  	len2 = bcTruelen(arg2);

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Bruce Momjian (#6)

1 attachment(s)

Re: [SQL] Comparison semantics of CHAR data type

On Thu, Feb 13, 2014 at 09:47:01PM -0500, Bruce Momjian wrote:

On Wed, Oct 16, 2013 at 02:17:11PM -0400, Bruce Momjian wrote:

You can see the UTF8 case is fine because \n is considered greater
than space, but in the C locale, where \n is less than space, the
false return value shows the problem with
internal_bpchar_pattern_compare() trimming the string and first
comparing on lengths. This is exactly the problem you outline, where
space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values. If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

I have added the attached C comment to explain the problem, and added a
TODO item to fix it if we ever break binary upgrading.

Does anyone think this warrants a doc mention?

I have done some more thinking on this and I found a way to document
this, which reduces our need to actually fix it some day. I am afraid
the behavioral change needed to fix this might break so many
applications that the fix will never be done, though I will keep the
TODO item until I get more feedback on that. Patch attached.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Attachments:

char.difftext/x-diff; charset=us-asciiDownload

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
new file mode 100644
index 30fd9bb..9635004
*** a/doc/src/sgml/datatype.sgml
--- b/doc/src/sgml/datatype.sgml
*************** SELECT '52093.89'::money::numeric::float
*** 1072,1081 ****
     <para>
      Values of type <type>character</type> are physically padded
      with spaces to the specified width <replaceable>n</>, and are
!     stored and displayed that way.  However, the padding spaces are
!     treated as semantically insignificant.  Trailing spaces are
!     disregarded when comparing two values of type <type>character</type>,
!     and they will be removed when converting a <type>character</type> value
      to one of the other string types.  Note that trailing spaces
      <emphasis>are</> semantically significant in
      <type>character varying</type> and <type>text</type> values, and
--- 1072,1084 ----
     <para>
      Values of type <type>character</type> are physically padded
      with spaces to the specified width <replaceable>n</>, and are
!     stored and displayed that way.  However, trailing spaces are treated as
!     semantically insignificant and disregarded when comparing two values
!     of type <type>character</type>.  In collations where whitespace
!     is significant, this behavior can produce unexpected results,
!     e.g. <command>SELECT 'a '::CHAR(2) collate "C" < 'a\n'::CHAR(2)
!     returns true.
!     Trailing spaces are removed when converting a <type>character</type> value
      to one of the other string types.  Note that trailing spaces
      <emphasis>are</> semantically significant in
      <type>character varying</type> and <type>text</type> values, and
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
new file mode 100644
index 284b5d1..502ca44
*** a/src/backend/utils/adt/varchar.c
--- b/src/backend/utils/adt/varchar.c
*************** bpcharcmp(PG_FUNCTION_ARGS)
*** 846,863 ****
  				len2;
  	int			cmp;
  
- 	/*
- 	 * Trimming trailing spaces off of both strings can cause a string
- 	 * with a character less than a space to compare greater than a
- 	 * space-extended string, e.g. this returns false:
- 	 *		SELECT E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
- 	 * even though '\n' is less than the space if CHAR(10) was
- 	 * space-extended.  The correct solution would be to trim only
- 	 * the longer string to be the same length of the shorter, if
- 	 * possible, then do the comparison.  However, changing this
- 	 * might break existing indexes, breaking binary upgrades.
- 	 * For details, see http://www.postgresql.org/message-id/CAK+WP1xdmyswEehMuetNztM4H199Z1w9KWRHVMKzyyFM+hV=zA@mail.gmail.com
- 	 */
  	len1 = bcTruelen(arg1);
  	len2 = bcTruelen(arg2);
  
--- 846,851 ----

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Bruce Momjian (#7)

Re: [SQL] Comparison semantics of CHAR data type

On Fri, Feb 14, 2014 at 05:02:26PM -0500, Bruce Momjian wrote:

On Thu, Feb 13, 2014 at 09:47:01PM -0500, Bruce Momjian wrote:

On Wed, Oct 16, 2013 at 02:17:11PM -0400, Bruce Momjian wrote:

You can see the UTF8 case is fine because \n is considered greater
than space, but in the C locale, where \n is less than space, the
false return value shows the problem with
internal_bpchar_pattern_compare() trimming the string and first
comparing on lengths. This is exactly the problem you outline, where
space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values. If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

I have added the attached C comment to explain the problem, and added a
TODO item to fix it if we ever break binary upgrading.

Does anyone think this warrants a doc mention?

I have done some more thinking on this and I found a way to document
this, which reduces our need to actually fix it some day. I am afraid
the behavioral change needed to fix this might break so many
applications that the fix will never be done, though I will keep the
TODO item until I get more feedback on that. Patch attached.

Patch applied.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers