String comparison and the SQL standard

Started by Laurenz Albeabout 13 years ago7 messagesgeneral
Jump to latest
#1Laurenz Albe
laurenz.albe@cybertec.at

While researching a problem with a different database system,
I came across the following in the SQL standard ISO/IEC 9075-2:2003,
Section 8.2 (<comparison predicate>), General Rules:

3) The comparison of two character strings is determined as follows:
a) Let CS be the collation as determined by Subclause 9.13,
"Collation determination", for the declared types of the
two character strings.
b) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of itself
that has been extended to the length of the longer string by
concatenation on the right of one or more pad characters,
where the pad character is chosen based on CS.
If CS has the NO PAD characteristic, then the pad character is
an implementation-dependent character different from
any character in the character set of X and Y that collates
less than any string under CS.
Otherwise, the pad character is a <space>.

That would effectively mean that 'a'='a ' is TRUE for
all character string types.

Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
that very result, while PostgreSQL and Oracle gave me FALSE.

Does anybody know if we deviate from the standard on purpose
in this case? I searched the archives, but all I could find was
/messages/by-id/20051019154026.X995@ganymede.hub.org

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Laurenz Albe (#1)
Re: String comparison and the SQL standard

I wrote:

While researching a problem with a different database system,
I came across the following in the SQL standard ISO/IEC 9075-2:2003,
Section 8.2 (<comparison predicate>), General Rules:

3) [...]

That would effectively mean that 'a'='a ' is TRUE for
all character string types.

Searching further, I found ISO/IEC 9075-2:2003,
Section 9.8 (Determination of identical values),
General Rules:

2) Case:
a) If V1 and V2 are both null, then V1 is identical to V2.
b) If V1 is null and V2 is not null, or if V1 is not null and
V2 is null, then V1 is not identical to V2.
c) If V1 and V2 are of comparable predefined types, then
Case:
i) If V1 and V2 are character strings, then let L be
CHARACTER_LENGTH(V1).
Case:
1) If CHARACTER_LENGTH(V2) equals L, and if for all i,
1 (one) ≤ i ≤ L, the i-th character of V1 corresponds
to the same character position of ISO/IEC 10646 as
the i-th character of V2, then V1 is identical to V2.
2) Otherwise, V1 is not identical to V2.

That seems slightly contradictory to the above; I can only resolve
this to mean that 'a' and 'a ' are not identical in SQL, but
the equality comparison operator should still treat them as equal.

Does anybody have deeper insight into this?

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Laurenz Albe (#1)
Re: String comparison and the SQL standard

Albe Laurenz <laurenz.albe@wien.gv.at> writes:

While researching a problem with a different database system,
I came across the following in the SQL standard ISO/IEC 9075-2:2003,
Section 8.2 (<comparison predicate>), General Rules:

3) The comparison of two character strings is determined as follows:
a) Let CS be the collation as determined by Subclause 9.13,
"Collation determination", for the declared types of the
two character strings.
b) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of itself
that has been extended to the length of the longer string by
concatenation on the right of one or more pad characters,
where the pad character is chosen based on CS.
If CS has the NO PAD characteristic, then the pad character is
an implementation-dependent character different from
any character in the character set of X and Y that collates
less than any string under CS.
Otherwise, the pad character is a <space>.

The PAD case is specifying the way that CHAR(n) comparison should work.
(We don't expose the PAD/NO PAD distinction in any other way than
CHAR vs VARCHAR/TEXT types.)

AFAICS, the NO PAD case is ignorable BS: they are basically specifying
implementation not semantics there, and in a way that is totally
brain-dead. There isn't necessarily any such character as the one they
blithely posit. Moreover, the whole description seems to assume that
string comparison is single-pass left-to-right, which has little to do
with any modern collation specification. We just rely on strcmp to
decide that shorter strings are "less" than longer ones, which is the
point of this spec AFAICT.

Note that we don't actually do CHAR(n) comparison like that either,
but instead choose to strip trailing spaces before the comparison.

In any case, the most significant word in that whole paragraph is
"effectively", which means you can do it however you want as long
as you get an equivalent comparison result.

That would effectively mean that 'a'='a ' is TRUE for
all character string types.

In the PAD case, yes. Else no.

Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
that very result, while PostgreSQL and Oracle gave me FALSE.

This probably has more to do with what these systems think the
data type of an undecorated literal is, than with whether they do
trailing-space-insensitive comparison all the time.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#3)
Re: String comparison and the SQL standard

Tom Lane wrote:

This probably has more to do with what these systems think the
data type of an undecorated literal is, than with whether they do
trailing-space-insensitive comparison all the time.

I suspect so. Keep in mind that PostgreSQL does not comply with the
standard in this regard, but this has been discussed before and it
is a concious decision to deviate.  See section section 5.3
<literal> syntax 17:

| The declared type of a <character string literal> is fixed-length
| character string.

... so according to the standard 'a' is CHAR(1) and 'a   ' is
CHAR(4), and they should compare as equal. To make literals for
user-defined types more convenient to write, PostgreSQL treats
those literals as type UNKOWN until forced to resolve them to a
type, and in the absence of any other clues uses text instead of
char(n). The reason for that is that the community finds the
semantics of char(n) strange and generally encourages use of text
or varchar(n) for character-based types.

I'm not sure that this is as clearly and prominently documented as
it should be.

-Kevin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#3)
Re: String comparison and the SQL standard

Tom Lane wrote:

b) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of itself
that has been extended to the length of the longer string by
concatenation on the right of one or more pad characters,
where the pad character is chosen based on CS.
If CS has the NO PAD characteristic, then the pad character is
an implementation-dependent character different from
any character in the character set of X and Y that collates
less than any string under CS.
Otherwise, the pad character is a <space>.

The PAD case is specifying the way that CHAR(n) comparison should work.
(We don't expose the PAD/NO PAD distinction in any other way than
CHAR vs VARCHAR/TEXT types.)

AFAICS, the NO PAD case is ignorable BS: [...]

In any case, the most significant word in that whole paragraph is
"effectively", which means you can do it however you want as long
as you get an equivalent comparison result.

That would effectively mean that 'a'='a ' is TRUE for
all character string types.

In the PAD case, yes. Else no.

Thanks for the clarification.

Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
that very result, while PostgreSQL and Oracle gave me FALSE.

This probably has more to do with what these systems think the
data type of an undecorated literal is, than with whether they do
trailing-space-insensitive comparison all the time.

I tested not only with string literals, but also comparing
table columns of the respective types.

I came up with the following table of semantics used for
comparisons:

| CHAR(n)=CHAR(n) | VARCHAR(n)=VARCHAR(n) | CHAR(n)=VARCHAR(n) |
-----------+-----------------+-----------------------+--------------------+
Oracle | PAD SPACE | NO PAD | NO PAD |
-----------+-----------------+-----------------------+--------------------+
PostgreSQL | PAD SPACE | NO PAD | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+
MySQL | PAD SPACE | PAD SPACE | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+
SQL Server | PAD SPACE | PAD SPACE | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Laurenz Albe (#5)
Re: String comparison and the SQL standard

Albe Laurenz <laurenz.albe@wien.gv.at> writes:

I tested not only with string literals, but also comparing
table columns of the respective types.

I came up with the following table of semantics used for
comparisons:

| CHAR(n)=CHAR(n) | VARCHAR(n)=VARCHAR(n) | CHAR(n)=VARCHAR(n) |
-----------+-----------------+-----------------------+--------------------+
Oracle | PAD SPACE | NO PAD | NO PAD |
-----------+-----------------+-----------------------+--------------------+
PostgreSQL | PAD SPACE | NO PAD | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+
MySQL | PAD SPACE | PAD SPACE | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+
SQL Server | PAD SPACE | PAD SPACE | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+

Interesting. Did you determine which type is assigned to an
unmarked literal string by each system?

BTW, the last entry for PG surprised me a bit, because I would've
expected the varchar semantics to "win". Some experimentation shows
that you're correct about comparing char and varchar: we coerce the
varchar to char and use bpchareq, which ignores trailing blanks.
But if you compare char and text, we coerce the char to text (stripping
any trailing blanks as we do so) and then apply texteq. So in that
scenario, trailing blanks in the char datum are ignored, but trailing
blanks in the text datum are not, which is the behavior I was
remembering. It's a bit surprising that the two cases are resolved
differently --- I think that's coming out of the "most exact matches"
rule for ambiguous-operator resolution, because the available operators
are declared char = char and text = text.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#6)
Re: String comparison and the SQL standard

Tom Lane wrote:

I tested not only with string literals, but also comparing
table columns of the respective types.

I came up with the following table of semantics used for
comparisons:

| CHAR(n)=CHAR(n) | VARCHAR(n)=VARCHAR(n) | CHAR(n)=VARCHAR(n) |
-----------+-----------------+-----------------------+--------------------+
Oracle | PAD SPACE | NO PAD | NO PAD |
-----------+-----------------+-----------------------+--------------------+
PostgreSQL | PAD SPACE | NO PAD | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+
MySQL | PAD SPACE | PAD SPACE | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+
SQL Server | PAD SPACE | PAD SPACE | PAD SPACE |
-----------+-----------------+-----------------------+--------------------+

Interesting. Did you determine which type is assigned to an
unmarked literal string by each system?

In Oracle it is treated like a CHAR(n):
http://docs.oracle.com/cd/E11882_01/server.112/e26088/sql_elements002.htm#SQLRF51039

"Oracle uses blank-padded comparison semantics only when
both values in the comparison are either expressions of
data type CHAR, NCHAR, text literals, or values returned
by the USER function."

I don't know about MySQL and SQL Server, but since they pad
strings with space for comparison in all cases, it probably
does not make a difference.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general