Re: [HACKERS] 'a' == 'a ' (Was: RE: [pgsql-advocacy] Oracle buysInnobase)
-----Original Message-----
From: Stephan Szabo [mailto:sszabo@megazone.bigpanda.com]
Sent: Wednesday, October 19, 2005 2:34 PM
To: Dann Corbit
Cc: Terry Fielder; Tino Wildenhain; Marc G. Fournier;
Richard_D_Levine@raytheon.com; pgsql-hackers@postgresql.org; pgsql-
general@postgresql.org
Subject: Re: [HACKERS] 'a' == 'a ' (Was: RE: [pgsql-advocacy]
[GENERAL]
Oracle buysInnobase)
On Wed, 19 Oct 2005, Dann Corbit wrote:
-----Original Message-----
From: Terry Fielder [mailto:terry@ashtonwoodshomes.com]
Sent: Wednesday, October 19, 2005 2:05 PM
To: Dann Corbit
Cc: Tino Wildenhain; Marc G. Fournier;
Richard_D_Levine@raytheon.com;
pgsql-hackers@postgresql.org; pgsql-general@postgresql.org
Subject: Re: 'a' == 'a ' (Was: RE: [pgsql-advocacy] [GENERAL]
Oracle
buysInnobase)
Dann Corbit wrote:
Try this query in Oracle, SQL*Server, DB/2, Informix, etc.:
connxdatasync=# select 1 where cast('a' as varchar(30)) =
cast('a '
as
varchar(30));
?column?
----------
(0 rows)I see how you can interpret the SQL Standard to make the above
response
a correct one. But is it the response that you would like?
When the compared datatypes are VARCHAR: YES
What is the value of doing that?
I can see plenty of harm and absolutely no return. We are talking
about
blank padding before comparison. Do you really want 'Danniel '
considered distinct from 'Danniel ' in a comparison? In real life,
what does that buy you?It buys you the ability to store things where trailing spaces are
signficant (for example passwords) within the existing limitations of
not
having a full set of the collation behavior.
I suppose that there may be some value in having 'J&^% ' be different
from 'J&^% '.
I would expect to insert a password with trailing blanks to be the same
as inserting a password with no trailing blanks.
I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine. It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area. When I read the standard, it looked to me like
PostgreSQL was not performing correctly. It is not unlikely that I read
it wrong.
Dann,
I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine. It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area. When I read the standard, it looked to me like
PostgreSQL was not performing correctly. It is not unlikely that I read
it wrong.
AFAIT, the standard says "implementation-specific". So we're standard.
The main cost for comparing trimmed values is performance; factoring an
rtrim into every comparison will add significant overhead to the already
CPU-locked process of, for example, creating indexes. We're looking for
ways to make the comparison operators lighter-weight, not heavier.
My general perspective on this is that if trailing blanks are a significant
hazard for your application, then trim them on data input. That requires
a *lot* less peformance overhead than doing it every time you compare
something.
Changing the behaviour would break backwards compatibility for some users.
For that matter, I've been subscribed to 8 PostgreSQL mailing lists since
1999, and this is the first time I can recall someone complaining about
this comparison behavior. So it's obviously not a widespread issue.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
If there is a significant performance benefit to not expanding text columns in comparison operations, then it seems it should be OK.
I probably read the standard wrong, but it seems to me that varchar, char, and bpchar columns should all behave the same (e.g. if you do not expand with <blank> or the PAD character (whatever that is) then all char type columns should behave the same. I guess that there could be different default collations for different column types though (that is clearly allowed in the standard). Perhaps it just needs to be documented in such a way that even a blockhead like me can comprehend it easily.
Show quoted text
-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Wednesday, October 19, 2005 5:06 PM
To: pgsql-hackers@postgresql.org
Cc: Dann Corbit; Stephan Szabo; Terry Fielder; Tino Wildenhain; Marc G.
Fournier; Richard_D_Levine@raytheon.com; pgsql-general@postgresql.org
Subject: Re: [HACKERS] 'a' == 'a 'Dann,
I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine. It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area. When I read the standard, it looked to me like
PostgreSQL was not performing correctly. It is not unlikely that I read
it wrong.AFAIT, the standard says "implementation-specific". So we're standard.
The main cost for comparing trimmed values is performance; factoring an
rtrim into every comparison will add significant overhead to the already
CPU-locked process of, for example, creating indexes. We're looking for
ways to make the comparison operators lighter-weight, not heavier.My general perspective on this is that if trailing blanks are a
significant
hazard for your application, then trim them on data input. That requires
a *lot* less peformance overhead than doing it every time you compare
something.Changing the behaviour would break backwards compatibility for some users.
For that matter, I've been subscribed to 8 PostgreSQL mailing lists since
1999, and this is the first time I can recall someone complaining about
this comparison behavior. So it's obviously not a widespread issue.--
--JoshJosh Berkus
Aglio Database Solutions
San Francisco
Import Notes
Resolved by subject fallback
Josh Berkus wrote:
Dann,
I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine. It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area. When I read the standard, it looked to me like
PostgreSQL was not performing correctly. It is not unlikely that I read
it wrong.AFAIT, the standard says "implementation-specific". So we're standard.
The main cost for comparing trimmed values is performance; factoring an
rtrim into every comparison will add significant overhead to the already
CPU-locked process of, for example, creating indexes. We're looking for
ways to make the comparison operators lighter-weight, not heavier.
If I understand the spec correctly, it seems to indicate that this is
specific to the locale/character set. Assuming that the standard
doesn't have anything to do with any character sets, it should be
possible to make this available for those who want it as an initdb
option. Whether or not this is important enough to offer or not is
another matter.
Personally my questions are:
1) How many people have been bitten by this badly?
2) How many people have been bitten by joins that depend on padding?
Personally, unlike case folding, this seems to be an area where a bit of
documentation (i.e. all collation sets have are assumed to have the NO
PAD option in the SQL standard) would be sufficient to answer to
questions of standards-compliance.
My general perspective on this is that if trailing blanks are a significant
hazard for your application, then trim them on data input. That requires
a *lot* less peformance overhead than doing it every time you compare
something.
In general I agree. But I am not willing to jump to the conclusion that
it will never be warranted to add this as an initdb option. I am more
interested in what cases people see where this would be required. But I
agree that the bar is much higher than it is in many other cases.
Best Wishes,
Chris Travers
Metatron Technology Consulting
Chris Travers <chris@travelamericas.com> writes:
If I understand the spec correctly, it seems to indicate that this is
specific to the locale/character set.
The spec associates padding behavior with collations, which per spec are
separate from the datatypes --- that is, you should be able to able to
specify a collation for each string-type table column (whether char(N)
or varchar(N)) and even for each literal string constant. We do not
currently have that capability, and accordingly fall back to binding
PAD SPACE behavior to char(N) and NO PAD behavior to varchar(N).
AFAICS this choice is allowed by the spec since the default collation is
implementation-defined.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Chris Travers <chris@travelamericas.com> writes:
If I understand the spec correctly, it seems to indicate that this is
specific to the locale/character set.The spec associates padding behavior with collations, which per spec are
separate from the datatypes --- that is, you should be able to able to
specify a collation for each string-type table column (whether char(N)
or varchar(N)) and even for each literal string constant. We do not
currently have that capability, and accordingly fall back to binding
PAD SPACE behavior to char(N) and NO PAD behavior to varchar(N).AFAICS this choice is allowed by the spec since the default collation is
implementation-defined.
Does it even make sense for char(N) to not be space padded? I had the
impression char(N) was always N characters long, not more or less. I can't
picture any other character being used for padding, then you would need a more
flexible rtrim function.
And I can understand the collation order determining whether 'a' and 'a '
compare equal. But surely if you store 'a' in a varchar(N) you have to get 'a'
back out, not some other string! Does the spec really allow varchar to
actually be padded and not just compare ignoring trailing space?
(I can't believe anyone really wants varchar to be space padded. Space padding
always seemed like a legacy feature for databases with fixed record length
data types. Why would anyone want a string data type that can't represent all
strings?)
--
greg
On 10/20/2005 2:17 AM, Greg Stark wrote:
(I can't believe anyone really wants varchar to be space padded. Space padding
always seemed like a legacy feature for databases with fixed record length
data types. Why would anyone want a string data type that can't represent all
strings?)
They must have buried that "bow to COBOL" so deep in the code that they
had no choice but to abuse their power and stuff this cruft into the
standard.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On Thu, 2005-10-20 at 08:14 -0400, Jan Wieck wrote:
On 10/20/2005 2:17 AM, Greg Stark wrote:
(I can't believe anyone really wants varchar to be space padded. Space padding
always seemed like a legacy feature for databases with fixed record length
data types. Why would anyone want a string data type that can't represent all
strings?)They must have buried that "bow to COBOL" so deep in the code that they
had no choice but to abuse their power and stuff this cruft into the
standard.
MOVE SPACES TO 0101
You just gotta love it
Andy
At 05:33 PM 10/19/2005 -0700, Dann Corbit wrote:
If there is a significant performance benefit to not expanding text
columns in comparison operations, then it seems it should be OK.I probably read the standard wrong, but it seems to me that varchar, char,
and bpchar columns should all behave the same (e.g. if you do not expand
with <blank> or the PAD character (whatever that is) then all char type
columns should behave the same. I guess that there could be different
default collations for different column
I am not a DB guru. BUT IMO they should NOT behave the same.
Varchars should NOT be padded.
For the very reason when you select text out of varchar fields the result
is not padded.
If I insert a string with a single trailing space into a varchar, I _want_
that single trailing space to still be there when I retrieve it, and not
followed by more spaces. Otherwise I will have to pick a different database ;).
So similarly, I would expect that varchars 'a ' and 'a ' when compared
should be different.
However, in the case of _chars_ which are padded, then 'a ' should be
padded so that it can be compared with 'a '.
Otherwise there will be no reason to do equality comparisons of char(5)
fields with char(8) fields - they can NEVER be the same :).
But would that mean that when one does equality comparisons of varchars
with chars, one would probably want padding? Or only varchars of the same
length as the char would have a chance of matching?
Hmm.. I think I better leave this one to the DB gurus :). But I really
don't ever want 'a ' to be the same as 'a ' for varchars.
Link.
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Greg Stark
Sent: Wednesday, October 19, 2005 11:17 PM
To: Tom Lane
Cc: Chris Travers; josh@agliodbs.com; pgsql-hackers@postgresql.org;
Dann
Corbit; Stephan Szabo; Terry Fielder; Tino Wildenhain; Marc G.
Fournier;
Richard_D_Levine@raytheon.com; pgsql-general@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a 'Tom Lane <tgl@sss.pgh.pa.us> writes:
Chris Travers <chris@travelamericas.com> writes:
If I understand the spec correctly, it seems to indicate that this
is
specific to the locale/character set.
The spec associates padding behavior with collations, which per spec
are
separate from the datatypes --- that is, you should be able to able
to
specify a collation for each string-type table column (whether
char(N)
or varchar(N)) and even for each literal string constant. We do not
currently have that capability, and accordingly fall back to binding
PAD SPACE behavior to char(N) and NO PAD behavior to varchar(N).AFAICS this choice is allowed by the spec since the default
collation is
implementation-defined.
Does it even make sense for char(N) to not be space padded? I had the
impression char(N) was always N characters long, not more or less. I
can't
picture any other character being used for padding, then you would
need a
more
flexible rtrim function.And I can understand the collation order determining whether 'a' and
'a '
compare equal. But surely if you store 'a' in a varchar(N) you have to
get
'a'
back out, not some other string! Does the spec really allow varchar to
actually be padded and not just compare ignoring trailing space?(I can't believe anyone really wants varchar to be space padded. Space
padding
always seemed like a legacy feature for databases with fixed record
length
data types. Why would anyone want a string data type that can't
represent
all
strings?)
Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.
Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the same
length before performing the comparison.
Hence, you will see that 'Danniel' = 'Danniel ' is true in most cases.
Now, this really does not have any connection with storage or varchar or
bpchar or char or text or anything like that.
It is only the action to be taken when a comparison operation is
performed.
Import Notes
Resolved by subject fallback
Dann Corbit wrote:
Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.
IIrc, varchar and bpchar are stored in a similar way, but are presented
differently when retrieved. I.e. storage is separate from presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.
Again, I am happy "solving" this simply by documenting it since any
questions of interpretation and implimentation of the standard would be
answered. So far what I (and I am sure others) have not heard is a
strong case for changing the behavior, given that it is in line with a
reasonable interpretation of the standards.
Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the same
length before performing the comparison.
Understood, but what gain do you have in a case like this that might
justify the effort that would go into making it, say, an initdb option?
How often does this behavior cause problems?
Best Wishes,
Chris Travers
Metatron Technology Consulting
-----Original Message-----
From: Chris Travers [mailto:chris@travelamericas.com]
Sent: Thursday, October 20, 2005 11:53 AM
To: Dann Corbit
Cc: Greg Stark; Tom Lane; Chris Travers; josh@agliodbs.com; pgsql-
hackers@postgresql.org; Stephan Szabo; Terry Fielder; Tino Wildenhain;
Marc G. Fournier; Richard_D_Levine@raytheon.com; pgsql-
general@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a 'Dann Corbit wrote:
Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.IIrc, varchar and bpchar are stored in a similar way, but are
presented
differently when retrieved. I.e. storage is separate from
presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.Again, I am happy "solving" this simply by documenting it since any
questions of interpretation and implimentation of the standard would
be
answered. So far what I (and I am sure others) have not heard is a
strong case for changing the behavior, given that it is in line with a
reasonable interpretation of the standards.
I believe that this is a reasonable response. In order to comply with
the standard, bpchar and varchar would have to be stored with different
default collating sequences (which is fine with me). If (indeed) that
is the case, the only action needed would be to document the collating
sequences used.
Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the
same
length before performing the comparison.
Understood, but what gain do you have in a case like this that might
justify the effort that would go into making it, say, an initdb
option?
How often does this behavior cause problems?
I do not even know if it is a good idea. I was just pointing out that
the behavior of PostgreSQL is different from all the big database
vendors in this area and according to my reading of the standard, the
behavior was not compliant.
As to how often it causes a problem, I can't say. It has caused me
puzzlement on a few occasions, but no end of the world disasters.
Import Notes
Resolved by subject fallback
I will happily reiterate that I am the troll who started this mess by
whining about how *Oracle* handles this. Tom's explanation that CHAR is
has a PAD collation and VARCHAR has a NO PAD collation have restored my
faith that there is goodness in the world. My whining was out of
ignorance. I wouldn't change the proper way PostgreSQL works. Documenting
it is good. I will use this new found knowledge from now on in my database
designs.
Cheers,
Rick
Chris Travers <chris@travelamericas.com> wrote on 10/20/2005 01:52:36 PM:
Show quoted text
Dann Corbit wrote:
Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.IIrc, varchar and bpchar are stored in a similar way, but are presented
differently when retrieved. I.e. storage is separate from presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.Again, I am happy "solving" this simply by documenting it since any
questions of interpretation and implimentation of the standard would be
answered. So far what I (and I am sure others) have not heard is a
strong case for changing the behavior, given that it is in line with a
reasonable interpretation of the standards.Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the same
length before performing the comparison.Understood, but what gain do you have in a case like this that might
justify the effort that would go into making it, say, an initdb option?
How often does this behavior cause problems?Best Wishes,
Chris Travers
Metatron Technology Consulting
[Removed all the non-list addresses]
Dann Corbit wrote:
Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the
same
length before performing the comparison.Hence, you will see that 'Danniel' = 'Danniel ' is true in most cases.
Now, this really does not have any connection with storage or varchar
or
bpchar or char or text or anything like that.
Is this really true??? My understanding of the spec was that this was
=exactly= the difference between char(N) and varchar(N) - the former is
padded to length N when you store it, or at least the DB has to act as
if this is the case. Can someone quote the appropriate chapter and
verse?
Thanks.
- John D. Burger
MITRE
Look back in the stack and you will find that I have quoted chapter and
verse (see the attached html file in a previous email that I sent).
This is in relation to the comparison operator.
-----Original Message-----
From: John D. Burger [mailto:john@mitre.org]
Sent: Thursday, October 20, 2005 12:53 PM
To: Dann Corbit
Cc: <pgsql-hackers@postgresql.org>; pgsql-general General
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '[Removed all the non-list addresses]
Dann Corbit wrote:
Let me make something clear:
When we are talking about padding here it is only in the context of
a
comparison operator and NOT having anything to do with storage.
Given two strings of different in a comparison, most database
systems
(by default) will blank pad the shorter string so that they are the
same
length before performing the comparison.Hence, you will see that 'Danniel' = 'Danniel ' is true in most
cases.
Now, this really does not have any connection with storage or
varchar
or
bpchar or char or text or anything like that.Is this really true??? My understanding of the spec was that this was
=exactly= the difference between char(N) and varchar(N) - the former
is
padded to length N when you store it, or at least the DB has to act as
if this is the case. Can someone quote the appropriate chapter and
verse?
Again, the operations that I was talking about literally have nothing to
do with the internal data type of the string, as long as it is
fundamentally a character type. I was referring to what happens when
two strings are involved in a comparison operation. The standard spells
it out very clearly. Whether something is blank padded or not during a
comparison has nothing whatever to do with the type of the string and
everything to do with the collating sequence.
Show quoted text
Thanks.
- John D. Burger
MITRE
Import Notes
Resolved by subject fallback
Chris Travers <chris@travelamericas.com> writes:
IIrc, varchar and bpchar are stored in a similar way, but are presented
differently when retrieved. I.e. storage is separate from presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.
This is not so, although I've occasionally wondered whether we shouldn't
try to make it so. Certainly we don't rely on char(N) to be physically
fixed-width (and can't, at least not with variable-width character
encodings) so there can be no performance advantage to actually storing
the insignificant spaces. The hard part would be in figuring out how
the output routine could know how many spaces to add back.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
<snip>
The hard part would be in figuring out how
the output routine could know how many spaces to add back.
The length is in the metadata for the column, or am I being dense?
Show quoted text
regards, tom lane
-----Original Message-----
From: Richard_D_Levine@raytheon.com
[mailto:Richard_D_Levine@raytheon.com]
Sent: Thursday, October 20, 2005 2:12 PM
To: Tom Lane
Cc: Chris Travers; Dann Corbit; Greg Stark; josh@agliodbs.com; pgsql-
general@postgresql.org; pgsql-hackers@postgresql.org; Marc G.
Fournier;
Stephan Szabo; Terry Fielder; Tino Wildenhain
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a 'Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
<snip>The hard part would be in figuring out how
the output routine could know how many spaces to add back.The length is in the metadata for the column, or am I being dense?
I guess that what Tom is saying is that it would be nice to store
everything as VARCHAR. But with (for instance) BPCHAR, the returned
string is blank padded. So if you store
'Danniel'
in BPCHAR(20), you will get back:
'Danniel '
But if you store
'Danniel'
In VARCHAR(20)
You will get back exactly what you put in.
I guess that additional ambiguity arises if you add additional spaces to
the end. Many database systems solve this by trimming the characters
from the end of the string upon storage and the returned string will not
have any trailing blanks. I am not arguing pro nor con this way of
doing things.
Import Notes
Resolved by subject fallback
Richard_D_Levine@raytheon.com writes:
Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
The hard part would be in figuring out how
the output routine could know how many spaces to add back.
The length is in the metadata for the column, or am I being dense?
The output routine hasn't got access to that metadata. In general,
in fact, a random Datum can't be linked to any column since it may have
been generated on-the-fly.
It'd be easy if we stuck the column width into the stored representation
of bpchar, but this would defeat one of the purposes, ie, to make the
stored representations of text and bpchar alike.
regards, tom lane
"Dann Corbit" <DCorbit@connx.com> wrote on 10/20/2005 04:24:26 PM:
-----Original Message-----
From: Richard_D_Levine@raytheon.com[mailto:Richard_D_Levine@raytheon.com]
Sent: Thursday, October 20, 2005 2:12 PM
To: Tom Lane
Cc: Chris Travers; Dann Corbit; Greg Stark; josh@agliodbs.com; pgsql-
general@postgresql.org; pgsql-hackers@postgresql.org; Marc G.Fournier;
Stephan Szabo; Terry Fielder; Tino Wildenhain
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a 'Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
<snip>The hard part would be in figuring out how
the output routine could know how many spaces to add back.The length is in the metadata for the column, or am I being dense?
I guess that what Tom is saying is that it would be nice to store
everything as VARCHAR.
<snip>
I get that part, but he asked how to determine the proper output length
based on the PAD semantics, and I was saying to just pad whatever is stored
to the length available in the column metadata for those collations that
require padding.
I guess that additional ambiguity arises if you add additional spaces to
the end. Many database systems solve this by trimming the characters
from the end of the string upon storage and the returned string will not
have any trailing blanks. I am not arguing pro nor con this way of
doing things.
For PAD correlations, Oracle stores the blanks. Interbase does not. Going
from Interbase to Oracle I switched from CHAR to VARCHAR2 because of this,
shot my self in the foot by not understanding PAD, whined about it, and you
know the rest.
As to how it is output, I know the language interface has an effect.
Embedded SQL in C will put the null terminator in different places for
different databases and different switch settings particular to each
database.