Re: [HACKERS] 'a' == 'a ' (Was: RE: [pgsql-advocacy] Oracle buysInnobase)

Started by Dann Corbitabout 20 years ago34 messages
#1Dann Corbit
DCorbit@connx.com

-----Original Message-----
From: Stephan Szabo [mailto:sszabo@megazone.bigpanda.com]
Sent: Wednesday, October 19, 2005 2:34 PM
To: Dann Corbit
Cc: Terry Fielder; Tino Wildenhain; Marc G. Fournier;
Richard_D_Levine@raytheon.com; pgsql-hackers@postgresql.org; pgsql-
general@postgresql.org
Subject: Re: [HACKERS] 'a' == 'a ' (Was: RE: [pgsql-advocacy]

[GENERAL]

Oracle buysInnobase)

On Wed, 19 Oct 2005, Dann Corbit wrote:

-----Original Message-----
From: Terry Fielder [mailto:terry@ashtonwoodshomes.com]
Sent: Wednesday, October 19, 2005 2:05 PM
To: Dann Corbit
Cc: Tino Wildenhain; Marc G. Fournier;

Richard_D_Levine@raytheon.com;

pgsql-hackers@postgresql.org; pgsql-general@postgresql.org
Subject: Re: 'a' == 'a ' (Was: RE: [pgsql-advocacy] [GENERAL]

Oracle

buysInnobase)

Dann Corbit wrote:

Try this query in Oracle, SQL*Server, DB/2, Informix, etc.:

connxdatasync=# select 1 where cast('a' as varchar(30)) =

cast('a '

as

varchar(30));
?column?
----------
(0 rows)

I see how you can interpret the SQL Standard to make the above

response

a correct one. But is it the response that you would like?

When the compared datatypes are VARCHAR: YES

What is the value of doing that?

I can see plenty of harm and absolutely no return. We are talking

about

blank padding before comparison. Do you really want 'Danniel '
considered distinct from 'Danniel ' in a comparison? In real life,
what does that buy you?

It buys you the ability to store things where trailing spaces are
signficant (for example passwords) within the existing limitations of

not

having a full set of the collation behavior.

I suppose that there may be some value in having 'J&^% ' be different
from 'J&^% '.

I would expect to insert a password with trailing blanks to be the same
as inserting a password with no trailing blanks.

I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine. It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area. When I read the standard, it looked to me like
PostgreSQL was not performing correctly. It is not unlikely that I read
it wrong.

#2Josh Berkus
josh@agliodbs.com
In reply to: Dann Corbit (#1)
Re: [HACKERS] 'a' == 'a '

Dann,

I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine.  It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area.  When I read the standard, it looked to me like
PostgreSQL was not performing correctly.  It is not unlikely that I read
it wrong.

AFAIT, the standard says "implementation-specific". So we're standard.

The main cost for comparing trimmed values is performance; factoring an
rtrim into every comparison will add significant overhead to the already
CPU-locked process of, for example, creating indexes. We're looking for
ways to make the comparison operators lighter-weight, not heavier.

My general perspective on this is that if trailing blanks are a significant
hazard for your application, then trim them on data input. That requires
a *lot* less peformance overhead than doing it every time you compare
something.

Changing the behaviour would break backwards compatibility for some users.
For that matter, I've been subscribed to 8 PostgreSQL mailing lists since
1999, and this is the first time I can recall someone complaining about
this comparison behavior. So it's obviously not a widespread issue.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#3Dann Corbit
DCorbit@connx.com
In reply to: Josh Berkus (#2)
Re: [HACKERS] 'a' == 'a '

If there is a significant performance benefit to not expanding text columns in comparison operations, then it seems it should be OK.

I probably read the standard wrong, but it seems to me that varchar, char, and bpchar columns should all behave the same (e.g. if you do not expand with <blank> or the PAD character (whatever that is) then all char type columns should behave the same. I guess that there could be different default collations for different column types though (that is clearly allowed in the standard). Perhaps it just needs to be documented in such a way that even a blockhead like me can comprehend it easily.

Show quoted text

-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Wednesday, October 19, 2005 5:06 PM
To: pgsql-hackers@postgresql.org
Cc: Dann Corbit; Stephan Szabo; Terry Fielder; Tino Wildenhain; Marc G.
Fournier; Richard_D_Levine@raytheon.com; pgsql-general@postgresql.org
Subject: Re: [HACKERS] 'a' == 'a '

Dann,

I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine.  It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area.  When I read the standard, it looked to me like
PostgreSQL was not performing correctly.  It is not unlikely that I read
it wrong.

AFAIT, the standard says "implementation-specific". So we're standard.

The main cost for comparing trimmed values is performance; factoring an
rtrim into every comparison will add significant overhead to the already
CPU-locked process of, for example, creating indexes. We're looking for
ways to make the comparison operators lighter-weight, not heavier.

My general perspective on this is that if trailing blanks are a
significant
hazard for your application, then trim them on data input. That requires
a *lot* less peformance overhead than doing it every time you compare
something.

Changing the behaviour would break backwards compatibility for some users.
For that matter, I've been subscribed to 8 PostgreSQL mailing lists since
1999, and this is the first time I can recall someone complaining about
this comparison behavior. So it's obviously not a widespread issue.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#4Chris Travers
chris@travelamericas.com
In reply to: Josh Berkus (#2)
Re: [HACKERS] 'a' == 'a '

Josh Berkus wrote:

Dann,

I think that whatever is done ought to be whatever the standard says.
If I misinterpret the standard and PostgreSQL is doing it right, then
that is fine. It is just that PostgreSQL is very counter-intuitive
compared to other database systems that I have used in this one
particular area. When I read the standard, it looked to me like
PostgreSQL was not performing correctly. It is not unlikely that I read
it wrong.

AFAIT, the standard says "implementation-specific". So we're standard.

The main cost for comparing trimmed values is performance; factoring an
rtrim into every comparison will add significant overhead to the already
CPU-locked process of, for example, creating indexes. We're looking for
ways to make the comparison operators lighter-weight, not heavier.

If I understand the spec correctly, it seems to indicate that this is
specific to the locale/character set. Assuming that the standard
doesn't have anything to do with any character sets, it should be
possible to make this available for those who want it as an initdb
option. Whether or not this is important enough to offer or not is
another matter.

Personally my questions are:

1) How many people have been bitten by this badly?
2) How many people have been bitten by joins that depend on padding?

Personally, unlike case folding, this seems to be an area where a bit of
documentation (i.e. all collation sets have are assumed to have the NO
PAD option in the SQL standard) would be sufficient to answer to
questions of standards-compliance.

My general perspective on this is that if trailing blanks are a significant
hazard for your application, then trim them on data input. That requires
a *lot* less peformance overhead than doing it every time you compare
something.

In general I agree. But I am not willing to jump to the conclusion that
it will never be warranted to add this as an initdb option. I am more
interested in what cases people see where this would be required. But I
agree that the bar is much higher than it is in many other cases.

Best Wishes,
Chris Travers
Metatron Technology Consulting

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Travers (#4)
Re: [HACKERS] 'a' == 'a '

Chris Travers <chris@travelamericas.com> writes:

If I understand the spec correctly, it seems to indicate that this is
specific to the locale/character set.

The spec associates padding behavior with collations, which per spec are
separate from the datatypes --- that is, you should be able to able to
specify a collation for each string-type table column (whether char(N)
or varchar(N)) and even for each literal string constant. We do not
currently have that capability, and accordingly fall back to binding
PAD SPACE behavior to char(N) and NO PAD behavior to varchar(N).

AFAICS this choice is allowed by the spec since the default collation is
implementation-defined.

regards, tom lane

#6Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#5)
Re: [HACKERS] 'a' == 'a '

Tom Lane <tgl@sss.pgh.pa.us> writes:

Chris Travers <chris@travelamericas.com> writes:

If I understand the spec correctly, it seems to indicate that this is
specific to the locale/character set.

The spec associates padding behavior with collations, which per spec are
separate from the datatypes --- that is, you should be able to able to
specify a collation for each string-type table column (whether char(N)
or varchar(N)) and even for each literal string constant. We do not
currently have that capability, and accordingly fall back to binding
PAD SPACE behavior to char(N) and NO PAD behavior to varchar(N).

AFAICS this choice is allowed by the spec since the default collation is
implementation-defined.

Does it even make sense for char(N) to not be space padded? I had the
impression char(N) was always N characters long, not more or less. I can't
picture any other character being used for padding, then you would need a more
flexible rtrim function.

And I can understand the collation order determining whether 'a' and 'a '
compare equal. But surely if you store 'a' in a varchar(N) you have to get 'a'
back out, not some other string! Does the spec really allow varchar to
actually be padded and not just compare ignoring trailing space?

(I can't believe anyone really wants varchar to be space padded. Space padding
always seemed like a legacy feature for databases with fixed record length
data types. Why would anyone want a string data type that can't represent all
strings?)

--
greg

#7Jan Wieck
JanWieck@Yahoo.com
In reply to: Greg Stark (#6)
Re: [HACKERS] 'a' == 'a '

On 10/20/2005 2:17 AM, Greg Stark wrote:

(I can't believe anyone really wants varchar to be space padded. Space padding
always seemed like a legacy feature for databases with fixed record length
data types. Why would anyone want a string data type that can't represent all
strings?)

They must have buried that "bow to COBOL" so deep in the code that they
had no choice but to abuse their power and stuff this cruft into the
standard.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#8Andrew Kelly
akelly@corisweb.org
In reply to: Jan Wieck (#7)
Re: [HACKERS] 'a' == 'a '

On Thu, 2005-10-20 at 08:14 -0400, Jan Wieck wrote:

On 10/20/2005 2:17 AM, Greg Stark wrote:

(I can't believe anyone really wants varchar to be space padded. Space padding
always seemed like a legacy feature for databases with fixed record length
data types. Why would anyone want a string data type that can't represent all
strings?)

They must have buried that "bow to COBOL" so deep in the code that they
had no choice but to abuse their power and stuff this cruft into the
standard.

MOVE SPACES TO 0101

You just gotta love it

Andy

#9Lincoln Yeoh
lyeoh@pop.jaring.my
In reply to: Dann Corbit (#3)
Re: [HACKERS] 'a' == 'a '

At 05:33 PM 10/19/2005 -0700, Dann Corbit wrote:

If there is a significant performance benefit to not expanding text
columns in comparison operations, then it seems it should be OK.

I probably read the standard wrong, but it seems to me that varchar, char,
and bpchar columns should all behave the same (e.g. if you do not expand
with <blank> or the PAD character (whatever that is) then all char type
columns should behave the same. I guess that there could be different
default collations for different column

I am not a DB guru. BUT IMO they should NOT behave the same.

Varchars should NOT be padded.

For the very reason when you select text out of varchar fields the result
is not padded.

If I insert a string with a single trailing space into a varchar, I _want_
that single trailing space to still be there when I retrieve it, and not
followed by more spaces. Otherwise I will have to pick a different database ;).

So similarly, I would expect that varchars 'a ' and 'a ' when compared
should be different.

However, in the case of _chars_ which are padded, then 'a ' should be
padded so that it can be compared with 'a '.

Otherwise there will be no reason to do equality comparisons of char(5)
fields with char(8) fields - they can NEVER be the same :).

But would that mean that when one does equality comparisons of varchars
with chars, one would probably want padding? Or only varchars of the same
length as the char would have a chance of matching?

Hmm.. I think I better leave this one to the DB gurus :). But I really
don't ever want 'a ' to be the same as 'a ' for varchars.

Link.

#10Dann Corbit
DCorbit@connx.com
In reply to: Lincoln Yeoh (#9)
Re: [HACKERS] 'a' == 'a '

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Greg Stark
Sent: Wednesday, October 19, 2005 11:17 PM
To: Tom Lane
Cc: Chris Travers; josh@agliodbs.com; pgsql-hackers@postgresql.org;

Dann

Corbit; Stephan Szabo; Terry Fielder; Tino Wildenhain; Marc G.

Fournier;

Richard_D_Levine@raytheon.com; pgsql-general@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Tom Lane <tgl@sss.pgh.pa.us> writes:

Chris Travers <chris@travelamericas.com> writes:

If I understand the spec correctly, it seems to indicate that this

is

specific to the locale/character set.

The spec associates padding behavior with collations, which per spec

are

separate from the datatypes --- that is, you should be able to able

to

specify a collation for each string-type table column (whether

char(N)

or varchar(N)) and even for each literal string constant. We do not
currently have that capability, and accordingly fall back to binding
PAD SPACE behavior to char(N) and NO PAD behavior to varchar(N).

AFAICS this choice is allowed by the spec since the default

collation is

implementation-defined.

Does it even make sense for char(N) to not be space padded? I had the
impression char(N) was always N characters long, not more or less. I

can't

picture any other character being used for padding, then you would

need a

more
flexible rtrim function.

And I can understand the collation order determining whether 'a' and

'a '

compare equal. But surely if you store 'a' in a varchar(N) you have to

get

'a'
back out, not some other string! Does the spec really allow varchar to
actually be padded and not just compare ignoring trailing space?

(I can't believe anyone really wants varchar to be space padded. Space
padding
always seemed like a legacy feature for databases with fixed record

length

data types. Why would anyone want a string data type that can't

represent

all
strings?)

Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.

Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the same
length before performing the comparison.

Hence, you will see that 'Danniel' = 'Danniel ' is true in most cases.

Now, this really does not have any connection with storage or varchar or
bpchar or char or text or anything like that.

It is only the action to be taken when a comparison operation is
performed.

#11Chris Travers
chris@travelamericas.com
In reply to: Dann Corbit (#10)
Re: [HACKERS] 'a' == 'a '

Dann Corbit wrote:

Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.

IIrc, varchar and bpchar are stored in a similar way, but are presented
differently when retrieved. I.e. storage is separate from presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.

Again, I am happy "solving" this simply by documenting it since any
questions of interpretation and implimentation of the standard would be
answered. So far what I (and I am sure others) have not heard is a
strong case for changing the behavior, given that it is in line with a
reasonable interpretation of the standards.

Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the same
length before performing the comparison.

Understood, but what gain do you have in a case like this that might
justify the effort that would go into making it, say, an initdb option?
How often does this behavior cause problems?

Best Wishes,
Chris Travers
Metatron Technology Consulting

#12Dann Corbit
DCorbit@connx.com
In reply to: Chris Travers (#11)
Re: [HACKERS] 'a' == 'a '

-----Original Message-----
From: Chris Travers [mailto:chris@travelamericas.com]
Sent: Thursday, October 20, 2005 11:53 AM
To: Dann Corbit
Cc: Greg Stark; Tom Lane; Chris Travers; josh@agliodbs.com; pgsql-
hackers@postgresql.org; Stephan Szabo; Terry Fielder; Tino Wildenhain;
Marc G. Fournier; Richard_D_Levine@raytheon.com; pgsql-
general@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Dann Corbit wrote:

Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.

IIrc, varchar and bpchar are stored in a similar way, but are

presented

differently when retrieved. I.e. storage is separate from

presentation

in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.

Again, I am happy "solving" this simply by documenting it since any
questions of interpretation and implimentation of the standard would

be

answered. So far what I (and I am sure others) have not heard is a
strong case for changing the behavior, given that it is in line with a
reasonable interpretation of the standards.

I believe that this is a reasonable response. In order to comply with
the standard, bpchar and varchar would have to be stored with different
default collating sequences (which is fine with me). If (indeed) that
is the case, the only action needed would be to document the collating
sequences used.

Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the

same

length before performing the comparison.

Understood, but what gain do you have in a case like this that might
justify the effort that would go into making it, say, an initdb

option?

How often does this behavior cause problems?

I do not even know if it is a good idea. I was just pointing out that
the behavior of PostgreSQL is different from all the big database
vendors in this area and according to my reading of the standard, the
behavior was not compliant.

As to how often it causes a problem, I can't say. It has caused me
puzzlement on a few occasions, but no end of the world disasters.

#13Noname
Richard_D_Levine@raytheon.com
In reply to: Chris Travers (#11)
Re: [HACKERS] 'a' == 'a '

I will happily reiterate that I am the troll who started this mess by
whining about how *Oracle* handles this. Tom's explanation that CHAR is
has a PAD collation and VARCHAR has a NO PAD collation have restored my
faith that there is goodness in the world. My whining was out of
ignorance. I wouldn't change the proper way PostgreSQL works. Documenting
it is good. I will use this new found knowledge from now on in my database
designs.

Cheers,

Rick

Chris Travers <chris@travelamericas.com> wrote on 10/20/2005 01:52:36 PM:

Show quoted text

Dann Corbit wrote:

Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.

IIrc, varchar and bpchar are stored in a similar way, but are presented
differently when retrieved. I.e. storage is separate from presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.

Again, I am happy "solving" this simply by documenting it since any
questions of interpretation and implimentation of the standard would be
answered. So far what I (and I am sure others) have not heard is a
strong case for changing the behavior, given that it is in line with a
reasonable interpretation of the standards.

Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the same
length before performing the comparison.

Understood, but what gain do you have in a case like this that might
justify the effort that would go into making it, say, an initdb option?
How often does this behavior cause problems?

Best Wishes,
Chris Travers
Metatron Technology Consulting

#14John D. Burger
john@mitre.org
In reply to: Dann Corbit (#10)
Re: [HACKERS] 'a' == 'a '

[Removed all the non-list addresses]

Dann Corbit wrote:

Let me make something clear:
When we are talking about padding here it is only in the context of a
comparison operator and NOT having anything to do with storage.

Given two strings of different in a comparison, most database systems
(by default) will blank pad the shorter string so that they are the
same
length before performing the comparison.

Hence, you will see that 'Danniel' = 'Danniel ' is true in most cases.

Now, this really does not have any connection with storage or varchar
or
bpchar or char or text or anything like that.

Is this really true??? My understanding of the spec was that this was
=exactly= the difference between char(N) and varchar(N) - the former is
padded to length N when you store it, or at least the DB has to act as
if this is the case. Can someone quote the appropriate chapter and
verse?

Thanks.

- John D. Burger
MITRE

#15Dann Corbit
DCorbit@connx.com
In reply to: John D. Burger (#14)
Re: [HACKERS] 'a' == 'a '

Look back in the stack and you will find that I have quoted chapter and
verse (see the attached html file in a previous email that I sent).

This is in relation to the comparison operator.

-----Original Message-----
From: John D. Burger [mailto:john@mitre.org]
Sent: Thursday, October 20, 2005 12:53 PM
To: Dann Corbit
Cc: <pgsql-hackers@postgresql.org>; pgsql-general General
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

[Removed all the non-list addresses]

Dann Corbit wrote:

Let me make something clear:
When we are talking about padding here it is only in the context of

a

comparison operator and NOT having anything to do with storage.

Given two strings of different in a comparison, most database

systems

(by default) will blank pad the shorter string so that they are the
same
length before performing the comparison.

Hence, you will see that 'Danniel' = 'Danniel ' is true in most

cases.

Now, this really does not have any connection with storage or

varchar

or
bpchar or char or text or anything like that.

Is this really true??? My understanding of the spec was that this was
=exactly= the difference between char(N) and varchar(N) - the former

is

padded to length N when you store it, or at least the DB has to act as
if this is the case. Can someone quote the appropriate chapter and
verse?

Again, the operations that I was talking about literally have nothing to
do with the internal data type of the string, as long as it is
fundamentally a character type. I was referring to what happens when
two strings are involved in a comparison operation. The standard spells
it out very clearly. Whether something is blank padded or not during a
comparison has nothing whatever to do with the type of the string and
everything to do with the collating sequence.

Show quoted text

Thanks.

- John D. Burger
MITRE

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Travers (#11)
Re: [HACKERS] 'a' == 'a '

Chris Travers <chris@travelamericas.com> writes:

IIrc, varchar and bpchar are stored in a similar way, but are presented
differently when retrieved. I.e. storage is separate from presentation
in this case. I.e. the padding in bpchar occurs when it is presented
and stripped when it is stored.

This is not so, although I've occasionally wondered whether we shouldn't
try to make it so. Certainly we don't rely on char(N) to be physically
fixed-width (and can't, at least not with variable-width character
encodings) so there can be no performance advantage to actually storing
the insignificant spaces. The hard part would be in figuring out how
the output routine could know how many spaces to add back.

regards, tom lane

#17Noname
Richard_D_Levine@raytheon.com
In reply to: Tom Lane (#16)
Re: [HACKERS] 'a' == 'a '

Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
<snip>

The hard part would be in figuring out how
the output routine could know how many spaces to add back.

The length is in the metadata for the column, or am I being dense?

Show quoted text

regards, tom lane

#18Dann Corbit
DCorbit@connx.com
In reply to: Noname (#17)
Re: [GENERAL] 'a' == 'a '

-----Original Message-----
From: Richard_D_Levine@raytheon.com

[mailto:Richard_D_Levine@raytheon.com]

Sent: Thursday, October 20, 2005 2:12 PM
To: Tom Lane
Cc: Chris Travers; Dann Corbit; Greg Stark; josh@agliodbs.com; pgsql-
general@postgresql.org; pgsql-hackers@postgresql.org; Marc G.

Fournier;

Stephan Szabo; Terry Fielder; Tino Wildenhain
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
<snip>

The hard part would be in figuring out how
the output routine could know how many spaces to add back.

The length is in the metadata for the column, or am I being dense?

I guess that what Tom is saying is that it would be nice to store
everything as VARCHAR. But with (for instance) BPCHAR, the returned
string is blank padded. So if you store
'Danniel'
in BPCHAR(20), you will get back:
'Danniel '
But if you store
'Danniel'
In VARCHAR(20)
You will get back exactly what you put in.

I guess that additional ambiguity arises if you add additional spaces to
the end. Many database systems solve this by trimming the characters
from the end of the string upon storage and the returned string will not
have any trailing blanks. I am not arguing pro nor con this way of
doing things.

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#17)
Re: [HACKERS] 'a' == 'a '

Richard_D_Levine@raytheon.com writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:

The hard part would be in figuring out how
the output routine could know how many spaces to add back.

The length is in the metadata for the column, or am I being dense?

The output routine hasn't got access to that metadata. In general,
in fact, a random Datum can't be linked to any column since it may have
been generated on-the-fly.

It'd be easy if we stuck the column width into the stored representation
of bpchar, but this would defeat one of the purposes, ie, to make the
stored representations of text and bpchar alike.

regards, tom lane

#20Noname
Richard_D_Levine@raytheon.com
In reply to: Dann Corbit (#18)
Re: [GENERAL] 'a' == 'a '

"Dann Corbit" <DCorbit@connx.com> wrote on 10/20/2005 04:24:26 PM:

-----Original Message-----
From: Richard_D_Levine@raytheon.com

[mailto:Richard_D_Levine@raytheon.com]

Sent: Thursday, October 20, 2005 2:12 PM
To: Tom Lane
Cc: Chris Travers; Dann Corbit; Greg Stark; josh@agliodbs.com; pgsql-
general@postgresql.org; pgsql-hackers@postgresql.org; Marc G.

Fournier;

Stephan Szabo; Terry Fielder; Tino Wildenhain
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Tom Lane <tgl@sss.pgh.pa.us> wrote on 10/20/2005 03:11:23 PM:
<snip>

The hard part would be in figuring out how
the output routine could know how many spaces to add back.

The length is in the metadata for the column, or am I being dense?

I guess that what Tom is saying is that it would be nice to store
everything as VARCHAR.

<snip>

I get that part, but he asked how to determine the proper output length
based on the PAD semantics, and I was saying to just pad whatever is stored
to the length available in the column metadata for those collations that
require padding.

I guess that additional ambiguity arises if you add additional spaces to
the end. Many database systems solve this by trimming the characters
from the end of the string upon storage and the returned string will not
have any trailing blanks. I am not arguing pro nor con this way of
doing things.

For PAD correlations, Oracle stores the blanks. Interbase does not. Going
from Interbase to Oracle I switched from CHAR to VARCHAR2 because of this,
shot my self in the foot by not understanding PAD, whined about it, and you
know the rest.

As to how it is output, I know the language interface has an effect.
Embedded SQL in C will put the null terminator in different places for
different databases and different switch settings particular to each
database.

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dann Corbit (#18)
Re: [GENERAL] 'a' == 'a '

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional spaces to
the end. Many database systems solve this by trimming the characters
from the end of the string upon storage and the returned string will not
have any trailing blanks.

Can you document that? ISTM that that would effectively make char(n)
and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

regards, tom lane

#22Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#21)
Re: [GENERAL] 'a' == 'a '

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, October 20, 2005 2:54 PM
To: Dann Corbit
Cc: Richard_D_Levine@raytheon.com; general@postgresql.org; pgsql-
hackers@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional

spaces to

the end. Many database systems solve this by trimming the

characters

from the end of the string upon storage and the returned string will

not

have any trailing blanks.

Can you document that? ISTM that that would effectively make char(n)
and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

This is SQL*Server:

drop table test_char
go
create table test_char(
fixed_30 char(30),
varch_30 varchar(30),
nchar_30 nchar(30),
nvarc_30 nvarchar(30)
)
go
insert into test_char values('Dann ', 'Dann ', 'Dann ', 'Dann ')
go
select len(fixed_30), len(varch_30), len(nchar_30), len(nvarc_30)
from test_char
go

Result set:
4 4 4 4

#23Andrew Dunstan
andrew@dunslane.net
In reply to: Dann Corbit (#22)
Re: [GENERAL] 'a' == 'a '

Dann Corbit wrote:

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, October 20, 2005 2:54 PM
To: Dann Corbit
Cc: Richard_D_Levine@raytheon.com; general@postgresql.org; pgsql-
hackers@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional

spaces to

the end. Many database systems solve this by trimming the

characters

from the end of the string upon storage and the returned string will

not

have any trailing blanks.

Can you document that? ISTM that that would effectively make char(n)
and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

This is SQL*Server:

drop table test_char
go
create table test_char(
fixed_30 char(30),
varch_30 varchar(30),
nchar_30 nchar(30),
nvarc_30 nvarchar(30)
)
go
insert into test_char values('Dann ', 'Dann ', 'Dann ', 'Dann ')
go
select len(fixed_30), len(varch_30), len(nchar_30), len(nvarc_30)
from test_char
go

Result set:
4 4 4 4

Yech.

What does one do in such a system if you want trailing blanks to be
significant, or even kept?

Anyway, the consensus seems to be that Postgresql's behavious is
consistent with a reasonable reading of the standard, so is there
anything really left to discuss, other than a possible addition to
documentation?

cheers

andrew

#24Wilkin, Kurt
Kurt.Wilkin@fnzc.co.nz
In reply to: Andrew Dunstan (#23)
Re: [GENERAL] 'a' == 'a '

Dann Corbit wrote:

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, October 20, 2005 2:54 PM
To: Dann Corbit
Cc: Richard_D_Levine@raytheon.com; general@postgresql.org; pgsql-
hackers@postgresql.org Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional
spaces to the end. Many database systems solve this by trimming
the characters from the end of the string upon storage and the
returned

string will
not

have any trailing blanks.

Can you document that? ISTM that that would effectively make char(n)
and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

This is SQL*Server:

drop table test_char
go
create table test_char(
fixed_30 char(30),
varch_30 varchar(30),
nchar_30 nchar(30),
nvarc_30 nvarchar(30)
)
go
insert into test_char values('Dann ', 'Dann ', 'Dann ', 'Dann ') go
select len(fixed_30), len(varch_30), len(nchar_30), len(nvarc_30)
from test_char go

Result set:
4 4 4 4

For SQL Server, you are seeing the behaviour of the len function,
none of the data has been trimmed.
Using the same tables:

select len('x' + (fixed_30) + 'x'),
len('x' + varch_30 + 'x'),
len('x' + nchar_30 + 'x'),
len('x' + nvarc_30 + 'x')
from test_char

Result set :

32 7 32 7

Cheers, Kurt.

--
This electronic message together with any attachments is confidential and
intended for the named recipient's use only. If you are not the intended
recipient (i) do not copy, disclose or use the contents in any way, (ii)
please let us know by return email immediately then destroy the message, and
any hard copies of the message, and any attachments. The sender of this
message is not responsible for any changes made to this message and/or any
attachments and/or connection linkages to the Internet referred to in this
message after it has been sent. Unless otherwise stated, any pricing
information given in this message and/or attachments is indicative only, is
subject to change and does not constitute an offer to buy or sell securities
or derivatives at any price quoted. Any reference to the terms of executed
transactions should be treated as preliminary only and subject to separate
formal written notification. Where reference is made to research material
and/or research recommendations, the basis of the provision of such research
material and/or recommendations is set out in the relevant disclaimer.

#25Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Wilkin, Kurt (#24)
Re: [GENERAL] 'a' == 'a '

I wonder how widespread the MicroSoft behavior is Sybase ASE,
for example, gives this result set:

30 5 30 5

That seems more appropriate to me.

-Kevin

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional

spaces to

the end. Many database systems solve this by trimming the

characters

from the end of the string upon storage and the returned string will

not

have any trailing blanks.

Can you document that? ISTM that that would effectively make char(n)
and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

This is SQL*Server:

drop table test_char
go
create table test_char(
fixed_30 char(30),
varch_30 varchar(30),
nchar_30 nchar(30),
nvarc_30 nvarchar(30)
)
go
insert into test_char values('Dann ', 'Dann ', 'Dann ', 'Dann ')
go
select len(fixed_30), len(varch_30), len(nchar_30), len(nvarc_30)
from test_char
go

Result set:
4 4 4 4

#26Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#21)
Re: [GENERAL] 'a' == 'a '

Is there any TODO here?

---------------------------------------------------------------------------

Tom Lane wrote:

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional spaces to
the end. Many database systems solve this by trimming the characters
from the end of the string upon storage and the returned string will not
have any trailing blanks.

Can you document that? ISTM that that would effectively make char(n)
and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#27Dann Corbit
DCorbit@connx.com
In reply to: Bruce Momjian (#26)
Re: [GENERAL] 'a' == 'a '

Document the collating sequences used for the character types.

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Monday, October 24, 2005 11:01 AM
To: Tom Lane
Cc: Dann Corbit; Richard_D_Levine@raytheon.com;

general@postgresql.org;

pgsql-hackers@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Is there any TODO here?

------------------------------------------------------------------------
--

-

Tom Lane wrote:

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional

spaces

to

the end. Many database systems solve this by trimming the

characters

from the end of the string upon storage and the returned string

will

not

have any trailing blanks.

Can you document that? ISTM that that would effectively make

char(n)

and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do

not

Show quoted text

match

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania
19073
#28Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Dann Corbit (#27)
Re: [GENERAL] 'a' == 'a '

Dann Corbit wrote:

Document the collating sequences used for the character types.

Sorry, I don't understand that. What does it mean?

---------------------------------------------------------------------------

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Monday, October 24, 2005 11:01 AM
To: Tom Lane
Cc: Dann Corbit; Richard_D_Levine@raytheon.com;

general@postgresql.org;

pgsql-hackers@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Is there any TODO here?

------------------------------------------------------------------------
--

-

Tom Lane wrote:

"Dann Corbit" <DCorbit@connx.com> writes:

I guess that additional ambiguity arises if you add additional

spaces

to

the end. Many database systems solve this by trimming the

characters

from the end of the string upon storage and the returned string

will

not

have any trailing blanks.

Can you document that? ISTM that that would effectively make

char(n)

and varchar(n) exactly equivalent, which is ... um ... a bit stupid.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do

not

match

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania
19073
-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#29Dann Corbit
DCorbit@connx.com
In reply to: Bruce Momjian (#28)
Re: [GENERAL] 'a' == 'a '

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Monday, October 24, 2005 5:57 PM
To: Dann Corbit
Cc: Tom Lane; Richard_D_Levine@raytheon.com; general@postgresql.org;
pgsql-hackers@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Dann Corbit wrote:

Document the collating sequences used for the character types.

Sorry, I don't understand that. What does it mean?

According to the standard, comparisons against character types are
supposed to PAD with either <space> or <PAD CHAR> according to the
collating sequence and not according to the data type.

Since the padding behavior of PosgreSQL comparisons for character types
is different according to the data type and not the collating sequence,
what is needed is simply to define the collating sequence used for
different character types.

#30Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Dann Corbit (#29)
Re: [GENERAL] 'a' == 'a '

Dann Corbit wrote:

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Monday, October 24, 2005 5:57 PM
To: Dann Corbit
Cc: Tom Lane; Richard_D_Levine@raytheon.com; general@postgresql.org;
pgsql-hackers@postgresql.org
Subject: Re: [GENERAL] [HACKERS] 'a' == 'a '

Dann Corbit wrote:

Document the collating sequences used for the character types.

Sorry, I don't understand that. What does it mean?

According to the standard, comparisons against character types are
supposed to PAD with either <space> or <PAD CHAR> according to the
collating sequence and not according to the data type.

Since the padding behavior of PosgreSQL comparisons for character types
is different according to the data type and not the collating sequence,
what is needed is simply to define the collating sequence used for
different character types.

But isn't collating sequence related to ordering? How does this relate
to padding?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#31Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#30)
Re: [GENERAL] 'a' == 'a '

Dann Corbit wrote:

But isn't collating sequence related to ordering? How does this

relate

to padding?

Right. Collating sequence is how ordering is defined. But when you
compare two character types, they are supposed to pad according to the
collating sequence. So whether you blank fill or pad with some special
character when performing a comparison is defined by the collating
sequence and not by the character type. Since we see (for instance)
that bpchar(n) and varchar(n) pad differently when performing a
comparison, we must assume that they have a different collating
sequence. So the question is "what is it?"

It is always possible that I have misread the standard.

OK, I understand now. It is tempting to think that the difference
between char() and varchar() is that internally they use a different
collating sequences, but that isn't the case. If it were, space would
be ignored during comparisons any place in the string, when in fact, is
it is only trailing space that char() ignores, e.g.:

test=> SELECT 'a '::CHAR(10) = 'a'::CHAR(10);
?column?
----------
t
(1 row)

test=> SELECT 'a '::VARCHAR(10) = 'a'::VARCHAR(10);
?column?
----------
f
(1 row)

test=> SELECT 'a'::CHAR(10) = ' a'::CHAR(10);
?column?
----------
f
(1 row)

test=> SELECT 'a'::VARCHAR(10) = ' a'::VARCHAR(10);
?column?
----------
f
(1 row)

Our docs already have:

http://candle.pha.pa.us/main/writings/pgsql/sgml/datatype-character.html

Values of type character are physically padded with spaces to the
specified width n, and are stored and displayed that way. However, the
padding spaces are treated as semantically insignificant. Trailing
spaces are disregarded when comparing two values of type character, and
they will be removed when converting a character value to one of the
other string types. Note that trailing spaces are semantically
significant in character varying and text values.

What additional documentation is needed?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#31)
Re: [GENERAL] 'a' == 'a '

Bruce Momjian <pgman@candle.pha.pa.us> writes:

OK, I understand now. It is tempting to think that the difference
between char() and varchar() is that internally they use a different
collating sequences, but that isn't the case. If it were, space would
be ignored during comparisons any place in the string, when in fact, is
it is only trailing space that char() ignores, e.g.:

No, you don't understand. The standard defines PAD SPACE as making
*trailing* spaces irrelevant to comparisons, not embedded or leading
spaces. And they regard PAD SPACE as an attribute of a collation rather
than of the string datatype. This seems pretty wacko to me (in
particular it's hard to see how NO PAD is useful with char(N) storage)
but that's what they did.

What additional documentation is needed?

Some specific discussion of the relationship to the standard would be
helpful, perhaps.

regards, tom lane

#33Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#32)
Re: [GENERAL] 'a' == 'a '

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

OK, I understand now. It is tempting to think that the difference
between char() and varchar() is that internally they use a different
collating sequences, but that isn't the case. If it were, space would
be ignored during comparisons any place in the string, when in fact, is
it is only trailing space that char() ignores, e.g.:

No, you don't understand. The standard defines PAD SPACE as making
*trailing* spaces irrelevant to comparisons, not embedded or leading
spaces. And they regard PAD SPACE as an attribute of a collation rather
than of the string datatype. This seems pretty wacko to me (in
particular it's hard to see how NO PAD is useful with char(N) storage)
but that's what they did.

So you can have two collating sequences where in one trailing space is
significant, and another that isn't? Strange.

What additional documentation is needed?

Some specific discussion of the relationship to the standard would be
helpful, perhaps.

I guess, but explaining it seems pretty complex in itself, and I am
unsure what value it adds.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#34Andrew Dunstan
andrew@dunslane.net
In reply to: Bruce Momjian (#33)
Re: [GENERAL] 'a' == 'a '

Bruce Momjian wrote:

What additional documentation is needed?

Some specific discussion of the relationship to the standard would be
helpful, perhaps.

I guess, but explaining it seems pretty complex in itself, and I am
unsure what value it adds.

It will give us something to keep the language lawyers occupied :-)

cheers

andrew