char(n) to varchar or text conversion should strip trailing spaces

Started by Tom Laneover 23 years ago8 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I've gotten really tired of explaining to newbies why stuff involving
char(n) fields doesn't work like they expect. Our current behavior is
not valid per SQL92 anyway, I believe.

I think there is a pretty simple solution now that we have pg_cast:
we could stop treating char(n) as binary-equivalent to varchar/text,
and instead define it as requiring a runtime conversion (which would
be essentially the rtrim() function). The cast in the other direction
would be assignment-only, so that any expression that involves mixed
char(n) and varchar/text operations would be evaluated in varchar
rules after stripping char's insignificant trailing blanks.

If we did this, then operations like
WHERE UPPER(charcolumn) = 'FOO'
would work as a newbie expects. I believe that we'd come a lot closer
to spec compliance on the behavior of char(n), too.

Comments?

regards, tom lane

#2Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#1)
Re: char(n) to varchar or text conversion should strip trailing spaces

I've gotten really tired of explaining to newbies why stuff involving
char(n) fields doesn't work like they expect. Our current behavior is
not valid per SQL92 anyway, I believe.

I think there is a pretty simple solution now that we have pg_cast:
we could stop treating char(n) as binary-equivalent to varchar/text,
and instead define it as requiring a runtime conversion (which would
be essentially the rtrim() function). The cast in the other direction
would be assignment-only, so that any expression that involves mixed
char(n) and varchar/text operations would be evaluated in varchar
rules after stripping char's insignificant trailing blanks.

If we did this, then operations like
WHERE UPPER(charcolumn) = 'FOO'
would work as a newbie expects. I believe that we'd come a lot closer
to spec compliance on the behavior of char(n), too.

I am all for it. That would much more closely match what I would expect.

One alternate possible approach would maybe be to change the on-disk
representation to really be binary compatible and change the input
output and operator functions ? IIRC fixed width optimizations do not gain as
much as in earlier versions anyway. Then char(n) would have the benefit of
beeing trailing blank insensitive and having the optimal storage format.

Andreas

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#2)
Re: char(n) to varchar or text conversion should strip trailing spaces

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

One alternate possible approach would maybe be to change the on-disk
representation to really be binary compatible and change the input
output and operator functions ?

Hmm ... now that's an interesting thought. So the input converter would
actively strip trailing blanks, output would add them back, and only in
a few char(n)-specific functions would we need to pretend they are there.

This would mean that char(n)-to-text is binary compatible but the
reverse direction is not (it must strip trailing blanks) --- but that
coercion could be folded into the existing length-limit checker for
char(n) columns.

IIRC fixed width optimizations do not gain as
much as in earlier versions anyway.

There are no fixed-width optimizations for char(n) at all, now that we
have multibyte enabled all the time.

Seems like a great idea to me.

regards, tom lane

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#3)
Re: char(n) to varchar or text conversion should strip

Tom Lane writes:

Hmm ... now that's an interesting thought. So the input converter would
actively strip trailing blanks, output would add them back,

But how would the output know how many to put back?

--
Peter Eisentraut peter_e@gmx.net

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: char(n) to varchar or text conversion should strip trailing spaces

I said:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

One alternate possible approach would maybe be to change the on-disk
representation to really be binary compatible and change the input
output and operator functions ?

Seems like a great idea to me.

On further thought I've got some reservations about it. The main
problem is that the physical contents of a char(n) Datum wouldn't
provide the complete semantic meaning: you must know the typmod as well
to know the number of padding spaces that are supposed to be there.
We have special provisions to allow input and output functions to know
the typmod, but in more general cases functions will not receive a
typmod, and in any case they cannot deliver a typmod. So for any
operation that wanted to behave as though the padding spaces are real,
there'd be a problem.

We could maybe play some games with having char(n) values be expanded
(space-padded) during expression evaluation and then trimmed for
storage, but this strikes me as awfully messy, not to mention redundant
with the compression done by TOAST.

Also: I've been reading through the spec in more detail, and what I now
realize is that they expect trailing blanks to be ignored by default in
both char(n) and varchar(n) comparisons! In fact, it's not really the
datatype that determines this, but the collation attribute, and they
specify that the default collation must have the PAD SPACE attribute
(which essentially means that trailing spaces are not significant).
What's more, AFAICT padding spaces in char(n) are treated as real data
by every operation except comparison --- for example, they are real
data in concatenation.

I don't think we really want to meet the letter of the spec here :-(
It seems quite schizoid to treat pad spaces as real data for everything
except comparison. Certainly I do not want to do that for varchar or
text datatypes.

The idea of having char(n)-to-text conversion strip trailing blanks
still appeals to me, but I have to withdraw the claim that it'd improve
our spec compliance; it wouldn't.

I'm now wondering whether it wouldn't be better to leave the data
representation as-is (padding spaces are stored), and still allow binary
compatibility both ways, but add a collection of duplicate pg_proc and
pg_operator entries so that char-ness is preserved where appropriate.
For example, we'd need both these pg_proc entries for UPPER():
upper(text) returns text
upper(character) returns character
They could point at the same C routine, but the parser would select the
first when the input is text or varchar, and the second when the input
is character. This would solve the original complaint about
upper('foo '::char(6)) = 'FOO'
needing to yield TRUE. The extra pg_proc entries would be a tad
tedious, but I think we'd only need a couple dozen to satisfy the spec's
requirements. Some (perhaps not all) of these functions would need to
be duplicated:

btrim(text)
btrim(text,text)
convert(text,name)
convert(text,name,name)
convert_using(text,text)
initcap(text)
lower(text)
lpad(text,integer)
lpad(text,integer,text)
ltrim(text)
ltrim(text,text)
max(text)
min(text)
overlay(text,text,integer)
overlay(text,text,integer,integer)
repeat(text,integer)
replace(text,text,text)
rpad(text,integer)
rpad(text,integer,text)
rtrim(text)
rtrim(text,text)
split_part(text,text,integer)
substr(text,integer)
substr(text,integer,integer)
substring(text,integer)
substring(text,integer,integer)
text_larger(text,text)
text_smaller(text,text)
textcat(text,text)
translate(text,text,text)
upper(text)

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#4)
Re: char(n) to varchar or text conversion should strip

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

Hmm ... now that's an interesting thought. So the input converter would
actively strip trailing blanks, output would add them back,

But how would the output know how many to put back?

The output routine would need access to the column typmod. Which it
would have, in simple "SELECT columnname" cases, but this is a serious
weakness of the scheme in general. See my followup post of a few
minutes ago.

regards, tom lane

#7Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#6)
Re: char(n) to varchar or text conversion should strip

Hmm ... now that's an interesting thought. So the input converter would
actively strip trailing blanks, output would add them back,

But how would the output know how many to put back?

The output routine would need access to the column typmod. Which it
would have, in simple "SELECT columnname" cases, but this is a serious
weakness of the scheme in general.

I think it is not really a weakness of the scheme, but a weakness that typmod
is not available in some places where it would actually be needed.
One effect of this is, that all the varlena datatypes have a redundant
length info per row, even for such types that have the same length for
all rows of one table (e.g. numeric(8,2), and char(n)). In a lot of cases
that means 50-100% more disk space than actually needed.

I can see, that making typmod availabe in more places would probably
be major work (or not feasible at all), but I think it would generally be
of (great) value.

To the problem of concatenation:
Would it be feasible to alter the concatenation method to concatenate
the results of the output functions of the relevant expressions ?
Imho that would actually return the expected results more often than it
currently does, and it would fix the typmod issue for char(n) concatenation.

Andreas

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#7)
Re: char(n) to varchar or text conversion should strip

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

I think it is not really a weakness of the scheme, but a weakness that typmod
is not available in some places where it would actually be needed.
One effect of this is, that all the varlena datatypes have a redundant
length info per row, even for such types that have the same length for
all rows of one table (e.g. numeric(8,2), and char(n)).

That argument doesn't hold any water, seeing that neither of your
examples are actually fixed-width... numeric doesn't store leading
or trailing zeroes, and char(n) is not a fixed-width item when dealing
with multibyte encodings. And on top of that, there's TOAST to
think about. So I don't think there's any shot at getting rid of the
varlen word.

Would it be feasible to alter the concatenation method to concatenate
the results of the output functions of the relevant expressions ?
Imho that would actually return the expected results more often than it
currently does, and it would fix the typmod issue for char(n) concatenation.

No, it wouldn't really fix the typmod issue; you still have the problem
of where is the output function going to get the typmod from? If the
concatenation argument is anything but a simple column reference, you've
still got trouble.

regards, tom lane