Identifier case folding notes

Started by Peter Eisentrautalmost 18 years ago20 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

I have had some idle thoughts on the issue of identifier case folding. Maybe
we can collect our ideas and inch closer to a solution sometime. Or we
determine that it's useless and impossible, but then I can at least collect
that result in the wiki and point future users to it.

Background: PostgreSQL folds unquoted identifiers to lower case. SQL
specifies folding to upper case, and most other SQL DBMS do it that way. If
an application mixes referring to an object using quoted and unquoted
versions of an identifier, then incompatibilities arise. We have always
stressed to users that one should refer to an object always unquoted or
always quoted. While that remains a good suggestion for a number of reasons,
we have seen occasional complaints that some closed source applications
violate this rule and therefore cannot be run on PostgreSQL.

A number of solutions have been proposed over time, which I summarize here:

1. Change the lexer to fold to upper case, as it is supposed to do according
to the SQL standard. This will break almost everything, because almost all
built-in objects have lower-case names and thus couldn't be referred to
anymore except by quoting. Changing the names of all the internal objects to
upper-case names would involve vast code changes, probably break just as much,
and make everything uglier. So this approach is unworkable.

2. Fold to upper case, but not when referring built-in objects. The lexer
generally doesn't know what a name will refer to, so this is not possible to
implement, at least without resorting to lots of hard-coding or horrid
kludges. Also, a behavior like this will probably create all kinds of weird
inconsistencies, resulting from putting catalog knowledge in the lexer.

3 and 4 -- Two variants of "ignore case altogether":

3. Fold all identifiers to lower case, even quoted ones. This would probably
in fact fix the breakage of many of the above-mentioned problem applications,
and it would in general be very easy to understand for a user. And it could
be implemented in about three lines. One disadvantage is that one could no
longer have objects that have names different only by case, but that is
probably rare and incredibly stupid and can be neglected. The main
disadvantage is that the case of identifiers and in particular column labels
is lost. So applications and programming interfaces that look up result
columns in a case-sensitive manner would fail. And things like SELECT expr AS
"Nice Heading" won't work properly anymore.

4. Compare the "name" data type in a case-insensitive manner. This would
probably address most problem cases. Again, you can't have objects with names
different in case only. One condition to implementing this would be that this
behavior would have be tied down globally at initdb, because it affects system
indexes and shared catalogs. That might be impractical for some, because
you'd need different instances for different behaviors, especially when you
want to host multiple applications or want to port an affected application to
the native PostgreSQL behavior over time.

5. One additional approach I thought of is that you swap the case of
identifiers as you lex them (upper to lower, lower to upper), and then swap
them back when you send them to the client. This needs a small change in the
lexer, one for sending the RowDescription, and support in pg_dump and a few
other places if desired. There will, however, be a number of weird, albeit
self-imposed, side-effects. I have implemented a little test patch for this.
It's weird, but it works in basic ways.

Obviously, no solution will ever work completely. And we probably don't want
such a solution, because it would create two different and incompatible
PostgreSQL universes. If we are aiming for a solution that would allow most
affected applications to hobble along, we would probably serve most users.
Implementing some or all of 3, 4, and 5 would probably achieve that.

Comments?

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Peter Eisentraut (#1)
Re: Identifier case folding notes

On Tue, 2008-07-08 at 19:25 +0200, Peter Eisentraut wrote:

4. Compare the "name" data type in a case-insensitive manner. This
would probably address most problem cases. Again, you can't have
objects with names
different in case only. One condition to implementing this would be
that this
behavior would have be tied down globally at initdb, because it
affects system
indexes and shared catalogs. That might be impractical for some,
because
you'd need different instances for different behaviors, especially
when you
want to host multiple applications or want to port an affected
application to
the native PostgreSQL behavior over time.

That sounds the most workable, given your descriptions.

If objects are never different solely by case alone, then you will have
the same index ordering as if you had sent them all to lower case.

Surely it is possible to mix the two approaches somehow?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: Identifier case folding notes

Peter Eisentraut <peter_e@gmx.net> writes:

I have had some idle thoughts on the issue of identifier case folding.
...
Comments?

IMHO, practically the only solid argument for changing from the way
we do things now is "to meet the letter of the spec". The various sorts
of gamesmanship you list would most definitely not meet the letter of
the spec; between that and the inevitability of breaking some apps,
I'm inclined to reject them all on sight.

What I think would perhaps be worth investigating is a compile-time
(or at latest initdb-time) option that flips the case folding behavior
to SQL-spec-compliant and also changes all the built-in catalog entries
to upper case. We would then have a solution we could offer to people
who really need to run apps that depend on SQL-spec case folding ...
and if the upper case hurts their eyes, or breaks some other apps that
they wish they could run in the same DB, well it's their problem.

Of course there would be large amounts of work to try to make psql,
pg_dump, etc behave as nicely as possible with either case-folding rule,
but it doesn't strike me as being so obviously "unworkable" as to be
dismissed at once.

regards, tom lane

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#3)
Re: Identifier case folding notes

Am Dienstag, 8. Juli 2008 schrieb Tom Lane:

IMHO, practically the only solid argument for changing from the way
we do things now is "to meet the letter of the spec".

Well no. As I have mentioned, there have actually been occasional complaints
by people who can't run their code generated by closed-source applications,
because they handle the case differently.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#4)
Re: Identifier case folding notes

Peter Eisentraut <peter_e@gmx.net> writes:

Am Dienstag, 8. Juli 2008 schrieb Tom Lane:

IMHO, practically the only solid argument for changing from the way
we do things now is "to meet the letter of the spec".

Well no. As I have mentioned, there have actually been occasional complaints
by people who can't run their code generated by closed-source applications,
because they handle the case differently.

Sure, otherwise we wouldn't really be worrying about this. But if
someone comes to us and says "this closed source app requires some
weird non-spec-compliant case folding rule, please make Postgres
do that", we're going to say no. Their argument only has weight
if they say their app expects the SQL-spec behavior.

regards, tom lane

#6Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#3)
Re: Identifier case folding notes

Tom,

IMHO, practically the only solid argument for changing from the way
we do things now is "to meet the letter of the spec". The various sorts
of gamesmanship you list would most definitely not meet the letter of
the spec; between that and the inevitability of breaking some apps,
I'm inclined to reject them all on sight.

Actually, there are a number of *very* popular database tools, particularly
in the Java world (such as Netbeans and BIRT) which do mix quoted and
unquoted identifiers. In general, users of those tools reject PostgreSQL
as "broken" for our nonstandard behavoir rather than trying to work around
it.

So it's not just a standards issue; this problem really *is* hurting us in
adoption.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#7Andreas Joseph Krogh
andreak@officenet.no
In reply to: Josh Berkus (#6)
Re: Identifier case folding notes

On Tuesday 08 July 2008 23:04:51 Josh Berkus wrote:

Tom,

IMHO, practically the only solid argument for changing from the way
we do things now is "to meet the letter of the spec". The various sorts
of gamesmanship you list would most definitely not meet the letter of
the spec; between that and the inevitability of breaking some apps,
I'm inclined to reject them all on sight.

Actually, there are a number of *very* popular database tools, particularly
in the Java world (such as Netbeans and BIRT) which do mix quoted and
unquoted identifiers. In general, users of those tools reject PostgreSQL
as "broken" for our nonstandard behavoir rather than trying to work around
it.

So it's not just a standards issue; this problem really *is* hurting us in
adoption.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Right. From a user's perspective 4) sounds best. I often run into problems having keywords as column-names:

andreak=# create table test(user varchar);
ERROR: syntax error at or near "user"
LINE 1: create table test(user varchar);
^
andreak=# create table test("user" varchar);
CREATE TABLE
andreak=# insert into test("USER") values('testuser');
ERROR: column "USER" of relation "test" does not exist
LINE 1: insert into test("USER") values('testuser');
^
andreak=# insert into test(user) values('testuser');
ERROR: syntax error at or near "user"
LINE 1: insert into test(user) values('testuser');
^
andreak=# insert into test("user") values('testuser');
INSERT 0 1

As you know, the only way of referring to the "user"-column is to qoute it in lowercase, which many apps and tools don't do.

--
Andreas Joseph Krogh <andreak@officenet.no>
Senior Software Developer / Manager

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Josh Berkus (#6)
Re: Identifier case folding notes

Josh Berkus <josh@agliodbs.com> wrote:

Actually, there are a number of *very* popular database tools,

particularly

in the Java world (such as Netbeans and BIRT) which do mix quoted and

unquoted identifiers. In general, users of those tools reject

PostgreSQL

as "broken" for our nonstandard behavoir rather than trying to work

around

it.

Do these tools expect an unquoted identifier to be treated according
to the standard? As I read it, an unquoted identifier should be
treated identically to the same identifier folded to uppercase and
wrapped in quotes, except that it will be guaranteed to be considered
an identifier, rather than possibly considered as a reserved word,
etc.

From our perspective, we're OK with the status quo since we always

quote all identifiers. I don't think any of the suggestions would
bite us (if implemented bug-free) because we also forbid names which
differ only in capitalization. We help out our programmers by letting
them ignore quoting (except identifiers which are reserved words) and
capitalization when they write queries in our tool; we correct the
capitalization and wrap the identifiers in quotes as we generate the
Java query classes. Doing something like that in psql autocompletion
and in other PostgreSQL tools would be a nice feature, if practicable.

-Kevin

#9Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Tom Lane (#3)
Re: Identifier case folding notes

Tom Lane wrote:

What I think would perhaps be worth investigating is a compile-time
(or at latest initdb-time) option that flips the case folding behavior
to SQL-spec-compliant and also changes all the built-in catalog entries
to upper case. We would then have a solution we could offer to people
who really need to run apps that depend on SQL-spec case folding ...
and if the upper case hurts their eyes, or breaks some other apps that
they wish they could run in the same DB, well it's their problem.

+1 for a compile-time option for spec-compliant behavior. Even
where the spec is stupid (timestamp with time zone literals) it'd
be nice to have the option; both for feature completeness
checklists and for teachers who want to teach targeting the spec.

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Ron Mayer (#9)
Re: Identifier case folding notes

Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:

+1 for a compile-time option for spec-compliant behavior. Even
where the spec is stupid (timestamp with time zone literals) it'd
be nice to have the option; both for feature completeness
checklists and for teachers who want to teach targeting the spec.

In my world it would be even more important for feature completeness
itself, and for production applications written to the spec for
portability. But, agreed: +1

-Kevin

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#8)
Re: Identifier case folding notes

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Josh Berkus <josh@agliodbs.com> wrote:

Actually, there are a number of *very* popular database tools,
particularly
in the Java world (such as Netbeans and BIRT) which do mix quoted and
unquoted identifiers.

Do these tools expect an unquoted identifier to be treated according
to the standard?

Unfortunately, they almost certainly don't. I'd bet long odds that
what they expect is mysql's traditional behavior, which is not even
within hailing distance of being spec compliant. (In a quick test,
it looks like mysql 5.0's default behavior is never to fold case
at all; and then there's the problem that they use the wrong kind
of quotes ...)

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Joseph Krogh (#7)
Re: Identifier case folding notes

Andreas Joseph Krogh <andreak@officenet.no> writes:

Right. From a user's perspective 4) sounds best. I often run into problems having keywords as column-names:

None of the proposals on the table will remove the need to use quotes in
that case.

regards, tom lane

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#1)
Re: Identifier case folding notes

Peter Eisentraut wrote:

I have had some idle thoughts on the issue of identifier case folding. Maybe
we can collect our ideas and inch closer to a solution sometime. Or we
determine that it's useless and impossible, but then I can at least collect
that result in the wiki and point future users to it.

Background: PostgreSQL folds unquoted identifiers to lower case. SQL
specifies folding to upper case, and most other SQL DBMS do it that way. If
an application mixes referring to an object using quoted and unquoted
versions of an identifier, then incompatibilities arise. We have always
stressed to users that one should refer to an object always unquoted or
always quoted. While that remains a good suggestion for a number of reasons,
we have seen occasional complaints that some closed source applications
violate this rule and therefore cannot be run on PostgreSQL.

A number of solutions have been proposed over time, which I summarize here:

1. Change the lexer to fold to upper case, as it is supposed to do according
to the SQL standard. This will break almost everything, because almost all
built-in objects have lower-case names and thus couldn't be referred to
anymore except by quoting. Changing the names of all the internal objects to
upper-case names would involve vast code changes, probably break just as much,
and make everything uglier. So this approach is unworkable.

2. Fold to upper case, but not when referring built-in objects. The lexer
generally doesn't know what a name will refer to, so this is not possible to
implement, at least without resorting to lots of hard-coding or horrid
kludges. Also, a behavior like this will probably create all kinds of weird
inconsistencies, resulting from putting catalog knowledge in the lexer.

3 and 4 -- Two variants of "ignore case altogether":

3. Fold all identifiers to lower case, even quoted ones. This would probably
in fact fix the breakage of many of the above-mentioned problem applications,
and it would in general be very easy to understand for a user. And it could
be implemented in about three lines. One disadvantage is that one could no
longer have objects that have names different only by case, but that is
probably rare and incredibly stupid and can be neglected. The main
disadvantage is that the case of identifiers and in particular column labels
is lost. So applications and programming interfaces that look up result
columns in a case-sensitive manner would fail. And things like SELECT expr AS
"Nice Heading" won't work properly anymore.

4. Compare the "name" data type in a case-insensitive manner. This would
probably address most problem cases. Again, you can't have objects with names
different in case only. One condition to implementing this would be that this
behavior would have be tied down globally at initdb, because it affects system
indexes and shared catalogs. That might be impractical for some, because
you'd need different instances for different behaviors, especially when you
want to host multiple applications or want to port an affected application to
the native PostgreSQL behavior over time.

5. One additional approach I thought of is that you swap the case of
identifiers as you lex them (upper to lower, lower to upper), and then swap
them back when you send them to the client. This needs a small change in the
lexer, one for sending the RowDescription, and support in pg_dump and a few
other places if desired. There will, however, be a number of weird, albeit
self-imposed, side-effects. I have implemented a little test patch for this.
It's weird, but it works in basic ways.

Obviously, no solution will ever work completely. And we probably don't want
such a solution, because it would create two different and incompatible
PostgreSQL universes. If we are aiming for a solution that would allow most
affected applications to hobble along, we would probably serve most users.
Implementing some or all of 3, 4, and 5 would probably achieve that.

I'm not sure if you've read all the archive history on this. Here are
the pointers from the TODO list:

http://archives.postgresql.org/pgsql-hackers/2004-04/msg00818.php
http://archives.postgresql.org/pgsql-hackers/2006-10/msg01527.php
http://archives.postgresql.org/pgsql-hackers/2008-03/msg00849.php

The fact is that we have substantial groups of users who want different
things:
. current users who want no change so there is no breakage in existing apps
. users on other DBs who want Spec compliance
. users on yet other DBs who want case preservation

The last group should not be lightly dismissed - it is quite common
behaviour on MSSQL as well as on MySQL, so we have some incentive to
make this possible to encourage migration.

I'm strongly of the opinion therefore that this should be behaviour
determined at initdb time (can't make it later because of shared
catalogs). I suspect that we won't be able to do all this by simple
transformations in the lexer, unlike what we do now. But I do think it's
worth doing.

cheers

andrew

#14Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#1)
Re: Identifier case folding notes

"Peter Eisentraut" <peter_e@gmx.net> writes:

One disadvantage is that one could no longer have objects that have names
different only by case, but that is probably rare and incredibly stupid and
can be neglected.

Certainly not if you hope to claim being within a mile of spec -- which seems
like the only point of fiddling with this. Breaking this would take as further
from spec-compliance than we are today.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#15Russell Smith
mr-russ@pws.com.au
In reply to: Andrew Dunstan (#13)
Re: Identifier case folding notes

Andrew Dunstan wrote:

I'm not sure if you've read all the archive history on this. Here are
the pointers from the TODO list:

http://archives.postgresql.org/pgsql-hackers/2004-04/msg00818.php
http://archives.postgresql.org/pgsql-hackers/2006-10/msg01527.php
http://archives.postgresql.org/pgsql-hackers/2008-03/msg00849.php

The fact is that we have substantial groups of users who want
different things:
. current users who want no change so there is no breakage in existing
apps
. users on other DBs who want Spec compliance
. users on yet other DBs who want case preservation

The last group should not be lightly dismissed - it is quite common
behaviour on MSSQL as well as on MySQL, so we have some incentive to
make this possible to encourage migration.

I'm strongly of the opinion therefore that this should be behaviour
determined at initdb time (can't make it later because of shared
catalogs). I suspect that we won't be able to do all this by simple
transformations in the lexer, unlike what we do now. But I do think
it's worth doing.

cheers

andrew

Hi,

as part of the
http://archives.postgresql.org/pgsql-hackers/2008-03/msg00849.php
thread, I did a reasonable amount of discovery work on making the 3
options andrew presents a reality. As I'm not skilled enough I never
got far enough to make them all work at once. I did however get lower
case and case preservation working.

To make those tow work the catalogs need no changes. Some of the
regressions tests expect case folding, so they did need changing to
operate correctly. I was unable to make the input files to initdb
correctly fold the case of system catalogs for the upper case version.
I'm sure somebody with more experience would not find it as difficult as
I did. Function names tended to be where all the gotchas were. Count()
vs count() vs COUNT() for example.

Once the db was up and running, the issue becomes all the supporting
tools. psql was made to autocomplete with case preservation, I was
going to make pg_dump just quote everything. I then got to the point of
adding a fixed GUC like LC_LOCALE that allows psql to read the case
folding situation and act according. That is where my progress ended.

Attached is what i had worked in. It's a patch against 8.3.1. I know
it's not CVS head, but it is what I was using at the time to experiment.

Regards

Russell

Attachments:

case-folding-patch-WIP.difftext/x-diff; name=case-folding-patch-WIP.diffDownload+204-130
#16Andreas Joseph Krogh
andreak@officenet.no
In reply to: Tom Lane (#12)
Re: Identifier case folding notes

On Wednesday 09 July 2008 00:35:07 Tom Lane wrote:

Andreas Joseph Krogh <andreak@officenet.no> writes:

Right. From a user's perspective 4) sounds best. I often run into problems having keywords as column-names:

None of the proposals on the table will remove the need to use quotes in
that case.

I know, but then tools/frameworks won't fail when they produce queries like SELECT "USER" FROM test, because it sends "USER" and not "user".

--
Andreas Joseph Krogh <andreak@officenet.no>
Senior Software Developer / Manager
------------------------+---------------------------------------------+
OfficeNet AS | The most difficult thing in the world is to |
Karenslyst Allé 11 | know how to do a thing and to watch |
PO. Box 529 Skøyen | somebody else doing it wrong, without |
0214 Oslo | comment. |
NORWAY | |
Tlf: +47 24 15 38 90 | |
Fax: +47 24 15 38 91 | |
Mobile: +47 909 56 963 | |
------------------------+---------------------------------------------+

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#14)
Re: Identifier case folding notes

Am Mittwoch, 9. Juli 2008 schrieb Gregory Stark:

"Peter Eisentraut" <peter_e@gmx.net> writes:

One disadvantage is that one could no longer have objects that have names
different only by case, but that is probably rare and incredibly stupid
and can be neglected.

Certainly not if you hope to claim being within a mile of spec

Which I don't.

-- which seems like the only point of fiddling with this.

No, the point is making more applications run.

Breaking this would take as further from spec-compliance than we are today.

As far as standards compliance goes, I don't think there is a well-defined
distance measure, but if you can come up with one, I would actually be quite
keen on debating this point. :-) In any case, any of this would only be an
option anyway.

#18Tino Wildenhain
tino@wildenhain.de
In reply to: Peter Eisentraut (#1)
Re: Identifier case folding notes

Hi,

Peter Eisentraut wrote:
...

4. Compare the "name" data type in a case-insensitive manner. This would
probably address most problem cases. Again, you can't have objects with names
different in case only. One condition to implementing this would be that this
behavior would have be tied down globally at initdb, because it affects system
indexes and shared catalogs. That might be impractical for some, because
you'd need different instances for different behaviors, especially when you
want to host multiple applications or want to port an affected application to
the native PostgreSQL behavior over time.

The whole stuff as I understand is to fix the behavior with applications
creating objects without quotes and accessing them "QUOTEDUPPERCASE"?

Would a small script fixing the schema by using rename not fix this for
many applications?

T.

#19Bruce Momjian
bruce@momjian.us
In reply to: Tino Wildenhain (#18)
Re: Identifier case folding notes

"Tino Wildenhain" <tino@wildenhain.de> writes:

The whole stuff as I understand is to fix the behavior with applications
creating objects without quotes and accessing them "QUOTEDUPPERCASE"?

Would a small script fixing the schema by using rename not fix this for
many applications?

Well there are other circumstances where this can arise

select "FOO" from (select col as foo from bar)

Not to mention temporary objects which are created and dropped inside
functions.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#20Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#11)
Re: Identifier case folding notes

Tom,

Unfortunately, they almost certainly don't. I'd bet long odds that
what they expect is mysql's traditional behavior,

Nope. They're looking for Oracle, which is spec-complaint since they
wrote that spec.

--Josh