Bringing PostgreSQL torwards the standard regarding case folding

Started by Shachar Shemeshover 21 years ago19 messages

psql@shemesh.biz

over 21 years ago

I'm opening a new thread, as the previous one was too nested, and
contained too much emotions.

I'll start by my understanding of a summary of the thread so far. The
solution we are seeking would have to satisfy the following conditions:
1. Setting should be on a per-database level. A per-server option is not
good enough, and a per-session option is too difficult to implement,
with no apparent justifiable return.
2. Old applications already working with PG's lowercase folding should
have an option to continue working unmodified for the foreseeable future.

Solutions offered so far, and their status:
1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.
2. Dual state. Fold lower or upper. Break if client is broken.
3. Create a database conversion tool to change existing case.

Solution 1:
As currently the case folding is performed disjointed from the actual
use of the identifier, this solution requires quite a big amount of
work. On the other hand, and on second thought, it's main benefit -
gradual transition of applications from one to the other, is not really
necessary once you declare the current behaviour as there to stay.
Existing applications can simply choose to continue using whatever
method they currently use. No need for migration.

As such, I think we can simply state that tri-state migration path
solution can be discarded for the time being.

Solution 2:
Obviously, this is the way to go. We will have a dabase attribute that
states whether things are lower or upper case there.

Solution 3:
(unrelated to the above)
There seems to be some ambiguity about how to handle the translation.
Such a tool seems to require guessing which identifiers are accessed
quoted, unquoted, or both. The last option, of course, will never work.

We may need such a tool, for some projects may wish to transform from
one way to the other. It seems to me, however, that such a tool can wait
a little.

Open issues:
1. What do we do with identifiers in "template1" upon database creation?
2. How do we handle queries to tables belonging the catalog that are
shared between databases?

Observation: on a lowercase folding DB, any identifier that is not
composed only of lowercase characters MUST can be automatically assumed
to be accessed only through quoted mode.

I therefor suggest the following path to a solution:
1. CreateDB will be able to create databases from either type.
2. template1 will be defined to be one or the other. For the sake of
this discussion, let's assume it's lowercase (current situation)
3. CreateDB, upon being asked to create a new DB that has uppercase
folding, will copy over template1, as it currently does.
4. While copying, it will check each identifier. If the identifier is
not lowercase only, it is safe to copy it verbatim.
5. If the identifier is lowercase only, convert it to uppercase only. I
am assuming here that the authors of the client code chose an
uppercase-folding database, so they should know what they are doing when
accessing stuff from the standard offering.
6. I'm not sure what are the shared tables from the catalog. I don't
think it so unreasonable to ask anyone doing catalog work to assume that
catalog entries are case-sensitive. As such, maybe it's best to just
leave the data as is.
7. Column headers, however, will have to have a solution. A point still
open in current design.

I'm hoping this summary helps in furthering the discussion.

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/

Rod Taylor

pg@rbt.ca

over 21 years ago

In reply to: Shachar Shemesh (#1)

Re: Bringing PostgreSQL torwards the standard regarding

5. If the identifier is lowercase only, convert it to uppercase only. I
am assuming here that the authors of the client code chose an
uppercase-folding database, so they should know what they are doing when
accessing stuff from the standard offering.

You've just broken one of my databases.

In one project I quote nearly everything and do so in lower case only.
This was done to ensure portability between PostgreSQL, Oracle, etc. --
but with my preference of lower case names.

If someone copied this database with the wrong case folding option, it
would break a (reasonably) spec compliant application that is regularly
installed on environments where we have little to no control over the
database settings.

I think copied attributes need to be left alone. Train the PostgreSQL
utilities to always quote the identifiers instead.

If you want case to be folded, run an external utility to does a bunch
of ALTER ... RENAMEs.

Shachar Shemesh

psql@shemesh.biz

over 21 years ago

In reply to: Rod Taylor (#2)

Re: Bringing PostgreSQL torwards the standard regarding

Rod Taylor wrote:

5. If the identifier is lowercase only, convert it to uppercase only. I
am assuming here that the authors of the client code chose an
uppercase-folding database, so they should know what they are doing when
accessing stuff from the standard offering.

You've just broken one of my databases.

In one project I quote nearly everything and do so in lower case only.
This was done to ensure portability between PostgreSQL, Oracle, etc. --
but with my preference of lower case names.

I'm not sure you understood me.

First, if we don't convert lower->upper, how can anyone expect the
following query to work:
select lower(id) from table;

Even if you quote everything, you'd still probably have:
select lower("id") from "table";

Noone can expect you to do:
select "lower"("id") from "table";

The problem is that "lower" is defined in template1. If we don't
uppercase it when we create the database, the above won't work. Then
again, I'm fairly sure that the identifiers you placed as lowercase in
your database are not defined by template1.

In short, I don't think this suggestion broke your database.

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/

Rod Taylor

pg@rbt.ca

over 21 years ago

In reply to: Shachar Shemesh (#3)

Re: Bringing PostgreSQL torwards the standard regarding

I'm not sure you understood me.

Perhaps not, here is what we have:

\c template_db
CREATE TABLE "example" ( "col" integer);

CREATE DATABASE newdb WITH TEMPLATE template_db UPPERCASE IDENTIFIERS;

\c newdb
SELECT "col" FROM "example";

In short, I don't think this suggestion broke your database.

I understood that since "example", when copied, would turn into
"EXAMPLE" -- but that if it was "Example" it would be copied as
"Example".

Andrew Dunstan

andrew@dunslane.net

over 21 years ago

In reply to: Shachar Shemesh (#1)

Re: Bringing PostgreSQL torwards the standard regarding

Shachar Shemesh wrote:

I'm opening a new thread, as the previous one was too nested, and
contained too much emotions.

I'll start by my understanding of a summary of the thread so far. The
solution we are seeking would have to satisfy the following conditions:
1. Setting should be on a per-database level. A per-server option is
not good enough, and a per-session option is too difficult to
implement, with no apparent justifiable return.

I am not convinced on this point. Why is per-server not good enough? The
obvious place to make these changes seems to me to be during or
immediatly after the bootstrap phase of initdb. It would avoid a host of
later troubles.

2. Old applications already working with PG's lowercase folding should
have an option to continue working unmodified for the foreseeable future.

Solutions offered so far, and their status:
1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.
2. Dual state. Fold lower or upper. Break if client is broken.
3. Create a database conversion tool to change existing case.

I don't think we should rush at this. All of these solutions are based
on the existing structures. I have started thinking about a solution
that would involve keeping two versions of catalog names: a canonical
name and a "name as supplied at creation". There would be heaps of
wrinkles, but it might get us where we want to be. But I have not had
time to sort it out in my head yet, let alone make any experiments.
Let's keep getting more ideas.

cheers

andrew

Alvaro Herrera

alvherre@dcc.uchile.cl

over 21 years ago

In reply to: Andrew Dunstan (#5)

Re: Bringing PostgreSQL torwards the standard regarding

On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:

I don't think we should rush at this. All of these solutions are based
on the existing structures. I have started thinking about a solution
that would involve keeping two versions of catalog names: a canonical
name and a "name as supplied at creation".

Why do you want two names? Just keep the original casing, and a boolean
saying if it's quoted or not.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseï¿½ar algo." (Jean B. Say)

Andrew Dunstan

andrew@dunslane.net

over 21 years ago

In reply to: Alvaro Herrera (#6)

Re: Bringing PostgreSQL torwards the standard regarding

Alvaro Herrera wrote:

On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:

I don't think we should rush at this. All of these solutions are based
on the existing structures. I have started thinking about a solution
that would involve keeping two versions of catalog names: a canonical
name and a "name as supplied at creation".

Why do you want two names? Just keep the original casing, and a boolean
saying if it's quoted or not.

Because I was not just considering the upper/lower issue - refer to the
thread that relates to unquoted names with case preserved.

cheers

andrew

Andrew Dunstan

andrew@dunslane.net

over 21 years ago

In reply to: Andrew Dunstan (#7)

Re: Bringing PostgreSQL torwards the standard regarding

I wrote:

Alvaro Herrera wrote:

On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:

I don't think we should rush at this. All of these solutions are
based on the existing structures. I have started thinking about a
solution that would involve keeping two versions of catalog names: a
canonical name and a "name as supplied at creation".

Why do you want two names? Just keep the original casing, and a boolean
saying if it's quoted or not.

Because I was not just considering the upper/lower issue - refer to
the thread that relates to unquoted names with case preserved.

Sorry - brain malfunction - yes, original casing plus boolean would
work. In effect you could derive the canonical form from those two.

cheers

andrew

Dennis Bjorklund

db@zigo.dhs.org

over 21 years ago

In reply to: Andrew Dunstan (#8)

Re: Bringing PostgreSQL torwards the standard regarding

On Sun, 25 Apr 2004, Andrew Dunstan wrote:

Why do you want two names? Just keep the original casing, and a boolean
saying if it's quoted or not.

Sorry - brain malfunction - yes, original casing plus boolean would
work. In effect you could derive the canonical form from those two.

Say that you have this in the table with the identifier

name quoted
---- ------
Foo False

Now you want to add the name "FOO"

FOO True

should you be allowed or is it a clash with the above?

What if you also add "foo"

foo True

One of these two should be forbidden. And what about a quoted "FOO":

FOO False
FOO True

This case says it is not enough with an expressional unique index on
(upper(name), quoted). It would be easier to enforce uniqueness if one
store both the converted name and the original name:

name orig_name
---- ---------
FOO NULL <-- quoted one
FOO FOO <-- unquoted one

and the first case

FOO Foo <-- unquoted
FOO NULL <-- clashes with the first, good
foo NULL <-- no clash, works fine

With this one can always use upper case translation as per sql spec and
psql can optionally show all unquoted identifiers as upper, lower or mixed
case.

Then we also have the INFORMATION_SCHEMA that should show the names in
UPPER CASE when not quoted, this since applications that are written for
the standard might depend on that (probably no application do today but it
would be a valid case of use of the information schema).

--
/Dennis Bjï¿½rklund

#10

Andrew Dunstan

andrew@dunslane.net

over 21 years ago

In reply to: Dennis Bjorklund (#9)

Re: Bringing PostgreSQL torwards the standard regarding

Dennis Bjorklund said:

On Sun, 25 Apr 2004, Andrew Dunstan wrote:

Why do you want two names? Just keep the original casing, and a
boolean saying if it's quoted or not.

Sorry - brain malfunction - yes, original casing plus boolean would
work. In effect you could derive the canonical form from those two.

Dennis,

Ideas still swirling a bit, but I was thinking that there would be a per
database flag (which could indeed be set at db creation time) which would
specify the flavor of canonical names being used - upper, or lower, or we
could also consider exact (i.e. full case sensitivity, which I seem to
recall is a mode that SQLServer allows, possibly even the default, but my
memory could be rusty).

The canonical form of an unquoted name is dictated by this setting, while
the canonical form of a quoted name is the name as supplied. Two names
clash if their canonical forms are identical, quoted or not.

Assuming that we have a database with the flag set to use upper case
canonical names, as per the standard, then ...

Say that you have this in the table with the identifier

name quoted
---- ------
Foo False

Now you want to add the name "FOO"

FOO True

should you be allowed or is it a clash with the above?

It's a clash. The canonical for of both is "FOO"

What if you also add "foo"

foo True

No clash - "FOO" <> "foo"

One of these two should be forbidden. And what about a quoted "FOO":

FOO False
FOO True

clash

This case says it is not enough with an expressional unique index on
(upper(name), quoted). It would be easier to enforce uniqueness if one
store both the converted name and the original name:

The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper
(name) END.

The advantage of using a boolean is that a lot less work would need to be
done to use whatever flag was being used for the DB. Possibly a reindex
after the files are copied. It might fail on some highly pathological
cases, but should never fail on our standard template databases.

name orig_name
---- ---------
FOO NULL <-- quoted one
FOO FOO <-- unquoted one

and the first case

FOO Foo <-- unquoted
FOO NULL <-- clashes with the first, good foo
NULL <-- no clash, works fine

With this one can always use upper case translation as per sql spec and
psql can optionally show all unquoted identifiers as upper, lower or
mixed case.

My thought was that there would be a user setting that would allow
resultset labels to use either canonical or literal names.

Then we also have the INFORMATION_SCHEMA that should show the names in
UPPER CASE when not quoted, this since applications that are written
for the standard might depend on that (probably no application do today
but it would be a valid case of use of the information schema).

I see 2 possibilities: either use the upper case canonical setting I
envisioned above, or change the information schema setup to force upper
case labels via AS clauses in the views.

cheers

andrew

#11

Dennis Bjorklund

db@zigo.dhs.org

over 21 years ago

In reply to: Andrew Dunstan (#10)

Re: Bringing PostgreSQL torwards the standard regarding

On Mon, 26 Apr 2004, Andrew Dunstan wrote:

Ideas still swirling a bit

Sure, I'm thinking in public as well. Not something you want to do if you
are afraid of being wrong and showing it :-) But I'm not.

The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper
(name) END.

That's simple enough (and pretty straight forward).

--
/Dennis Bjï¿½rklund

#12

Josh Berkus

josh@agliodbs.com

over 21 years ago

In reply to: Shachar Shemesh (#1)

Re: Bringing PostgreSQL torwards the standard regarding case folding

Shachar,

I've been giving this some more thought. Here are my contributions:

1. Setting should be on a per-database level. A per-server option is not
good enough, and a per-session option is too difficult to implement,
with no apparent justifiable return.

I disagree with this. I think doing case-folding per database would be
preposterously difficult, and that per-server is adequate. Per database
settings bring up a whole raft of logical conflicts, particularly around the
system catalogs and dblink, that aren't necessarily worth navigating.

I also didn't follow the discussion of why a client-side implementation was
technically impossible; this seems like the most obvious course to me, and to
have *considerable* benefit. It's also consistent with our other statement
variables, such as datestyle, which are all client-side, per-session
settings.

A server-side implementation would possibly reqire touching every single
source code file in Postgres, something that would justify a lot of effort to
avoid.

2. Old applications already working with PG's lowercase folding should
have an option to continue working unmodified for the foreseeable future.

Si.

1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.

Can't see this being possible.

2. Dual state. Fold lower or upper. Break if client is broken.

Best, I think. But it should be client-side.

3. Create a database conversion tool to change existing case.

No thanks.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Import Notes

Reply to msg id not found: auto-000004944907@davinci.ethosmedia.comReference msg id not found: auto-000004944907@davinci.ethosmedia.com | Resolved by subject fallback

#13

Shachar Shemesh

psql@shemesh.biz

over 21 years ago

In reply to: Josh Berkus (#12)

Re: Bringing PostgreSQL torwards the standard regarding case folding

Josh Berkus wrote:

I also didn't follow the discussion of why a client-side implementation was
technically impossible; this seems like the most obvious course to me, and to
have *considerable* benefit. It's also consistent with our other statement
variables, such as datestyle, which are all client-side, per-session
settings.

But they are not client side, are they? The date is formatted by the
server. The client is simply receiving whatever datestyle itselected.
That is, assuming I understand the behaviour correctly.

I would catagorize it as a server side per-session configuration. If
that's what you mean, we thought it was too insane because it doesn't
explain how you are supposed to handle all the catalog and other stuff
where you are implictly assuming quoting is not necessary.

I'l reiterate the example. Do you really expect to have to write
select "upper"("field") from "table"
and should it be "upper" or "UPPER"?

A server-side implementation would possibly reqire touching every single
source code file in Postgres, something that would justify a lot of effort to
avoid.

I think the concensus was that the runtime part was aprox. four lines
where the case folding currently takes place. Obviously, you would have
to get a var, and propogate that var to that place, but not actually
change program flow.

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/

#14

Josh Berkus

josh@agliodbs.com

over 21 years ago

In reply to: Shachar Shemesh (#13)

Re: Bringing PostgreSQL torwards the standard regarding case folding

Shachar,

I think the concensus was that the runtime part was aprox. four lines
where the case folding currently takes place. Obviously, you would have
to get a var, and propogate that var to that place, but not actually
change program flow.

That's only if you ignore the system catalogs entirely, which maybe you're
prepared to do. If you want to change case folding for the system catalogs,
though, you'll need to update code in thousands of places, becuase the
back-end code is expecting lower-case identifiers ....

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#15

Shachar Shemesh

psql@shemesh.biz

over 21 years ago

In reply to: Josh Berkus (#14)

Re: Bringing PostgreSQL torwards the standard regarding case folding

Josh Berkus wrote:

Shachar,

I think the concensus was that the runtime part was aprox. four lines
where the case folding currently takes place. Obviously, you would have
to get a var, and propogate that var to that place, but not actually
change program flow.

That's only if you ignore the system catalogs entirely, which maybe you're
prepared to do. If you want to change case folding for the system catalogs,
though, you'll need to update code in thousands of places, becuase the
back-end code is expecting lower-case identifiers ....

IF you want per session setting, yes.

If you want per database setting, you only need to worry about the
shared catalogs

If you want server wide setting, you just create the catalogs with the
correct name, and get it over with.

That's why I said that per-session setting seems like too much trouble.

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/

#16

scott.marlowe

scott.marlowe@ihs.com

over 21 years ago

In reply to: Josh Berkus (#14)

Re: Bringing PostgreSQL torwards the standard regarding

On Mon, 26 Apr 2004, Josh Berkus wrote:

Shachar,

I think the concensus was that the runtime part was aprox. four lines
where the case folding currently takes place. Obviously, you would have
to get a var, and propogate that var to that place, but not actually
change program flow.

That's only if you ignore the system catalogs entirely, which maybe you're
prepared to do. If you want to change case folding for the system catalogs,
though, you'll need to update code in thousands of places, becuase the
back-end code is expecting lower-case identifiers ....

As someone who has discussed this with Tom in the past, I seem to remember
that there were major issues with handling the system catalogs, because
internally, the backends treat the identifiers as if they have already
been quoted.

I think the answer to all of this would require a lot of code being
touched to either make it case fold, costing performance, or the
replacement of the default lower cased catalog with upper cased catalog.

i.e. no simple switch setting, but an initdb option that would be set like
locale currently is, for the life of the cluster.

A more comprehensive solution, one which allowed switching from upper
folding to lower folding to no folding, to case insensitive, or some
subset of those possibilities results in

a: slower backend performance, due to folding case for system catalogs
b: touching a helluva lot of backend code to make it possible to fold up
or down.

I'm not 100% sure on this all, but that seems to be the point Tom and I
came to in our discussion, and neither of the two solutions seemed very
good at the time.

#17

Shachar Shemesh

psql@shemesh.biz

over 21 years ago

In reply to: scott.marlowe (#16)

Re: Bringing PostgreSQL torwards the standard regarding

scott.marlowe wrote:

I think the answer to all of this would require a lot of code being
touched to either make it case fold, costing performance, or the
replacement of the default lower cased catalog with upper cased catalog.

I'm not the one to decide, but it seems to me that this is not a good
time to enter such a major change, and may never be. How terrible will
it be if the system is per server (i.e. - one given to initdb?).

A more comprehensive solution, one which allowed switching from upper
folding to lower folding to no folding, to case insensitive, or some
subset of those possibilities results in

a: slower backend performance, due to folding case for system catalogs
b: touching a helluva lot of backend code to make it possible to fold up
or down.

Well, if it's only the shared catalogs that are affected, this may not
be such a big problem (very hopeful tone?)

I'm not 100% sure on this all, but that seems to be the point Tom and I
came to in our discussion, and neither of the two solutions seemed very
good at the time.

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/

#18

Hannu Krosing

hannu@tm.ee

over 21 years ago

In reply to: scott.marlowe (#16)

Re: Bringing PostgreSQL torwards the standard regarding

scott.marlowe kirjutas T, 27.04.2004 kell 20:43:

As someone who has discussed this with Tom in the past, I seem to remember
that there were major issues with handling the system catalogs, because
internally, the backends treat the identifiers as if they have already
been quoted.

why not leave it at that and mandate their *external* doublequoted use ?

not only for system catalogs but also for system columns.

I already have had negative experience with a proprietary tool not being
able to use postgres (over ODBC) because it has a system column called
tmin (or was it tid). If things were case-folded to upper automatically,
there would not have been such problem.

--------------
Hannu

#19

Andrew Dunstan

andrew@dunslane.net

over 21 years ago

In reply to: Dennis Bjorklund (#9)

Re: Bringing PostgreSQL torwards the standard regarding

Jan Wieck wrote:

This is the reason why the setting has to be at least per database and
cannot be changed after DB creation.

I think there's overwhelming consensus that db creation time is the
latest you can specify the canonical name setting for it. There's
probably a good case to be made for it to be when you initdb, so that it
is set as expected for shared tables.

(Is anyone actually doing anything on this?)

cheers

andrew

Show quoted text

What has to change is the behaviour of the name type operators, which
will automatically change the uniqueness behaviour of the catalog
indexes.

In an UPPERCASE database

foo/Foo/FOO false = FOO true

In a lowercase database

foo/Foo/FOO false = foo true

In both of them

foo/Foo/FOO false <> Foo true
foo/Foo/FOO false = foo/Foo/FOO false

Jan

Dennis Bjorklund wrote:

On Sun, 25 Apr 2004, Andrew Dunstan wrote:

Why do you want two names? Just keep the original casing, and a

boolean

saying if it's quoted or not.

Sorry - brain malfunction - yes, original casing plus boolean would
work. In effect you could derive the canonical form from those two.

Say that you have this in the table with the identifier

name quoted
---- ------
Foo False

Now you want to add the name "FOO"

FOO True

should you be allowed or is it a clash with the above?

What if you also add "foo"

foo True

One of these two should be forbidden. And what about a quoted "FOO":

FOO False
FOO True

This case says it is not enough with an expressional unique index on
(upper(name), quoted). It would be easier to enforce uniqueness if one
store both the converted name and the original name:

name orig_name
---- ---------
FOO NULL <-- quoted one
FOO FOO <-- unquoted one

and the first case

FOO Foo <-- unquoted
FOO NULL <-- clashes with the first, good
foo NULL <-- no clash, works fine

With this one can always use upper case translation as per sql spec and
psql can optionally show all unquoted identifiers as upper, lower or
mixed
case.

Then we also have the INFORMATION_SCHEMA that should show the names in
UPPER CASE when not quoted, this since applications that are written for
the standard might depend on that (probably no application do today
but it
would be a valid case of use of the information schema).

Import Notes

Reply to msg id not found: 40A242F7.7030608@Yahoo.com