Folding of case of identifiers

Started by Niels Jespersenover 4 years ago2 messagesgeneral

NJN@dst.dk

over 4 years ago

Hello all

According to https://www.postgresql.org/docs/current/sql-syntax-lexical.html, "Key words and unquoted identifiers are case insensitive." And "SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($)."

So far so good. Non-latin letters are included, which I take to also include the danish letters æøå/ÆØÅ.

However, name-folding is odd for these letters. Of these three create tables, the two first succeed, the last one does not (G and g is equivalent, Æ and æ is not).

create table æblegrød (a int, køn text);
create table ÆblegrØd (a int, køn text);
create table ÆbleGrØd (a int, køn text);

Can anyone explain the logic that rules this.

Regards Niels Jespersen

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Niels Jespersen (#1)

Re: Folding of case of identifiers

Niels Jespersen <NJN@dst.dk> writes:

According to https://www.postgresql.org/docs/current/sql-syntax-lexical.html, "Key words and unquoted identifiers are case insensitive." And "SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($)."

So far so good. Non-latin letters are included, which I take to also include the danish letters æøå/ÆØÅ.

However, name-folding is odd for these letters. Of these three create tables, the two first succeed, the last one does not (G and g is equivalent, Æ and æ is not).

Whether non-ASCII characters get downcased is very context dependent.
You've not mentioned the database encoding or the locale (LC_CTYPE)
setting, but both of those are relevant. Basically, in a single-byte
encoding we'll apply tolower() to identifier characters; but we don't
attempt to case-fold multi-byte characters at all. This logic is pretty
hoary, dating from before Unicode became widespread, but I'd be hesitant
to change it now.

regards, tom lane