Question regarding UTF-8 data and "C" collation on definition of field of table

Started by Dionisis Kontominasabout 3 years ago11 messagesgeneral

dkontominas@gmail.com

about 3 years ago

Hello all,

I have a question regarding the definition of the type of a character
field in a table and more specifically about its collation and UTF-8
characters and strings.

Let's say that the definition is for example as follows:

name character varying(8) COLLATE pg_catalog."C" NOT NULL

and also assume that the database default encoding is UTF8 and also the
Collate and Ctype is "C"". I plan to store strings of various languages in
this field.

Are these the correct settings that I should have used on creation of
the database?.

Thank you in Advance!

Kindest regards,

Dionisis Kontominas

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Dionisis Kontominas (#1)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Dionisis Kontominas <dkontominas@gmail.com> writes:

Let's say that the definition is for example as follows:
name character varying(8) COLLATE pg_catalog."C" NOT NULL
and also assume that the database default encoding is UTF8 and also the
Collate and Ctype is "C"". I plan to store strings of various languages in
this field.

Are these the correct settings that I should have used on creation of
the database?.

Well, it won't crash or anything, but sorting will be according
to byte-by-byte values. So the sort order of non-ASCII text is
likely to look odd. How much do you care about that?

regards, tom lane

Dionisis Kontominas

dkontominas@gmail.com

about 3 years ago

In reply to: Tom Lane (#2)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Hello Tom,

Thank you for your response.

I suppose that affects the outcome of ORDER BY clauses on the field,
along with the content of the indexes. Is this right?

Assuming that the requirement exists, to store UTF-8 characters on a
field that can be from multiple languages, and the database default
encoding is UTF8 which is the right thing I suppose (please verify), what
do you think should be the values of the Collation and Ctype for the
database to behave correctly? I could not find something specific in the
documentation.

What I did find interesting though is the below statement:

24.2.2.1. Standard Collations
"Additionally, the SQL standard collation name ucs_basic is available for
encoding UTF8. It is equivalent to C and sorts by Unicode code point."

Is this the right collation in the creation of the database in this use
case? If so, what would be the corresponding suitable Ctype?

Regards,
Dionisis

On Mon, 6 Feb 2023 at 00:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Dionisis Kontominas <dkontominas@gmail.com> writes:

Let's say that the definition is for example as follows:
name character varying(8) COLLATE pg_catalog."C" NOT NULL
and also assume that the database default encoding is UTF8 and also the
Collate and Ctype is "C"". I plan to store strings of various languages

in

this field.

Are these the correct settings that I should have used on creation of
the database?.

Well, it won't crash or anything, but sorting will be according
to byte-by-byte values. So the sort order of non-ASCII text is
likely to look odd. How much do you care about that?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Dionisis Kontominas (#3)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Dionisis Kontominas <dkontominas@gmail.com> writes:

I suppose that affects the outcome of ORDER BY clauses on the field,
along with the content of the indexes. Is this right?

Yeah.

Assuming that the requirement exists, to store UTF-8 characters on a
field that can be from multiple languages, and the database default
encoding is UTF8 which is the right thing I suppose (please verify), what
do you think should be the values of the Collation and Ctype for the
database to behave correctly?

Um ... so define "correct". If you have a mishmash of languages in the
same column, it's likely that they have conflicting rules about sorting,
and there may be no ordering that's not surprising to somebody.

If there's a predominant language in the data, selecting a collation
matching that seems like your best bet. Otherwise, maybe you should
just shrug your shoulders and stick with C collation. It's likely
to be faster than any alternative.

regards, tom lane

Dionisis Kontominas

dkontominas@gmail.com

about 3 years ago

In reply to: Tom Lane (#4)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Hi Tom,

1. Regarding the different languages in the same column, that is normal
if the column is a UTF-8 one, i.e. should be able to hold for example
English, Greek and Chinese characters. In this case what is the best
approach to define the collation and lctype of the column? Either C
or ucs_basic maybe or something else or it does not matter, the characters
would be stored correctly as long as the database is UTF8 encoding?
2. Is there a real difference between C and ucs_basic collation/lctype?
I have looked at the documentation and on the internet and most information
converges that they are actually the same as behavior is concerned;
actually no difference. What I suspect though as a non-similar is that C
after the ASCII part sorts the characters according to byte code point,
whereas the ucs_basic sorts those characters according to the Unicode code
point which may be different I suppose. Can you confirm?
3. In case the previous is correct I think that the most successful
collation for unicode columns would e to set the collation to
ucs_basic collation type and lctype accordingly if exists.

Regards,
Dionisis

On Mon, 6 Feb 2023 at 01:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Dionisis Kontominas <dkontominas@gmail.com> writes:

I suppose that affects the outcome of ORDER BY clauses on the field,
along with the content of the indexes. Is this right?

Yeah.

Assuming that the requirement exists, to store UTF-8 characters on a
field that can be from multiple languages, and the database default
encoding is UTF8 which is the right thing I suppose (please verify), what
do you think should be the values of the Collation and Ctype for the
database to behave correctly?

Um ... so define "correct". If you have a mishmash of languages in the
same column, it's likely that they have conflicting rules about sorting,
and there may be no ordering that's not surprising to somebody.

If there's a predominant language in the data, selecting a collation
matching that seems like your best bet. Otherwise, maybe you should
just shrug your shoulders and stick with C collation. It's likely
to be faster than any alternative.

regards, tom lane

Ron

ronljohnsonjr@gmail.com

about 3 years ago

In reply to: Dionisis Kontominas (#1)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Why are you specifying the collation to be "C" when the default db encoding
is UTF8, and UTF-8 has Greek, Chinese and English encodings?

On 2/5/23 17:08, Dionisis Kontominas wrote:

Hello all,

I have a question regarding the definition of the type of a character
field in a table and more specifically about its collation and UTF-8
characters and strings.

Let's say that the definition is for example as follows:

name character varying(8) COLLATE pg_catalog."C" NOT NULL

and also assume that the database default encoding is UTF8 and also the
Collate and Ctype is "C"". I plan to store strings of various languages in
this field.

Are these the correct settings that I should have used on creation of
the database?.

Thank you in Advance!

Kindest regards,

Dionisis Kontominas

--
Born in Arizona, moved to Babylonia.

Dionisis Kontominas

dkontominas@gmail.com

about 3 years ago

In reply to: Ron (#6)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Because if I don't specify the collation/lctype it seems to get the default
from the OS, which in my case is : English_Netherlands.1252 (database
encoding UTF8). That might not be best for truly unicode content columns,
so I investigated the "C" option, which also seems not to work; might be
worse.

To reframe my question, when you expect multilingual data in a column and
the database encoding is utf8, which seems to accommodate the need for
storage, what could be considered as best practice (if it can exist really)
for collation and lctype?

On Mon, 6 Feb 2023 at 01:57, Ron <ronljohnsonjr@gmail.com> wrote:

Show quoted text

Why are you specifying the collation to be "C" when the default db
encoding
is UTF8, and UTF-8 has Greek, Chinese and English encodings?

On 2/5/23 17:08, Dionisis Kontominas wrote:

Hello all,

I have a question regarding the definition of the type of a character
field in a table and more specifically about its collation and UTF-8
characters and strings.

Let's say that the definition is for example as follows:

name character varying(8) COLLATE pg_catalog."C" NOT NULL

and also assume that the database default encoding is UTF8 and also the
Collate and Ctype is "C"". I plan to store strings of various languages

in

this field.

Are these the correct settings that I should have used on creation of
the database?.

Thank you in Advance!

Kindest regards,

Dionisis Kontominas

--
Born in Arizona, moved to Babylonia.

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Tom Lane (#4)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

On Sun, Feb 5, 2023 at 4:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If there's a predominant language in the data, selecting a collation
matching that seems like your best bet. Otherwise, maybe you should
just shrug your shoulders and stick with C collation. It's likely
to be faster than any alternative.

FWIW there are certain "compromise locales" supported by ICU/CLDR.
These include "English (Europe)", and, most notably, EOR (European
Ordering Rules):

https://en.wikipedia.org/wiki/European_ordering_rules

I'm not sure how widely used those are. EOR seems to have been
standardized by the EU or by an adjacent institution, so not sure how
widely used it really is.

It's also possible to use a custom collation with ICU, which is almost
infinitely flexible:

http://www.unicode.org/reports/tr10/#Customization

As an example, the rules about the relative ordering of each script
can be changed this way. There is also something called merged
tailorings.

The OP should see the Postgres ICU docs for hints on how to use these
facilities to make a custom collation that matches whatever their
requirements are:

https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING

--
Peter Geoghegan

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Dionisis Kontominas (#5)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Dionisis Kontominas <dkontominas@gmail.com> writes:

1. Regarding the different languages in the same column, that is normal
if the column is a UTF-8 one, i.e. should be able to hold for example
English, Greek and Chinese characters. In this case what is the best
approach to define the collation and lctype of the column? Either C
or ucs_basic maybe or something else or it does not matter, the characters
would be stored correctly as long as the database is UTF8 encoding?

The characters will be stored correctly, yes. Collation only affects
sort order. Ctype affects some other functions like upper/lowercase
folding and what is considered a "letter" in regexps. If you use "C"
for lc_ctype then only ASCII letters will be folded or recognized as
letters.

2. Is there a real difference between C and ucs_basic collation/lctype?

No, at least not in Postgres.

regards, tom lane

#10

Peter J. Holzer

hjp-pgsql@hjp.at

about 3 years ago

In reply to: Ron (#6)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

On 2023-02-05 18:57:13 -0600, Ron wrote:

Why are you specifying the collation to be "C" when the default db encoding
is UTF8, and UTF-8 has Greek, Chinese and English encodings?

C is equally bad for Greek, Chinese and English ;-)

#11

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Peter Geoghegan (#8)

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

On Sun, 5 Feb 2023 17:14:44 -0800
Peter Geoghegan <pg@bowt.ie> wrote:

...

The OP should see the Postgres ICU docs for hints on how to use these
facilities to make a custom collation that matches whatever their
requirements are:

https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING

As you are talking about ICU customization to match whatever the requirement we
want, we were wondering if this would be that easy with ICU to build/create such
custom and odd collation (and not just move numbers after latin)? Even
being able to order letter by letter?

For the record, I helped on an issue last week to sort data using the ebcdic
order. ICU was just a pain, especially with this buggy and annoying bug when
sorting numbers after letters[1]remember ? /messages/by-id/20200903105727.064665ce@firost and the fact that it takes whatever we feed it
with without complaining for badly formed or impossible collation. We just gave
up.

One of our recommendation was to write a glibc collation file, built/installed
it using localedef and "import" it in PostgreSQL using
pg_import_system_collations(). The customer actually did it and it works like a
charm.

Regards,

[1]: remember ? /messages/by-id/20200903105727.064665ce@firost
/messages/by-id/20200903105727.064665ce@firost