Locale, encoding, sort order confusion

Started by John Guntherover 19 years ago5 messagesgeneral
Jump to latest
#1John Gunther
postgresql@bucksvsbytes.com

I've been reading about locales, encodings, sort orders, the to_ascii
function but I'm more confused than enlightened.

What I want is very simple:
1) I want the database to correctly accept, store, and display
alphabetic characters, including European accented characters, entered
and viewed in HTML
forms.
2) I want sorting to ignore the diacritical marks so that, for example,
u, u-accent, and u-umlaut are all sorted as if they were plain u.
3) I want sorting to ignore non-alphanumerics, letter case, and white
space.

To illustrate, the following data is in sorted order:

St-�mile
stendahl
st �nders
St. Epson

Can someone tell me what combination of PostgreSQL and Linux settings I
need for this? Or point me somewhere that it's well explained. It seems
like a very basic question, but I'm just dense,
I guess. I've tried a half dozen time-consuming configs without success.

Thank you.

John Gunther

#2Peter Eisentraut
peter_e@gmx.net
In reply to: John Gunther (#1)
Re: Locale, encoding, sort order confusion

John Gunther wrote:

Can someone tell me what combination of PostgreSQL and Linux settings
I need for this? Or point me somewhere that it's well explained. It
seems like a very basic question, but I'm just dense,
I guess.

Pretty much any locale (say, en_US for you) with a matching character
set should work. Unless you go out of your way, this should be the
default setting.

I've tried a half dozen time-consuming configs without
success.

Like what?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#3John Gunther
postgresql@bucksvsbytes.com
In reply to: Peter Eisentraut (#2)
Re: Locale, encoding, sort order confusion

Pretty much any locale (say, en_US for you) with a matching character
set should work. Unless you go out of your way, this should be the
default setting.

I've tried a half dozen time-consuming configs without
success.

Like what?

For example, with all LC_* parameters set to "en_US.UTF8", I get the following incorrect "order:by":

B�arn
B�cancour
Beaupr�

B�cancour should be last.

Incidentally, in psql via a putty.exe session, the only character set translation I can find that displays the accented characters is CP437. Does this seem right?

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: John Gunther (#3)
Re: Locale, encoding, sort order confusion

John Gunther wrote:

Pretty much any locale (say, en_US for you) with a matching character
set should work. Unless you go out of your way, this should be the
default setting.

I've tried a half dozen time-consuming configs without
success.

Like what?

For example, with all LC_* parameters set to "en_US.UTF8", I get the
following incorrect "order:by":

B�arn
B�cancour
Beaupr�

Did you initdb with locale en_US.UTF8, and also createdb with encoding
UTF8? While you can certainly choose mismatching values in createdb and
initdb, you shouldn't because it doesn't work. See the docs here:

http://www.postgresql.org/docs/8.1/static/multibyte.html

Important: Although you can specify any encoding you want for a
database, it is unwise to choose an encoding that is not what is
expected by the locale you have selected. The LC_COLLATE and
LC_CTYPE settings imply a particular encoding, and locale-dependent
operations (such as sorting) are likely to misinterpret data that is
in an incompatible encoding.

Since these locale settings are frozen by initdb, the apparent
flexibility to use different encodings in different databases of a
cluster is more theoretical than real. It is likely that these
mechanisms will be revisited in future versions of PostgreSQL.

Incidentally, in psql via a putty.exe session, the only character set
translation I can find that displays the accented characters is CP437. Does
this seem right?

You should probably set client_encoding in the psql session (using
\encoding) if you want to change the charset in putty.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5John Gunther
postgresql@bucksvsbytes.com
In reply to: Alvaro Herrera (#4)
Re: Locale, encoding, sort order confusion

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Alvaro Herrera wrote:
<blockquote cite="mid20060821001057.GB1130@alvh.no-ip.org" type="cite">
<blockquote type="cite">
<pre wrap="">For example, with all LC_* parameters set to "en_US.UTF8", I get the
following incorrect "order:by":

B&eacute;arn
B&eacute;cancour
Beaupr&eacute;
</pre>
</blockquote>
<pre wrap=""><!---->
Did you initdb with locale en_US.UTF8, and also createdb with encoding
UTF8? While you can certainly choose mismatching values in createdb and
initdb, you shouldn't because it doesn't work.</pre>
</blockquote>
That's an interesting question. Are the LC_* variables set by initdb or
createdb? In other words, does their value indicate what initdb
settings I used? If I do a default createdb, will the new database
automatically be consistent with the cluster's initdb?<br>
</body>
</html>