More message encoding woes

Started by Heikki Linnakangasabout 17 years ago39 messageshackers

heikki.linnakangas@enterprisedb.com

about 17 years ago

latin1db=# SELECT version();
version

-----------------------------------------------------------------------------------
PostgreSQL 8.3.7 on i686-pc-linux-gnu, compiled by GCC gcc (Debian
4.3.3-5) 4.3.3
(1 row)

latin1db=# SELECT * FROM foo;
ERROR: no existe la relaciÃ³n Â«fooÂ»

The accented characters are garbled. When I try the same with a database
that's in UTF8 in the same cluster, it works:

utf8db=# SELECT * FROM foo;
ERROR: no existe la relaciï¿½n ï¿½fooï¿½

What is happening is that gettext() returns the message in the encoding
determined by LC_CTYPE, while we expect it to return it in the database
encoding. Starting with PG 8.3 we enforce that the encoding specified in
LC_CTYPE matches the database encoding, but not for the C locale.

In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
which fixes that, but we only do it on Windows. In earlier versions we
called it on all platforms, but only for UTF-8. It seems that we should
call bind_textdomain_codeset on all platforms and all encodings.
However, there seems to be a reason why we only do it for Windows on CVS
HEAD: we need a mapping from our encoding ID to the OS codeset name, and
the OS codeset names vary.

How can we make this more robust?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: More message encoding woes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
which fixes that, but we only do it on Windows. In earlier versions we
called it on all platforms, but only for UTF-8. It seems that we should
call bind_textdomain_codeset on all platforms and all encodings.

Yes, this problem has been recognized for some time.

However, there seems to be a reason why we only do it for Windows on CVS
HEAD: we need a mapping from our encoding ID to the OS codeset name, and
the OS codeset names vary.

How can we make this more robust?

One possibility is to assume that the output of nl_langinfo(CODESET)
will be recognized by bind_textdomain_codeset(). Whether that actually
works can only be determined by experiment.

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds. The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name. (I guess we could look at the source
code.)

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#2)

Re: More message encoding woes

Tom Lane wrote:

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds. The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name. (I guess we could look at the source
code.)

Unfortunately it doesn't give any error. The value passed to it is just
stored, and isn't used until gettext(). Quick testing shows that if you
give an invalid encoding name, gettext will simply refrain from
translating anything and revert to English.

We could exploit that to determine if the codeset name we gave
bind_textdomain_codeset was valid: pick a string that is known to be
translated in all translations, like "syntax error", and see if
gettext("syntax error") returns the original string. Something along the
lines of:

const char *teststring = "syntax error";
encoding_match *m = encoding_match_list;
while(m->system_enc_name)
{
if (m->pg_enc_code != GetDatabaseEncoding())
continue;
bind_textdomain_codeset("postgres");
if (gettext(teststring) != teststring)
break; /* found! */
}

This feels rather hacky, but if we only do that with the combination of
LC_CTYPE=C and LC_MESSAGES=other than C that we have a problem with, I
think it would be ok. The current behavior is highly unlikely to give
correct results, so I don't think we can do much worse than that.

Another possibility is to just refrain from translating anything if
LC_CTYPE=C. If the above loop fails to find anything that works, that's
what we should fall back to IMHO.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#3)

Re: More message encoding woes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds. The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name. (I guess we could look at the source
code.)

Unfortunately it doesn't give any error.

(Man, why are the APIs in this problem space so universally awful?)

Where does it get the default codeset from? Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#4)

Re: More message encoding woes

Tom Lane wrote:

Where does it get the default codeset from? Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

LC_CTYPE. In 8.3 and up where we constrain that to match the database
encoding, we only have a problem with the C locale.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#5)

Re: More message encoding woes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Where does it get the default codeset from? Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

LC_CTYPE. In 8.3 and up where we constrain that to match the database
encoding, we only have a problem with the C locale.

... and even if we wanted to fiddle with it, that just moves the problem
over to finding an LC_CTYPE value that matches the database encoding
:-(.

Yup, it's a mess. We'd have done this long ago if it were easy.

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name? If it works, great, and if it doesn't, you get English.

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#6)

Re: More message encoding woes

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Where does it get the default codeset from? Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

LC_CTYPE. In 8.3 and up where we constrain that to match the database
encoding, we only have a problem with the C locale.

... and even if we wanted to fiddle with it, that just moves the problem
over to finding an LC_CTYPE value that matches the database encoding
:-(.

Yup, it's a mess. We'd have done this long ago if it were easy.

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name? If it works, great, and if it doesn't, you get English.

Yeah, that's better than nothing.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#7)

Re: More message encoding woes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name? If it works, great, and if it doesn't, you get English.

Yeah, that's better than nothing.

A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
says that it would not work quite well enough. The encoding names are
similar but not identical --- in particular I notice a lot of
discrepancies about dash versus underscore vs no separator at all.

What we need is an API equivalent to "iconv --list", but I'm not seeing
one :-(. Do we need to go so far as to try to run that program?
Its output format is poorly standardized, among other problems ...

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#8)

Re: More message encoding woes

Tom Lane wrote:

What we need is an API equivalent to "iconv --list", but I'm not seeing
one :-(.

There's also "locale -m". Looking at the implementation of that, it just
lists what's in /usr/share/i18n/charmaps. Not too portable either..

Do we need to go so far as to try to run that program?
Its output format is poorly standardized, among other problems ...

And doing that at every backend startup is too slow.

I would be happy to just revert to English if the OS doesn't recognize
the name we use for the encoding. What sucks about that most is that the
user has no way to specify the right encoding name even if he knows it.
I don't think we want to introduce a new GUC for that.

One idea is to extract the encoding from LC_MESSAGES. Then call
pg_get_encoding_from_locale() on that and check that it matches
server_encoding. If it does, great, pass it to
bind_textdomain_codeset(). If it doesn't, throw an error.

It stretches the conventional meaning LC_MESSAGES/LC_CTYPE a bit, since
LC_CTYPE usually specifies the codeset to use, but I think it's quite
intuitive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Tom Lane (#8)

Re: More message encoding woes

Tom Lane píše v po 30. 03. 2009 v 14:04 -0400:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name? If it works, great, and if it doesn't, you get English.

Yeah, that's better than nothing.

A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
says that it would not work quite well enough. The encoding names are
similar but not identical --- in particular I notice a lot of
discrepancies about dash versus underscore vs no separator at all.

The same problem is with collation when you try restore database on
different OS. :(

Zdenek

#11

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Heikki Linnakangas (#9)

Re: More message encoding woes

Heikki Linnakangas wrote:

One idea is to extract the encoding from LC_MESSAGES. Then call
pg_get_encoding_from_locale() on that and check that it matches
server_encoding. If it does, great, pass it to
bind_textdomain_codeset(). If it doesn't, throw an error.

I tried to implement this but it gets complicated. First of all, we can
only throw an error when lc_messages is set interactively. If it's set
in postgresql.conf, it might be valid for some databases but not for
others with different encoding. And that makes per-user lc_messages
setting quite hard too.

Another complication is what to do if e.g. plpgsql or a 3rd party module
have called pg_bindtextdomain, when lc_messages=C and we don't yet know
the system name for the database encoding, and you later set
lc_messages='fi_FI.iso8859-1', in a latin1 database. In order to
retroactively set the codeset, we'd have to remember all the calls to
pg_bindtextdomain. Not impossible, for sure, but more work.

I'm leaning towards the idea of trying out all the spellings of the
database encoding we have in encoding_match_list. That gives the best
user experience, as it just works, and it doesn't seem that complicated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#11)

Re: More message encoding woes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I'm leaning towards the idea of trying out all the spellings of the
database encoding we have in encoding_match_list. That gives the best
user experience, as it just works, and it doesn't seem that complicated.

How were you going to check --- use that idea of translating a string
that's known to have a translation? OK, but you'd better document
somewhere where translators will read it "you must translate this string
first of all". Maybe use a special string "Translate Me First" that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context. (I can see "syntax error" being
problematic in some translations, since translators will know it is
always just a fragment of a larger message ...)

regards, tom lane

#13

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#12)

Re: More message encoding woes

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I'm leaning towards the idea of trying out all the spellings of the
database encoding we have in encoding_match_list. That gives the best
user experience, as it just works, and it doesn't seem that complicated.

How were you going to check --- use that idea of translating a string
that's known to have a translation? OK, but you'd better document
somewhere where translators will read it "you must translate this string
first of all". Maybe use a special string "Translate Me First" that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context.

Yep, something like that. There seems to be a magic empty string
translation at the beginning of every po file that returns the
meta-information about the translation, like translation author and
date. Assuming that works reliably, I'll use that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#13)

Re: More message encoding woes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Maybe use a special string "Translate Me First" that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context.

Yep, something like that. There seems to be a magic empty string
translation at the beginning of every po file that returns the
meta-information about the translation, like translation author and
date. Assuming that works reliably, I'll use that.

At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding. I suppose that would result in
failure, when we'd prefer it not to. A single-purpose string could be
documented as "whatever you translate this to should be pure ASCII,
never mind if it's sensible".

regards, tom lane

#15

Alvaro Herrera

alvherre@2ndquadrant.com

about 17 years ago

In reply to: Tom Lane (#14)

Re: More message encoding woes

Tom Lane wrote:

At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding. I suppose that would result in
failure, when we'd prefer it not to. A single-purpose string could be
documented as "whatever you translate this to should be pure ASCII,
never mind if it's sensible".

One problem with this idea is that it may be hard to coerce gettext into
putting a particular string at the top of the file :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#16

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Alvaro Herrera (#15)

Re: More message encoding woes

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding. I suppose that would result in
failure, when we'd prefer it not to. A single-purpose string could be
documented as "whatever you translate this to should be pure ASCII,
never mind if it's sensible".

One problem with this idea is that it may be hard to coerce gettext into
putting a particular string at the top of the file :-(

I doubt we can, which is why the documentation needs to tell translators
about it.

regards, tom lane

#17

Peter Eisentraut

peter_e@gmx.net

about 17 years ago

In reply to: Tom Lane (#8)

Re: More message encoding woes

On Monday 30 March 2009 21:04:00 Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name? If it works, great, and if it doesn't, you get English.

Yeah, that's better than nothing.

A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
says that it would not work quite well enough. The encoding names are
similar but not identical --- in particular I notice a lot of
discrepancies about dash versus underscore vs no separator at all.

I seem to recall that the encoding names are normalized by the C library
somewhere, but I can't find the documentation now. It might be worth trying
anyway -- the above might not in fact be a problem.

#18

Peter Eisentraut

peter_e@gmx.net

about 17 years ago

In reply to: Heikki Linnakangas (#5)

Re: More message encoding woes

On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:

Tom Lane wrote:

Where does it get the default codeset from? Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

LC_CTYPE. In 8.3 and up where we constrain that to match the database
encoding, we only have a problem with the C locale.

Why don't we apply the same restriction to the C locale then?

#19

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Peter Eisentraut (#18)

Re: More message encoding woes

Peter Eisentraut <peter_e@gmx.net> writes:

On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:

LC_CTYPE. In 8.3 and up where we constrain that to match the database
encoding, we only have a problem with the C locale.

Why don't we apply the same restriction to the C locale then?

(1) what would you constrain it to?

(2) historically we've allowed C locale to be used with any encoding,
and there are a *lot* of users depending on that (particularly in the
Far East, I gather).

regards, tom lane

#20

Alvaro Herrera

alvherre@2ndquadrant.com

about 17 years ago

In reply to: Tom Lane (#16)

Re: More message encoding woes

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

One problem with this idea is that it may be hard to coerce gettext into
putting a particular string at the top of the file :-(

I doubt we can, which is why the documentation needs to tell translators
about it.

I doubt that documenting the issue will be enough (in fact I'm pretty
sure it won't). Maybe we can just supply the string translated in our
POT files, and add a comment that the translator is not supposed to
touch it. This doesn't seem all that difficult -- I think it just
requires that we add a msgmerge step to "make update-po" that uses a
file on which the message has already been translated.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#21

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#14)

#22