handling unconvertible error messages
Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
(built with NLS). Let's say for some reason, I have client encoding set
to LATIN1. All error messages come back like this:
test=> select * from notthere;
ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
equivalent in encoding "LATIN1"
There is no straightforward way for the client to learn that there is a
real error message, but it could not be converted.
I think ideally we could make this better in two ways:
1) Send the original error message untranslated. That would require
saving the original error message in errmsg(), errdetail(), etc. That
would be a lot of work for only the occasional use. But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).
2) Send an indication that there was an encoding problem. Maybe a
NOTICE, or an error context? Wiring all this into elog.c looks a bit
tricky, however.
Ideas?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 25 July 2016 at 22:43, Peter Eisentraut <peter.eisentraut@2ndquadrant.com
wrote:
Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
(built with NLS). Let's say for some reason, I have client encoding set
to LATIN1. All error messages come back like this:test=> select * from notthere;
ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
equivalent in encoding "LATIN1"There is no straightforward way for the client to learn that there is a
real error message, but it could not be converted.I think ideally we could make this better in two ways:
1) Send the original error message untranslated. That would require
saving the original error message in errmsg(), errdetail(), etc. That
would be a lot of work for only the occasional use. But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).2) Send an indication that there was an encoding problem. Maybe a
NOTICE, or an error context? Wiring all this into elog.c looks a bit
tricky, however.
We have a similar problem with the server logs. But there there's also an
additional problem: if there isn't any character mapping issue we just
totally ignore text encoding concerns and log in whatever encoding the
client asked the backend to use into the log files. So log files can be a
line-by-line mix of UTF-8, ISO-8859-1, and whatever other fun encodings
someone asks for. There is *no* way to correctly read such a file since
lines don't have any marking as to their encoding and no tools out there
support line-by-line differently encoded text files anyway.
I'm not sure how closely it ties in to the issue you mention, but I think
it's at least related enough to keep in mind while considering the
client_encoding issue.
I suggest (3) "log the message with unmappable characters masked". Though I
would definitely like to be able to also send the raw original, along with
a field indicating the encoding of the original since it won't be the
client_encoding, since we need some way to get to the info.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hello,
At Wed, 27 Jul 2016 19:53:01 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YFL0b1886tMYF9RPeDdpWryG1cr8ew3pYfiXgrJofpHjA@mail.gmail.com>
On 25 July 2016 at 22:43, Peter Eisentraut <peter.eisentraut@2ndquadrant.com
wrote:
Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
(built with NLS). Let's say for some reason, I have client encoding set
to LATIN1. All error messages come back like this:test=> select * from notthere;
ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
equivalent in encoding "LATIN1"There is no straightforward way for the client to learn that there is a
real error message, but it could not be converted.I think ideally we could make this better in two ways:
1) Send the original error message untranslated. That would require
saving the original error message in errmsg(), errdetail(), etc. That
would be a lot of work for only the occasional use. But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).2) Send an indication that there was an encoding problem. Maybe a
NOTICE, or an error context? Wiring all this into elog.c looks a bit
tricky, however.We have a similar problem with the server logs. But there there's also an
additional problem: if there isn't any character mapping issue we just
totally ignore text encoding concerns and log in whatever encoding the
client asked the backend to use into the log files. So log files can be a
line-by-line mix of UTF-8, ISO-8859-1, and whatever other fun encodings
someone asks for. There is *no* way to correctly read such a file since
lines don't have any marking as to their encoding and no tools out there
support line-by-line differently encoded text files anyway.
Cyrillic messages with such conversion failure looks just as a
series '?' delimited with spaces. The same occurs for Japanese
(or CJK as an integral of similar alphabets), which conatins
(almost) no compatible letters with ASCII characters. We are
sometimes obliged to take a count of '?'s to identify messages
like the following:p
$ LANG=C postgres
?????????: ??????? ?? ???? ?????????: 2016-07-28 14:08:32 JST
?????????: ?????? ?? ????????? ???????????????? ?????? ????????
?????????: ??????? ?? ?????? ????????? ???????????
?????????: ??????? ??????? ??????????? ??????
I'm not sure how closely it ties in to the issue you mention, but I think
it's at least related enough to keep in mind while considering the
client_encoding issue.
The issue this thread stands for is a failure of character code
replacement performed by backend code, and the another is a
gettext(3)'s behavior according to LC_CTYPE.
I think that data in tables *must* follow the specified encoding
and should result in error for incompatible characters, but I
don't think so for messages from PosgreSQL.
We Jpaanse already have such log message at very early of
starting postmaster.
LOG: データベースシステムは 2016-07-28 14:14:06 JST にシャットダウンしました
LOG: MultiXact member wraparound protections are now enabled
LOG: データベースシステムの接続受付準備が整いました。
The reason for the second line is that it just doesn't have
corresponding translation in ja.po. It is far acceptable than the
sequence of question marks shown above.
I suggest (3) "log the message with unmappable characters masked". Though I
would definitely like to be able to also send the raw original, along with
a field indicating the encoding of the original since it won't be the
client_encoding, since we need some way to get to the info.
So, I don't think this (3) won't do so much for these
languages. I prefer (1) for this issue. Putting aside the log
issue, error system of PostgreSQL is already doing very similar
thing in err_sendstring for error-recursion cases.
It seems possible to add silent fallback for conversion-failure
there.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 25 Jul 2016 10:43:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
Example: I have a database cluster initialized with
--locale=ru_RU.UTF-8 (built with NLS). Let's say for some reason, I
have client encoding set to LATIN1. All error messages come back
like this:test=> select * from notthere;
ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has
no equivalent in encoding "LATIN1"There is no straightforward way for the client to learn that there is
a real error message, but it could not be converted.
Really, situation is a bit worse. There is at least one case, where
error message comes unreadble to the client, even if encodings are
compatible.
I.e. if server default locale is ru_RU.UTF-8 and client requestes
encoding WIN1251 which is able to handle cyrillic.
If error occurs during processing of StartMessage protocol message,
i.e. client request connection to unexisting database,
ErrorResponse would contain message in the server default locale,
despite of client encoding being specified in the StartMessage.
If session is correctly established with such parameters, error
messages are displayed correctly.
I haven't yet investigatged if it is just delayed initialization of
backend locale system or backend is not yet forked at the time of
generation of this message and wrongly encoded message is sent by
postmaster.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 25 Jul 2016 10:43:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
e is
a real error message, but it could not be converted.
I think ideally we could make this better in two ways:
1) Send the original error message untranslated. That would require
saving the original error message in errmsg(), errdetail(), etc. That
would be a lot of work for only the occasional use. But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).2) Send an indication that there was an encoding problem. Maybe a
NOTICE, or an error context? Wiring all this into elog.c looks a bit
tricky, however.Ideas?
I think there are two more ways:
(3 was in the Craig's message)
4. At the session startup try to reinitializie LC_MESSAGES locale
category with the combination
of the server (or better client-send) language and region and
client-supplied encoding, and if this failed, use untranslated error
message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would fail.
so, if client would ask server with ru_RU.UTF-8 default locale to use
LATIN1 encoding, server would fallback to untranslated messages.
This approach would have problems on windows, where locale is strictly
tied to the ANSI encoding of given language/territory. Even if we would
make UTF-8 a special case, attempt to connect with encoding KOI8 or
LATIN5 to the Windows postgresql server which runs in
Russian_Russia.1251 locale would result in the fallback to untranslated
message. But I think that this case is marginal and better to present
untranslated messages to the people (or applications) which require
non-default 8-bit encoding even if it is possible to represent
translated messages in this encoding, than to present unreadable
translated messages to anybody.
5. Use transliteration in case of encoding problem. Some iconv
implementations (such as Linux glibc iconv and GNU portable libiconv)
supports //TRANSLIT sufix for encoding and if this suffix specified
replace unrepresentable symbols with phonetically similar approximation.
I don't know how well it would work for Japanese, but for Russian it is
definitely better than lots of question marks.
--
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Victor Wagner <vitus@wagner.pp.ru> writes:
If error occurs during processing of StartMessage protocol message,
i.e. client request connection to unexisting database,
ErrorResponse would contain message in the server default locale,
despite of client encoding being specified in the StartMessage.
Yeah. I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's better
than sending completely-unreadable ones. Or is that just my
English-centricity showing?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, 04 Aug 2016 09:42:10 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Victor Wagner <vitus@wagner.pp.ru> writes:
If error occurs during processing of StartMessage protocol message,
i.e. client request connection to unexisting database,
ErrorResponse would contain message in the server default locale,
despite of client encoding being specified in the StartMessage.Yeah. I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's better
than sending completely-unreadable ones. Or is that just my
English-centricity showing?
From my russian point of view, english messages are definitely better
than transliteration of Russian with latin letters (although it is
not completely unreadable), not to mention wrong encoding or lots of
question marks.
Really, if this response is sent after backend has been forked, problem
probably can be easily fixed better way - StartupMessage contain
information about desired client encoding, so this information just
should be processed earlier than any other information from this
message, which can cause errors (such as database name).
If this errors are sent from postmaster itself, things are worse,
because I don't think that locale subsystem is desined to be
reintitalized lots of times in the same process.
But postmaster itself can use non-localized messaging. Its messages in
the logs are typically analyzed by more or less qualified DBA and
system admistrators, not by end user.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Victor Wagner <vitus@wagner.pp.ru> writes:
On Thu, 04 Aug 2016 09:42:10 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:Yeah. I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Really, if this response is sent after backend has been forked, problem
probably can be easily fixed better way - StartupMessage contain
information about desired client encoding, so this information just
should be processed earlier than any other information from this
message, which can cause errors (such as database name).
I think that's wishful thinking. There will *always* be errors that
come out before we can examine the contents of the startup message.
Moreover, until we've done authentication, we should be very wary of
applying client-specified settings at all: they might be malicious.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, 04 Aug 2016 14:25:52 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Victor Wagner <vitus@wagner.pp.ru> writes:
Really, if this response is sent after backend has been forked,
problem probably can be easily fixed better way - StartupMessage
contain information about desired client encoding, so this
information just should be processed earlier than any other
information from this message, which can cause errors (such as
database name).I think that's wishful thinking. There will *always* be errors that
come out before we can examine the contents of the startup message.
Moreover, until we've done authentication, we should be very wary of
applying client-specified settings at all: they might be malicious.
I think that this case can be an exception from the rule "don't apply
settings from the untrusted source".
Let's consider possible threat model:
1. We anyway parse StartupMessage before authentication. There is
nothing we can do with it, so parser should be robust enough, to handle
untrusted input. As I can see from the quick glance, it is.
2. When encoding name is parsed, it is used to search in the array of
supported encoding. No possible attack here - either it is valid or not.
3. As far as I know, we don't allow client to change language, only
encoding, so it is not even possible that attacker could make messages
in the log unreadable for the system administartor.
So, if we would fix the problem, reported by Peter Eisentraut at the
begining of this thread, and fall back to untranslated messages
whenever client-requested encoding is unable to represent messages in
the server default language, this solution, would be not worse than
your solution.
There would be fallback to C locale in any case of doubt, but in the
case when NLS messages can be made readable, they would be readable.
Really, there is at least one case, when fallback to C locale should be
done unconditionally - a CancelRequest. In this case client cannot send
an encoding, so C locale should be used.
As far as I understand it is not the case with SSLRequest. Although it
doesn't contain encoding information as well as CancelRequest, errors
in subsequent SSL negotiations would be reported by client-side SSL
libraries, not by server.
--
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/4/16 2:45 AM, Victor Wagner wrote:
4. At the session startup try to reinitializie LC_MESSAGES locale
category with the combination
of the server (or better client-send) language and region and
client-supplied encoding, and if this failed, use untranslated error
message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would fail.
so, if client would ask server with ru_RU.UTF-8 default locale to use
LATIN1 encoding, server would fallback to untranslated messages.
I think this is basically my solution (1), with the same problems.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/4/16 9:42 AM, Tom Lane wrote:
I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's better
than sending completely-unreadable ones. Or is that just my
English-centricity showing?
Well, most of the time this all works, only if there are different
client and server settings you might have problems. We wouldn't want to
partially disable the NLS feature for the normal case.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 5 Aug 2016 11:23:37 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 8/4/16 9:42 AM, Tom Lane wrote:
I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's
better than sending completely-unreadable ones. Or is that just my
English-centricity showing?Well, most of the time this all works, only if there are different
client and server settings you might have problems. We wouldn't want
to partially disable the NLS feature for the normal case.
There are cases, where client cannot tell server which encoding it
wants to use, and server cannot tell which encoding it uses, but it can
send error messages. For example, CancelRequest.
The only way to ensure that message is readable in this case is to fall
back to some encoding, definitely known by both client and server.
And for now it is US-ASCII.
It is, as far as I understand, what Tom is proposing:
Fall back to the untranslated message at the beginning of session, and
return to NLS only when encoding is successfully negotiated between
client and server.
May be, there can be other solution - prepare client to be able to
accept UTF-8 messages from server regardless of encoding, i.e. if
message starts with BOM marker (0xFEFF unicode char, EF BB BF byte
sequence in utf-8), interpret it as UTF-8. It would require client to
support some kind of encoding conversion, and in some 8-bit
environments pose problems with displaying these messages.
--
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 5 Aug 2016 11:21:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 8/4/16 2:45 AM, Victor Wagner wrote:
4. At the session startup try to reinitializie LC_MESSAGES locale
category with the combination
of the server (or better client-send) language and region and
client-supplied encoding, and if this failed, use untranslated error
message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would
fail. so, if client would ask server with ru_RU.UTF-8 default
locale to use LATIN1 encoding, server would fallback to
untranslated messages.I think this is basically my solution (1), with the same problems.
I think, that there is a big difference from server point of view.
You propose that both translated and untranslated message should be
passed around inside backend. It has some benefits, but requires
considerable reworking of server internals.
My solution doesn't require keeping both original message and
translated one during all call stack unwinding. It just checks if
combination of language and encoding is supported by the NLS subsystem,
and if not, falls back to untranslated message for entire session.
It is much more local change and is comparable by complexity with one,
proposed by Tom Lane.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Mon, 8 Aug 2016 10:19:10 +0300, Victor Wagner <vitus@wagner.pp.ru> wrote in <20160808101910.49beeed6@fafnir.local.vm>
On Fri, 5 Aug 2016 11:21:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:On 8/4/16 2:45 AM, Victor Wagner wrote:
4. At the session startup try to reinitializie LC_MESSAGES locale
category with the combination
of the server (or better client-send) language and region and
client-supplied encoding, and if this failed, use untranslated error
message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would
fail. so, if client would ask server with ru_RU.UTF-8 default
locale to use LATIN1 encoding, server would fallback to
untranslated messages.I think this is basically my solution (1), with the same problems.
I think, that there is a big difference from server point of view.
You propose that both translated and untranslated message should be
passed around inside backend. It has some benefits, but requires
considerable reworking of server internals.
Agreed.
My solution doesn't require keeping both original message and
translated one during all call stack unwinding. It just checks if
combination of language and encoding is supported by the NLS subsystem,
and if not, falls back to untranslated message for entire session.
Looking at check_client_encoding(), the comment says as following.
| * If we are not within a transaction then PrepareClientEncoding will not
| * be able to look up the necessary conversion procs. If we are still
| * starting up, it will return "OK" anyway, and InitializeClientEncoding
| * will fix things once initialization is far enough along. After
We shold overcome this to realize startup-time check for
conversion procs.
It is much more local change and is comparable by complexity with one,
proposed by Tom Lane.
I'm not sure what messages may be raised before authentication
but it can be a more generic-solution. (Adding check during
on-session.)
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp>
Looking at check_client_encoding(), the comment says as following.
| * If we are not within a transaction then PrepareClientEncoding will not
| * be able to look up the necessary conversion procs. If we are still
| * starting up, it will return "OK" anyway, and InitializeClientEncoding
| * will fix things once initialization is far enough along. AfterWe shold overcome this to realize startup-time check for
conversion procs.
Somewhat wrong. The core problem is the procedures offered by
PrepareClientEncoding is choosed only by encoding->encoding
basis, not counting character set compatibility. So, currently
this is not detectable before actually doing conversion of a
character stream.
Conversely, providing a means to check character-set
compatibility will naturally fixes this. Check at session-startup
(out-of-transaction check?) is still another problem.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I'm not sure what messages may be raised before authentication
but it can be a more generic-solution. (Adding check during
on-session.)
Definitely, there can be authentication error message, which is sent if
authentication didn't happen. Also, as far as I understand, message
"Database ... doesn't exists" is also send before authentication.
Also, there are CancelRequests, where normal authentication is not
used, and server key, provided in another session used instead.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160808.181154.252052789.horiguchi.kyotaro@lab.ntt.co.jp>
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp>
Looking at check_client_encoding(), the comment says as following.
| * If we are not within a transaction then PrepareClientEncoding will not
| * be able to look up the necessary conversion procs. If we are still
| * starting up, it will return "OK" anyway, and InitializeClientEncoding
| * will fix things once initialization is far enough along. AfterWe shold overcome this to realize startup-time check for
conversion procs.Somewhat wrong. The core problem is the procedures offered by
PrepareClientEncoding is choosed only by encoding->encoding
basis, not counting character set compatibility. So, currently
this is not detectable before actually doing conversion of a
character stream.Conversely, providing a means to check character-set
compatibility will naturally fixes this. Check at session-startup
(out-of-transaction check?) is still another problem.
I don't see charset compatibility to be easily detectable,
because locale (or character set) is not a matter of PostgreSQL
(except for some encodings bound to one particular character
set)... So the conversion-fallback might be a only available
solution.
Thougts?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp>Somewhat wrong. The core problem is the procedures offered by
PrepareClientEncoding is choosed only by encoding->encoding
basis, not counting character set compatibility. So, currently
this is not detectable before actually doing conversion of a
character stream.
Yes, my idea was to check language/encoding compatibility. Make sure
that NLS messages can be represented in the client-specified encoding
in a readable way. As far, as I know, there is no platform-independent
bulletproof way to do so.
On Unix you can try to initialize locale with given language and given
encoding, but it can fail even if encoding is compatible with language,
simply because corresponding locale is not generated on this system.
But this seems to be a problem of system administration and can be left
out to local sysadmins.
Once you have correctly initialized LC_MESSAGES, you don't need
encoding conversion routines for the NLS messages. You can use
bind_textdomain_codeset function to provide messages in the
client-desired encoding. (but this can cause problems with server logs,
where messages from different sessions would come in different
encodings)
On Windows things are more complicated. There is just one ANSI code
page, associated to given language, and locale initialization would
fail with any other codepage, including utf-8.
regards,
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I don't see charset compatibility to be easily detectable,
In the worst case we can hardcode explicit compatibility table.
There is limited set of languages, which have translated error messages,
and limited (albeit wide) set of encodings, supported by PostgreSQL. So
it is possible to define complete list of encodings, compatible with
some translation. And fall back to untranslated messages if client
encoding is not in this list.
because locale (or character set) is not a matter of PostgreSQL
(except for some encodings bound to one particular character
set)... So the conversion-fallback might be a only available
solution.
Conversion fallback may be a solution for data. For NLS-messages I think
it is better to fall back to English (untranslated) messages than use of
transliteration or something alike.
I think that for now we can assume that the best effort is already done
for the data, and think how to improve situation with messages.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
(I've recovered the lost Cc recipients so far)
At Mon, 8 Aug 2016 12:52:11 +0300, Victor Wagner <vitus@wagner.pp.ru> wrote in <20160808125211.1361cc0f@fafnir.local.vm>
On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:I don't see charset compatibility to be easily detectable,
In the worst case we can hardcode explicit compatibility table.
We could have the language lists compatible with some
language-bound encodings. For example, LATIN1 (ISO/IEC 8859-1),
according to Wikipedia
(https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
According to the list, we might have the following compatibility
list of locales, maybe without region.
{{"UTF8", "LATIN1"}, "af", "sq", "eu", "da", "en", "fo", "en"}... and so.
The biggest problem for this is at least *I* cannot confirm the
validity of the list. Both about perfectness of coverage of
LATIN1 over all languages in the list and omission of any
possiblly coverable language. Nontheless, we could use such lists
if we accept the possible imperfectness, which would eventually
result in the original error (conversion failure) or excess
fallback for possibly convertable languages but unfortunately the
latter would be inacceptable for table data.
There is limited set of languages, which have translated error messages,
and limited (albeit wide) set of encodings, supported by PostgreSQL. So
Yes, we can have a negative list already known to be incompatible.
{{"UTF8", "LATIN1"}, "ru", .. er..what else?}
ISO639-1 seems to have about 190 languages and most of them are
apparently incompatible with LATIN1 encoding. It doesn't seem to
me good to have a haphazardly made negative list.
it is possible to define complete list of encodings, compatible with
some translation. And fall back to untranslated messages if client
encoding is not in this list.because locale (or character set) is not a matter of PostgreSQL
(except for some encodings bound to one particular character
set)... So the conversion-fallback might be a only available
solution.Conversion fallback may be a solution for data. For NLS-messages I think
it is better to fall back to English (untranslated) messages than use of
transliteration or something alike.
I suppose that 'fallback' means "have a try then use English if
failed" so I think it is sutable rather for message, not for
data, and it doesn't need any a priori information about
compatibility. It seems to me that PostgreSQL refuses to ignore
or conceal conversion errors and return broken or unwanted byte
sequence for data. Things are different for error messages, it
is preferable to be anyyhow readable than totally abandoned.
I think that for now we can assume that the best effort is already done
for the data, and think how to improve situation with messages.
Is there any source to know the compatibility for any combination
of language vs encoding? Maybe we need a ground for the list.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers