BUG #6510: A simple prompt is displayed using wrong charset

alvherre@2ndquadrant.com

about 14 years ago

In reply to: Alexander Lakhin (#1)

hackersbugsgeneral

Re: BUG #6510: A simple prompt is displayed using wrong charset

Excerpts from exclusion's message of sáb mar 03 15:44:37 -0300 2012:

I'm using postgresSQL in Windows with Russian locale and get unreadable
messages when the postgres utilities prompting me for input.
Please look at the screenshot:
http://oi44.tinypic.com/aotje8.jpg
(The psql writes the unreadable message prompting for the password.)
But at the same time the following message (WARINING) displayed right.

I believe it's related to setlocale and the difference between OEM and ANSI
encoding, which we had in Windows with the Russian locale.
The startup code of psql sets locale with the call setlocale(LC_ALL, "") and
MSDN documentation says that the call:
Sets the locale to the default, which is the user-default ANSI code page
obtained from the operating system.

After the call all the strings printed with the printf(stdout) will go
through the ANSI->OEM conversion.

But in the simple_prompt function strings written to con, and such writes go
without conversion.

Were you able to come up with some way to make this work?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

exclusion@gmail.com

about 14 years ago

In reply to: Alvaro Herrera (#2)

hackersbugsgeneral

Re: BUG #6510: A simple prompt is displayed using wrong charset

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.
I tested the attached patch (build the source with msvc) and it fixes
the issue. If it looks acceptible, then probably DEVTTY should not be
used on Windows at all.
I found two other references of DEVTTY at
psql/command.c
success = saveHistory(fname ? fname : DEVTTY, -1, false, false);

and
contrib/pg_upgrade/option.c
log_opts.debug_fd = fopen(DEVTTY, "w");

By the way, is there any reason to use stderr for the prompt output, not
stdout?

Regards,
Alexander

16.03.2012 23:13, Alvaro Herrera пишет:

Show quoted text

Excerpts from exclusion's message of sáb mar 03 15:44:37 -0300 2012:

I'm using postgresSQL in Windows with Russian locale and get unreadable
messages when the postgres utilities prompting me for input.
Please look at the screenshot:
http://oi44.tinypic.com/aotje8.jpg
(The psql writes the unreadable message prompting for the password.)
But at the same time the following message (WARINING) displayed right.

I believe it's related to setlocale and the difference between OEM and ANSI
encoding, which we had in Windows with the Russian locale.
The startup code of psql sets locale with the call setlocale(LC_ALL, "") and
MSDN documentation says that the call:
Sets the locale to the default, which is the user-default ANSI code page
obtained from the operating system.

After the call all the strings printed with the printf(stdout) will go
through the ANSI->OEM conversion.

But in the simple_prompt function strings written to con, and such writes go
without conversion.

Were you able to come up with some way to make this work?

alvherre@2ndquadrant.com

about 14 years ago

In reply to: Alexander Lakhin (#3)

hackersbugsgeneral

Re: BUG #6510: A simple prompt is displayed using wrong charset

Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012:

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.

Using console directly instead of stdin/out/err is more appropriate when
asking for passwords and reading them back, because you can redirect the
rest of the output to/from files or pipes, without the prompt
interfering with that. This also explains why stderr is used instead of
stdout.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

exclusion@gmail.com

about 14 years ago

In reply to: Alvaro Herrera (#4)

hackersbugsgeneral

Re: BUG #6510: A simple prompt is displayed using wrong charset

Thanks, I've understood your point.
Please look at the patch. It implements the first way and it makes psql
work too.

Regards,
Alexander

20.03.2012 00:05, Alvaro Herrera пишет:

Show quoted text

Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012:

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.

Using console directly instead of stdin/out/err is more appropriate when
asking for passwords and reading them back, because you can redirect the
rest of the output to/from files or pipes, without the prompt
interfering with that. This also explains why stderr is used instead of
stdout.

alvherre@2ndquadrant.com

about 14 years ago

In reply to: Alexander Lakhin (#5)

hackersbugsgeneral

Re: BUG #6510: A simple prompt is displayed using wrong charset

Excerpts from Alexander LAW's message of mar mar 20 16:50:14 -0300 2012:

Thanks, I've understood your point.
Please look at the patch. It implements the first way and it makes psql
work too.

Great, thanks. Hopefully somebody with Windows-compile abilities will
have a look at this.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

exclusion@gmail.com

over 13 years ago

In reply to: Alvaro Herrera (#4)

hackersbugsgeneral

Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

Hello,

The dump file itself is correct. The issue is only with the non-ASCII
object names in pg_dump messages.
The messages text (which is non-ASCII too) displayed consistently with
right encoding (i.e. with OS encoding thanks to libintl/gettext), but
encoding of db object names depends on the dump encoding and thus
they're getting unreadable when different encoding is used.
The same can be reproduced in Linux (where console encoding is UTF-8)
when doing dump with Windows-1251 or Latin1 (for western european
languages).

Thanks,
Alexander

The following bug has been logged on the website:

Bug reference: 6742
Logged by: Alexander LAW
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 9.1.4
Operating system: Windows
Description:

When I try to dump database with UTF-8 encoding in Windows, I get unreadable
object names.
Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the
left window all the pg_dump messages displayed correctly (except for the
prompt password (bug #6510)), but the non-ASCII object name is gibberish. On
the right window (where dump is done with the Windows 1251 encoding (OS
Encoding for Russian locale)) everything is right.

Did you check the dump file using an editor that can handle UTF-8?
The Windows console is not known for properly handling that encoding.

Thomas

exclusion@gmail.com

over 13 years ago

In reply to: Alexander Lakhin (#7)

hackersbugsgeneral

Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create any databases they wish.
I have clients all over the world (so they can create databases with different encoding).

The question is - what I (as admin) want to see in my postgresql log, containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second, It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for each file.

B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So at best I will get ??? in the log.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some conversation function.

I think that the last solution is the solution. What is your opinion?

In fact the problem exists even with a simple installation on Windows when you use non-English locale.
So the solution would be useful for many of us.

Best regards,
Alexander

On 05/23/2012 09:15 AM, yi huang wrote:

I'm using postgresql 9.1.3 from debian squeeze-backports with
zh_CN.UTF-8 locale, i find my main log (which is
"/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
indicate some sort of charset encoding problem.

It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the
system locale, and the PostgreSQL backends log in whatever encoding
their database is in. They all write to the same log file, producing a
log file full of mixed encoding data that'll choke many text editors.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

In the future it's possible that this may be fixed by logging output to
different files on a per-database basis or by converting the text
encoding of log messages, but no agreement has been reached on the
correct approach and nobody has stepped up to implement it.

--
Craig Ringer

exclusion@gmail.com

over 13 years ago

In reply to: Alexander Lakhin (#7)

hackersbugsgeneral

Re: main log encoding problem

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create any
databases they wish.
I have clients all over the world (so they can create databases with
different encoding).

The question is - what I (as admin) want to see in my postgresql log,
containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second,
It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for each
file.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

In fact the problem exists even with a simple installation on Windows
when you use non-English locale.
So the solution would be useful for many of us.

Best regards,
Alexander

P.S. sorry for the wrong subject in my previous message sent to
pgsql-general

On 05/23/2012 09:15 AM, yi huang wrote:

I'm using postgresql 9.1.3 from debian squeeze-backports with
zh_CN.UTF-8 locale, i find my main log (which is
"/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
indicate some sort of charset encoding problem.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

--
Craig Ringer

#10

t-ishii@sra.co.jp

over 13 years ago

In reply to: Alexander Lakhin (#9)

hackersbugsgeneral

Re: main log encoding problem

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#11

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Tatsuo Ishii (#10)

hackersbugsgeneral

Re: main log encoding problem

Tatsuo Ishii <ishii@postgresql.org> writes:

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.

Um ... but ...

(1) nothing whatsoever can read MULE, except emacs and xemacs.

(2) there is more than one version of MULE (emacs versus xemacs,
not to mention any possible cross-version discrepancies).

(3) from a log volume standpoint, this could be pretty disastrous.

I'm not for a write-only solution, which is pretty much what this
would be.

regards, tom lane

#12

t-ishii@sra.co.jp

over 13 years ago

In reply to: Tom Lane (#11)

hackersbugsgeneral

Re: main log encoding problem

Tatsuo Ishii <ishii@postgresql.org> writes:

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.

Um ... but ...

(1) nothing whatsoever can read MULE, except emacs and xemacs.

(2) there is more than one version of MULE (emacs versus xemacs,
not to mention any possible cross-version discrepancies).

(3) from a log volume standpoint, this could be pretty disastrous.

I'm not for a write-only solution, which is pretty much what this
would be.

I'm not sure how long xemacs will survive (the last stable release of
xemacs was released in 2009). Anyway, I'm not too worried about your
points, since it's easy to convert back from mule-internal code
encoded log files to original encoding mixed log file. No information
will be lost. Even converting to UTF-8 should be possible. My point
is, once the log file is converted to UTF-8, there's no way to convert
back to original encoding log file.

Probably we treat mule-internal encoded log files as an internal
format, and have a utility which does conversion from mule-internal to
UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#13

craig@2ndquadrant.com

over 13 years ago

In reply to: Alexander Lakhin (#9)

hackersbugsgeneral

Re: main log encoding problem

On 07/18/2012 11:16 PM, Alexander Law wrote:

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create
any databases they wish.
I have clients all over the world (so they can create databases with
different encoding).

The question is - what I (as admin) want to see in my postgresql log,
containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second,
It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for
each file.

B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So
at best I will get ??? in the log.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic linker
messages are always correct. Ensuring that the logging collector knows
when setlocale() has been called to change the encoding and translation
of system messages, handling the different logging output methods, etc -
it's going to be fiddly.

I have some performance concerns about the transcoding required for (b)
or (c), but realistically it's already the norm to convert all the data
sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are all
in the system locale.

- Allow (a) for people who have many different DBs in many different
encodings, do high volume logging, and want to avoid conversion
overhead. Let them deal with the mess, just provide an additional % code
for the encoding so they can name their per-DB log files to indicate the
encoding.

The main issue is just that code needs to be prototyped, cleaned up, and
submitted. So far nobody's cared enough to design it, build it, and get
it through patch review. I've just foolishly volunteered myself to work
on an automated crash-test system for virtual plug-pull testing, so I'm
not stepping up.

--
Craig Ringer

#14

exclusion@gmail.com

over 13 years ago

In reply to: Tatsuo Ishii (#10)

hackersbugsgeneral

Re: main log encoding problem

Hello,

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--

I believe that postgres has such conversion functions anyway. And they
used for data conversion when we have clients (and databases) with
different encodings. So if they can be used for data, why not to use
them for relatively little amount of log messages?
And regarding mule internal encoding - reading about Mule
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding which
is a UTF-8 encoding of a superset of Unicode. /
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please
provide some examples of these that can results in lossy conversion?
?hoosing UTF-8 in a viewer/editor is no big deal too. Most of them
detect UTF-8 automagically, and for the others BOM can be added.

Best regards,
Aexander

#15

exclusion@gmail.com

over 13 years ago

In reply to: Craig Ringer (#13)

hackersbugsgeneral

Re: main log encoding problem

Hello,

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic
linker messages are always correct. Ensuring that the logging
collector knows when setlocale() has been called to change the
encoding and translation of system messages, handling the different
logging output methods, etc - it's going to be fiddly.

I have some performance concerns about the transcoding required for
(b) or (c), but realistically it's already the norm to convert all the
data sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not consistent
and in my opinion this is the main problem.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are
all in the system locale.

As I understand UTF-8 is the default encoding for databases. And even
when a database is in the system encoding, translated postgres messages
still come in UTF-8 and will go through UTF-8 -> System locale
conversion within gettext.

- Allow (a) for people who have many different DBs in many different
encodings, do high volume logging, and want to avoid conversion
overhead. Let them deal with the mess, just provide an additional %
code for the encoding so they can name their per-DB log files to
indicate the encoding.

I think that (a) solution can be an evolvement of the logging mechanism
if there will be a need for it.

The main issue is just that code needs to be prototyped, cleaned up,
and submitted. So far nobody's cared enough to design it, build it,
and get it through patch review. I've just foolishly volunteered
myself to work on an automated crash-test system for virtual plug-pull
testing, so I'm not stepping up.

I see you point and I can prepare a prototype if the proposed (c)
solution seems reasonable enough and can be accepted.

Best regards,
Alexander

#16

t-ishii@sra.co.jp

over 13 years ago

In reply to: Alexander Lakhin (#14)

hackersbugsgeneral

Re: main log encoding problem

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--

I believe that postgres has such conversion functions anyway. And they
used for data conversion when we have clients (and databases) with
different encodings. So if they can be used for data, why not to use
them for relatively little amount of log messages?

Frontend/Backend encoding conversion only happens when they are
different. While conversion for logs *always* happens. A busy database
could produce tons of logs (i is not unusual that log all SQLs for
auditing purpose).

And regarding mule internal encoding - reading about Mule
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding
which is a UTF-8 encoding of a superset of Unicode. /
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please
provide some examples of these that can results in lossy conversion?

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#17

t-ishii@sra.co.jp

over 13 years ago

In reply to: Alexander Lakhin (#15)

hackersbugsgeneral

Re: main log encoding problem

Hello,

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic
linker messages are always correct. Ensuring that the logging
collector knows when setlocale() has been called to change the
encoding and translation of system messages, handling the different
logging output methods, etc - it's going to be fiddly.

I have some performance concerns about the transcoding required for
(b) or (c), but realistically it's already the norm to convert all the
data sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not

This is not true. Transcoding happens only when PostgreSQL is built
with --enable-nls option (default is no nls).

consistent and in my opinion this is the main problem.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
- UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are
- all in the system locale.

As I understand UTF-8 is the default encoding for databases. And even
when a database is in the system encoding, translated postgres
messages still come in UTF-8 and will go through UTF-8 -> System
locale conversion within gettext.

Again, this is not always true.

Show quoted text

- Allow (a) for people who have many different DBs in many different
- encodings, do high volume logging, and want to avoid conversion
- overhead. Let them deal with the mess, just provide an additional %
- code for the encoding so they can name their per-DB log files to
- indicate the encoding.

I think that (a) solution can be an evolvement of the logging
mechanism if there will be a need for it.

The main issue is just that code needs to be prototyped, cleaned up,
and submitted. So far nobody's cared enough to design it, build it,
and get it through patch review. I've just foolishly volunteered
myself to work on an automated crash-test system for virtual plug-pull
testing, so I'm not stepping up.

I see you point and I can prepare a prototype if the proposed (c)
solution seems reasonable enough and can be accepted.

Best regards,
Alexander

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18

exclusion@gmail.com

over 13 years ago

In reply to: Tatsuo Ishii (#17)

hackersbugsgeneral

Re: main log encoding problem

The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not

This is not true. Transcoding happens only when PostgreSQL is built
with --enable-nls option (default is no nls).

I'll restate the initial issue as I see it.
I have Windows and I'm installing PostgreSQL for Windows (latest
version, downloaded from enterprise.db). Then I create a database with
default settings (with UTF-8 encoding), do something wrong in my DB and
get such a log file with the two different encodings (UTF-8 and
Windows-1251 (ANSI)) and with localized postgres messages.

#19

exclusion@gmail.com

over 13 years ago

In reply to: Tatsuo Ishii (#16)

hackersbugsgeneral

Re: main log encoding problem

And regarding mule internal encoding - reading about Mule
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding
which is a UTF-8 encoding of a superset of Unicode. /
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please
provide some examples of these that can results in lossy conversion?

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.

Ok, maybe the time of real universal encoding has not yet come. Then we
maybe just should add a new parameter "log_encoding" (UTF-8 by default)
to postgresql.conf. And to use this encoding consistently within
logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

#20

t-ishii@sra.co.jp

over 13 years ago

In reply to: Alexander Lakhin (#19)

hackersbugsgeneral

Re: main log encoding problem

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.

Ok, maybe the time of real universal encoding has not yet come. Then
we maybe just should add a new parameter "log_encoding" (UTF-8 by
default) to postgresql.conf. And to use this encoding consistently
within logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

What do you mean by "not available"?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#21

exclusion@gmail.com

over 13 years ago

In reply to: Tatsuo Ishii (#20)

hackersbugsgeneral

#22

t-ishii@sra.co.jp

over 13 years ago

In reply to: Alexander Lakhin (#21)

hackersbugsgeneral

#23

Alban Hertroys

haramrae@gmail.com

over 13 years ago

In reply to: Alexander Lakhin (#21)

hackersbugsgeneral

#24

Alban Hertroys

haramrae@gmail.com

over 13 years ago

In reply to: Alban Hertroys (#23)

hackersbugsgeneral

#25

exclusion@gmail.com

over 13 years ago

In reply to: Tatsuo Ishii (#22)

hackersbugsgeneral

#26

exclusion@gmail.com

over 13 years ago

In reply to: Alban Hertroys (#23)

hackersbugsgeneral

#27

Alban Hertroys

haramrae@gmail.com

over 13 years ago

In reply to: Alexander Lakhin (#26)

hackersbugsgeneral

#28

craig@2ndquadrant.com

over 13 years ago

In reply to: Tatsuo Ishii (#16)

hackersbugsgeneral

#29

craig@2ndquadrant.com

over 13 years ago

In reply to: Alban Hertroys (#23)

hackersbugsgeneral

#30

exclusion@gmail.com

over 13 years ago

In reply to: Alexander Lakhin (#7)

hackersbugsgeneral

#31

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Alexander Lakhin (#30)

hackersbugsgeneral

#32

noah@leadboat.com

over 13 years ago

In reply to: Alexander Lakhin (#5)

hackersbugsgeneral

#33

exclusion@gmail.com

over 13 years ago

In reply to: Noah Misch (#32)

hackersbugsgeneral

#34

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Alexander Lakhin (#33)

hackersbugsgeneral

#35

noah@leadboat.com

over 13 years ago

In reply to: Alexander Lakhin (#33)

hackersbugsgeneral

#36

noah@leadboat.com

over 13 years ago

In reply to: Tom Lane (#34)

hackersbugsgeneral

#37

noah@leadboat.com

over 13 years ago

In reply to: Noah Misch (#35)

hackersbugsgeneral

#38

alvherre@2ndquadrant.com

over 13 years ago

In reply to: Noah Misch (#37)

hackersbugsgeneral

#39

exclusion@gmail.com

about 13 years ago

In reply to: Alvaro Herrera (#38)

hackersbugsgeneral

#40

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Alexander Lakhin (#39)

hackersbugsgeneral

#41

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tom Lane (#40)

hackersbugsgeneral

#42

andrew@dunslane.net

about 13 years ago

In reply to: Alvaro Herrera (#41)

hackersbugsgeneral

#43

craig@2ndquadrant.com

about 13 years ago

In reply to: Tom Lane (#40)

hackersbugsgeneral

#44

craig@2ndquadrant.com

about 13 years ago

In reply to: Alexander Lakhin (#39)

hackersbugsgeneral

#45

noah@leadboat.com

about 13 years ago

In reply to: Craig Ringer (#43)

hackersbugsgeneral

#46

craig@2ndquadrant.com

about 13 years ago

In reply to: Noah Misch (#45)

hackersbugsgeneral

#47

andrew@dunslane.net

about 13 years ago

In reply to: Craig Ringer (#44)

hackersbugsgeneral

#48

noah@leadboat.com

about 13 years ago

In reply to: Andrew Dunstan (#47)

hackersbugsgeneral

#49

andrew@dunslane.net

about 13 years ago

In reply to: Noah Misch (#48)

hackersbugsgeneral

#50

Peter Eisentraut

peter_e@gmx.net

about 13 years ago

In reply to: Andrew Dunstan (#49)

hackersbugsgeneral

#51

andrew@dunslane.net

about 13 years ago

In reply to: Peter Eisentraut (#50)

hackersbugsgeneral

#52

exclusion@gmail.com

about 13 years ago

In reply to: Andrew Dunstan (#49)

hackersbugsgeneral

#53

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Alexander Lakhin (#52)

hackersbugsgeneral

#54

noah@leadboat.com

about 13 years ago

In reply to: Tom Lane (#53)

hackersbugsgeneral

#55

exclusion@gmail.com

about 13 years ago

In reply to: Noah Misch (#54)

hackersbugsgeneral

#56

noah@leadboat.com

about 13 years ago

In reply to: Alexander Lakhin (#55)

hackersbugsgeneral

#57

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Noah Misch (#56)

hackersbugsgeneral

#58

noah@leadboat.com

about 13 years ago

In reply to: Tom Lane (#57)

hackersbugsgeneral

#59

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Noah Misch (#58)

hackersbugsgeneral

#60

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Tom Lane (#57)

hackersbugsgeneral

#61

noah@leadboat.com

about 13 years ago

In reply to: Bruce Momjian (#60)

hackersbugsgeneral

#62

exclusion@gmail.com

about 13 years ago

In reply to: Noah Misch (#56)

hackersbugsgeneral

#63

Peter Eisentraut

peter_e@gmx.net

about 13 years ago

In reply to: Bruce Momjian (#60)

hackersbugsgeneral

#64

craig@2ndquadrant.com

about 13 years ago

In reply to: Peter Eisentraut (#63)

hackersbugsgeneral

#65

noah@leadboat.com

about 13 years ago

In reply to: Alexander Lakhin (#62)

hackersbugsgeneral

#66

exclusion@gmail.com

about 13 years ago

In reply to: Noah Misch (#65)

hackersbugsgeneral

#67

noah@leadboat.com

almost 13 years ago

In reply to: Alexander Lakhin (#66)

hackersbugsgeneral

#68

exclusion@gmail.com

almost 13 years ago

In reply to: Noah Misch (#67)

hackersbugsgeneral

#69

noah@leadboat.com

almost 13 years ago

In reply to: Alexander Lakhin (#68)

hackersbugsgeneral

#70