BUG #6510: A simple prompt is displayed using wrong charset

Started by Nonamealmost 14 years ago70 messages
#1Noname
exclusion@gmail.com

The following bug has been logged on the website:

Bug reference: 6510
Logged by: Alexander LAW
Email address: exclusion@gmail.com
PostgreSQL version: 9.1.3
Operating system: Windows
Description:

I'm using postgresSQL in Windows with Russian locale and get unreadable
messages when the postgres utilities prompting me for input.
Please look at the screenshot:
http://oi44.tinypic.com/aotje8.jpg
(The psql writes the unreadable message prompting for the password.)
But at the same time the following message (WARINING) displayed right.

I believe it's related to setlocale and the difference between OEM and ANSI
encoding, which we had in Windows with the Russian locale.
The startup code of psql sets locale with the call setlocale(LC_ALL, "") and
MSDN documentation says that the call:
Sets the locale to the default, which is the user-default ANSI code page
obtained from the operating system.

After the call all the strings printed with the printf(stdout) will go
through the ANSI->OEM conversion.

But in the simple_prompt function strings written to con, and such writes go
without conversion.

I've made a little test to illustrate this:

#include "stdafx.h"
#include <locale.h>

int _tmain(int argc, _TCHAR* argv[])
{
printf("ОК\n");
setlocale(0, "");
fprintf(stdout, "ОК\n");
FILE * termin = fopen("con", "w");
fprintf(termin, "ОК\n");
fflush(termin);
return 0;
}

where "ОК" is "OK" with russian letters.
This test gives the following result:
http://oi39.tinypic.com/35jgljs.jpg

The second line is readable, while the others are not.

If it can be helpful to understand the issue, I can perform another tests.

Thanks in advance,
Alexander

#2Alvaro Herrera
alvherre@commandprompt.com
In reply to: Noname (#1)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Excerpts from exclusion's message of sáb mar 03 15:44:37 -0300 2012:

I'm using postgresSQL in Windows with Russian locale and get unreadable
messages when the postgres utilities prompting me for input.
Please look at the screenshot:
http://oi44.tinypic.com/aotje8.jpg
(The psql writes the unreadable message prompting for the password.)
But at the same time the following message (WARINING) displayed right.

I believe it's related to setlocale and the difference between OEM and ANSI
encoding, which we had in Windows with the Russian locale.
The startup code of psql sets locale with the call setlocale(LC_ALL, "") and
MSDN documentation says that the call:
Sets the locale to the default, which is the user-default ANSI code page
obtained from the operating system.

After the call all the strings printed with the printf(stdout) will go
through the ANSI->OEM conversion.

But in the simple_prompt function strings written to con, and such writes go
without conversion.

Were you able to come up with some way to make this work?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Alexander LAW
exclusion@gmail.com
In reply to: Alvaro Herrera (#2)
1 attachment(s)
Re: BUG #6510: A simple prompt is displayed using wrong charset

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.
I tested the attached patch (build the source with msvc) and it fixes
the issue. If it looks acceptible, then probably DEVTTY should not be
used on Windows at all.
I found two other references of DEVTTY at
psql/command.c
success = saveHistory(fname ? fname : DEVTTY, -1, false, false);

and
contrib/pg_upgrade/option.c
log_opts.debug_fd = fopen(DEVTTY, "w");

By the way, is there any reason to use stderr for the prompt output, not
stdout?

Regards,
Alexander

16.03.2012 23:13, Alvaro Herrera пишет:

Show quoted text

Excerpts from exclusion's message of sáb mar 03 15:44:37 -0300 2012:

I'm using postgresSQL in Windows with Russian locale and get unreadable
messages when the postgres utilities prompting me for input.
Please look at the screenshot:
http://oi44.tinypic.com/aotje8.jpg
(The psql writes the unreadable message prompting for the password.)
But at the same time the following message (WARINING) displayed right.

I believe it's related to setlocale and the difference between OEM and ANSI
encoding, which we had in Windows with the Russian locale.
The startup code of psql sets locale with the call setlocale(LC_ALL, "") and
MSDN documentation says that the call:
Sets the locale to the default, which is the user-default ANSI code page
obtained from the operating system.

After the call all the strings printed with the printf(stdout) will go
through the ANSI->OEM conversion.

But in the simple_prompt function strings written to con, and such writes go
without conversion.

Were you able to come up with some way to make this work?

Attachments:

sprompt.difftext/x-patch; name=sprompt.diffDownload
--- original/src/port/sprompt.c	2012-02-24 02:53:36.000000000 +0400
+++ fix/src/port/sprompt.c	2012-03-17 18:52:15.010559900 +0400
@@ -60,14 +60,13 @@
 	 * Do not try to collapse these into one "w+" mode file. Doesn't work on
 	 * some platforms (eg, HPUX 10.20).
 	 */
+#ifdef WIN32
+	termin = stdin;
+	termout = stderr;
+#else
 	termin = fopen(DEVTTY, "r");
 	termout = fopen(DEVTTY, "w");
-	if (!termin || !termout
-#ifdef WIN32
-	/* See DEVTTY comment for msys */
-		|| (getenv("OSTYPE") && strcmp(getenv("OSTYPE"), "msys") == 0)
-#endif
-		)
+	if (!termin || !termout)
 	{
 		if (termin)
 			fclose(termin);
@@ -76,6 +75,7 @@
 		termin = stdin;
 		termout = stderr;
 	}
+#endif
 
 #ifdef HAVE_TERMIOS_H
 	if (!echo)
#4Alvaro Herrera
alvherre@commandprompt.com
In reply to: Alexander LAW (#3)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012:

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.

Using console directly instead of stdin/out/err is more appropriate when
asking for passwords and reading them back, because you can redirect the
rest of the output to/from files or pipes, without the prompt
interfering with that. This also explains why stderr is used instead of
stdout.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5Alexander LAW
exclusion@gmail.com
In reply to: Alvaro Herrera (#4)
1 attachment(s)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Thanks, I've understood your point.
Please look at the patch. It implements the first way and it makes psql
work too.

Regards,
Alexander

20.03.2012 00:05, Alvaro Herrera пишет:

Show quoted text

Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012:

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.

Using console directly instead of stdin/out/err is more appropriate when
asking for passwords and reading them back, because you can redirect the
rest of the output to/from files or pipes, without the prompt
interfering with that. This also explains why stderr is used instead of
stdout.

Attachments:

sprompt.difftext/x-patch; name=sprompt.diffDownload
--- a/src/port/sprompt.c	2012-02-24 02:53:36.000000000 +0400
+++ b/src/port/sprompt.c	2012-03-20 23:34:22.181979200 +0400
@@ -41,6 +41,7 @@
 	char	   *destination;
 	FILE	   *termin,
 			   *termout;
+	char	   *_prompt;
 
 #ifdef HAVE_TERMIOS_H
 	struct termios t_orig,
@@ -49,6 +50,7 @@
 #ifdef WIN32
 	HANDLE		t = NULL;
 	LPDWORD		t_orig = NULL;
+	char	   *oem_prompt = NULL;
 #endif
 #endif
 
@@ -104,7 +106,20 @@
 
 	if (prompt)
 	{
-		fputs(_(prompt), termout);
+		_prompt = _(prompt);
+#ifdef WIN32
+		if ((termout != stderr) && (termout != stdout))
+		{
+			oem_prompt = (char*) malloc(strlen(_prompt) + 1);
+			CharToOemBuff(_prompt, oem_prompt, strlen(_prompt) + 1);
+			_prompt = oem_prompt;
+		}
+#endif
+		fputs(_prompt, termout);
+#ifdef WIN32
+		if (oem_prompt)
+			free(oem_prompt);
+#endif
 		fflush(termout);
 	}
 
@@ -130,6 +145,13 @@
 		/* remove trailing newline */
 		destination[length - 1] = '\0';
 
+#ifdef WIN32
+	if (termin != stdin)
+	{
+		OemToCharBuff(destination, destination, strlen(destination) + 1);
+	}
+#endif
+
 #ifdef HAVE_TERMIOS_H
 	if (!echo)
 	{
#6Alvaro Herrera
alvherre@commandprompt.com
In reply to: Alexander LAW (#5)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Excerpts from Alexander LAW's message of mar mar 20 16:50:14 -0300 2012:

Thanks, I've understood your point.
Please look at the patch. It implements the first way and it makes psql
work too.

Great, thanks. Hopefully somebody with Windows-compile abilities will
have a look at this.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Alexander Law
exclusion@gmail.com
In reply to: Alvaro Herrera (#4)
Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

Hello,

The dump file itself is correct. The issue is only with the non-ASCII
object names in pg_dump messages.
The messages text (which is non-ASCII too) displayed consistently with
right encoding (i.e. with OS encoding thanks to libintl/gettext), but
encoding of db object names depends on the dump encoding and thus
they're getting unreadable when different encoding is used.
The same can be reproduced in Linux (where console encoding is UTF-8)
when doing dump with Windows-1251 or Latin1 (for western european
languages).

Thanks,
Alexander

The following bug has been logged on the website:

Bug reference: 6742
Logged by: Alexander LAW
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 9.1.4
Operating system: Windows
Description:

When I try to dump database with UTF-8 encoding in Windows, I get unreadable
object names.
Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the
left window all the pg_dump messages displayed correctly (except for the
prompt password (bug #6510)), but the non-ASCII object name is gibberish. On
the right window (where dump is done with the Windows 1251 encoding (OS
Encoding for Russian locale)) everything is right.

Did you check the dump file using an editor that can handle UTF-8?
The Windows console is not known for properly handling that encoding.

Thomas

#8Alexander Law
exclusion@gmail.com
In reply to: Alexander Law (#7)
Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create any databases they wish.
I have clients all over the world (so they can create databases with different encoding).

The question is - what I (as admin) want to see in my postgresql log, containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second, It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for each file.

B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So at best I will get ??? in the log.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some conversation function.

I think that the last solution is the solution. What is your opinion?

In fact the problem exists even with a simple installation on Windows when you use non-English locale.
So the solution would be useful for many of us.

Best regards,
Alexander

On 05/23/2012 09:15 AM, yi huang wrote:

I'm using postgresql 9.1.3 from debian squeeze-backports with
zh_CN.UTF-8 locale, i find my main log (which is
"/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
indicate some sort of charset encoding problem.

It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the
system locale, and the PostgreSQL backends log in whatever encoding
their database is in. They all write to the same log file, producing a
log file full of mixed encoding data that'll choke many text editors.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

In the future it's possible that this may be fixed by logging output to
different files on a per-database basis or by converting the text
encoding of log messages, but no agreement has been reached on the
correct approach and nobody has stepped up to implement it.

--
Craig Ringer

#9Alexander Law
exclusion@gmail.com
In reply to: Alexander Law (#7)
Re: main log encoding problem

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create any
databases they wish.
I have clients all over the world (so they can create databases with
different encoding).

The question is - what I (as admin) want to see in my postgresql log,
containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second,
It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for each
file.

B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So at
best I will get ??? in the log.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

In fact the problem exists even with a simple installation on Windows
when you use non-English locale.
So the solution would be useful for many of us.

Best regards,
Alexander

P.S. sorry for the wrong subject in my previous message sent to
pgsql-general

On 05/23/2012 09:15 AM, yi huang wrote:

I'm using postgresql 9.1.3 from debian squeeze-backports with
zh_CN.UTF-8 locale, i find my main log (which is
"/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
indicate some sort of charset encoding problem.

It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the
system locale, and the PostgreSQL backends log in whatever encoding
their database is in. They all write to the same log file, producing a
log file full of mixed encoding data that'll choke many text editors.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

In the future it's possible that this may be fixed by logging output to
different files on a per-database basis or by converting the text
encoding of log messages, but no agreement has been reached on the
correct approach and nobody has stepped up to implement it.

--
Craig Ringer

#10Tatsuo Ishii
ishii@postgresql.org
In reply to: Alexander Law (#9)
Re: main log encoding problem

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#10)
Re: main log encoding problem

Tatsuo Ishii <ishii@postgresql.org> writes:

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.

Um ... but ...

(1) nothing whatsoever can read MULE, except emacs and xemacs.

(2) there is more than one version of MULE (emacs versus xemacs,
not to mention any possible cross-version discrepancies).

(3) from a log volume standpoint, this could be pretty disastrous.

I'm not for a write-only solution, which is pretty much what this
would be.

regards, tom lane

#12Tatsuo Ishii
ishii@postgresql.org
In reply to: Tom Lane (#11)
Re: main log encoding problem

Tatsuo Ishii <ishii@postgresql.org> writes:

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.

Um ... but ...

(1) nothing whatsoever can read MULE, except emacs and xemacs.

(2) there is more than one version of MULE (emacs versus xemacs,
not to mention any possible cross-version discrepancies).

(3) from a log volume standpoint, this could be pretty disastrous.

I'm not for a write-only solution, which is pretty much what this
would be.

I'm not sure how long xemacs will survive (the last stable release of
xemacs was released in 2009). Anyway, I'm not too worried about your
points, since it's easy to convert back from mule-internal code
encoded log files to original encoding mixed log file. No information
will be lost. Even converting to UTF-8 should be possible. My point
is, once the log file is converted to UTF-8, there's no way to convert
back to original encoding log file.

Probably we treat mule-internal encoded log files as an internal
format, and have a utility which does conversion from mule-internal to
UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#13Craig Ringer
ringerc@ringerc.id.au
In reply to: Alexander Law (#9)
Re: main log encoding problem

On 07/18/2012 11:16 PM, Alexander Law wrote:

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create
any databases they wish.
I have clients all over the world (so they can create databases with
different encoding).

The question is - what I (as admin) want to see in my postgresql log,
containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second,
It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for
each file.

B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So
at best I will get ??? in the log.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic linker
messages are always correct. Ensuring that the logging collector knows
when setlocale() has been called to change the encoding and translation
of system messages, handling the different logging output methods, etc -
it's going to be fiddly.

I have some performance concerns about the transcoding required for (b)
or (c), but realistically it's already the norm to convert all the data
sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are all
in the system locale.

- Allow (a) for people who have many different DBs in many different
encodings, do high volume logging, and want to avoid conversion
overhead. Let them deal with the mess, just provide an additional % code
for the encoding so they can name their per-DB log files to indicate the
encoding.

The main issue is just that code needs to be prototyped, cleaned up, and
submitted. So far nobody's cared enough to design it, build it, and get
it through patch review. I've just foolishly volunteered myself to work
on an automated crash-test system for virtual plug-pull testing, so I'm
not stepping up.

--
Craig Ringer

#14Alexander Law
exclusion@gmail.com
In reply to: Tatsuo Ishii (#10)
Re: main log encoding problem

Hello,

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--

I believe that postgres has such conversion functions anyway. And they
used for data conversion when we have clients (and databases) with
different encodings. So if they can be used for data, why not to use
them for relatively little amount of log messages?
And regarding mule internal encoding - reading about Mule
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding which
is a UTF-8 encoding of a superset of Unicode. /
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please
provide some examples of these that can results in lossy conversion?
?hoosing UTF-8 in a viewer/editor is no big deal too. Most of them
detect UTF-8 automagically, and for the others BOM can be added.

Best regards,
Aexander

#15Alexander Law
exclusion@gmail.com
In reply to: Craig Ringer (#13)
Re: main log encoding problem

Hello,

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic
linker messages are always correct. Ensuring that the logging
collector knows when setlocale() has been called to change the
encoding and translation of system messages, handling the different
logging output methods, etc - it's going to be fiddly.

I have some performance concerns about the transcoding required for
(b) or (c), but realistically it's already the norm to convert all the
data sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not consistent
and in my opinion this is the main problem.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are
all in the system locale.

As I understand UTF-8 is the default encoding for databases. And even
when a database is in the system encoding, translated postgres messages
still come in UTF-8 and will go through UTF-8 -> System locale
conversion within gettext.

- Allow (a) for people who have many different DBs in many different
encodings, do high volume logging, and want to avoid conversion
overhead. Let them deal with the mess, just provide an additional %
code for the encoding so they can name their per-DB log files to
indicate the encoding.

I think that (a) solution can be an evolvement of the logging mechanism
if there will be a need for it.

The main issue is just that code needs to be prototyped, cleaned up,
and submitted. So far nobody's cared enough to design it, build it,
and get it through patch review. I've just foolishly volunteered
myself to work on an automated crash-test system for virtual plug-pull
testing, so I'm not stepping up.

I see you point and I can prepare a prototype if the proposed (c)
solution seems reasonable enough and can be accepted.

Best regards,
Alexander

#16Tatsuo Ishii
ishii@postgresql.org
In reply to: Alexander Law (#14)
Re: main log encoding problem

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--

I believe that postgres has such conversion functions anyway. And they
used for data conversion when we have clients (and databases) with
different encodings. So if they can be used for data, why not to use
them for relatively little amount of log messages?

Frontend/Backend encoding conversion only happens when they are
different. While conversion for logs *always* happens. A busy database
could produce tons of logs (i is not unusual that log all SQLs for
auditing purpose).

And regarding mule internal encoding - reading about Mule
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding
which is a UTF-8 encoding of a superset of Unicode. /
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please
provide some examples of these that can results in lossy conversion?

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#17Tatsuo Ishii
ishii@postgresql.org
In reply to: Alexander Law (#15)
Re: main log encoding problem

Hello,

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic
linker messages are always correct. Ensuring that the logging
collector knows when setlocale() has been called to change the
encoding and translation of system messages, handling the different
logging output methods, etc - it's going to be fiddly.

I have some performance concerns about the transcoding required for
(b) or (c), but realistically it's already the norm to convert all the
data sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not

This is not true. Transcoding happens only when PostgreSQL is built
with --enable-nls option (default is no nls).

consistent and in my opinion this is the main problem.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
- UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are
- all in the system locale.

As I understand UTF-8 is the default encoding for databases. And even
when a database is in the system encoding, translated postgres
messages still come in UTF-8 and will go through UTF-8 -> System
locale conversion within gettext.

Again, this is not always true.

Show quoted text

- Allow (a) for people who have many different DBs in many different
- encodings, do high volume logging, and want to avoid conversion
- overhead. Let them deal with the mess, just provide an additional %
- code for the encoding so they can name their per-DB log files to
- indicate the encoding.

I think that (a) solution can be an evolvement of the logging
mechanism if there will be a need for it.

The main issue is just that code needs to be prototyped, cleaned up,
and submitted. So far nobody's cared enough to design it, build it,
and get it through patch review. I've just foolishly volunteered
myself to work on an automated crash-test system for virtual plug-pull
testing, so I'm not stepping up.

I see you point and I can prepare a prototype if the proposed (c)
solution seems reasonable enough and can be accepted.

Best regards,
Alexander

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18Alexander Law
exclusion@gmail.com
In reply to: Tatsuo Ishii (#17)
Re: main log encoding problem

The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not

This is not true. Transcoding happens only when PostgreSQL is built
with --enable-nls option (default is no nls).

I'll restate the initial issue as I see it.
I have Windows and I'm installing PostgreSQL for Windows (latest
version, downloaded from enterprise.db). Then I create a database with
default settings (with UTF-8 encoding), do something wrong in my DB and
get such a log file with the two different encodings (UTF-8 and
Windows-1251 (ANSI)) and with localized postgres messages.

#19Alexander Law
exclusion@gmail.com
In reply to: Tatsuo Ishii (#16)
Re: main log encoding problem

And regarding mule internal encoding - reading about Mule
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding
which is a UTF-8 encoding of a superset of Unicode. /
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please
provide some examples of these that can results in lossy conversion?

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.

Ok, maybe the time of real universal encoding has not yet come. Then we
maybe just should add a new parameter "log_encoding" (UTF-8 by default)
to postgresql.conf. And to use this encoding consistently within
logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

#20Tatsuo Ishii
ishii@postgresql.org
In reply to: Alexander Law (#19)
Re: main log encoding problem

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.

Ok, maybe the time of real universal encoding has not yet come. Then
we maybe just should add a new parameter "log_encoding" (UTF-8 by
default) to postgresql.conf. And to use this encoding consistently
within logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

What do you mean by "not available"?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#21Alexander Law
exclusion@gmail.com
In reply to: Tatsuo Ishii (#20)
Re: main log encoding problem

Ok, maybe the time of real universal encoding has not yet come. Then
we maybe just should add a new parameter "log_encoding" (UTF-8 by
default) to postgresql.conf. And to use this encoding consistently
within logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

What do you mean by "not available"?

Sorry, it was inaccurate phrase. I mean "if the conversion to this
encoding is not avaliable". For example, when we have database in EUC_JP
and log_encoding set to Latin1. I think that we can even fall back to
UTF-8 as we can convert all encodings to it (with some exceptions that
you noticed).

#22Tatsuo Ishii
ishii@postgresql.org
In reply to: Alexander Law (#21)
Re: main log encoding problem

Sorry, it was inaccurate phrase. I mean "if the conversion to this
encoding is not avaliable". For example, when we have database in
EUC_JP and log_encoding set to Latin1. I think that we can even fall
back to UTF-8 as we can convert all encodings to it (with some
exceptions that you noticed).

So, what you wanted to say here is:

"If the conversion to this encoding is not avaliable then fall back to
UTF-8"

Am I correct?

Also is it possible to completely disable the feature?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#23Alban Hertroys
haramrae@gmail.com
In reply to: Alexander Law (#21)
Re: [GENERAL] main log encoding problem

On 19 July 2012 10:40, Alexander Law <exclusion@gmail.com> wrote:

Ok, maybe the time of real universal encoding has not yet come. Then
we maybe just should add a new parameter "log_encoding" (UTF-8 by
default) to postgresql.conf. And to use this encoding consistently
within logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

What do you mean by "not available"?

Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding
is not avaliable". For example, when we have database in EUC_JP and
log_encoding set to Latin1. I think that we can even fall back to UTF-8 as
we can convert all encodings to it (with some exceptions that you noticed).

I like Craig's idea of adding the client encoding to the log lines. A
possible problem with that (I'm not an encoding expert) is that a log
line like that will contain data about the database server meta-data
(log time, client encoding, etc) in the database default encoding and
database data (the logged query and user-supplied values) in the
client encoding. One option would be to use the client encoding for
the entire log line, but would that result in legible meta-data in
every encoding?

It appears that the primarly here is that SQL statements and
user-supplied data are being logged, while the log-file is a text file
in a fixed encoding.
Perhaps another solution would be to add the ability to log certain
types of information (not the core database server log info, of
course!) to a database/table so that each record can be stored in its
own encoding?
That way the transcoding doesn't have to take place until someone is
reading the log, you'd know what to transcode the data to (namely the
client_encoding of the reading session) and there isn't any issue of
transcoding errors while logging statements.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

#24Alban Hertroys
haramrae@gmail.com
In reply to: Alban Hertroys (#23)
Re: [GENERAL] main log encoding problem

Yikes, messed up my grammar a bit I see!

On 19 July 2012 10:58, Alban Hertroys <haramrae@gmail.com> wrote:

I like Craig's idea of adding the client encoding to the log lines. A
possible problem with that (I'm not an encoding expert) is that a log
line like that will contain data about the database server meta-data
(log time, client encoding, etc) in the database default encoding and

...will contain meta-data about the database server (log time...

It appears that the primarly here is that SQL statements and

It appears the primary issue here...

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

#25Alexander Law
exclusion@gmail.com
In reply to: Tatsuo Ishii (#22)
Re: main log encoding problem

Sorry, it was inaccurate phrase. I mean "if the conversion to this
encoding is not avaliable". For example, when we have database in
EUC_JP and log_encoding set to Latin1. I think that we can even fall
back to UTF-8 as we can convert all encodings to it (with some
exceptions that you noticed).

So, what you wanted to say here is:

"If the conversion to this encoding is not avaliable then fall back to
UTF-8"

Am I correct?

Also is it possible to completely disable the feature?

Yes, you're. I think it could be disabled by setting log_encoding='',
but if the parameter is missing then the feature should be enabled (with
UTF-8).

#26Alexander Law
exclusion@gmail.com
In reply to: Alban Hertroys (#23)
Re: [GENERAL] main log encoding problem

I like Craig's idea of adding the client encoding to the log lines. A
possible problem with that (I'm not an encoding expert) is that a log
line like that will contain data about the database server meta-data
(log time, client encoding, etc) in the database default encoding and
database data (the logged query and user-supplied values) in the
client encoding. One option would be to use the client encoding for
the entire log line, but would that result in legible meta-data in
every encoding?

I think then we get non-human readable logs. We will need one more tool
to open and convert the log (and omit excessive encoding specification
in each line).

It appears that the primarly here is that SQL statements and
user-supplied data are being logged, while the log-file is a text file
in a fixed encoding.

Yes, and in in my opinion there is nothing unusual about it. XML/HTML
are examples of a text files with fixed encoding that can contain
multi-language strings. UTF-8 is the default encoding for XML. And when
it's not good enough (as Tatsou noticed), you still can switch to another.

Perhaps another solution would be to add the ability to log certain
types of information (not the core database server log info, of
course!) to a database/table so that each record can be stored in its
own encoding?
That way the transcoding doesn't have to take place until someone is
reading the log, you'd know what to transcode the data to (namely the
client_encoding of the reading session) and there isn't any issue of
transcoding errors while logging statements.

I don't think it would be the simplest solution of the existing problem.
It can be another branch of evolution, but it doesn't answer the
question - what encoding to use for the core database server log?

#27Alban Hertroys
haramrae@gmail.com
In reply to: Alexander Law (#26)
Re: [GENERAL] main log encoding problem

On 19 July 2012 13:50, Alexander Law <exclusion@gmail.com> wrote:

I like Craig's idea of adding the client encoding to the log lines. A
possible problem with that (I'm not an encoding expert) is that a log
line like that will contain data about the database server meta-data
(log time, client encoding, etc) in the database default encoding and
database data (the logged query and user-supplied values) in the
client encoding. One option would be to use the client encoding for
the entire log line, but would that result in legible meta-data in
every encoding?

I think then we get non-human readable logs. We will need one more tool to
open and convert the log (and omit excessive encoding specification in each
line).

Only the parts that contain user-supplied data in very different
encodings would not be "human readable", similar to what we already
have.

It appears that the primarly here is that SQL statements and
user-supplied data are being logged, while the log-file is a text file
in a fixed encoding.

Yes, and in in my opinion there is nothing unusual about it. XML/HTML are
examples of a text files with fixed encoding that can contain multi-language
strings. UTF-8 is the default encoding for XML. And when it's not good
enough (as Tatsou noticed), you still can switch to another.

Yes, but in those examples it is acceptable that the application fails
to write the output. That, and the output needs to be converted to
various different client encodings (namely that of the visitor's
browser) anyway, so it does not really add any additional overhead.

This doesn't hold true for database server log files. Ideally, writing
those has to be reliable (how are you going to catch errors
otherwise?) and should not impact the performance of the database
server in a significant way (the less the better). The end result will
probably be somewhere in the middle.

Perhaps another solution would be to add the ability to log certain
types of information (not the core database server log info, of
course!) to a database/table so that each record can be stored in its
own encoding?
That way the transcoding doesn't have to take place until someone is
reading the log, you'd know what to transcode the data to (namely the
client_encoding of the reading session) and there isn't any issue of
transcoding errors while logging statements.

I don't think it would be the simplest solution of the existing problem. It
can be another branch of evolution, but it doesn't answer the question -
what encoding to use for the core database server log?

It makes that problem much easier. If you need the "human-readable"
logs, you can write those to a different log (namely one in the
database). The result is that the server can use pretty much any
encoding (or a mix of multiple!) to write its log files.

You'll need a query to read the human-readable logs of course, but
since they're in the database, all the tools you need are already
available to you.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

#28Craig Ringer
ringerc@ringerc.id.au
In reply to: Tatsuo Ishii (#16)
Re: main log encoding problem

On 07/19/2012 03:24 PM, Tatsuo Ishii wrote:

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.

Good point re unified chars. That was always a bad idea, and that's just
one of the issues it causes.

I think these difficult encodings are where logging to dedicated file
per-database is useful.

I'm not convinced that a weird and uncommon encoding is the answer. I
guess as an alternative for people for whom it's useful if it's low cost
in terms of complexity/maintenance/etc...

--
Craig Ringer

#29Craig Ringer
ringerc@ringerc.id.au
In reply to: Alban Hertroys (#23)
Re: [GENERAL] main log encoding problem

On 07/19/2012 04:58 PM, Alban Hertroys wrote:

On 19 July 2012 10:40, Alexander Law <exclusion@gmail.com> wrote:

Ok, maybe the time of real universal encoding has not yet come. Then
we maybe just should add a new parameter "log_encoding" (UTF-8 by
default) to postgresql.conf. And to use this encoding consistently
within logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.

What do you mean by "not available"?

Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding
is not avaliable". For example, when we have database in EUC_JP and
log_encoding set to Latin1. I think that we can even fall back to UTF-8 as
we can convert all encodings to it (with some exceptions that you noticed).

I like Craig's idea of adding the client encoding to the log lines.

Nonono! Log *file* *names* when one-file-per-database is in use.

Encoding as a log line prefix is a terrible idea for all sorts of reasons.
--
Craig Ringer

#30Alexander Law
exclusion@gmail.com
In reply to: Alexander Law (#7)
Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

Hello,
I would like to fix this bug, but it looks like it would be not one-line
patch.
Looking at the pg_dump code I see that the object names come through the
following chain:
1. pg_dump executes 'SELECT c.tableoid, c.oid, c.relname, ... ' and gets
the object_name with the encoding chosen for db connection/dump.
2. it invokes write_msg function or alike:
write_msg(NULL, "finding the columns and types of table \"%s\"\n",
tbinfo->dobj.name);
3. vwrite_msg localizes text message, but not the argument(s):
vfprintf(stderr, _(fmt), ap);
Here gettext (_) internally translates fmt to OS encoding (if it's
different from UTF-8 - encoding of a localized strings).

And I can see only a few solutions of the problem:
1. To convert the object name at the back-end, i.e. to modify all the
similar SELECT's as:
'SELECT c.tableoid, c.oid, c.relname, convert_to(c.relname,
'OS_ENCODING') AS locrelname, ...'
and then do write_msg(NULL, "finding the columns and types of table
\"%s\"\n", tbinfo->dobj.local_name);
The downside of this approach is that it requires rewriting all the
SELECT's for all the object. And it doesn't help us to write out any
other text from backend, such as localized backend error.

2. To setup another connection to backend with the OS encoding, and to
get all the object names through it. It looks insane too. And we have
the same problem with the localized backend errors coming on "main"
connection.

3. To make convert_to_os_encoding(text, encoding) function for a
frontend utilities. Unfortunately frontend can't use internal PostgreSQL
conversion functions, and modifying them to use through libpq looks
unfeasible.
So the only way to implement such function is to use another encoding
conversion framework (library).
And my question is - is it possible to include libiconv (add this
dependency) to the frontend utilities code?

4. To force users to use OS encoding as the Database encoding. Or to not
use non-ASCII characters in an db object names and to disable nls on
Windows completely. It doesn't look like a solution at all.

BTW, it's not the only one instance of the issue. For example, when I
try to use vacuumdb, I get completely unreadable messages:
http://oi48.tinypic.com/1c8j9.jpg
(blue marks what is in Russian or English, all the other text is gibberish).

Best regards,
Alexander

18.07.2012 12:51, Alexander Law wrote:

Show quoted text

Hello,

The dump file itself is correct. The issue is only with the non-ASCII
object names in pg_dump messages.
The messages text (which is non-ASCII too) displayed consistently with
right encoding (i.e. with OS encoding thanks to libintl/gettext), but
encoding of db object names depends on the dump encoding and thus
they're getting unreadable when different encoding is used.
The same can be reproduced in Linux (where console encoding is UTF-8)
when doing dump with Windows-1251 or Latin1 (for western european
languages).

Thanks,
Alexander

The following bug has been logged on the website:

Bug reference: 6742
Logged by: Alexander LAW
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 9.1.4
Operating system: Windows
Description:

When I try to dump database with UTF-8 encoding in Windows, I get unreadable
object names.
Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the
left window all the pg_dump messages displayed correctly (except for the
prompt password (bug #6510)), but the non-ASCII object name is gibberish. On
the right window (where dump is done with the Windows 1251 encoding (OS
Encoding for Russian locale)) everything is right.

Did you check the dump file using an editor that can handle UTF-8?
The Windows console is not known for properly handling that encoding.

Thomas

#31Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Law (#30)
Re: Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

On Wed, Jul 25, 2012 at 7:54 AM, Alexander Law <exclusion@gmail.com> wrote:

Hello,
I would like to fix this bug, but it looks like it would be not one-line
patch.
Looking at the pg_dump code I see that the object names come through the
following chain:
1. pg_dump executes 'SELECT c.tableoid, c.oid, c.relname, ... ' and gets the
object_name with the encoding chosen for db connection/dump.
2. it invokes write_msg function or alike:

write_msg(NULL, "finding the columns and types of table \"%s\"\n",
tbinfo->dobj.name);
3. vwrite_msg localizes text message, but not the argument(s):
vfprintf(stderr, _(fmt), ap);
Here gettext (_) internally translates fmt to OS encoding (if it's different
from UTF-8 - encoding of a localized strings).

And I can see only a few solutions of the problem:
1. To convert the object name at the back-end, i.e. to modify all the
similar SELECT's as:
'SELECT c.tableoid, c.oid, c.relname, convert_to(c.relname, 'OS_ENCODING')
AS locrelname, ...'
and then do write_msg(NULL, "finding the columns and types of table
\"%s\"\n", tbinfo->dobj.local_name);
The downside of this approach is that it requires rewriting all the SELECT's
for all the object. And it doesn't help us to write out any other text from
backend, such as localized backend error.

2. To setup another connection to backend with the OS encoding, and to get
all the object names through it. It looks insane too. And we have the same
problem with the localized backend errors coming on "main" connection.

3. To make convert_to_os_encoding(text, encoding) function for a frontend
utilities. Unfortunately frontend can't use internal PostgreSQL conversion
functions, and modifying them to use through libpq looks unfeasible.
So the only way to implement such function is to use another encoding
conversion framework (library).
And my question is - is it possible to include libiconv (add this
dependency) to the frontend utilities code?

4. To force users to use OS encoding as the Database encoding. Or to not use
non-ASCII characters in an db object names and to disable nls on Windows
completely. It doesn't look like a solution at all.

I think if you're going to try to do something about this, #1 is
probably the best option.

It does sound like a lot of work, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32Noah Misch
noah@leadboat.com
In reply to: Alexander LAW (#5)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Hi Alexander,

I was able to reproduce the problem based on your description and test case,
and your change does resolve it for me.

On Tue, Mar 20, 2012 at 11:50:14PM +0400, Alexander LAW wrote:

Thanks, I've understood your point.
Please look at the patch. It implements the first way and it makes psql
work too.

20.03.2012 00:05, Alvaro Herrera ??????????:

Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012:

I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and
OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for
msys before. I think it's more straightforward.

Using console directly instead of stdin/out/err is more appropriate when
asking for passwords and reading them back, because you can redirect the
rest of the output to/from files or pipes, without the prompt
interfering with that. This also explains why stderr is used instead of
stdout.

The console output code page will usually match the OEM code page, but this is
not guaranteed. For example, one can change it with chcp.exe before starting
psql. The conversion should be to the actual console output code page. After
"chcp 869", notice how printing to stdout yields question marks while your
code yields unrelated characters.

It would be nicer still to find a way to make the output routines treat this
explicitly-opened console like stdout to a console. I could not find any
documentation around this. Digging into the CRT source code, I see that the
automatic code page conversion happens in write(). One of tests write() uses
to determine whether the destination is a console is to call GetConsoleMode()
on the HANDLE underlying the CRT file descriptor. If that call fails, write()
assumes the target is not a console. GetConsoleMode() requires GENERIC_READ
access on its subject HANDLE, but the HANDLE resulting from our fopen("con",
"w") has only GENERIC_WRITE. Therefore, write() wrongly concludes that it's
writing to a non-console. fopen("con", "w+") fails, but fopen("CONOUT$",
"w+") seems to give the outcome we need. write() recognizes that it's writing
to a console and applies the code page conversion. Let's use that.

This gave me occasion to look at the special case for MSYS that you mentioned.
I observe the same behavior when running a native psql in a Cygwin xterm;
writes to the console succeed but do not appear anywhere. Instead of guessing
at console visibility based on an environment variable witnessing a particular
platform, let's check IsWindowVisible(GetConsoleWindow()).

What do you think of taking that approach?

Thanks,
nm

#33Alexander Law
exclusion@gmail.com
In reply to: Noah Misch (#32)
1 attachment(s)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Hi Noah,

Thank you for your review.
I agree with you, CONOUT$ way is much simpler. Please look at the patch.
Regarding msys - yes, that check was not correct.
In fact you can use "con" with msys, if you run sh.exe, not a graphical
terminal.
So the issue with con not related to msys, but to some terminal
implementations.
Namely, I see that con is not supported by rxvt, mintty and xterm (from
x.cygwin project).
(rxvt was default terminal for msys 1.0.10, so I think such behavior was
considered as msys feature because of this)
Your solution to use IsWindowVisible(GetConsoleWindow()) works for these
terminals (I've made simple test and it returns false for all of them),
but this check will not work for telnet (console app running through
telnet can use con/conout).
Maybe this should be considered as a distinct bug with another patch
required? (I see no ideal solution for it yet. Probably it's possible to
detect not "ostype", but these terminals, though it would not be generic
too.)
And there is another issue with a console charset. When writing string
to a console CRT converts it to console encoding, but when reading input
back it doesn't. So it seems, there should be conversion from
ConsoleCP() to ACP() and then probably to UTF-8 to make postgres
utilities support national chars in passwords or usernames (with
createuser --interactive).
I think it can be fixed as another bug too.

Best regards,
Alexander

10.10.2012 15:05, Noah Misch wrote:

Show quoted text

Hi Alexander,

The console output code page will usually match the OEM code page, but this is
not guaranteed. For example, one can change it with chcp.exe before starting
psql. The conversion should be to the actual console output code page. After
"chcp 869", notice how printing to stdout yields question marks while your
code yields unrelated characters.

It would be nicer still to find a way to make the output routines treat this
explicitly-opened console like stdout to a console. I could not find any
documentation around this. Digging into the CRT source code, I see that the
automatic code page conversion happens in write(). One of tests write() uses
to determine whether the destination is a console is to call GetConsoleMode()
on the HANDLE underlying the CRT file descriptor. If that call fails, write()
assumes the target is not a console. GetConsoleMode() requires GENERIC_READ
access on its subject HANDLE, but the HANDLE resulting from our fopen("con",
"w") has only GENERIC_WRITE. Therefore, write() wrongly concludes that it's
writing to a non-console. fopen("con", "w+") fails, but fopen("CONOUT$",
"w+") seems to give the outcome we need. write() recognizes that it's writing
to a console and applies the code page conversion. Let's use that.

This gave me occasion to look at the special case for MSYS that you mentioned.
I observe the same behavior when running a native psql in a Cygwin xterm;
writes to the console succeed but do not appear anywhere. Instead of guessing
at console visibility based on an environment variable witnessing a particular
platform, let's check IsWindowVisible(GetConsoleWindow()).

What do you think of taking that approach?

Thanks,
nm

Attachments:

sprompt.difftext/x-patch; name=sprompt.diffDownload
--- a/src/port/sprompt.c
+++ b/src/port/sprompt.c
@@ -60,8 +60,13 @@ simple_prompt(const char *prompt, int maxlen, bool echo)
 	 * Do not try to collapse these into one "w+" mode file. Doesn't work on
 	 * some platforms (eg, HPUX 10.20).
 	 */
+#ifdef WIN32
+	termin = fopen("CONIN$", "r");
+	termout = fopen("CONOUT$", "w+");
+#else
 	termin = fopen(DEVTTY, "r");
 	termout = fopen(DEVTTY, "w");
+#endif
 	if (!termin || !termout
 #ifdef WIN32
 	/* See DEVTTY comment for msys */
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Law (#33)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Alexander Law <exclusion@gmail.com> writes:

+#ifdef WIN32
+	termin = fopen("CONIN$", "r");
+	termout = fopen("CONOUT$", "w+");
+#else
termin = fopen(DEVTTY, "r");
termout = fopen(DEVTTY, "w");
+#endif
if (!termin || !termout

My immediate reaction to this patch is "that's a horrible kluge, why
shouldn't we change the definition of DEVTTY instead?" Is there a
similar issue in other places where we use DEVTTY?

Also, why did you change the termout output mode, is that important
or just randomness?

regards, tom lane

#35Noah Misch
noah@leadboat.com
In reply to: Alexander Law (#33)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On Sun, Oct 14, 2012 at 10:35:04AM +0400, Alexander Law wrote:

I agree with you, CONOUT$ way is much simpler. Please look at the patch.

See comments below.

Regarding msys - yes, that check was not correct.
In fact you can use "con" with msys, if you run sh.exe, not a graphical
terminal.
So the issue with con not related to msys, but to some terminal
implementations.
Namely, I see that con is not supported by rxvt, mintty and xterm (from
x.cygwin project).
(rxvt was default terminal for msys 1.0.10, so I think such behavior was
considered as msys feature because of this)
Your solution to use IsWindowVisible(GetConsoleWindow()) works for these
terminals (I've made simple test and it returns false for all of them),
but this check will not work for telnet (console app running through
telnet can use con/conout).

Thanks for testing those environments. I can reproduce the distinctive
behavior when a Windows telnet client connects to a Windows telnet server.
When I connect to a Windows telnet server from a GNU/Linux system, I get the
normal invisible-console behavior.

I also get the invisible-console behavior in PowerShell ISE.

Maybe this should be considered as a distinct bug with another patch
required? (I see no ideal solution for it yet. Probably it's possible to
detect not "ostype", but these terminals, though it would not be generic
too.)

Using stdin/stderr when we could have used the console is a mild loss; use
cases involving redirected output will need to account for the abnormality.
Interacting with a user-invisible console is a large loss; prompts will hang
indefinitely. Therefore, the test should err on the side of stdin/stderr.

Since any change here seems to have its own trade-offs, yes, let's leave it
for a separate patch.

And there is another issue with a console charset. When writing string
to a console CRT converts it to console encoding, but when reading input
back it doesn't. So it seems, there should be conversion from
ConsoleCP() to ACP() and then probably to UTF-8 to make postgres
utilities support national chars in passwords or usernames (with
createuser --interactive).

Yes, that also deserves attention. I do not know whether converting to UTF-8
is correct. Given a username <foo> containing non-ASCII characters, you
should be able to input <foo> the same way for both "psql -U <foo>" and the
createuser prompt. We should also be thoughtful about backward compatibility.

I think it can be fixed as another bug too.

Agreed.

--- a/src/port/sprompt.c
+++ b/src/port/sprompt.c
@@ -60,8 +60,13 @@ simple_prompt(const char *prompt, int maxlen, bool echo)
* Do not try to collapse these into one "w+" mode file. Doesn't work on
* some platforms (eg, HPUX 10.20).
*/
+#ifdef WIN32
+	termin = fopen("CONIN$", "r");
+	termout = fopen("CONOUT$", "w+");

This definitely needs a block comment explaining the behaviors that led us to
select this particular implementation.

+#else
termin = fopen(DEVTTY, "r");
termout = fopen(DEVTTY, "w");

This thread has illustrated that the DEVTTY abstraction does not suffice. I
think we should remove it entirely. Remove it from port.h; use literal
"/dev/tty" here; re-add it as a local #define near the one remaining use, with
an XXX comment indicating that the usage is broken.

If it would help, I can prepare a version with the comment changes and
refactoring I have in mind.

+#endif
if (!termin || !termout
#ifdef WIN32
/* See DEVTTY comment for msys */

Thanks,
nm

#36Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#34)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On Sun, Oct 14, 2012 at 12:10:42PM -0400, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

+#ifdef WIN32
+	termin = fopen("CONIN$", "r");
+	termout = fopen("CONOUT$", "w+");
+#else
termin = fopen(DEVTTY, "r");
termout = fopen(DEVTTY, "w");
+#endif
if (!termin || !termout

My immediate reaction to this patch is "that's a horrible kluge, why
shouldn't we change the definition of DEVTTY instead?"

You could make DEVTTY_IN, DEVTTY_IN_MODE, DEVTTY_OUT and DEVTTY_OUT_MODE to
capture all the differences. That doesn't strike me as an improvement, and no
other function would use them at present. As I explained in my reply to
Alexander, we should instead remove DEVTTY.

Is there a
similar issue in other places where we use DEVTTY?

Yes. However, the other use of DEVTTY arises only with readline support, not
typical of native Windows builds.

Also, why did you change the termout output mode, is that important
or just randomness?

It's essential:
http://archives.postgresql.org/message-id/20121010110555.GA21405@tornado.leadboat.com

nm

#37Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#35)
1 attachment(s)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On Mon, Oct 15, 2012 at 05:41:36AM -0400, Noah Misch wrote:

--- a/src/port/sprompt.c
+++ b/src/port/sprompt.c
@@ -60,8 +60,13 @@ simple_prompt(const char *prompt, int maxlen, bool echo)
* Do not try to collapse these into one "w+" mode file. Doesn't work on
* some platforms (eg, HPUX 10.20).
*/
+#ifdef WIN32
+	termin = fopen("CONIN$", "r");
+	termout = fopen("CONOUT$", "w+");

This definitely needs a block comment explaining the behaviors that led us to
select this particular implementation.

+#else
termin = fopen(DEVTTY, "r");
termout = fopen(DEVTTY, "w");

This thread has illustrated that the DEVTTY abstraction does not suffice. I
think we should remove it entirely. Remove it from port.h; use literal
"/dev/tty" here; re-add it as a local #define near the one remaining use, with
an XXX comment indicating that the usage is broken.

If it would help, I can prepare a version with the comment changes and
refactoring I have in mind.

Following an off-list ack from Alexander, here is that version. No functional
differences from Alexander's latest version, and I have verified that it still
fixes the original test case. I'm marking this Ready for Committer.

To test this on an English (United States) copy of Windows 7, I made two
configuration changes in the "Region and Language" control panel. On the
"Administrative" tab, choose "Change system locale..." and select Russian
(Russia). After the reboot, choose "Russian (Russia)" on the "Format" tab.
(Neither of these changes will affect the display language of most Windows UI
components.) Finally, run "initdb -W testdatadir". Before the patch, the
password prompt contained some line-drawing characters and other garbage.
Afterward, it matches the string in src/bin/initdb/po/ru.po.

Thanks,
nm

Attachments:

simple_prompt-nonascii-v3.patchtext/plain; charset=us-asciiDownload
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
***************
*** 1043,1048 **** exec_command(const char *cmd,
--- 1043,1059 ----
  		char	   *fname = psql_scan_slash_option(scan_state,
  												   OT_NORMAL, NULL, true);
  
+ #if defined(WIN32) && !defined(__CYGWIN__)
+ 
+ 		/*
+ 		 * XXX This does not work for all terminal environments or for output
+ 		 * containing non-ASCII characters; see comments in simple_prompt().
+ 		 */
+ #define DEVTTY	"con"
+ #else
+ #define DEVTTY	"/dev/tty"
+ #endif
+ 
  		expand_tilde(&fname);
  		/* This scrolls off the screen when using /dev/tty */
  		success = saveHistory(fname ? fname : DEVTTY, -1, false, false);
*** a/src/include/port.h
--- b/src/include/port.h
***************
*** 110,120 **** extern BOOL AddUserToTokenDacl(HANDLE hToken);
  
  #if defined(WIN32) && !defined(__CYGWIN__)
  #define DEVNULL "nul"
- /* "con" does not work from the Msys 1.0.10 console (part of MinGW). */
- #define DEVTTY	"con"
  #else
  #define DEVNULL "/dev/null"
- #define DEVTTY "/dev/tty"
  #endif
  
  /*
--- 110,117 ----
*** a/src/port/sprompt.c
--- b/src/port/sprompt.c
***************
*** 56,70 **** simple_prompt(const char *prompt, int maxlen, bool echo)
  	if (!destination)
  		return NULL;
  
  	/*
  	 * Do not try to collapse these into one "w+" mode file. Doesn't work on
  	 * some platforms (eg, HPUX 10.20).
  	 */
! 	termin = fopen(DEVTTY, "r");
! 	termout = fopen(DEVTTY, "w");
  	if (!termin || !termout
  #ifdef WIN32
! 	/* See DEVTTY comment for msys */
  		|| (getenv("OSTYPE") && strcmp(getenv("OSTYPE"), "msys") == 0)
  #endif
  		)
--- 56,97 ----
  	if (!destination)
  		return NULL;
  
+ #ifdef WIN32
+ 
+ 	/*
+ 	 * A Windows console has an "input code page" and an "output code page";
+ 	 * these usually match each other, but they rarely match the "Windows ANSI
+ 	 * code page" defined at system boot and expected of "char *" arguments to
+ 	 * Windows API functions.  The Microsoft CRT write() implementation
+ 	 * automatically converts text between these code pages when writing to a
+ 	 * console.  To identify such file descriptors, it calls GetConsoleMode()
+ 	 * on the underlying HANDLE, which in turn requires GENERIC_READ access on
+ 	 * the HANDLE.  Opening termout in mode "w+" allows that detection to
+ 	 * succeed.  Otherwise, write() would not recognize the descriptor as a
+ 	 * console, and non-ASCII characters would display incorrectly.
+ 	 *
+ 	 * XXX fgets() still receives text in the console's input code page.  This
+ 	 * makes non-ASCII credentials unportable.
+ 	 */
+ 	termin = fopen("CONIN$", "r");
+ 	termout = fopen("CONOUT$", "w+");
+ #else
+ 
  	/*
  	 * Do not try to collapse these into one "w+" mode file. Doesn't work on
  	 * some platforms (eg, HPUX 10.20).
  	 */
! 	termin = fopen("/dev/tty", "r");
! 	termout = fopen("/dev/tty", "w");
! #endif
  	if (!termin || !termout
  #ifdef WIN32
! 	/*
! 	 * Direct console I/O does not work from the MSYS 1.0.10 console.  Writes
! 	 * reach nowhere user-visible; reads block indefinitely.  XXX This affects
! 	 * most Windows terminal environments, including rxvt, mintty, Cygwin
! 	 * xterm, Cygwin sshd, and PowerShell ISE.  Switch to a more-generic test.
! 	 */
  		|| (getenv("OSTYPE") && strcmp(getenv("OSTYPE"), "msys") == 0)
  #endif
  		)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noah Misch (#37)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Noah Misch escribió:

Following an off-list ack from Alexander, here is that version. No functional
differences from Alexander's latest version, and I have verified that it still
fixes the original test case. I'm marking this Ready for Committer.

This seems good to me, but I'm not comfortable committing Windows stuff.
Andrew, Magnus, are you able to handle this?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39Alexander Law
exclusion@gmail.com
In reply to: Alvaro Herrera (#38)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.
I would like to fix other bugs related to postgres localization, but I
am not sure yet how to do it.

Thanks in advance,
Alexander

18.10.2012 19:46, Alvaro Herrera wrote:

Noah Misch escribi�:

Following an off-list ack from Alexander, here is that version. No functional
differences from Alexander's latest version, and I have verified that it still
fixes the original test case. I'm marking this Ready for Committer.

This seems good to me, but I'm not comfortable committing Windows stuff.
Andrew, Magnus, are you able to handle this?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Law (#39)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Alexander Law <exclusion@gmail.com> writes:

Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.

It's waiting on some Windows-savvy committer to pick it up, IMO.
(FWIW, I have no objection to the patch as given, but I am unqualified
to evaluate how sane it is on Windows.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#40)
Re: BUG #6510: A simple prompt is displayed using wrong charset

Tom Lane escribió:

Alexander Law <exclusion@gmail.com> writes:

Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.

It's waiting on some Windows-savvy committer to pick it up, IMO.
(FWIW, I have no objection to the patch as given, but I am unqualified
to evaluate how sane it is on Windows.)

I think part of the problem is that we no longer have many Windows-savvy
committers willing to spend time on Windows-specific patches; there are,
of course, people with enough knowledge, but they don't always
necessarily have much interest. Note that I added Magnus and Andrew to
this thread because they are known to have contributed Windows-specific
improvements, but they have yet to followup on this thread *at all*.

I remember back when Windows support was added, one of the arguments in
its favor was "it's going to attract lots of new developers". Yeah, right.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#41)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/23/2013 01:08 PM, Alvaro Herrera wrote:

Tom Lane escribió:

Alexander Law <exclusion@gmail.com> writes:

Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.

It's waiting on some Windows-savvy committer to pick it up, IMO.
(FWIW, I have no objection to the patch as given, but I am unqualified
to evaluate how sane it is on Windows.)

I think part of the problem is that we no longer have many Windows-savvy
committers willing to spend time on Windows-specific patches; there are,
of course, people with enough knowledge, but they don't always
necessarily have much interest. Note that I added Magnus and Andrew to
this thread because they are known to have contributed Windows-specific
improvements, but they have yet to followup on this thread *at all*.

I remember back when Windows support was added, one of the arguments in
its favor was "it's going to attract lots of new developers". Yeah, right.

I'm about to take a look at this one.

Note BTW that Craig Ringer has recently done some excellent work on
Windows, and there are several other active non-committers (e.g. Noah)
who comment on Windows too.

IIRC I never expected us to get a huge influx of developers from the
Windows world. What I did expect was a lot of new users, and I think we
have on fact got that.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Craig Ringer
craig@2ndQuadrant.com
In reply to: Tom Lane (#40)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/24/2013 01:34 AM, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.

It's waiting on some Windows-savvy committer to pick it up, IMO.

I'm no committer, but I can work with Windows and know text encoding
issues fairly well. I'll take a look, though I can't do it immediately
as I have some other priority work. Should be able to in the next 24h.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Craig Ringer
craig@2ndQuadrant.com
In reply to: Alexander Law (#39)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/24/2013 01:06 AM, Alexander Law wrote:

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.
I would like to fix other bugs related to postgres localization, but I
am not sure yet how to do it.

For anyone looking for the history, the 1st post on this topic is here:

/messages/by-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#45Noah Misch
noah@leadboat.com
In reply to: Craig Ringer (#43)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On Thu, Jan 24, 2013 at 03:45:36PM +0800, Craig Ringer wrote:

On 01/24/2013 01:34 AM, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.

It's waiting on some Windows-savvy committer to pick it up, IMO.

I'm no committer, but I can work with Windows and know text encoding
issues fairly well. I'll take a look, though I can't do it immediately
as I have some other priority work. Should be able to in the next 24h.

Feel free, but it's already Ready for Committer. Also, Andrew has said he
plans to pick it up. If in doubt, I think this patch is relatively solid in
terms of non-committer review.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Craig Ringer
craig@2ndQuadrant.com
In reply to: Noah Misch (#45)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/24/2013 07:57 PM, Noah Misch wrote:

On Thu, Jan 24, 2013 at 03:45:36PM +0800, Craig Ringer wrote:

On 01/24/2013 01:34 AM, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.

It's waiting on some Windows-savvy committer to pick it up, IMO.

I'm no committer, but I can work with Windows and know text encoding
issues fairly well. I'll take a look, though I can't do it immediately
as I have some other priority work. Should be able to in the next 24h.

Feel free, but it's already Ready for Committer. Also, Andrew has said he
plans to pick it up. If in doubt, I think this patch is relatively solid in
terms of non-committer review.

I'm happy with that; I'll look elsewhere. Thanks.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Andrew Dunstan
andrew@dunslane.net
In reply to: Craig Ringer (#44)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/24/2013 03:42 AM, Craig Ringer wrote:

On 01/24/2013 01:06 AM, Alexander Law wrote:

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902) committed.
I would like to fix other bugs related to postgres localization, but
I am not sure yet how to do it.

For anyone looking for the history, the 1st post on this topic is here:

/messages/by-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org

Yeah.

I'm happy enough with this patch. ISTM it's really a bug and should be
backpatched, no?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#47)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote:

On 01/24/2013 03:42 AM, Craig Ringer wrote:

On 01/24/2013 01:06 AM, Alexander Law wrote:

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902)
committed.
I would like to fix other bugs related to postgres localization, but
I am not sure yet how to do it.

For anyone looking for the history, the 1st post on this topic is here:

/messages/by-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org

Yeah.

I'm happy enough with this patch. ISTM it's really a bug and should be
backpatched, no?

It is a bug, yes. I'm neutral on whether to backpatch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#48)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/24/2013 11:19 AM, Noah Misch wrote:

On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote:

On 01/24/2013 03:42 AM, Craig Ringer wrote:

On 01/24/2013 01:06 AM, Alexander Law wrote:

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902)
committed.
I would like to fix other bugs related to postgres localization, but
I am not sure yet how to do it.

For anyone looking for the history, the 1st post on this topic is here:

/messages/by-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org

Yeah.

I'm happy enough with this patch. ISTM it's really a bug and should be
backpatched, no?

It is a bug, yes. I'm neutral on whether to backpatch.

Well, that's what I did. :-)

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#49)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 1/24/13 4:04 PM, Andrew Dunstan wrote:

On 01/24/2013 11:19 AM, Noah Misch wrote:

On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote:

On 01/24/2013 03:42 AM, Craig Ringer wrote:

On 01/24/2013 01:06 AM, Alexander Law wrote:

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902)
committed.
I would like to fix other bugs related to postgres localization, but
I am not sure yet how to do it.

For anyone looking for the history, the 1st post on this topic is here:

/messages/by-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org

Yeah.

I'm happy enough with this patch. ISTM it's really a bug and should be
backpatched, no?

It is a bug, yes. I'm neutral on whether to backpatch.

Well, that's what I did. :-)

The 9.0 and 9.1 branches are now failing to build.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#50)
Re: BUG #6510: A simple prompt is displayed using wrong charset

On 01/25/2013 10:26 AM, Peter Eisentraut wrote:

On 1/24/13 4:04 PM, Andrew Dunstan wrote:

On 01/24/2013 11:19 AM, Noah Misch wrote:

On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote:

On 01/24/2013 03:42 AM, Craig Ringer wrote:

On 01/24/2013 01:06 AM, Alexander Law wrote:

Hello,
Please let me know if I can do something to get the bug fix
(https://commitfest.postgresql.org/action/patch_view?id=902)
committed.
I would like to fix other bugs related to postgres localization, but
I am not sure yet how to do it.

For anyone looking for the history, the 1st post on this topic is here:

/messages/by-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org

Yeah.

I'm happy enough with this patch. ISTM it's really a bug and should be
backpatched, no?

It is a bug, yes. I'm neutral on whether to backpatch.

Well, that's what I did. :-)

The 9.0 and 9.1 branches are now failing to build.

Yeah, sorry, working on it.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Alexander Law
exclusion@gmail.com
In reply to: Andrew Dunstan (#49)
1 attachment(s)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Hello,
Thanks for fixing bug #6510!
Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

Best regards,
Alexander

Attachments:

0001-Fix-postmaster-messages-encoding.patchtext/x-patch; name=0001-Fix-postmaster-messages-encoding.patchDownload
>From 1e2d5f712744d4731b665724703c0da4971ea41e Mon Sep 17 00:00:00 2001
From: Alexander Lakhin <exclusion@gmail.com>
Date: Mon, 28 Jan 2013 08:19:34 +0400
Subject: Fix postmaster messages encoding

---
 src/backend/main/main.c |    6 ++++++
 1 file changed, 6 insertions(+)
 mode change 100644 => 100755 src/backend/main/main.c

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 1173bda..b79a483
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -89,6 +89,12 @@ main(int argc, char *argv[])
 	pgwin32_install_crashdump_handler();
 #endif
 
+    /*
+    * Use the platform encoding until the process connects to a database
+    * and sets the appropriate encoding.
+    */
+	SetDatabaseEncoding(GetPlatformEncoding());
+
 	/*
 	 * Set up locale information from environment.	Note that LC_CTYPE and
 	 * LC_COLLATE will be overridden later from pg_control if we are in an
-- 
1.7.10.4



#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Law (#52)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Alexander Law <exclusion@gmail.com> writes:

Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

That patch looks entirely unsafe to me. Neither of those functions
should be expected to be able to run when none of our standard
infrastructure (palloc, elog) is up yet.

Possibly it would be safe to do this somewhere around where we do
GUC initialization.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#53)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

That patch looks entirely unsafe to me. Neither of those functions
should be expected to be able to run when none of our standard
infrastructure (palloc, elog) is up yet.

Possibly it would be safe to do this somewhere around where we do
GUC initialization.

Even then, I wouldn't be surprised to find problematic consequences beyond
error display. What if all the databases are EUC_JP, the platform encoding is
KOI8, and some postgresql.conf settings contain EUC_JP characters? Does the
postmaster not rely on its use of SQL_ASCII to allow those values?

I would look at fixing this by making the error output machinery smarter in
this area before changing the postmaster's notion of server_encoding.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Alexander Law
exclusion@gmail.com
In reply to: Noah Misch (#54)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

30.01.2013 05:51, Noah Misch wrote:

On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

That patch looks entirely unsafe to me. Neither of those functions
should be expected to be able to run when none of our standard
infrastructure (palloc, elog) is up yet.

Possibly it would be safe to do this somewhere around where we do
GUC initialization.

Looking at elog.c:write_console, and boostrap.c:AuxiliaryProcessMain,
mcxt.c:MemoryContextInit I would place this call
(SetDatabaseEncoding(GetPlatformEncoding())) at MemoryContextInit.
(The branch of conversion pgwin32_toUTF16 is not executed until
CurrentMemoryContext is not null)

But I see some calls to ereport before MemoryContextInit. Is it ok or
MemoryContext initialization should be done before?
For example, main.c:main -> pgwin32_signal_initialize -> ereport

And there is another issue with elog.c:write_stderr
if (pgwin32_is_service) then the process writes message to the windows
eventlog (write_eventlog), trying to convert in to UTF16. But it doesn't
check MemoryContext before the call to pgwin32_toUTF16 (as write_console
does) and we can get a crash in the following way:
main.c:check_root -> if (pgwin32_is_admin()) write_stderr -> if
(pgwin32_is_service()) write_eventlog -> if (if (GetDatabaseEncoding()
!= GetPlatformEncoding() ) pgwin32_toUTF16 -> crash

So placing SetDatabaseEncoding(GetPlatformEncoding()) before the
check_root can be a solution for the issue.

Even then, I wouldn't be surprised to find problematic consequences beyond
error display. What if all the databases are EUC_JP, the platform encoding is
KOI8, and some postgresql.conf settings contain EUC_JP characters? Does the
postmaster not rely on its use of SQL_ASCII to allow those values?

I would look at fixing this by making the error output machinery smarter in
this area before changing the postmaster's notion of server_encoding.

Maybe I still miss something but I thought that
postinit.c/CheckMyDatabase will switch encoding of a messages by
pg_bind_textdomain_codeset to EUC_JP so there will be no issues with it.
But until then KOI8 should be used.
Regarding postgresql.conf, as it has no explicit encoding specification,
it should be interpreted as having the platform encoding. So in your
example it should contain KOI8, not EUC_JP characters.

Thanks,
Alexander

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Noah Misch
noah@leadboat.com
In reply to: Alexander Law (#55)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Wed, Jan 30, 2013 at 10:00:01AM +0400, Alexander Law wrote:

30.01.2013 05:51, Noah Misch wrote:

On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote:

Alexander Law <exclusion@gmail.com> writes:

Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

Even then, I wouldn't be surprised to find problematic consequences beyond
error display. What if all the databases are EUC_JP, the platform encoding is
KOI8, and some postgresql.conf settings contain EUC_JP characters? Does the
postmaster not rely on its use of SQL_ASCII to allow those values?

I would look at fixing this by making the error output machinery smarter in
this area before changing the postmaster's notion of server_encoding.

With your proposed change, the problem will resurface in an actual SQL_ASCII
database. At the problem's root is write_console()'s assumption that messages
are in the database encoding. pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry. That
excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to
behave differently in such cases.

Maybe I still miss something but I thought that
postinit.c/CheckMyDatabase will switch encoding of a messages by
pg_bind_textdomain_codeset to EUC_JP so there will be no issues with it.
But until then KOI8 should be used.
Regarding postgresql.conf, as it has no explicit encoding specification,
it should be interpreted as having the platform encoding. So in your
example it should contain KOI8, not EUC_JP characters.

Following some actual testing, I see that we treat postgresql.conf values as
byte sequences; any reinterpretation as encoded text happens later. Hence,
contrary to my earlier suspicion, your patch does not make that situation
worse. The present situation is bad; among other things, current_setting() is
a vector for injecting invalid text data. But unconditionally validating
postgresql.conf values in the platform encoding would not be an improvement.
Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish
to put KOI8R strings in a GUC, say search_path. That's possible today; if we
required that postgresql.conf conform to the platform encoding and no other,
it would become impossible. This area warrants improvement, but doing so will
entail careful design.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#56)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Noah Misch <noah@leadboat.com> writes:

Following some actual testing, I see that we treat postgresql.conf values as
byte sequences; any reinterpretation as encoded text happens later. Hence,
contrary to my earlier suspicion, your patch does not make that situation
worse. The present situation is bad; among other things, current_setting() is
a vector for injecting invalid text data. But unconditionally validating
postgresql.conf values in the platform encoding would not be an improvement.
Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish
to put KOI8R strings in a GUC, say search_path. That's possible today; if we
required that postgresql.conf conform to the platform encoding and no other,
it would become impossible. This area warrants improvement, but doing so will
entail careful design.

The key problem, ISTM, is that it's not at all clear what encoding to
expect the incoming data to be in. I'm concerned about trying to fix
that by assuming it's in some "platform encoding" --- for one thing,
while that might be a well-defined concept on Windows, I don't believe
it is anywhere else.

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8' # must precede any non-ASCII in the file

There would still be a lot of practical problems to solve, like what to
do if we fail to convert some string into the database encoding. But at
least the problems would be somewhat well-defined.

While we're thinking about this, it'd be nice to fix our handling (or
rather lack of handling) of encoding considerations for database names,
user names, and passwords. I could imagine adding some sort of encoding
marker to connection request packets, which could fix the don't-know-
the-encoding problem as far as incoming data is concerned. But how
shall we deal with storing the strings in shared catalogs, which have to
be readable from multiple databases possibly of different encodings?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#57)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Following some actual testing, I see that we treat postgresql.conf values as
byte sequences; any reinterpretation as encoded text happens later. Hence,
contrary to my earlier suspicion, your patch does not make that situation
worse. The present situation is bad; among other things, current_setting() is
a vector for injecting invalid text data. But unconditionally validating
postgresql.conf values in the platform encoding would not be an improvement.
Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish
to put KOI8R strings in a GUC, say search_path. That's possible today; if we
required that postgresql.conf conform to the platform encoding and no other,
it would become impossible. This area warrants improvement, but doing so will
entail careful design.

The key problem, ISTM, is that it's not at all clear what encoding to
expect the incoming data to be in. I'm concerned about trying to fix
that by assuming it's in some "platform encoding" --- for one thing,
while that might be a well-defined concept on Windows, I don't believe
it is anywhere else.

GetPlatformEncoding() imposes a sufficiently-portable definition. I just
don't think that definition leads to a value that can be presumed desirable
and adequate for postgresql.conf.

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8' # must precede any non-ASCII in the file

There would still be a lot of practical problems to solve, like what to
do if we fail to convert some string into the database encoding. But at
least the problems would be somewhat well-defined.

Agreed. That's a promising direction.

While we're thinking about this, it'd be nice to fix our handling (or
rather lack of handling) of encoding considerations for database names,
user names, and passwords. I could imagine adding some sort of encoding
marker to connection request packets, which could fix the don't-know-
the-encoding problem as far as incoming data is concerned.

That deserves a TODO entry under Wire Protocol Changes to avoid losing it.

But how
shall we deal with storing the strings in shared catalogs, which have to
be readable from multiple databases possibly of different encodings?

I suppose we would pick an encoding sufficient for all values we intend to
support (UTF8? MULE_INTERNAL?), then store the data in that encoding using
either bytea or a new type, say "omnitext".

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#58)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Noah Misch <noah@leadboat.com> writes:

On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote:

The key problem, ISTM, is that it's not at all clear what encoding to
expect the incoming data to be in. I'm concerned about trying to fix
that by assuming it's in some "platform encoding" --- for one thing,
while that might be a well-defined concept on Windows, I don't believe
it is anywhere else.

GetPlatformEncoding() imposes a sufficiently-portable definition. I just
don't think that definition leads to a value that can be presumed desirable
and adequate for postgresql.conf.

Nah, GetPlatformEncoding is utterly useless for this purpose --- restart
the server with some other environment, and bad things will happen.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Greg Stark
stark@mit.edu
In reply to: Tom Lane (#57)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8' # must precede any non-ASCII in the file

If we're going to do that we might as well use the Emacs standard
-*-coding: latin-1;-*-

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways. Log line formats could be treated
similarly if we choose.

Hostnames would need to be in a particular encoding if we're to
generate punycode I think but I think that particular encoding would
have to be UTF8. Converting from some other encoding would be error
prone and unnecessarily complex.

What parts of postgresql.conf are actually encoded strings that need
to be (and can be) manipulated as encoded strings?

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Noah Misch
noah@leadboat.com
In reply to: Greg Stark (#60)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Tue, Feb 12, 2013 at 03:22:17AM +0000, Greg Stark wrote:

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways.

For Unix, yes. On Windows, they're ultimately UTF16 strings; some system APIs
accept paths in the Windows ANSI code page and convert to UTF16 internally.
Nonetheless, good point.

What parts of postgresql.conf are actually encoded strings that need
to be (and can be) manipulated as encoded strings?

Mainly the ones that refer to arbitrary database objects. At least these:

default_tablespace
default_text_search_config
search_path
temp_tablespaces

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Alexander Law
exclusion@gmail.com
In reply to: Noah Misch (#56)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Hello,

Alexander Law <exclusion@gmail.com> writes:

Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

With your proposed change, the problem will resurface in an actual SQL_ASCII
database. At the problem's root is write_console()'s assumption that messages
are in the database encoding. pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry. That
excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to
behave differently in such cases.

Thank you for the notice. So it seems that "DatabaseEncoding" variable
alone can't present a database encoding (for communication with a
client) and current process messages encoding (for logging messages) at
once. There should be another variable, something like
CurrentProcessEncoding, that will be set to OS encoding at start and can
be changed to encoding of a connected database (if
bind_textdomain_codeset succeeded).

On Tue, Feb 12, 2013 at 03:22:17AM +0000, Greg Stark wrote:

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways.

For Unix, yes. On Windows, they're ultimately UTF16 strings; some system APIs
accept paths in the Windows ANSI code page and convert to UTF16 internally.
Nonetheless, good point.

Yes, and if postresql.conf not going to be UTF16 encoded, it seems
natural to use ANSI code page on Windows to write such paths in it.
So the paths should be written in OS encoding, which is accepted by OS
functions, such as fopen. (This is what we have now.)
And it seems too complicated to have different encodings in one file. Or
maybe path parameters should be separated from the others, for which OS
encoding is undesirable.

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8' # must precede any non-ASCII in the file

If we're going to do that we might as well use the Emacs standard
-*-coding: latin-1;-*-

Explicit encoding specification such as these (or even <?xml
version="1.0" encoding="utf-8"?>) can be useful but what encoding to
assume without it? For XML (without BOM) it's UTF-8, for emacs it
depends on it's language environment.
If postgresql.conf doesn't have to be portable (as XML), then IMO OS
encoding is the right choice for it.

Best regards,
Alexander

#63Peter Eisentraut
peter_e@gmx.net
In reply to: Greg Stark (#60)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On 2/11/13 10:22 PM, Greg Stark wrote:

On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8' # must precede any non-ASCII in the file

If we're going to do that we might as well use the Emacs standard
-*-coding: latin-1;-*-

Yes, or more generally perhaps what Python does:
http://docs.python.org/2.7/reference/lexical_analysis.html#encoding-declarations

(In Python 2, the default is ASCII, in Python 3, the default is UTF8.)

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways. Log line formats could be treated
similarly if we choose.

Well, at some point someone is going to open a configuration file in an
editor, and the editor will usually only be table to deal with one
encoding per file. So we need to make that work, even if technically,
different considerations apply to different settings.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Craig Ringer
craig@2ndquadrant.com
In reply to: Peter Eisentraut (#63)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On 02/15/2013 12:45 AM, Peter Eisentraut wrote:

On 2/11/13 10:22 PM, Greg Stark wrote:

On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8' # must precede any non-ASCII in the file

If we're going to do that we might as well use the Emacs standard
-*-coding: latin-1;-*-

Yes, or more generally perhaps what Python does:
http://docs.python.org/2.7/reference/lexical_analysis.html#encoding-declarations

(In Python 2, the default is ASCII, in Python 3, the default is UTF8.)

Not that Python also respects a BOM in a UTF-8 file, treating the BOM as
flagging the file as being UTF-8.

"In addition, if the first bytes of the file are the UTF-8 byte-order
mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8."

IMO we should do the same. If there's no explicit encoding declaration,
treat it as UTF-8 if there's a BOM and as the platform's local character
encoding otherwise.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Noah Misch
noah@leadboat.com
In reply to: Alexander Law (#62)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Thu, Feb 14, 2013 at 11:11:13AM +0400, Alexander Law wrote:

Hello,

Alexander Law <exclusion@gmail.com> writes:

Please look at the following l10n bug:
/messages/by-id/502A26F1.6010109@gmail.com
and the proposed patch.

With your proposed change, the problem will resurface in an actual SQL_ASCII
database. At the problem's root is write_console()'s assumption that messages
are in the database encoding. pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry. That
excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to
behave differently in such cases.

Thank you for the notice. So it seems that "DatabaseEncoding" variable
alone can't present a database encoding (for communication with a
client) and current process messages encoding (for logging messages) at
once. There should be another variable, something like
CurrentProcessEncoding, that will be set to OS encoding at start and can
be changed to encoding of a connected database (if
bind_textdomain_codeset succeeded).

I'd call it MessageEncoding unless it corresponds with similar rigor to a
broader concept.

On Tue, Feb 12, 2013 at 03:22:17AM +0000, Greg Stark wrote:

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways.

For Unix, yes. On Windows, they're ultimately UTF16 strings; some system APIs
accept paths in the Windows ANSI code page and convert to UTF16 internally.
Nonetheless, good point.

Yes, and if postresql.conf not going to be UTF16 encoded, it seems
natural to use ANSI code page on Windows to write such paths in it.
So the paths should be written in OS encoding, which is accepted by OS
functions, such as fopen. (This is what we have now.)

To the contrary, we would do better to use _wfopen() after converting from the
encoding at hand to UTF16. We should have the goal of removing our dependence
on the Windows ANSI code page, not tightening our bonds thereto. As long as
PostgreSQL uses fopen() on Windows, it will remain possible to create a file
that PostgreSQL cannot open. Making the full transition is probably a big
job, and we don't need to get there in one patch. Try, however, to avoid
patch designs that increase the distance left to cover.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Alexander Law
exclusion@gmail.com
In reply to: Noah Misch (#65)
1 attachment(s)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Hello,

15.02.2013 02:59, Noah Misch wrote:

With your proposed change, the problem will resurface in an actual SQL_ASCII
database. At the problem's root is write_console()'s assumption that messages
are in the database encoding. pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry. That
excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to
behave differently in such cases.

Thank you for the notice. So it seems that "DatabaseEncoding" variable
alone can't present a database encoding (for communication with a
client) and current process messages encoding (for logging messages) at
once. There should be another variable, something like
CurrentProcessEncoding, that will be set to OS encoding at start and can
be changed to encoding of a connected database (if
bind_textdomain_codeset succeeded).

I'd call it MessageEncoding unless it corresponds with similar rigor to a
broader concept.

Please look at the next version of the patch.

Thanks,
Alexander

Attachments:

0001-Fix-postmaster-messages-encoding.patchtext/x-patch; name=0001-Fix-postmaster-messages-encoding.patchDownload
>From 5bce21326d48761c6f86be8797432a69b2533dcd Mon Sep 17 00:00:00 2001
From: Alexander Lakhin <exclusion@gmail.com>
Date: Wed, 20 Feb 2013 15:34:05 +0400
Subject: Fix postmaster messages encoding

---
 src/backend/main/main.c        |    2 ++
 src/backend/utils/error/elog.c |    4 ++--
 src/backend/utils/mb/mbutils.c |   24 ++++++++++++++++++++++--
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 1173bda..ed4067e 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -100,6 +100,8 @@ main(int argc, char *argv[])
 
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("postgres"));
 
+	SetMessageEncoding(GetPlatformEncoding());
+
 #ifdef WIN32
 
 	/*
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 3a211bf..40f20f3 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -1868,7 +1868,7 @@ write_eventlog(int level, const char *line, int len)
 	 * Also verify that we are not on our way into error recursion trouble due
 	 * to error messages thrown deep inside pgwin32_toUTF16().
 	 */
-	if (GetDatabaseEncoding() != GetPlatformEncoding() &&
+	if (GetMessageEncoding() != GetPlatformEncoding() &&
 		!in_error_recursion_trouble())
 	{
 		utf16 = pgwin32_toUTF16(line, len, NULL);
@@ -1915,7 +1915,7 @@ write_console(const char *line, int len)
 	 * through to writing unconverted if we have not yet set up
 	 * CurrentMemoryContext.
 	 */
-	if (GetDatabaseEncoding() != GetPlatformEncoding() &&
+	if (GetMessageEncoding() != GetPlatformEncoding() &&
 		!in_error_recursion_trouble() &&
 		!redirection_done &&
 		CurrentMemoryContext != NULL)
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 287ff80..8b51b78 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -57,6 +57,7 @@ static FmgrInfo *ToClientConvProc = NULL;
  */
 static pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
 static pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
+static pg_enc2name *MessageEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
 static pg_enc2name *PlatformEncoding = NULL;
 
 /*
@@ -881,6 +882,16 @@ SetDatabaseEncoding(int encoding)
 	Assert(DatabaseEncoding->encoding == encoding);
 }
 
+void
+SetMessageEncoding(int encoding)
+{
+	if (!PG_VALID_BE_ENCODING(encoding))
+		elog(ERROR, "invalid message encoding: %d", encoding);
+
+	MessageEncoding = &pg_enc2name_tbl[encoding];
+	Assert(MessageEncoding->encoding == encoding);
+}
+
 /*
  * Bind gettext to the codeset equivalent with the database encoding.
  */
@@ -915,6 +926,8 @@ pg_bind_textdomain_codeset(const char *domainname)
 			if (bind_textdomain_codeset(domainname,
 										pg_enc2gettext_tbl[i].name) == NULL)
 				elog(LOG, "bind_textdomain_codeset failed");
+			else
+				SetMessageEncoding(encoding);
 			break;
 		}
 	}
@@ -964,6 +977,13 @@ GetPlatformEncoding(void)
 	return PlatformEncoding->encoding;
 }
 
+int
+GetMessageEncoding(void)
+{
+	Assert(MessageEncoding);
+	return MessageEncoding->encoding;
+}
+
 #ifdef WIN32
 
 /*
@@ -977,7 +997,7 @@ pgwin32_toUTF16(const char *str, int len, int *utf16len)
 	int			dstlen;
 	UINT		codepage;
 
-	codepage = pg_enc2name_tbl[GetDatabaseEncoding()].codepage;
+	codepage = pg_enc2name_tbl[GetMessageEncoding()].codepage;
 
 	/*
 	 * Use MultiByteToWideChar directly if there is a corresponding codepage,
@@ -994,7 +1014,7 @@ pgwin32_toUTF16(const char *str, int len, int *utf16len)
 		char	   *utf8;
 
 		utf8 = (char *) pg_do_encoding_conversion((unsigned char *) str,
-										len, GetDatabaseEncoding(), PG_UTF8);
+										len, GetMessageEncoding(), PG_UTF8);
 		if (utf8 != str)
 			len = strlen(utf8);
 
-- 
1.7.10.4

#67Noah Misch
noah@leadboat.com
In reply to: Alexander Law (#66)
1 attachment(s)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Wed, Feb 20, 2013 at 04:00:00PM +0400, Alexander Law wrote:

15.02.2013 02:59, Noah Misch wrote:

With your proposed change, the problem will resurface in an actual SQL_ASCII
database. At the problem's root is write_console()'s assumption that messages
are in the database encoding. pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry. That
excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to
behave differently in such cases.

Thank you for the notice. So it seems that "DatabaseEncoding" variable
alone can't present a database encoding (for communication with a
client) and current process messages encoding (for logging messages) at
once. There should be another variable, something like
CurrentProcessEncoding, that will be set to OS encoding at start and can
be changed to encoding of a connected database (if
bind_textdomain_codeset succeeded).

I'd call it MessageEncoding unless it corresponds with similar rigor to a
broader concept.

Please look at the next version of the patch.

I studied this patch. I like the APIs it adds, but the implementation details
required attention.

Your patch assumes the message encoding will be a backend encoding, but this
need not be so; locale "Japanese (Japan)" uses CP932 aka SJIS. I've also
added the non-backend encodings to pg_enc2gettext_tbl; I'd welcome sanity
checks from folks who know those encodings well.

write_console() still garbled the payload in all cases where it used write()
to a console instead of WriteConsoleW(). You can observe this by configuring
Windows for Russian, running initdb with default locale settings, and running
"select 1/0" in template1. See comments for the technical details; I fixed
this by removing the logic to preemptively skip WriteConsoleW() for
encoding-related reasons. (Over in write_eventlog(), I am suspicious of the
benefits we get from avoiding ReportEventW() where possible. I have not
removed that, though.)

--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -100,6 +100,8 @@ main(int argc, char *argv[])

set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("postgres"));

+ SetMessageEncoding(GetPlatformEncoding());

SetMessageEncoding() and GetPlatformEncoding() can both elog()/ereport(), but
this is too early in the life of a PostgreSQL process for that.

The goal of this line of code is to capture gettext's default encoding, which
is platform-specific. On non-Windows platforms, it's the encoding implied by
LC_CTYPE. On Windows, it's the Windows ANSI code page. GetPlatformEncoding()
always gives the encoding implied by LC_CTYPE, which is therefore not correct
on Windows. You can observe broken output by setting your locale (in control
panel "Region and Language" -> Formats -> Format) to something unrelated to
your Windows ANSI code page (Region and Language -> Administrative -> Change
system locale...). Let's make PostgreSQL independent of gettext's
Windows-specific default by *always* calling bind_textdomain_codeset() on
Windows. In the postmaster and other non-database-attached processes, as well
as in SQL_ASCII databases, we would pass the encoding implied by LC_CTYPE.
Besides removing a discrepancy in PostgreSQL behavior between Windows and
non-Windows platforms, this has the great practical advantage of reducing
PostgreSQL's dependence on the deprecated Windows ANSI code page, which can
only be changed on a system-wide basis, by an administrator, after a reboot.

For message creation purposes, the encoding implied by LC_CTYPE on Windows is
somewhat arbitrary. Microsoft has deprecated them all in favor of UTF16, and
some locales (e.g. "Hindi (India)") have no legacy code page. I can see
adding a postmaster_encoding GUC to be used in place of consulting LC_CTYPE.
I haven't done that now; I just mention it for future reference.

On a !ENABLE_NLS build, messages are ASCII with database-encoding strings
sometimes interpolated. Therefore, such builds should keep the message
encoding equal to the database encoding.

The MessageEncoding was not maintained accurately on non-Windows systems. For
example, connecting to an lc_ctype=LATIN1, encoding=LATIN1 database does not
require a bind_textdomain_codeset() call on such systems, so we did not update
the message encoding. This was academic for the time being, because
MessageEncoding is only consulted on Windows.

The attached revision fixes all above points. Would you look it over? The
area was painfully light on comments, so I added some. I renamed
pgwin32_toUTF16(), which ceases to be a good name now that it converts from
message encoding, not database encoding. GetPlatformEncoding() became unused,
so I removed it. (If we have cause reintroduce the exact same concept later,
GetTTYEncoding() would name it more accurately.)

What should we do for the back branches, if anything? Fixes for garbled
display on consoles and event logs are fair to back-patch, but users might be
accustomed to the present situation for SQL_ASCII databases. Given the low
incidence of complaints and the workaround of using logging_collector, I am
inclined to put the whole thing in master only.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Attachments:

w32-gettext-encoding-v4.patchtext/plain; charset=us-asciiDownload
*** a/src/backend/main/main.c
--- b/src/backend/main/main.c
***************
*** 265,270 **** startup_hacks(const char *progname)
--- 265,274 ----
  /*
   * Help display should match the options accepted by PostmasterMain()
   * and PostgresMain().
+  *
+  * XXX On Windows, non-ASCII localizations of these messages only display
+  * correctly if the console output code page covers the necessary characters.
+  * Messages emitted in write_console() do not exhibit this problem.
   */
  static void
  help(const char *progname)
*** a/src/backend/utils/adt/pg_locale.c
--- b/src/backend/utils/adt/pg_locale.c
***************
*** 131,144 **** static char *IsoLocaleName(const char *);		/* MSVC specific */
  /*
   * pg_perm_setlocale
   *
!  * This is identical to the libc function setlocale(), with the addition
!  * that if the operation is successful, the corresponding LC_XXX environment
!  * variable is set to match.  By setting the environment variable, we ensure
!  * that any subsequent use of setlocale(..., "") will preserve the settings
!  * made through this routine.  Of course, LC_ALL must also be unset to fully
!  * ensure that, but that has to be done elsewhere after all the individual
!  * LC_XXX variables have been set correctly.  (Thank you Perl for making this
!  * kluge necessary.)
   */
  char *
  pg_perm_setlocale(int category, const char *locale)
--- 131,146 ----
  /*
   * pg_perm_setlocale
   *
!  * This wraps the libc function setlocale(), with two additions.  First, when
!  * changing LC_CTYPE, update gettext's encoding for the current message
!  * domain.  GNU gettext automatically tracks LC_CTYPE on most platforms, but
!  * not on Windows.  Second, if the operation is successful, the corresponding
!  * LC_XXX environment variable is set to match.  By setting the environment
!  * variable, we ensure that any subsequent use of setlocale(..., "") will
!  * preserve the settings made through this routine.  Of course, LC_ALL must
!  * also be unset to fully ensure that, but that has to be done elsewhere after
!  * all the individual LC_XXX variables have been set correctly.  (Thank you
!  * Perl for making this kluge necessary.)
   */
  char *
  pg_perm_setlocale(int category, const char *locale)
***************
*** 172,177 **** pg_perm_setlocale(int category, const char *locale)
--- 174,195 ----
  	if (result == NULL)
  		return result;			/* fall out immediately on failure */
  
+ 	/*
+ 	 * Use the right encoding in translated messages.  Under ENABLE_NLS, let
+ 	 * pg_bind_textdomain_codeset() figure it out.  Under !ENABLE_NLS, message
+ 	 * format strings are ASCII, but database-encoding strings may enter the
+ 	 * message via %s.  This makes the overall message encoding equal to the
+ 	 * database encoding.
+ 	 */
+ 	if (category == LC_CTYPE)
+ 	{
+ #ifdef ENABLE_NLS
+ 		SetMessageEncoding(pg_bind_textdomain_codeset(textdomain(NULL)));
+ #else
+ 		SetMessageEncoding(GetDatabaseEncoding());
+ #endif
+ 	}
+ 
  	switch (category)
  	{
  		case LC_COLLATE:
*** a/src/backend/utils/error/elog.c
--- b/src/backend/utils/error/elog.c
***************
*** 1814,1819 **** write_syslog(int level, const char *line)
--- 1814,1835 ----
  
  #ifdef WIN32
  /*
+  * Get the PostgreSQL equivalent of the Windows ANSI code page.  "ANSI" system
+  * interfaces (e.g. CreateFileA()) expect string arguments in this encoding.
+  * Every process in a given system will find the same value at all times.
+  */
+ static int
+ GetACPEncoding(void)
+ {
+ 	static int	encoding = -2;
+ 
+ 	if (encoding == -2)
+ 		encoding = pg_codepage_to_encoding(GetACP());
+ 
+ 	return encoding;
+ }
+ 
+ /*
   * Write a message line to the windows event log
   */
  static void
***************
*** 1858,1873 **** write_eventlog(int level, const char *line, int len)
  	}
  
  	/*
! 	 * Convert message to UTF16 text and write it with ReportEventW, but
! 	 * fall-back into ReportEventA if conversion failed.
  	 *
  	 * Also verify that we are not on our way into error recursion trouble due
! 	 * to error messages thrown deep inside pgwin32_toUTF16().
  	 */
! 	if (GetDatabaseEncoding() != GetPlatformEncoding() &&
! 		!in_error_recursion_trouble())
  	{
! 		utf16 = pgwin32_toUTF16(line, len, NULL);
  		if (utf16)
  		{
  			ReportEventW(evtHandle,
--- 1874,1891 ----
  	}
  
  	/*
! 	 * If message character encoding matches the encoding expected by
! 	 * ReportEventA(), call it to avoid the hazards of conversion.  Otherwise,
! 	 * try to convert the message to UTF16 and write it with ReportEventW().
! 	 * Fall back on ReportEventA() if conversion failed.
  	 *
  	 * Also verify that we are not on our way into error recursion trouble due
! 	 * to error messages thrown deep inside pgwin32_message_to_UTF16().
  	 */
! 	if (!in_error_recursion_trouble() &&
! 		GetMessageEncoding() != GetACPEncoding())
  	{
! 		utf16 = pgwin32_message_to_UTF16(line, len, NULL);
  		if (utf16)
  		{
  			ReportEventW(evtHandle,
***************
*** 1879,1884 **** write_eventlog(int level, const char *line, int len)
--- 1897,1903 ----
  						 0,
  						 (LPCWSTR *) &utf16,
  						 NULL);
+ 			/* XXX Try ReportEventA() when ReportEventW() fails? */
  
  			pfree(utf16);
  			return;
***************
*** 1904,1925 **** write_console(const char *line, int len)
  #ifdef WIN32
  
  	/*
! 	 * WriteConsoleW() will fail if stdout is redirected, so just fall through
  	 * to writing unconverted to the logfile in this case.
  	 *
  	 * Since we palloc the structure required for conversion, also fall
  	 * through to writing unconverted if we have not yet set up
  	 * CurrentMemoryContext.
  	 */
! 	if (GetDatabaseEncoding() != GetPlatformEncoding() &&
! 		!in_error_recursion_trouble() &&
  		!redirection_done &&
  		CurrentMemoryContext != NULL)
  	{
  		WCHAR	   *utf16;
  		int			utf16len;
  
! 		utf16 = pgwin32_toUTF16(line, len, &utf16len);
  		if (utf16 != NULL)
  		{
  			HANDLE		stdHandle;
--- 1923,1952 ----
  #ifdef WIN32
  
  	/*
! 	 * Try to convert the message to UTF16 and write it with WriteConsoleW().
! 	 * Fall back on write() if anything fails.
! 	 *
! 	 * In contrast to write_eventlog(), don't skip straight to write() based
! 	 * on the applicable encodings.  Unlike WriteConsoleW(), write() depends
! 	 * on the suitability of the console output code page.  Since we put
! 	 * stderr into binary mode in SubPostmasterMain(), write() skips the
! 	 * necessary translation anyway.
! 	 *
! 	 * WriteConsoleW() will fail if stderr is redirected, so just fall through
  	 * to writing unconverted to the logfile in this case.
  	 *
  	 * Since we palloc the structure required for conversion, also fall
  	 * through to writing unconverted if we have not yet set up
  	 * CurrentMemoryContext.
  	 */
! 	if (!in_error_recursion_trouble() &&
  		!redirection_done &&
  		CurrentMemoryContext != NULL)
  	{
  		WCHAR	   *utf16;
  		int			utf16len;
  
! 		utf16 = pgwin32_message_to_UTF16(line, len, &utf16len);
  		if (utf16 != NULL)
  		{
  			HANDLE		stdHandle;
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 357,367 **** CheckMyDatabase(const char *name, bool am_superuser)
  	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_OVERRIDE);
  	SetConfigOption("lc_ctype", ctype, PGC_INTERNAL, PGC_S_OVERRIDE);
  
- 	/* Use the right encoding in translated messages */
- #ifdef ENABLE_NLS
- 	pg_bind_textdomain_codeset(textdomain(NULL));
- #endif
- 
  	ReleaseSysCache(tup);
  }
  
--- 357,362 ----
*** a/src/backend/utils/mb/encnames.c
--- b/src/backend/utils/mb/encnames.c
***************
*** 352,361 **** pg_enc2name pg_enc2name_tbl[] =
--- 352,364 ----
  
  /* ----------
   * These are encoding names for gettext.
+  *
+  * This covers all encodings except MULE_INTERNAL, which is alien to gettext.
   * ----------
   */
  pg_enc2gettext pg_enc2gettext_tbl[] =
  {
+ 	{PG_SQL_ASCII, "US-ASCII"},
  	{PG_UTF8, "UTF-8"},
  	{PG_LATIN1, "LATIN1"},
  	{PG_LATIN2, "LATIN2"},
***************
*** 389,394 **** pg_enc2gettext pg_enc2gettext_tbl[] =
--- 392,404 ----
  	{PG_EUC_KR, "EUC-KR"},
  	{PG_EUC_TW, "EUC-TW"},
  	{PG_EUC_JIS_2004, "EUC-JP"},
+ 	{PG_SJIS, "SHIFT-JIS"},
+ 	{PG_BIG5, "BIG5"},
+ 	{PG_GBK, "GBK"},
+ 	{PG_UHC, "UHC"},
+ 	{PG_GB18030, "GB18030"},
+ 	{PG_JOHAB, "JOHAB"},
+ 	{PG_SHIFT_JIS_2004, "SHIFT_JISX0213"},
  	{0, NULL}
  };
  
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***************
*** 53,63 **** static FmgrInfo *ToServerConvProc = NULL;
  static FmgrInfo *ToClientConvProc = NULL;
  
  /*
!  * These variables track the currently selected FE and BE encodings.
   */
  static pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
  static pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
! static pg_enc2name *PlatformEncoding = NULL;
  
  /*
   * During backend startup we can't set client encoding because we (a)
--- 53,63 ----
  static FmgrInfo *ToClientConvProc = NULL;
  
  /*
!  * These variables track the currently-selected encodings.
   */
  static pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
  static pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
! static pg_enc2name *MessageEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
  
  /*
   * During backend startup we can't set client encoding because we (a)
***************
*** 881,926 **** SetDatabaseEncoding(int encoding)
  	Assert(DatabaseEncoding->encoding == encoding);
  }
  
- /*
-  * Bind gettext to the codeset equivalent with the database encoding.
-  */
  void
! pg_bind_textdomain_codeset(const char *domainname)
  {
! #if defined(ENABLE_NLS)
! 	int			encoding = GetDatabaseEncoding();
! 	int			i;
  
! 	/*
! 	 * gettext() uses the codeset specified by LC_CTYPE by default, so if that
! 	 * matches the database encoding we don't need to do anything. In CREATE
! 	 * DATABASE, we enforce or trust that the locale's codeset matches
! 	 * database encoding, except for the C locale. In C locale, we bind
! 	 * gettext() explicitly to the right codeset.
! 	 *
! 	 * On Windows, though, gettext() tends to get confused so we always bind
! 	 * it.
! 	 */
! #ifndef WIN32
! 	const char *ctype = setlocale(LC_CTYPE, NULL);
  
! 	if (pg_strcasecmp(ctype, "C") != 0 && pg_strcasecmp(ctype, "POSIX") != 0)
! 		return;
! #endif
  
  	for (i = 0; pg_enc2gettext_tbl[i].name != NULL; i++)
  	{
  		if (pg_enc2gettext_tbl[i].encoding == encoding)
  		{
  			if (bind_textdomain_codeset(domainname,
! 										pg_enc2gettext_tbl[i].name) == NULL)
  				elog(LOG, "bind_textdomain_codeset failed");
  			break;
  		}
  	}
  #endif
  }
  
  int
  GetDatabaseEncoding(void)
  {
--- 881,982 ----
  	Assert(DatabaseEncoding->encoding == encoding);
  }
  
  void
! SetMessageEncoding(int encoding)
  {
! 	/* Some calls happen before we can elog()! */
! 	Assert(PG_VALID_ENCODING(encoding));
  
! 	MessageEncoding = &pg_enc2name_tbl[encoding];
! 	Assert(MessageEncoding->encoding == encoding);
! }
  
! #ifdef ENABLE_NLS
! /*
!  * Make one bind_textdomain_codeset() call, translating a pg_enc to a gettext
!  * codeset.  Fails for MULE_INTERNAL, an encoding unknown to gettext; can also
!  * fail for gettext-internal causes like out-of-memory.
!  */
! static bool
! raw_pg_bind_textdomain_codeset(const char *domainname, int encoding)
! {
! 	bool		elog_ok = (CurrentMemoryContext != NULL);
! 	int			i;
  
  	for (i = 0; pg_enc2gettext_tbl[i].name != NULL; i++)
  	{
  		if (pg_enc2gettext_tbl[i].encoding == encoding)
  		{
  			if (bind_textdomain_codeset(domainname,
! 										pg_enc2gettext_tbl[i].name) != NULL)
! 				return true;
! 
! 			if (elog_ok)
  				elog(LOG, "bind_textdomain_codeset failed");
+ 			else
+ 				write_stderr("bind_textdomain_codeset failed");
+ 
  			break;
  		}
  	}
+ 
+ 	return false;
+ }
+ 
+ /*
+  * Bind a gettext message domain to the codeset corresponding to the database
+  * encoding.  For SQL_ASCII, instead bind to the codeset implied by LC_CTYPE.
+  * Return the MessageEncoding implied by the new settings.
+  *
+  * On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
+  * When that matches the database encoding, we don't need to do anything.  In
+  * CREATE DATABASE, we enforce or trust that the locale's codeset matches the
+  * database encoding, except for the C locale.  (On Windows, we also permit a
+  * discrepancy under the UTF8 encoding.)  For the C locale, explicitly bind
+  * gettext to the right codeset.
+  *
+  * On Windows, gettext defaults to the Windows ANSI code page.  This is a
+  * convenient departure for software that passes the strings to Windows ANSI
+  * APIs, but we don't do that.  Compel gettext to use database encoding or,
+  * failing that, the LC_CTYPE encoding as it would on other platforms.
+  *
+  * This function is called before elog() and palloc() are usable.
+  */
+ int
+ pg_bind_textdomain_codeset(const char *domainname)
+ {
+ 	bool		elog_ok = (CurrentMemoryContext != NULL);
+ 	int			encoding = GetDatabaseEncoding();
+ 	int			new_msgenc;
+ 
+ #ifndef WIN32
+ 	const char *ctype = setlocale(LC_CTYPE, NULL);
+ 
+ 	if (pg_strcasecmp(ctype, "C") == 0 || pg_strcasecmp(ctype, "POSIX") == 0)
  #endif
+ 		if (encoding != PG_SQL_ASCII &&
+ 			raw_pg_bind_textdomain_codeset(domainname, encoding))
+ 			return encoding;
+ 
+ 	new_msgenc = pg_get_encoding_from_locale(NULL, elog_ok);
+ 	if (new_msgenc < 0)
+ 		new_msgenc = PG_SQL_ASCII;
+ 
+ #ifdef WIN32
+ 	if (!raw_pg_bind_textdomain_codeset(domainname, new_msgenc))
+ 		/* On failure, the old message encoding remains valid. */
+ 		return GetMessageEncoding();
+ #endif
+ 
+ 	return new_msgenc;
  }
+ #endif
  
+ /*
+  * The database encoding, also called the server encoding, represents the
+  * encoding of data stored in text-like data types.  Affected types include
+  * cstring, text, varchar, name, xml, and json.
+  */
  int
  GetDatabaseEncoding(void)
  {
***************
*** 949,967 **** pg_client_encoding(PG_FUNCTION_ARGS)
  	return DirectFunctionCall1(namein, CStringGetDatum(ClientEncoding->name));
  }
  
  int
! GetPlatformEncoding(void)
  {
! 	if (PlatformEncoding == NULL)
! 	{
! 		/* try to determine encoding of server's environment locale */
! 		int			encoding = pg_get_encoding_from_locale("", true);
! 
! 		if (encoding < 0)
! 			encoding = PG_SQL_ASCII;
! 		PlatformEncoding = &pg_enc2name_tbl[encoding];
! 	}
! 	return PlatformEncoding->encoding;
  }
  
  #ifdef WIN32
--- 1005,1021 ----
  	return DirectFunctionCall1(namein, CStringGetDatum(ClientEncoding->name));
  }
  
+ /*
+  * gettext() returns messages in this encoding.  This often matches the
+  * database encoding, but it differs for SQL_ASCII databases, for processes
+  * not attached to a database, and under a database encoding lacking iconv
+  * support (MULE_INTERNAL).
+  */
  int
! GetMessageEncoding(void)
  {
! 	Assert(MessageEncoding);
! 	return MessageEncoding->encoding;
  }
  
  #ifdef WIN32
***************
*** 971,983 **** GetPlatformEncoding(void)
   * is also passed to utf16len if not null. Returns NULL iff failed.
   */
  WCHAR *
! pgwin32_toUTF16(const char *str, int len, int *utf16len)
  {
  	WCHAR	   *utf16;
  	int			dstlen;
  	UINT		codepage;
  
! 	codepage = pg_enc2name_tbl[GetDatabaseEncoding()].codepage;
  
  	/*
  	 * Use MultiByteToWideChar directly if there is a corresponding codepage,
--- 1025,1037 ----
   * is also passed to utf16len if not null. Returns NULL iff failed.
   */
  WCHAR *
! pgwin32_message_to_UTF16(const char *str, int len, int *utf16len)
  {
  	WCHAR	   *utf16;
  	int			dstlen;
  	UINT		codepage;
  
! 	codepage = pg_enc2name_tbl[GetMessageEncoding()].codepage;
  
  	/*
  	 * Use MultiByteToWideChar directly if there is a corresponding codepage,
***************
*** 994,1000 **** pgwin32_toUTF16(const char *str, int len, int *utf16len)
  		char	   *utf8;
  
  		utf8 = (char *) pg_do_encoding_conversion((unsigned char *) str,
! 										len, GetDatabaseEncoding(), PG_UTF8);
  		if (utf8 != str)
  			len = strlen(utf8);
  
--- 1048,1054 ----
  		char	   *utf8;
  
  		utf8 = (char *) pg_do_encoding_conversion((unsigned char *) str,
! 										 len, GetMessageEncoding(), PG_UTF8);
  		if (utf8 != str)
  			len = strlen(utf8);
  
*** a/src/include/mb/pg_wchar.h
--- b/src/include/mb/pg_wchar.h
***************
*** 481,488 **** extern const char *pg_get_client_encoding_name(void);
  extern void SetDatabaseEncoding(int encoding);
  extern int	GetDatabaseEncoding(void);
  extern const char *GetDatabaseEncodingName(void);
! extern int	GetPlatformEncoding(void);
! extern void pg_bind_textdomain_codeset(const char *domainname);
  
  extern int	pg_valid_client_encoding(const char *name);
  extern int	pg_valid_server_encoding(const char *name);
--- 481,492 ----
  extern void SetDatabaseEncoding(int encoding);
  extern int	GetDatabaseEncoding(void);
  extern const char *GetDatabaseEncodingName(void);
! extern void SetMessageEncoding(int encoding);
! extern int	GetMessageEncoding(void);
! 
! #ifdef ENABLE_NLS
! extern int	pg_bind_textdomain_codeset(const char *domainname);
! #endif
  
  extern int	pg_valid_client_encoding(const char *name);
  extern int	pg_valid_server_encoding(const char *name);
***************
*** 542,548 **** extern void mic2latin_with_table(const unsigned char *mic, unsigned char *p,
  extern bool pg_utf8_islegal(const unsigned char *source, int length);
  
  #ifdef WIN32
! extern WCHAR *pgwin32_toUTF16(const char *str, int len, int *utf16len);
  #endif
  
  #endif   /* PG_WCHAR_H */
--- 546,552 ----
  extern bool pg_utf8_islegal(const unsigned char *source, int length);
  
  #ifdef WIN32
! extern WCHAR *pgwin32_message_to_UTF16(const char *str, int len, int *utf16len);
  #endif
  
  #endif   /* PG_WCHAR_H */
*** a/src/include/port.h
--- b/src/include/port.h
***************
*** 452,457 **** extern void qsort_arg(void *base, size_t nel, size_t elsize,
--- 452,461 ----
  /* port/chklocale.c */
  extern int	pg_get_encoding_from_locale(const char *ctype, bool write_message);
  
+ #if defined(WIN32) && !defined(FRONTEND)
+ extern int	pg_codepage_to_encoding(UINT cp);
+ #endif
+ 
  /* port/inet_net_ntop.c */
  extern char *inet_net_ntop(int af, const void *src, int bits,
  			  char *dst, size_t size);
*** a/src/port/chklocale.c
--- b/src/port/chklocale.c
***************
*** 235,240 **** win32_langinfo(const char *ctype)
--- 235,266 ----
  
  	return r;
  }
+ 
+ #ifndef FRONTEND
+ /*
+  * Given a Windows code page identifier, find the corresponding PostgreSQL
+  * encoding.  Issue a warning and return -1 if none found.
+  */
+ int
+ pg_codepage_to_encoding(UINT cp)
+ {
+ 	char		sys[16];
+ 	int			i;
+ 
+ 	sprintf(sys, "CP%u", cp);
+ 
+ 	/* Check the table */
+ 	for (i = 0; encoding_match_list[i].system_enc_name; i++)
+ 		if (pg_strcasecmp(sys, encoding_match_list[i].system_enc_name) == 0)
+ 			return encoding_match_list[i].pg_enc_code;
+ 
+ 	ereport(WARNING,
+ 			(errmsg("could not determine encoding for codeset \"%s\"", sys),
+ 		   errdetail("Please report this to <pgsql-bugs@postgresql.org>.")));
+ 
+ 	return -1;
+ }
+ #endif
  #endif   /* WIN32 */
  
  #if (defined(HAVE_LANGINFO_H) && defined(CODESET)) || defined(WIN32)
***************
*** 248,253 **** win32_langinfo(const char *ctype)
--- 274,282 ----
   *
   * If the result is PG_SQL_ASCII, callers should treat it as being compatible
   * with any desired encoding.
+  *
+  * If running in the backend and write_message is false, this function must
+  * cope with the possibility that elog() and palloc() are not yet usable.
   */
  int
  pg_get_encoding_from_locale(const char *ctype, bool write_message)
#68Alexander Law
exclusion@gmail.com
In reply to: Noah Misch (#67)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

Hello Noah,

Thanks for your work, your patch is definitely better. I agree that this
approach much more generic.
23.06.2013 20:53, Noah Misch wrote:

The attached revision fixes all above points. Would you look it over? The
area was painfully light on comments, so I added some. I renamed
pgwin32_toUTF16(), which ceases to be a good name now that it converts from
message encoding, not database encoding. GetPlatformEncoding() became unused,
so I removed it. (If we have cause reintroduce the exact same concept later,
GetTTYEncoding() would name it more accurately.)

Yes, the patch works for me. I have just a little question about
pgwin32_message_to_UTF16. Do we need to convert SQL_ASCII through UTF8
or should SQL_ASCII be mapped to 20127 (US-ASCII (7-bit))?

What should we do for the back branches, if anything? Fixes for garbled
display on consoles and event logs are fair to back-patch, but users might be
accustomed to the present situation for SQL_ASCII databases. Given the low
incidence of complaints and the workaround of using logging_collector, I am
inclined to put the whole thing in master only.

I thought that the change could be a first step to the PosgreSQL log
encoding normalization. Today the log may contain messages with
different encodings (we had a long discussion a year ago:
/messages/by-id/5007C399.6000405@gmail.com)
Now the new function GetMessageEncoding allows to convert all the
messages consistently. If the future log encoding fix will be considered
as important enough to be backported, then this patch should be
backported too.

Thanks again!

Best regards,
Alexander

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Noah Misch
noah@leadboat.com
In reply to: Alexander Law (#68)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote:

23.06.2013 20:53, Noah Misch wrote:

The attached revision fixes all above points. Would you look it over? The
area was painfully light on comments, so I added some. I renamed
pgwin32_toUTF16(), which ceases to be a good name now that it converts from
message encoding, not database encoding. GetPlatformEncoding() became unused,
so I removed it. (If we have cause reintroduce the exact same concept later,
GetTTYEncoding() would name it more accurately.)

Yes, the patch works for me. I have just a little question about
pgwin32_message_to_UTF16. Do we need to convert SQL_ASCII through UTF8
or should SQL_ASCII be mapped to 20127 (US-ASCII (7-bit))?

Good question. SQL_ASCII is an awkward database encoding; it actually
indicates ignorance about the significance of byte sequences in text. As a
result, nothing we do here is likely to be delightful. I caused an error
containing some LATIN2 bytes (SELECT * FROM "śmiać się") in an SQL_ASCII /
LC_CTYPE=C database, and the result was decent, considering: the event log
shows a question-mark-in-rhombus symbol for each of the non-ASCII bytes.
Using CP20127 turned it into "6miaf sij". I don't have a great idea for
improving on the existing hack.

What should we do for the back branches, if anything? Fixes for garbled
display on consoles and event logs are fair to back-patch, but users might be
accustomed to the present situation for SQL_ASCII databases. Given the low
incidence of complaints and the workaround of using logging_collector, I am
inclined to put the whole thing in master only.

I thought that the change could be a first step to the PosgreSQL log
encoding normalization. Today the log may contain messages with
different encodings (we had a long discussion a year ago:
/messages/by-id/5007C399.6000405@gmail.com)
Now the new function GetMessageEncoding allows to convert all the
messages consistently. If the future log encoding fix will be considered
as important enough to be backported, then this patch should be
backported too.

I doubt that would prove to be back-patch material.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Noah Misch
noah@leadboat.com
In reply to: Alexander Law (#68)
Re: BUG #7493: Postmaster messages unreadable in a Windows console

On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote:

Thanks for your work, your patch is definitely better. I agree that this
approach much more generic.

Committed.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers