[bug fix] multibyte messages are displayed incorrectly on the client

Started by MauMauabout 12 years ago19 messages

maumau307@gmail.com

about 12 years ago

1 attachment(s)

Hello,

The attached patch fixes incorrect message output on the client side. I
guess this problem can happen with any major release. Could you review
this?

[Problem]
When the client's locale differs from the server's message locale, the
messages generated on the server are converted appropriately and sent to the
client. For example, if the server runs on Linux with lc_messages =
'ja_JP.UTF-8' in postgresql.conf, and you run psql on Windows where the
system locale is SJIS, Japanese messages are converted from UTF-8 to SJIS on
the server and sent to psql. psql can display those SJIS messages
correctly. This is no problem.

However, this desirable behavior holds true only after the database session
is established. The error messages during session establishment are
displayed incorrectly, and you cannot recognize the message contents. For
example, run psql -d postgres -U non-existent-username. The displayed
message is unrecognizable.

[Cause]
While the session is being established, the server cannot use the client
encoding for message conversion yet, because it cannot access system
catalogs to retrieve conversion functions. So, the server sends messages to
the client without conversion. In the above example, the server sends
Japanese UTF-8 messages to psql, which expects those messages in SJIS.

[Fix]
Disable message localization during session startup. In other words,
messages are output in English until the database session is established.

Regards
MauMau

Attachments:

no_localize_message_in_startup.patchapplication/octet-stream; name=no_localize_message_in_startup.patchDownload

diff -rpcd a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
*** a/src/backend/postmaster/postmaster.c	2013-12-02 09:17:05.000000000 +0900
--- b/src/backend/postmaster/postmaster.c	2013-12-13 16:14:51.000000000 +0900
*************** BackendInitialize(Port *port)
*** 3970,3975 ****
--- 3970,3980 ----
  	enable_timeout_after(STARTUP_PACKET_TIMEOUT, AuthenticationTimeout * 1000);
  
  	/*
+ 	 * Output messages in English because client encoding is not known yet.
+ 	 */
+ 	disable_message_localization();
+ 
+ 	/*
  	 * Receive the startup packet (which might turn out to be a cancel request
  	 * packet).
  	 */
diff -rpcd a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
*** a/src/backend/utils/error/elog.c	2013-12-02 09:17:05.000000000 +0900
--- b/src/backend/utils/error/elog.c	2013-12-13 16:15:00.000000000 +0900
*************** static int	errordata_stack_depth = -1; /
*** 147,152 ****
--- 147,154 ----
  
  static int	recursion_depth = 0;	/* to detect actual recursion */
  
+ static bool localize_message = true;
+ 
  /* buffers for formatted timestamps that might be used by both
   * log_line_prefix and csv logs.
   */
*************** in_error_recursion_trouble(void)
*** 197,202 ****
--- 199,216 ----
  	return (recursion_depth > 2);
  }
  
+ void
+ enable_message_localization(void)
+ {
+ 	localize_message = true;
+ }
+ 
+ void
+ disable_message_localization(void)
+ {
+ 	localize_message = false;
+ }
+ 
  /*
   * One of those fallback steps is to stop trying to localize the error
   * message, since there's a significant probability that that's exactly
*************** static inline const char *
*** 206,212 ****
  err_gettext(const char *str)
  {
  #ifdef ENABLE_NLS
! 	if (in_error_recursion_trouble())
  		return str;
  	else
  		return gettext(str);
--- 220,226 ----
  err_gettext(const char *str)
  {
  #ifdef ENABLE_NLS
! 	if (!localize_message || in_error_recursion_trouble())
  		return str;
  	else
  		return gettext(str);
*************** errcode_for_socket_access(void)
*** 703,709 ****
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (translateit && !in_error_recursion_trouble()) \
  			fmt = dgettext((domain), fmt);				  \
  		/* Expand %m in format string */ \
  		fmtbuf = expand_fmt_string(fmt, edata); \
--- 717,723 ----
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (translateit && localize_message && !in_error_recursion_trouble()) \
  			fmt = dgettext((domain), fmt);				  \
  		/* Expand %m in format string */ \
  		fmtbuf = expand_fmt_string(fmt, edata); \
*************** errcode_for_socket_access(void)
*** 744,750 ****
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (!in_error_recursion_trouble()) \
  			fmt = dngettext((domain), fmt_singular, fmt_plural, n); \
  		else \
  			fmt = (n == 1 ? fmt_singular : fmt_plural); \
--- 758,764 ----
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (localize_message && !in_error_recursion_trouble()) \
  			fmt = dngettext((domain), fmt_singular, fmt_plural, n); \
  		else \
  			fmt = (n == 1 ? fmt_singular : fmt_plural); \
diff -rpcd a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
*** a/src/backend/utils/init/postinit.c	2013-12-02 09:17:05.000000000 +0900
--- b/src/backend/utils/init/postinit.c	2013-12-13 16:15:10.000000000 +0900
*************** InitPostgres(const char *in_dbname, Oid 
*** 737,742 ****
--- 737,745 ----
  		/* initialize client encoding */
  		InitializeClientEncoding();
  
+ 		/* now that client encoding is known, localize later messages */
+ 		enable_message_localization();
+ 
  		/* report this backend in the PgBackendStatus array */
  		pgstat_bestart();
  
*************** InitPostgres(const char *in_dbname, Oid 
*** 914,919 ****
--- 917,925 ----
  	/* initialize client encoding */
  	InitializeClientEncoding();
  
+ 	/* now that client encoding is known, localize later messages */
+ 	enable_message_localization();
+ 
  	/* report this backend in the PgBackendStatus array */
  	if (!bootstrap)
  		pgstat_bestart();
diff -rpcd a/src/include/utils/elog.h b/src/include/utils/elog.h
*** a/src/include/utils/elog.h	2013-12-02 09:17:04.000000000 +0900
--- b/src/include/utils/elog.h	2013-12-13 16:14:29.000000000 +0900
*************** extern char *Log_destination_string;
*** 440,445 ****
--- 440,447 ----
  extern void DebugFileOpen(void);
  extern char *unpack_sql_state(int sql_state);
  extern bool in_error_recursion_trouble(void);
+ extern void enable_message_localization(void);
+ extern void disable_message_localization(void);
  
  #ifdef HAVE_SYSLOG
  extern void set_syslog_parameters(const char *ident, int facility);

Bruce Momjian

bruce@momjian.us

about 12 years ago

In reply to: MauMau (#1)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:

[Cause]
While the session is being established, the server cannot use the
client encoding for message conversion yet, because it cannot access
system catalogs to retrieve conversion functions. So, the server
sends messages to the client without conversion. In the above
example, the server sends Japanese UTF-8 messages to psql, which
expects those messages in SJIS.

[Fix]
Disable message localization during session startup. In other
words, messages are output in English until the database session is
established.

I think the question is whether the server encoding or English are
likely to be better for the average client. My bet is that the server
encoding is more likely correct.

However, you are right that English/ASCII at least will always be
viewable, while there are many server/client combinations that will
produce unreadable characters.

I would be interested to hear other people's experience with this.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Bruce Momjian (#2)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote:

On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:

[Cause]
While the session is being established, the server cannot use the
client encoding for message conversion yet, because it cannot access
system catalogs to retrieve conversion functions. So, the server
sends messages to the client without conversion. In the above
example, the server sends Japanese UTF-8 messages to psql, which
expects those messages in SJIS.

Better to attack that directly. Arrange to apply any client_encoding named in
the startup packet earlier, before authentication. This relates to the TODO
item "Let the client indicate character encoding of database names, user
names, and passwords". (I expect such an endeavor to be tricky.)

[Fix]
Disable message localization during session startup. In other
words, messages are output in English until the database session is
established.

I think the question is whether the server encoding or English are
likely to be better for the average client. My bet is that the server
encoding is more likely correct.

However, you are right that English/ASCII at least will always be
viewable, while there are many server/client combinations that will
produce unreadable characters.

I would be interested to hear other people's experience with this.

I don't have a sufficient sense of multilingualism among our users to know
whether English/ASCII messages would be more useful, on average, than
localized messages in the server encoding. Forcing English/ASCII does worsen
behavior in the frequent situation where client encoding will match server
encoding. I lean toward retaining the status quo of delivering localized
messages in the server encoding.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#3)

Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client

Noah Misch escribiï¿½:

On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote:

On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:

[Fix]
Disable message localization during session startup. In other
words, messages are output in English until the database session is
established.

I think the question is whether the server encoding or English are
likely to be better for the average client. My bet is that the server
encoding is more likely correct.

However, you are right that English/ASCII at least will always be
viewable, while there are many server/client combinations that will
produce unreadable characters.

I would be interested to hear other people's experience with this.

I don't have a sufficient sense of multilingualism among our users to know
whether English/ASCII messages would be more useful, on average, than
localized messages in the server encoding. Forcing English/ASCII does worsen
behavior in the frequent situation where client encoding will match server
encoding. I lean toward retaining the status quo of delivering localized
messages in the server encoding.

The problem is that if there's an encoding mismatch, the message might
be impossible to figure out. If the message is in english, at least it
can be searched for in the web, or something -- the user might even find
a page in which the english error string appears, with a native language
explanation.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

MauMau

maumau307@gmail.com

about 12 years ago

In reply to: Noah Misch (#3)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

From: "Noah Misch" <noah@leadboat.com>

Better to attack that directly. Arrange to apply any client_encoding
named in
the startup packet earlier, before authentication. This relates to the
TODO
item "Let the client indicate character encoding of database names, user
names, and passwords". (I expect such an endeavor to be tricky.)

Unfortunately, character set conversion is not possible until the database
session is established, since it requires system catalog access. Please the
comment in src/backend/utils/mb/mbutils.c:

* During backend startup we can't set client encoding because we (a)
* can't look up the conversion functions, and (b) may not know the database
* encoding yet either. So SetClientEncoding() just accepts anything and
* remembers it for InitializeClientEncoding() to apply later.

I guess that's why Tom-san suggested the same solution as my patch (as a
compromise) in the below thread, which is also a TODO item:

Re: encoding of PostgreSQL messages
/messages/by-id/19896.1234107496@sss.pgh.pa.us

From: "Alvaro Herrera" <alvherre@2ndquadrant.com>

The problem is that if there's an encoding mismatch, the message might
be impossible to figure out. If the message is in english, at least it
can be searched for in the web, or something -- the user might even find
a page in which the english error string appears, with a native language
explanation.

I feel like this, too. Being readable in English is better than being
unrecognizable.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: MauMau (#5)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Sun, Dec 22, 2013 at 07:51:55PM +0900, MauMau wrote:

From: "Noah Misch" <noah@leadboat.com>

Better to attack that directly. Arrange to apply any
client_encoding named in
the startup packet earlier, before authentication. This relates
to the TODO
item "Let the client indicate character encoding of database names, user
names, and passwords". (I expect such an endeavor to be tricky.)

Unfortunately, character set conversion is not possible until the
database session is established, since it requires system catalog
access. Please the comment in src/backend/utils/mb/mbutils.c:

* During backend startup we can't set client encoding because we (a)
* can't look up the conversion functions, and (b) may not know the database
* encoding yet either. So SetClientEncoding() just accepts anything and
* remembers it for InitializeClientEncoding() to apply later.

Yes, changing that is the tricky part.

I guess that's why Tom-san suggested the same solution as my patch
(as a compromise) in the below thread, which is also a TODO item:

Re: encoding of PostgreSQL messages
/messages/by-id/19896.1234107496@sss.pgh.pa.us

That's fair for the necessarily-earliest messages, like 'invalid value for
parameter "client_encoding"' and messages pertaining to the physical structure
of the startup packet. The client's encoding expectation is unknowable. An
error that mentions "client_encoding" will hopefully put users on the right
track regardless of how we translate and encode the surrounding words. The
other affected messages are quite technical, making a casual user unlikely to
fix or even see them. Not so for authentication messages, so I'm wary of
forcing use of ASCII that late in the handshake.

Note that choosing to use ASCII need not imply wholly declining to translate.
If the build uses GNU libiconv, gettext can emit ASCII approximations for
translations that conform to a Latin-derived alphabet, falling back to no
translation where the alphabet differs too much. pg_perm_setlocale(LC_CTYPE,
"C") requests such behavior. (The inferior iconv //TRANSLIT implementation of
GNU libc will convert non-ASCII characters to question marks, though.)

From: "Alvaro Herrera" <alvherre@2ndquadrant.com>

The problem is that if there's an encoding mismatch, the message might
be impossible to figure out. If the message is in english, at least it
can be searched for in the web, or something -- the user might even find
a page in which the english error string appears, with a native language
explanation.

I feel like this, too. Being readable in English is better than
being unrecognizable.

I agree that English consistently beats mojibake. I question whether that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password. The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo. If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's strategy. If the
figure is only 60%, I'd vote for holding out for a more-extensive fix that
allows us to encoding-convert localized authentication failure messages.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

MauMau

maumau307@gmail.com

about 12 years ago

In reply to: Noah Misch (#6)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

From: "Noah Misch" <noah@leadboat.com>

I agree that English consistently beats mojibake. I question whether that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password. The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo. If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's strategy. If
the
figure is only 60%, I'd vote for holding out for a more-extensive fix that
allows us to encoding-convert localized authentication failure messages.

I agree with you. It would be more friendly to users if more messages are
localized.

Then, as a happy medium, how about disabling message localization only if
the client encoding differs from the server one? That is, compare the
client_encoding value in the startup packet with the result of
GetPlatformEncoding(). If they don't match, call
disable_message_localization().

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

about 12 years ago

In reply to: MauMau (#7)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:

From: "Noah Misch" <noah@leadboat.com>

I agree that English consistently beats mojibake. I question whether that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password. The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo. If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's
strategy. If the
figure is only 60%, I'd vote for holding out for a more-extensive fix that
allows us to encoding-convert localized authentication failure messages.

I agree with you. It would be more friendly to users if more
messages are localized.

Then, as a happy medium, how about disabling message localization
only if the client encoding differs from the server one? That is,
compare the client_encoding value in the startup packet with the
result of GetPlatformEncoding(). If they don't match, call
disable_message_localization().

I think the problem is we don't know the client and server encodings
at that time.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

MauMau

maumau307@gmail.com

about 12 years ago

In reply to: Bruce Momjian (#8)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

From: "Bruce Momjian" <bruce@momjian.us>

On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:

Then, as a happy medium, how about disabling message localization
only if the client encoding differs from the server one? That is,
compare the client_encoding value in the startup packet with the
result of GetPlatformEncoding(). If they don't match, call
disable_message_localization().

I think the problem is we don't know the client and server encodings
at that time.

I suppose we know (or at least believe) those encodings during backend
startup:

* client encoding - the client_encoding parameter passed in the startup
packet, or if that's not present, client_encoding GUC value.

* server encoding - the encoding of strings gettext() returns. That is what
GetPlatformEncoding() returns.

Am I correct?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: MauMau (#9)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Tue, Jan 7, 2014 at 8:56 AM, MauMau <maumau307@gmail.com> wrote:

I suppose we know (or at least believe) those encodings during backend
startup:

* client encoding - the client_encoding parameter passed in the startup
packet, or if that's not present, client_encoding GUC value.

* server encoding - the encoding of strings gettext() returns. That is what
GetPlatformEncoding() returns.

Suppose the startup packet itself is malformed. How will you report the error?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

MauMau

maumau307@gmail.com

about 12 years ago

In reply to: Robert Haas (#10)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

From: "Robert Haas" <robertmhaas@gmail.com>

Suppose the startup packet itself is malformed. How will you report the
error?

I think we have no choice but to report the error in English, because we
don't know what the client wants.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

MauMau

maumau307@gmail.com

about 12 years ago

In reply to: Robert Haas (#10)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

From: "Robert Haas" <robertmhaas@gmail.com>

Suppose the startup packet itself is malformed. How will you report the
error?

I think we have no choice but to report the error in English, because we
don't know what the client wants.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: MauMau (#9)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Tue, Jan 07, 2014 at 10:56:28PM +0900, MauMau wrote:

From: "Bruce Momjian" <bruce@momjian.us>

On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:

Then, as a happy medium, how about disabling message localization
only if the client encoding differs from the server one? That is,
compare the client_encoding value in the startup packet with the
result of GetPlatformEncoding(). If they don't match, call
disable_message_localization().

I like this proposal. Thanks.

I think the problem is we don't know the client and server encodings
at that time.

I suppose we know (or at least believe) those encodings during
backend startup:

* client encoding - the client_encoding parameter passed in the
startup packet, or if that's not present, client_encoding GUC value.

* server encoding - the encoding of strings gettext() returns. That
is what GetPlatformEncoding() returns.

Agreed. You would need to poke into the relevant part of the startup packet
much earlier than we do today, but that's tractable. Note that
GetPlatformEncoding() is gone; use GetMessageEncoding().

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Noah Misch (#13)

Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client

Noah Misch <noah@leadboat.com> writes:

On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:

Then, as a happy medium, how about disabling message localization
only if the client encoding differs from the server one? That is,
compare the client_encoding value in the startup packet with the
result of GetPlatformEncoding(). If they don't match, call
disable_message_localization().

I like this proposal. Thanks.
...
Agreed. You would need to poke into the relevant part of the startup packet
much earlier than we do today, but that's tractable.

There's still the problem of what to do before we have a complete startup
packet, or if the packet is defective enough to not contain a recognizable
client encoding.

Perhaps more to the point, what it sounds like this is doing is creating
a third behavioral state, in between what prevails when we're first
reading the packet and what prevails after we've finally adopted the
requested client encoding. I'm less than convinced that's a good thing.

I'm also rather unexcited by the idea of introducing redundant and/or
ad-hoc code to parse the startup packet. That sounds like a recipe for
bugs, some of which might even rise to security issues, considering it
would happen before client authentication.

I think if we're going to do anything like this at all, it'd be best
just to disable localization from postmaster fork up till we've gotten
a client encoding out of the packet in the normal course of events.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Tom Lane (#14)

Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client

On Fri, Jan 10, 2014 at 08:03:00PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:

Then, as a happy medium, how about disabling message localization
only if the client encoding differs from the server one? That is,
compare the client_encoding value in the startup packet with the
result of GetPlatformEncoding(). If they don't match, call
disable_message_localization().

I like this proposal. Thanks.
...
Agreed. You would need to poke into the relevant part of the startup packet
much earlier than we do today, but that's tractable.

There's still the problem of what to do before we have a complete startup
packet, or if the packet is defective enough to not contain a recognizable
client encoding.

MauMau proposed using untranslated messages until we're past that point. I
like that answer fine, because routine mistakes from ordinary users will not
elicit the errors in question. The most interesting message in that group
might be 'invalid value for parameter "client_encoding"', and I think the
presence of the term "client_encoding" will be a sufficient clue regardless of
how we translate and encode the surrounding words.

Perhaps more to the point, what it sounds like this is doing is creating
a third behavioral state, in between what prevails when we're first
reading the packet and what prevails after we've finally adopted the
requested client encoding. I'm less than convinced that's a good thing.

I'm also rather unexcited by the idea of introducing redundant and/or
ad-hoc code to parse the startup packet. That sounds like a recipe for
bugs, some of which might even rise to security issues, considering it
would happen before client authentication.

Valid worries.

I think if we're going to do anything like this at all, it'd be best
just to disable localization from postmaster fork up till we've gotten
a client encoding out of the packet in the normal course of events.

That was MauMau's original proposal. I opined upthread that it would be
better to change nothing than to do that.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

MauMau

maumau307@gmail.com

almost 12 years ago

In reply to: MauMau (#7)

1 attachment(s)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

From: "MauMau" <maumau307@gmail.com>

From: "Noah Misch" <noah@leadboat.com>

I agree that English consistently beats mojibake. I question whether
that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password. The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo. If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's strategy.
If the
figure is only 60%, I'd vote for holding out for a more-extensive fix
that
allows us to encoding-convert localized authentication failure messages.

I agree with you. It would be more friendly to users if more messages are
localized.

Then, as a happy medium, how about disabling message localization only if
the client encoding differs from the server one? That is, compare the
client_encoding value in the startup packet with the result of
GetPlatformEncoding(). If they don't match, call
disable_message_localization().

I did this with the attached patch. I added some code in
BackendInitialize(). I'll update the CommitFest entry in a few days.

Regards
MauMau

Attachments:

no_localize_message_in_startup_v2.patchapplication/octet-stream; name=no_localize_message_in_startup_v2.patchDownload

diff -rpcd a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
*** a/src/backend/postmaster/postmaster.c	2014-01-05 05:17:12.000000000 +0900
--- b/src/backend/postmaster/postmaster.c	2014-01-20 14:22:05.000000000 +0900
***************
*** 100,105 ****
--- 100,106 ----
  #include "libpq/ip.h"
  #include "libpq/libpq.h"
  #include "libpq/pqsignal.h"
+ #include "mb/pg_wchar.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
*************** ProcessStartupPacket(Port *port, bool SS
*** 1742,1747 ****
--- 1743,1750 ----
  	void	   *buf;
  	ProtocolVersion proto;
  	MemoryContext oldcontext;
+ 	const char *client_encoding = NULL;
+ 	int			enc;
  
  	if (pq_getbytes((char *) &len, 4) == EOF)
  	{
*************** retry1:
*** 1912,1917 ****
--- 1915,1922 ----
  				port->guc_options = lappend(port->guc_options,
  											pstrdup(valptr));
  			}
+ 			if (strcmp(nameptr, "client_encoding") == 0)
+ 				client_encoding = valptr;
  			offset = valoffset + strlen(valptr) + 1;
  		}
  
*************** retry1:
*** 1946,1951 ****
--- 1951,1966 ----
  		port->guc_options = NIL;
  	}
  
+ 	/*
+ 	 * If the client encoding is the same as the server, localize and send
+ 	 * messages to the client in that encoding.
+ 	 */
+ 	if (client_encoding == NULL || client_encoding[0] == '\0')
+ 		client_encoding = GetConfigOption("client_encoding", true, false);
+ 	enc = pg_valid_client_encoding(client_encoding);
+ 	if (enc == pg_get_encoding_from_locale("", true))
+ 		enable_message_localization();
+ 
  	/* Check a user name was given. */
  	if (port->user_name == NULL || port->user_name[0] == '\0')
  		ereport(FATAL,
*************** BackendInitialize(Port *port)
*** 3970,3975 ****
--- 3985,3995 ----
  	enable_timeout_after(STARTUP_PACKET_TIMEOUT, AuthenticationTimeout * 1000);
  
  	/*
+ 	 * Output messages in English because client encoding is not known yet.
+ 	 */
+ 	disable_message_localization();
+ 
+ 	/*
  	 * Receive the startup packet (which might turn out to be a cancel request
  	 * packet).
  	 */
diff -rpcd a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
*** a/src/backend/utils/error/elog.c	2014-01-05 05:17:10.000000000 +0900
--- b/src/backend/utils/error/elog.c	2014-01-20 11:53:18.000000000 +0900
*************** static int	errordata_stack_depth = -1; /
*** 147,152 ****
--- 147,154 ----
  
  static int	recursion_depth = 0;	/* to detect actual recursion */
  
+ static bool localize_message = true;
+ 
  /* buffers for formatted timestamps that might be used by both
   * log_line_prefix and csv logs.
   */
*************** in_error_recursion_trouble(void)
*** 197,202 ****
--- 199,216 ----
  	return (recursion_depth > 2);
  }
  
+ void
+ enable_message_localization(void)
+ {
+ 	localize_message = true;
+ }
+ 
+ void
+ disable_message_localization(void)
+ {
+ 	localize_message = false;
+ }
+ 
  /*
   * One of those fallback steps is to stop trying to localize the error
   * message, since there's a significant probability that that's exactly
*************** static inline const char *
*** 206,212 ****
  err_gettext(const char *str)
  {
  #ifdef ENABLE_NLS
! 	if (in_error_recursion_trouble())
  		return str;
  	else
  		return gettext(str);
--- 220,226 ----
  err_gettext(const char *str)
  {
  #ifdef ENABLE_NLS
! 	if (!localize_message || in_error_recursion_trouble())
  		return str;
  	else
  		return gettext(str);
*************** errcode_for_socket_access(void)
*** 704,710 ****
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (translateit && !in_error_recursion_trouble()) \
  			fmt = dgettext((domain), fmt);				  \
  		/* Expand %m in format string */ \
  		fmtbuf = expand_fmt_string(fmt, edata); \
--- 718,724 ----
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (translateit && localize_message && !in_error_recursion_trouble()) \
  			fmt = dgettext((domain), fmt);				  \
  		/* Expand %m in format string */ \
  		fmtbuf = expand_fmt_string(fmt, edata); \
*************** errcode_for_socket_access(void)
*** 745,751 ****
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (!in_error_recursion_trouble()) \
  			fmt = dngettext((domain), fmt_singular, fmt_plural, n); \
  		else \
  			fmt = (n == 1 ? fmt_singular : fmt_plural); \
--- 759,765 ----
  		char		   *fmtbuf; \
  		StringInfoData	buf; \
  		/* Internationalize the error format string */ \
! 		if (localize_message && !in_error_recursion_trouble()) \
  			fmt = dngettext((domain), fmt_singular, fmt_plural, n); \
  		else \
  			fmt = (n == 1 ? fmt_singular : fmt_plural); \
diff -rpcd a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
*** a/src/backend/utils/init/postinit.c	2014-01-05 05:17:10.000000000 +0900
--- b/src/backend/utils/init/postinit.c	2014-01-20 11:53:18.000000000 +0900
*************** InitPostgres(const char *in_dbname, Oid 
*** 736,741 ****
--- 736,744 ----
  		/* initialize client encoding */
  		InitializeClientEncoding();
  
+ 		/* now that client encoding is known, localize later messages */
+ 		enable_message_localization();
+ 
  		/* report this backend in the PgBackendStatus array */
  		pgstat_bestart();
  
*************** InitPostgres(const char *in_dbname, Oid 
*** 913,918 ****
--- 916,924 ----
  	/* initialize client encoding */
  	InitializeClientEncoding();
  
+ 	/* now that client encoding is known, localize later messages */
+ 	enable_message_localization();
+ 
  	/* report this backend in the PgBackendStatus array */
  	if (!bootstrap)
  		pgstat_bestart();
diff -rpcd a/src/include/utils/elog.h b/src/include/utils/elog.h
*** a/src/include/utils/elog.h	2014-01-05 05:17:05.000000000 +0900
--- b/src/include/utils/elog.h	2014-01-20 11:53:18.000000000 +0900
*************** extern char *Log_destination_string;
*** 440,445 ****
--- 440,447 ----
  extern void DebugFileOpen(void);
  extern char *unpack_sql_state(int sql_state);
  extern bool in_error_recursion_trouble(void);
+ extern void enable_message_localization(void);
+ extern void disable_message_localization(void);
  
  #ifdef HAVE_SYSLOG
  extern void set_syslog_parameters(const char *ident, int facility);

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: MauMau (#16)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

"MauMau" <maumau307@gmail.com> writes:

Then, as a happy medium, how about disabling message localization only if
the client encoding differs from the server one? That is, compare the
client_encoding value in the startup packet with the result of
GetPlatformEncoding(). If they don't match, call
disable_message_localization().

AFAICT this is not what was agreed to in this thread. It puts far too
much credence in the server-side default for client_encoding, which up to
now has never been thought to be very interesting; indeed I doubt most
people bother to set it at all. The reason that this issue is even on
the table is that that default is too likely to be wrong, no?

Also, whatever possessed you to use pg_get_encoding_from_locale to
identify the server's encoding? That's expensive and seems fairly
unlikely to yield the right answer. I don't remember offhand where we
keep the postmaster's idea of what encoding messages should be in, but I'm
fairly sure it's stored explicitly somewhere. Or if it isn't, we can for
sure do better than recalculating it during every connection attempt.

Having said all that, though, I'm unconvinced that this cure isn't worse
than the disease. Somebody claimed upthread that no very interesting
messages would be delocalized by a change like this, but that's complete
nonsense: in particular, *every* message associated with client
authentication will be sent in English if we go down this path. Given
the nearly complete lack of complaints in the many years that this code
has worked like this, I'm betting that most people will find a change
like this to be a net reduction in friendliness.

Given the changes here to extract client_encoding from the startup packet
ASAP, I wonder whether the right thing isn't just to set the client
encoding immediately when we do that. Most application libraries pass
client encoding in the startup packet anyway (libpq certainly does).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Tom Lane (#17)

Re: [bug fix] multibyte messages are displayed incorrectly on the client

On 04/05/2014 07:56 AM, Tom Lane wrote:

"MauMau" <maumau307@gmail.com> writes:

Then, as a happy medium, how about disabling message localization only if
the client encoding differs from the server one? That is, compare the
client_encoding value in the startup packet with the result of
GetPlatformEncoding(). If they don't match, call
disable_message_localization().

AFAICT this is not what was agreed to in this thread. It puts far too
much credence in the server-side default for client_encoding, which up to
now has never been thought to be very interesting; indeed I doubt most
people bother to set it at all. The reason that this issue is even on
the table is that that default is too likely to be wrong, no?

Also, whatever possessed you to use pg_get_encoding_from_locale to
identify the server's encoding? That's expensive and seems fairly
unlikely to yield the right answer. I don't remember offhand where we
keep the postmaster's idea of what encoding messages should be in, but I'm
fairly sure it's stored explicitly somewhere. Or if it isn't, we can for
sure do better than recalculating it during every connection attempt.

Having said all that, though, I'm unconvinced that this cure isn't worse
than the disease. Somebody claimed upthread that no very interesting
messages would be delocalized by a change like this, but that's complete
nonsense: in particular, *every* message associated with client
authentication will be sent in English if we go down this path. Given
the nearly complete lack of complaints in the many years that this code
has worked like this, I'm betting that most people will find a change
like this to be a net reduction in friendliness.

Given the changes here to extract client_encoding from the startup packet
ASAP, I wonder whether the right thing isn't just to set the client
encoding immediately when we do that. Most application libraries pass
client encoding in the startup packet anyway (libpq certainly does).

Based on Tom's comments above, I'm marking this as returned with
feedback in the commitfest. I agree that setting client_encoding as
early as possible seems like the right thing to do.

Earlier in this thread, MauMau pointed out that we can't do encoding
conversions until we have connected to the database because you need to
read pg_conversion for that. That's because we support creating custom
conversions with CREATE CONVERSION. Frankly, I don't think anyone cares
about that feature. If we just dropped the CREATE/DROP CONVERSION
feature altogether and hard-coded the conversions we have, there would
be close to zero complaints. Even if you want to extend something around
encodings and conversions, the CREATE CONVERSION interface is clunky.
Firstly, conversions are per-database, and even schema-qualified, which
just seems like an extra complication. You'll most likely want to modify
the conversion across the whole system. Secondly, rather than define a
new conversion between encodings, you'll likely want to define a whole
new encoding with conversions to/from existing encodings, but you can't
do that anyway without hacking the source code.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Heikki Linnakangas (#18)

Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Earlier in this thread, MauMau pointed out that we can't do encoding
conversions until we have connected to the database because you need to
read pg_conversion for that. That's because we support creating custom
conversions with CREATE CONVERSION. Frankly, I don't think anyone cares
about that feature. If we just dropped the CREATE/DROP CONVERSION
feature altogether and hard-coded the conversions we have, there would
be close to zero complaints. Even if you want to extend something around
encodings and conversions, the CREATE CONVERSION interface is clunky.
Firstly, conversions are per-database, and even schema-qualified, which
just seems like an extra complication. You'll most likely want to modify
the conversion across the whole system. Secondly, rather than define a
new conversion between encodings, you'll likely want to define a whole
new encoding with conversions to/from existing encodings, but you can't
do that anyway without hacking the source code.

There's certainly something to be said for that position. If there were
any prospect of extensions defining new encodings someday, I'd argue for
keeping CREATE CONVERSION. But the performance headaches would be
substantial, and there aren't new encodings coming down the pike often
enough to justify the work involved, so I don't see us ever doing CREATE
ENCODING; and that means that CREATE CONVERSION is of little value.

I'd kind of like to see this go just because having catalog accesses
involved in encoding conversion setup is messy and fragile.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers