BUG #2685: Wrong charset of server messages on client [PATCH]

Started by Sergiy Vyshnevetskiyover 19 years ago10 messagesbugs
Jump to latest
#1Sergiy Vyshnevetskiy
serg@vostok.net

The following bug has been logged online:

Bug reference: 2685
Logged by: Sergiy Vyshnevetskiy
Email address: serg@vostok.net
PostgreSQL version: 8.1
Operating system: FreeBSD-6 stable
Description: Wrong charset of server messages on client [PATCH]
Details:

DESCRIPTION:

PostgreSQL backend uses gettext() to localize its messages. The charset of
localized messages is determined by LC_CTYPE by default.

Then the message is processed through sprintf-like mechanism (with database
data as possible arguments) and fed to send_message_to_frontend(), that
converts data from _database_charset_(!) to client charset.

If LC_CTYPE is not the same as (at least binary compatible to) database
charset, then client gets garbage characters in server messages. If database
charset is UTF-8, then cluster may recusively generate "invalid byte
sequence for encoding" errors till it fills up
errordata[ERRORDATA_STACK_SIZE], then it panics.

SOLUTION:

Convert server messages to database charset.

PATCH:

--- src/backend/utils/mb/mbutils.c.o0 Tue Oct 10 11:51:13 2006              
+++ src/backend/utils/mb/mbutils.c  Tue Oct 10 11:49:22 2006                

@@ -615,6 +615,7 @@

DatabaseEncoding = &pg_enc2name_tbl[encoding];

Assert(DatabaseEncoding->encoding == encoding);

#ifdef USE_ICU

+
bind_textdomain_codeset("postgres",(&pg_enc2iananame_tbl[encoding])->name);

ucnv_setDefaultName((&pg_enc2iananame_tbl[encoding])->name);

#endif

}

This, however, uncovers another bug: PostgreSQL dumps the messages into
stderr/syslog as-is, without converting database data from database charset
to charset from LC_MESSAGES. After this patch it will do so with message
text too. The fix should be trivial - set up a conversion from database
charset to server charset. I will post a patch for it later.

NOTE:

I used pg_enc2iananame_tbl instead of pg_enc2name_tbl, because gettext
doesn't accept many

Possible TODO:
Change PostgreSQL charset names to IANA-standard names.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sergiy Vyshnevetskiy (#1)
Re: BUG #2685: Wrong charset of server messages on client [PATCH]

"Sergiy Vyshnevetskiy" <serg@vostok.net> writes:

Convert server messages to database charset.

This has been discussed before:
http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php

The magic pg_enc2iananame_tbl[] you reference in your patch does not exist,
and if it did exist it wouldn't work on all platforms, since encoding
names aren't sufficiently well standardized :-(

This, however, uncovers another bug: PostgreSQL dumps the messages into
stderr/syslog as-is, without converting database data from database charset
to charset from LC_MESSAGES.

I'm quite unconvinced that that's a bug. If we tried to do a conversion
here, it would be trivial to set up denials of service for logging ---
just include a character in a comment in your SQL command that cannot be
converted to the LC_MESSAGES character set.

regards, tom lane

#3Sergiy Vyshnevetskiy
serg@vostok.net
In reply to: Tom Lane (#2)
Re: BUG #2685: Wrong charset of server messages on client

On Tue, 10 Oct 2006, Tom Lane wrote:

"Sergiy Vyshnevetskiy" <serg@vostok.net> writes:

Convert server messages to database charset.

This has been discussed before:
http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php

The magic pg_enc2iananame_tbl[] you reference in your patch does not exist,
and if it did exist it wouldn't work on all platforms, since encoding
names aren't sufficiently well standardized :-(

It's not magic, it's from ICU patch. Want me to send you a copy? :)

This, however, uncovers another bug: PostgreSQL dumps the messages into
stderr/syslog as-is, without converting database data from database charset
to charset from LC_MESSAGES.

I'm quite unconvinced that that's a bug. If we tried to do a conversion
here, it would be trivial to set up denials of service for logging ---
just include a character in a comment in your SQL command that cannot be
converted to the LC_MESSAGES character set.

They have to be printed as escape sequences. I think that dumping raw
string data in log without converting them to printable form can be used
to mess up log viewer at least. (At most this can be a security breach.)

Having row multibyte characters mixed with characters in LC_CTYPE in the
log makes it less useful. Syslog would mangle them further to a complete
unrecognition.

#4Sergiy Vyshnevetskiy
serg@vostok.net
In reply to: Sergiy Vyshnevetskiy (#3)
Re: BUG #2685: Wrong charset of server messages on client

On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

On Tue, 10 Oct 2006, Tom Lane wrote:

"Sergiy Vyshnevetskiy" <serg@vostok.net> writes:

Convert server messages to database charset.

This has been discussed before:
http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php

The magic pg_enc2iananame_tbl[] you reference in your patch does not
exist,
and if it did exist it wouldn't work on all platforms, since encoding
names aren't sufficiently well standardized :-(

It's not magic, it's from ICU patch. Want me to send you a copy? :)

Sorry. I thought it was more well-known. Just looked into gentoo portage -
they don't know about it eigther.

The patch is here:

http://people.freebsd.org/~girgen/postgresql-icu/pg-814-icu-xx-2006-09-25.diff.gz

This is the current list of encodings, according to iana:

http://www.iana.org/assignments/character-sets

#5Sergiy Vyshnevetskiy
serg@vostok.net
In reply to: Sergiy Vyshnevetskiy (#4)
Re: BUG #2685: Wrong charset of server messages on client

On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

On Tue, 10 Oct 2006, Tom Lane wrote:

"Sergiy Vyshnevetskiy" <serg@vostok.net> writes:

Convert server messages to database charset.

This has been discussed before:
http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php

The magic pg_enc2iananame_tbl[] you reference in your patch does not
exist,
and if it did exist it wouldn't work on all platforms, since encoding
names aren't sufficiently well standardized :-(

It's not magic, it's from ICU patch. Want me to send you a copy? :)

Sorry. I thought it was more well-known. Just looked into gentoo portage -
they don't know about it eigther.

The patch is here:

http://people.freebsd.org/~girgen/postgresql-icu/pg-814-icu-xx-2006-09-25.diff.gz

This is the current list of encodings, according to iana:

http://www.iana.org/assignments/character-sets

ICU homepage is

http://icu.sourceforge.net/

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sergiy Vyshnevetskiy (#4)
Re: BUG #2685: Wrong charset of server messages on client [PATCH]

Sergiy Vyshnevetskiy <serg@vostok.net> writes:

It's not magic, it's from ICU patch. Want me to send you a copy? :)

You're missing my point, which is that non-ICU locale support doesn't
necessarily recognize the same encoding names. We would have done this
years ago if we had a solution to that problem.

regards, tom lane

#7Sergiy Vyshnevetskiy
serg@vostok.net
In reply to: Tom Lane (#6)
Re: BUG #2685: Wrong charset of server messages on client

On Tue, 10 Oct 2006, Tom Lane wrote:

Sergiy Vyshnevetskiy <serg@vostok.net> writes:

It's not magic, it's from ICU patch. Want me to send you a copy? :)

You're missing my point, which is that non-ICU locale support doesn't
necessarily recognize the same encoding names. We would have done this
years ago if we had a solution to that problem.

We should use IANA-standard names. If it fails - it does nothing.
Anybody porting PostgreSQL to new platform can go over the list and make a
patch for their port.

#8Sergiy Vyshnevetskiy
serg@vostok.net
In reply to: Sergiy Vyshnevetskiy (#7)
Re: BUG #2685: Wrong charset of server messages on client

On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

On Tue, 10 Oct 2006, Tom Lane wrote:

Sergiy Vyshnevetskiy <serg@vostok.net> writes:

It's not magic, it's from ICU patch. Want me to send you a copy? :)

You're missing my point, which is that non-ICU locale support doesn't
necessarily recognize the same encoding names. We would have done this
years ago if we had a solution to that problem.

We should use IANA-standard names. If it fails - it does nothing.
Anybody porting PostgreSQL to new platform can go over the list and make a
patch for their port.

Here is a new and improved patch, that closes security hole as well. To
prevent DOS attack we lock LC_MESSAGES as C for any database encoding that
we are unable to bind to our textdomain.

Attachments:

enc.difftext/plain; charset=US-ASCII; name=enc.diffDownload+135-0
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sergiy Vyshnevetskiy (#8)
Re: BUG #2685: Wrong charset of server messages on client

Sergiy Vyshnevetskiy <serg@vostok.net> writes:

Here is a new and improved patch, that closes security hole as well.

We really can't consider a patch like this, because not only does it
ignore the problem of multiple spellings of encoding names, but it
actually breaks existing functionality on platforms with a variant
spelling of the name. I think a minimum requirement ought to be that
it work with any of the spellings recognized by initdb.

regards, tom lane

In reply to: Tom Lane (#9)
Re: BUG #2685: Wrong charset of server messages on client

On Tue, 10 Oct 2006, Tom Lane wrote:

Sergiy Vyshnevetskiy <serg@vostok.net> writes:

Here is a new and improved patch, that closes security hole as well.

We really can't consider a patch like this, because not only does it
ignore the problem of multiple spellings of encoding names, but it
actually breaks existing functionality on platforms with a variant
spelling of the name. I think a minimum requirement ought to be that
it work with any of the spellings recognized by initdb.

Alright, that was too strict. But when server uses messages in
LC_CTYPE encoding with data in database encoding and pushes this mix
through database-to-client charset conversion - that's a bug. PostgreSQL
bug. And "UTF-8 panic" is it's direct result.

As a stop-gap I included a version of patch that breaks nothing. But it
will fix the "wrong encoding" bug and "UTF-8 panic" only on those OS who
recognize the supplied spelling. Linux and FreeBSD are among them.

Cycling through possible spellings in SetDatabaseEncoding() is suboptimal.
The time and place to do it is somewhere in the configure script. There we
can fill pg_enc2localname_tbl with results of testing possible charset
names.

We can also just leave the patch as it is, because more and more OS learn
more and more different charset name spellings every new version. Why
waste too mush power chasing a horce that runs _to_you_? :)

Attachments:

enc.difftext/plain; charset=US-ASCII; name=enc.diffDownload+116-0