TM format can mix encodings in to_char()

Started by Juan José Santamaría Flechaabout 7 years ago15 messageshackers

Jump to latest

Juan José Santamaría Flecha

juanjo.santamaria@gmail.com

about 7 years ago

Hackers,

I will use as an example the code in the regression test
'collate.linux.utf8'.
There you can find:

SET lc_time TO 'tr_TR';
SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
to_char
-------------
01 NIS 2010
(1 row)

The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9
(LATIN5),
while the test runs in UTF8. So the following code will raise an error:

SET lc_time TO 'tr_TR';
SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
ERROR: invalid byte sequence for encoding "UTF8": 0xde 0x75

The problem seems to be in the code touched in the attached patch.

Regards,

Juan Jose Santamaria Flecha

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Juan José Santamaría Flecha (#1)

Re: TM format can mix encodings in to_char()

Hello.

At Fri, 12 Apr 2019 18:45:51 +0200, Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote in <CAC+AXB22So5aZm2vZe+MChYXec7gWfr-n-SK-iO091R0P_1Tew@mail.gmail.com>

Hackers,

I will use as an example the code in the regression test
'collate.linux.utf8'.
There you can find:

SET lc_time TO 'tr_TR';
SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
to_char
-------------
01 NIS 2010
(1 row)

The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9
(LATIN5),
while the test runs in UTF8. So the following code will raise an error:

SET lc_time TO 'tr_TR';
SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
ERROR: invalid byte sequence for encoding "UTF8": 0xde 0x75

The same case is handled for lc_numeric. lc_time ought to be
treated the same way.

The problem seems to be in the code touched in the attached patch.

It seems basically correct, but cache_single_time does extra
strdup when pg_any_to_server did conversion. Maybe it would be
better be like this:

oldcxt = MemoryContextSwitchTo(TopMemoryContext);
ptr = pg_any_to_server(buf, strlen(buf), encoding);

if (ptr == buf)
{
/* Conversion didn't pstrdup, so we must */
ptr = pstrdup(buf);
}
MemoryContextSwitchTo(oldcxt);

-	int			i;
+	int			i,
+				encoding;

It is not strictly kept, but (I believe) we don't define multiple
variables in a single definition.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Juan José Santamaría Flecha (#1)

Re: TM format can mix encodings in to_char()

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:

The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5),
while the test runs in UTF8. So the following code will raise an error:

SET lc_time TO 'tr_TR';
SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
ERROR: invalid byte sequence for encoding "UTF8": 0xde 0x75

Ugh.

The problem seems to be in the code touched in the attached patch.

Hmm. I'd always imagined that the way that libc works is that LC_CTYPE
determines the encoding (codeset) it's using across the board, so that
functions like strftime would deliver data in that encoding. That's
mainly based on the observation that nl_langinfo(CODESET) is specified
to depend on LC_CTYPE, and it would be monumentally stupid for any libc
functions to be operating according to a codeset that there's no way to
discover.

However, your example shows that at least glibc is indeed
monumentally stupid about this :-(.

But ... perhaps other implementations are not so silly? I went
looking into the POSIX spec to see if it says anything about this,
and discovered (in Base Definitions section 7, Locale):

If different character sets are used by the locale categories, the
results achieved by an application utilizing these categories are
undefined. Likewise, if different codesets are used for the data being
processed by interfaces whose behavior is dependent on the current
locale, or the codeset is different from the codeset assumed when the
locale was created, the result is also undefined.

"Undefined" is a term of art here: it means the library can misbehave
arbitrarily badly, up to and including abort() or halt-and-catch-fire.
We do *not* want to be invoking undefined behavior, even if particular
implementations seem to behave sanely. Your proposed patch isn't
getting us out of that, and what it is doing instead is embedding an
assumption that the implementation handles this in a particular way.

So what I'm thinking really needs to be done here is to force it to work
according to the LC_CTYPE-determines-the-codeset-for-everything model.
Note that that model is embedded into PG in quite a few ways besides the
one at stake here; for instance, pg_perm_setlocale thinks it should make
gettext track the LC_CTYPE encoding, not anything else.

If we're willing to assume a lot about how locale names are spelled,
we could imagine fixing this in cache_locale_time by having it strip
any encoding spec from the given LC_TIME string and then adding on the
codeset name from nl_langinfo(CODESET). Not sure about how well
that'd play on Windows, though. We'd also need to adjust check_locale
so that it does the same dance.

BTW, it seems very likely that we have similar issues with LC_MONETARY
and LC_NUMERIC in PGLC_localeconv(). There's an interesting Windows-only
hack in there now that seems to be addressing more or less the same issue;
I wonder whether that would be rendered unnecessary if we approached it
like this?

I'm also wondering why we have not noticed any comparable problem with
LC_MESSAGES or LC_COLLATE. It's not so surprising that we haven't
understood this hazard before with LC_TIME/LC_MONETARY/LC_NUMERIC given
their limited usage in PG, but the same can't be said of LC_MESSAGES or
LC_COLLATE.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Tom Lane (#3)

Re: TM format can mix encodings in to_char()

I wrote:

Hmm. I'd always imagined that the way that libc works is that LC_CTYPE
determines the encoding (codeset) it's using across the board, so that
functions like strftime would deliver data in that encoding.
[ and much more based on that ]

After further study of the code, the situation seems less dire than
I feared yesterday. In the first place, we disallow settings of
LC_COLLATE and LC_CTYPE that don't match the database encoding, see
tests in dbcommands.c's check_encoding_locale_matches() and in initdb.
So that core functionality will be consistent in any case.

Also, I see that PGLC_localeconv() is effectively doing exactly what
you suggested for strings that are encoded according to LC_MONETARY
and LC_NUMERIC:

encoding = pg_get_encoding_from_locale(locale_monetary, true);

db_encoding_convert(encoding, &worklconv.int_curr_symbol);
db_encoding_convert(encoding, &worklconv.currency_symbol);
...

This is a little bit off, now that I look at it, because it's
failing to account for the possibility of getting -1 from
pg_get_encoding_from_locale. It should probably do what
pg_bind_textdomain_codeset does:

if (encoding < 0)
encoding = PG_SQL_ASCII;

since passing PG_SQL_ASCII to the conversion will have the effect of
validating the data without any actual conversion.

I remain wary of this idea because it's depending on something that's
undefined per POSIX, but apparently it's working well enough for
LC_MONETARY and LC_NUMERIC, so we can probably get away with it for
LC_TIME as well. Anyway the current code clearly does not work on
glibc, and I also verified that there's a problem on FreeBSD, so
this patch should make things better.

Also, experimentation suggests that LC_MESSAGES actually does work
the way I thought this stuff works, ie, its implied codeset isn't
really used. (I think this only matters for strerror(), since we
force the issue for gettext, but glibc's strerror() is clearly not
paying attention to that.) Sigh, who needs consistency?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Tom Lane (#4)

Re: TM format can mix encodings in to_char()

I wrote:

This is a little bit off, now that I look at it, because it's
failing to account for the possibility of getting -1 from
pg_get_encoding_from_locale. It should probably do what
pg_bind_textdomain_codeset does:
if (encoding < 0)
encoding = PG_SQL_ASCII;

Actually, the more I looked at the existing code, the less happy I got
with its error handling. cache_locale_time() is an absolute failure
in that respect, because it makes no attempt at all to avoid throwing
errors while LC_TIME or LC_CTYPE is set to a transient value. We
could maybe tolerate LC_TIME being weird, but continuing with a value
of LC_CTYPE that doesn't match the database setting would almost
certainly be disastrous.

PGLC_localeconv had at least thought about the issue, but it ends up
wimping out:

/*
* Report it if we failed to restore anything. Perhaps this should be
* FATAL, rather than continuing with bad locale settings?
*/
if (trouble)
elog(WARNING, "failed to restore old locale");

And it's also oddly willing to keep going even if it couldn't get the
original setlocale settings to begin with.

I think that this code was written with only LC_MONETARY and LC_NUMERIC
in mind, for which there's at least some small argument that continuing
with unwanted values wouldn't be fatal (though IIRC, LC_NUMERIC would
still change the behavior of float8out). Since we added code to also
change LC_CTYPE on Windows, I think that continuing on after a restore
failure would be disastrous. And we've not heard any field reports
of this warning anyway, so there's no reason to think we have to support
the case in practice.

Hence, the attached revised patch changes the code to do elog(FATAL)
if it can't restore any locale settings to their previous values,
and it fixes cache_locale_time to not do anything risky while it's
got transient locale settings in place.

I propose to apply and back-patch this; the code's basically the
same in all supported branches.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Tom Lane (#5)

Re: TM format can mix encodings in to_char()

I wrote:

[ fix-encoding-and-error-recovery-in-cache-locale-time.patch ]

On closer inspection, I'm pretty sure either version of this patch
will break things on Windows, because that platform already had code
to convert the result of wcsftime() to the database encoding; we
were adding code to do a second conversion, which will not go well.

The attached revised patch deletes the no-longer-necessary
platform-specific recoding stanza, in favor of having cache_locale_time
know that it's getting UTF8 rather than something else. I also
updated a bunch of the related comments.

I don't have any way to test this on Windows, so could somebody
do that? Manually running the Turkish test cases ought to be enough.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

about 7 years ago

In reply to: Tom Lane (#6)

Re: TM format can mix encodings in to_char()

On 4/21/19 12:28 AM, Tom Lane wrote:

I wrote:

[ fix-encoding-and-error-recovery-in-cache-locale-time.patch ]

On closer inspection, I'm pretty sure either version of this patch
will break things on Windows, because that platform already had code
to convert the result of wcsftime() to the database encoding; we
were adding code to do a second conversion, which will not go well.

The attached revised patch deletes the no-longer-necessary
platform-specific recoding stanza, in favor of having cache_locale_time
know that it's getting UTF8 rather than something else. I also
updated a bunch of the related comments.

I don't have any way to test this on Windows, so could somebody
do that? Manually running the Turkish test cases ought to be enough.

How does one do that? Just set a Turkish locale?

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Andrew Dunstan (#7)

Re: TM format can mix encodings in to_char()

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On 4/21/19 12:28 AM, Tom Lane wrote:

I don't have any way to test this on Windows, so could somebody
do that? Manually running the Turkish test cases ought to be enough.

How does one do that? Just set a Turkish locale?

Try variants of the original test case. For instance, in a UTF8
database,

regression=# show server_encoding ;
server_encoding
-----------------
UTF8
(1 row)

regression=# SET lc_time TO 'tr_TR.iso88599';
SET
regression=# SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
to_char
--------------
01 ŞUB 2010
(1 row)

Unpatched, I get an error about invalid data. Now, this is in
a Linux machine, and you'll have to adapt it for Windows --- at
least change the LC_TIME setting. But the idea is to get out some
non-ASCII strings from an LC_TIME setting that names an encoding
different from the database's.

(I suspect you'll find that the existing code works fine on
Windows, it's only the first version(s) of this patch that fail.)

regards, tom lane

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andrew Dunstan (#7)

Re: TM format can mix encodings in to_char()

On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

How does one do that? Just set a Turkish locale?

tr_TR is, in a sense, special among locales:

http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

The Turkish dotless i has apparently been implicated in all kinds of
bugs in quite a variety of contexts.

--
Peter Geoghegan

#10

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Peter Geoghegan (#9)

Re: TM format can mix encodings in to_char()

Peter Geoghegan <pg@bowt.ie> writes:

On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

How does one do that? Just set a Turkish locale?

tr_TR is, in a sense, special among locales:
http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
The Turkish dotless i has apparently been implicated in all kinds of
bugs in quite a variety of contexts.

Yeah, we've had our share of those :-(. But the dotless i is not the
problem here --- it happens to not trigger an encoding conversion
issue, it seems. Amusingly, the existing test case for lc_time = tr_TR
in collate.linux.utf8.sql is specifically coded to check what happens
with dotted/dotless i, and yet it manages to not trip over this problem.
(I suspect the reason is that what comes out of strftime is "Nis" which
is ASCII, and the non-ASCII characters only arise from subsequent case
conversion within PG.)

regards, tom lane

#11

Juan José Santamaría Flecha

juanjo.santamaria@gmail.com

about 7 years ago

In reply to: Tom Lane (#10)

Re: TM format can mix encodings in to_char()

Actually, I tried to show my findings with the tr_TR regression test, but
you
can reproduce the same issue with other locales and non-ASCII characters, as
Tom has pointed out.

For exampe:

de_DE ISO-8859-1: March
es_ES ISO-8859-1: Wednesday
fr_FR ISO-8859-1: February

Regards,

Juan José Santamaría Flecha

#12

Andrew Dunstan

andrew@dunslane.net

about 7 years ago

In reply to: Tom Lane (#8)

Re: TM format can mix encodings in to_char()

On 4/21/19 10:21 AM, Tom Lane wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On 4/21/19 12:28 AM, Tom Lane wrote:

I don't have any way to test this on Windows, so could somebody
do that? Manually running the Turkish test cases ought to be enough.

How does one do that? Just set a Turkish locale?

Try variants of the original test case. For instance, in a UTF8
database,

regression=# show server_encoding ;
server_encoding
-----------------
UTF8
(1 row)

regression=# SET lc_time TO 'tr_TR.iso88599';
SET
regression=# SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
to_char
--------------
01 ŞUB 2010
(1 row)

Unpatched, I get an error about invalid data. Now, this is in
a Linux machine, and you'll have to adapt it for Windows --- at
least change the LC_TIME setting. But the idea is to get out some
non-ASCII strings from an LC_TIME setting that names an encoding
different from the database's.

(I suspect you'll find that the existing code works fine on
Windows, it's only the first version(s) of this patch that fail.)

Test above works as expected with the patch, see attached. This is from
jacana.

LMK if you want more tests run before I blow the test instance away

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Andrew Dunstan (#12)

Re: TM format can mix encodings in to_char()

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Test above works as expected with the patch, see attached. This is from
jacana.

Great, thanks for checking!

LMK if you want more tests run before I blow the test instance away

Can't think of anything else.

It'd be nice if we could cover stuff like this in the regression tests,
but I'm not sure how, seeing that the locale names are platform-dependent
and the overall behavior will also depend on the database encoding ...

regards, tom lane

#14

Juan José Santamaría Flecha

juanjo.santamaria@gmail.com

almost 7 years ago

In reply to: Tom Lane (#13)

Re: TM format can mix encodings in to_char()

It looks as if no work is left for this patch, so maybe updating the author to Tom Lane (I'm just a repoter at this point, which it's fine) and the status to ready for committer would better reflect its current status. Does anyone think otherwise?

Regards,

Juan José Santamaría Flecha

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Juan José Santamaría Flecha (#14)

Re: TM format can mix encodings in to_char()

Juanjo Santamaria Flecha <juanjo.santamaria@gmail.com> writes:

It looks as if no work is left for this patch, so maybe updating the author to Tom Lane (I'm just a repoter at this point, which it's fine) and the status to ready for committer would better reflect its current status. Does anyone think otherwise?

Yeah, this was dealt with in 7ad1cd31b et al. I didn't realize there
was a CF entry for it, or I would have closed it then. I've done so now.

regards, tom lane

TM format can mix encodings in to_char()

Attachments:

Attachments:

Attachments:

Attachments: