psql weird behaviour with charset encodings

Started by hernan gonzalezover 15 years ago18 messages
#1hernan gonzalez
hgonzalez@gmail.com

(Disclaimer: I've been using Postgresql for quite a long time, I
usually deal with non-ascii LATIN-9 characters ,
but that has never been a problem, until now)

My issue summarized: when psql is invoked from a user who has a locale
different from that of the database, the tabular output
is wrong for some text fields. The weird thing is that those text
fields are not just garbled, but empty. And more weird:
this does not happen in the expanded output format (\x). Apparently
it's not a terminal problem (I see all right after \x),
nor a client_encoding issue (idem). So...?

Details follow.
My scenario: Fedora 12, Postgresql 8.4.3 compiled from source.

Database encoding (global) LATIN9.
User postgres locale: LANG=en_US.iso885915,
User root locale LANG=en_US.UTF-8

When I connect from postgres user, all is right.
When I connect from root, it's not right... except with \x
Example (here last_name field has one non ascii character, N WITH TILDE) :

========================================================================

[root@myserver ~]# su - postgres
[postgres@myserver ~]$ psql db
psql (8.4.3)
db=# \set
...
ENCODING = 'LATIN9'
db=# select first_name,last_name,birth_date from persons where rid= 143;
 first_name  | last_name | birth_date
--------------+-----------+------------
Guillermo    | Calcaño   | 1996-09-30
db=# \x
db=# select first_name,last_name,birth_date from persons where rid= 143;
-[ RECORD 1 ]------------
first_name | Guillermo
last_name  | Calcaño
birth_date | 1996-09-30

[root@myserver ~]# /usr/local/pgsql/bin/psql -U postgres db
psql (8.4.3)
db=# \set
...
ENCODING = 'LATIN9'
db=# select first_name,last_name,birth_date from persons where rid= 143;
 first_name  | last_name | birth_date
--------------+-----------+------------
Guillermo    |    | 1996-09-30
(1 row)
db=# \x
db=# select first_name,last_name,birth_date from persons where rid= 143;
-[ RECORD 1 ]------------
first_name | Guillermo
last_name  | Calcaño
birth_date | 1996-09-30

==================================================================

It looks as it psql, in tabular form, needs to compute the lenght of
the field to output, and for this uses the user locale (not
the client_encoding, mind you, but the locale of the user that is
running the psql process). In case of mismatch,
it cannot decode the string and compute the lenght, so... it assumes
length=0 (?)
Is this the expect behaviour ? Has this behaviour changed recently?

--
Hernán J. González
http://hjg.com.ar/

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: hernan gonzalez (#1)
Re: psql weird behaviour with charset encodings

hernan gonzalez <hgonzalez@gmail.com> writes:

My scenario: Fedora 12, Postgresql 8.4.3 compiled from source.

Database encoding (global) LATIN9.
User postgres locale: LANG=en_US.iso885915,
User root locale LANG=en_US.UTF-8

When I connect from postgres user, all is right.
When I connect from root, it's not right... except with \x

What's client_encoding set to in the two cases? If it's not utf8,
does changing it to that improve matters? Alternatively, see what
xterm (or whatever terminal window you're using) thinks the encoding
is, and change it to match psql's client_encoding.

regards, tom lane

#3hernan gonzalez
hgonzalez@gmail.com
In reply to: Tom Lane (#2)
Re: psql weird behaviour with charset encodings

It's surely not a xterm problem, I see the characters ok with just the
\x formatting. I can check also the output redirecting to a file.

My original client_encoding seems to be LATIN9 in both cases,
accorging to the \set ouput.

If I change it (for the root user) to UTF8 with " SET CLIENT_ENCODING
TO 'UTF8'; " the conversion from LATIN9 to UTF8 indeed takes place
(and I see the characters ok, by switching my terminal to UTF8).

(BTW: I understood from the docs that " \set ENCODING 'UTF8'; " is
equivalent but this does nothing in my case)

But I actually dont want that. I want psql to not try any charset
conversion, just give me the raw text as is stored in the db. He seems
to do this when \x is set. But it seems that he need to compute lenght
of the strings and for this he needs to use the string routines from
his own locale.

I'm uncertain yet if this is the expected behaviour.

What's client_encoding set to in the two cases?  If it's not utf8,
does changing it to that improve matters?  Alternatively, see what
xterm (or whatever terminal window you're using) thinks the encoding
is, and change it to match psql's client_encoding.

                       regards, tom lane

--
Hernán J. González
http://hjg.com.ar/

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: hernan gonzalez (#3)
Re: psql weird behaviour with charset encodings

hernan gonzalez <hgonzalez@gmail.com> writes:

But I actually dont want that. I want psql to not try any charset
conversion, just give me the raw text as is stored in the db.

Well, that's what it's doing (given the default setting with
client_encoding equal to server_encoding), and then xterm is
misdisplaying the text because xterm thinks it's utf8. I'm
not very clear on why the \x case seems to display correctly
anyway, but you really need the terminal's encoding to agree
with client_encoding in order to get non-ASCII characters to
work well.

regards, tom lane

#5hernan gonzalez
hgonzalez@gmail.com
In reply to: Tom Lane (#4)
Re: psql weird behaviour with charset encodings

Mmm no: \x displays correctly for me because it sends
the raw text (in LATIN9) and I have set my terminal in LATIN9 (or ISO-8859-15)

And it's not that "xterm is misdisplaying" the text, it just that psql
is ouputting
an EMPTY (zero lenght) string for that field.
(I can even send the output to a file with \o, and check it in hexadecimal,
to re-verify that it's not a terminal problem - it's not).

The issue is that psql tries (apparently) to convert to UTF8
(even when he plans to output the raw text -LATIN9 in this case)
just for computing the lenght of the field, to build the table.
And because for this computation he (apparently) rely on the string
routines with it's own locale, instead of the DB or client encoding.

That's why no problem arises with the \x switch.

I believe this is wrong, because when the client does not specify
an encoding, no conversion should be attempted.

Hernán

Well, that's what it's doing (given the default setting with
client_encoding equal to server_encoding), and then xterm is
misdisplaying the text because xterm thinks it's utf8.  I'm
not very clear on why the \x case seems to display correctly
anyway, but you really need the terminal's encoding to agree
with client_encoding in order to get non-ASCII characters to
work well.

                       regards, tom lane

--
Hernán J. González
http://hjg.com.ar/

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: hernan gonzalez (#5)
Re: psql weird behaviour with charset encodings

hernan gonzalez <hgonzalez@gmail.com> writes:

The issue is that psql tries (apparently) to convert to UTF8
(even when he plans to output the raw text -LATIN9 in this case)
just for computing the lenght of the field, to build the table.
And because for this computation he (apparently) rely on the string
routines with it's own locale, instead of the DB or client encoding.

I didn't believe this, since I know perfectly well that the formatting
code doesn't rely on any OS-supplied width calculations. But when I
tested it out, I found I could reproduce Hernan's problem on Fedora 11.
Some tracing showed that the problem is here:

fprintf(fout, "%.*s", bytes_to_output,
this_line->ptr + bytes_output[j]);

As the variable name indicates, psql has carefully calculated the number
of *bytes* it wants to print. However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding. I haven't dug into the glibc code to
check, but it's presumably barfing because the string isn't valid
according to UTF8 encoding, and then failing to print anything.

It appears to me that this behavior violates the Single Unix Spec,
which says very clearly that the count is a count of bytes:
http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html
However, I'm quite sure that our chances of persuading the glibc boys
that this is a bad idea are zero. I think we're going to have to
change the code to not rely on %.*s here. Even without the charset
mismatch in Hernan's example, we'd be printing the wrong amount of
data anytime the LC_CTYPE charset is multibyte. (IOW, the code should
do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8,
even if client_encoding is too; anybody want to check?)

The above coding is new in 8.4, but it's probably not the only use of
%.*s --- we had better go looking for other trouble spots, too.

regards, tom lane

#7Noname
hgonzalez@gmail.com
In reply to: Tom Lane (#6)
Re: psql weird behaviour with charset encodings

However, it appears that glibc's printf

code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.

Well, I myself have problems to believe that :-)
This would be nasty... Are you sure?

I couldn reproduce that.
I made a quick test, passing a utf-8 encoded string
(5 bytes correspoding to 4 unicode chars: "niño")
And my glib (same Fedora 12) seems to count bytes,
as it should.

#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o";
printf("|%.*s|\n",5,s);
}

This, compiled with gcc 4.4.3, run with my root locale (utf8)
did not padded a blank. ie it worked as expected.

Hernán

#8hernan gonzalez
hgonzalez@gmail.com
In reply to: Noname (#7)
Re: psql weird behaviour with charset encodings

Sorry about a error in my previous example (mixed width and precision).
But the conclusion is the same - it works on bytes:

#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o"; /* 5 bytes , 4 utf8 chars */
printf("|%*s|\n",6,s); /* this should pad a black */
printf("|%.*s|\n",4,s); /* this should eat a char */
}

[root@myserv tmp]# ./a.out | od -t cx1
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a

Hernán

Show quoted text

On Fri, May 7, 2010 at 10:48 PM, <hgonzalez@gmail.com> wrote:

However, it appears that glibc's printf

code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.

Well, I myself have problems to believe that :-)
This would be nasty... Are you sure?

I couldn reproduce that.
I made a quick test, passing a utf-8 encoded string
(5 bytes correspoding to 4 unicode chars: "niño")
And my glib (same Fedora 12) seems to count bytes,
as it should.

#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o";
printf("|%.*s|\n",5,s);
}

This, compiled with gcc 4.4.3, run with my root locale (utf8)
did not padded a blank. i.e. it worked as expected.

Hernán

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: hernan gonzalez (#8)
Re: psql weird behaviour with charset encodings

hernan gonzalez <hgonzalez@gmail.com> writes:

Sorry about a error in my previous example (mixed width and precision).
But the conclusion is the same - it works on bytes:

This example works like that because it's running in C locale always.
Try something like this:

#include<stdio.h>
#include<locale.h>

int main () {
char s[] = "ni\xc3qo"; /* 5 bytes , not valid utf8 */

setlocale(LC_ALL, "");
printf("|%.*s|\n",3,s);
return 0;
}

I get different (and undesirable) effects depending on LANG.

regards, tom lane

#10hernan gonzalez
hgonzalez@gmail.com
In reply to: Tom Lane (#9)
Re: psql weird behaviour with charset encodings

Wow, you are right, this is bizarre...

And it's not that glibc intends to compute the length in unicode chars,
it actually counts bytes (c plain chars) -as it should- for computing
field widths...
But, for some strange reason, when there is some width calculation involved
it tries so parse the char[] using the locale encoding (when there's no point
in doing it!) and if it fails, it truncates (silently) the printf output.
So it seems more a glib bug to me than an interpretion issue (bytes vs chars).
I posted some details in stackoverflow:
http://stackoverflow.com/questions/2792567/printf-field-width-bytes-or-chars

BTW, I understand that postgresql uses locale semantics in the server code.
But is this really necessary/appropiate in the client (psql) side?
Couldnt we stick
with C locale here?

--
Hernán J. González
http://hjg.com.ar/

#11Noname
hgonzalez@gmail.com
In reply to: hernan gonzalez (#10)
Re: [GENERAL] psql weird behaviour with charset encodings

Well, I finally found some related -rather old- issues in Bugzilla (glib)

http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

The last explains why they do not consider it a bug:

ISO C99 requires for %.*s to only write complete characters that fit below
the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.

It's clear, though it's also rather ugly, from a specification point of
view (we must
count raw bytes for the width field, but also must decode the utf8 chars
for finding
character boundaries). I guess we must live with that.

Hernán J. González

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#11)
Re: [GENERAL] psql weird behaviour with charset encodings

hgonzalez@gmail.com writes:

http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

The last explains why they do not consider it a bug:

ISO C99 requires for %.*s to only write complete characters that fit below
the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.

Yeah, that was about the position I thought they'd take.

So the bottom line here is that we're best off to avoid %.*s because
it may fail if the string contains data that isn't validly encoded
according to libc's idea of the prevailing encoding. I think that
means the patch I committed earlier is still a good idea, but the
comments need a bit of adjustment. Will fix.

regards, tom lane

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: hernan gonzalez (#10)
Re: psql weird behaviour with charset encodings

hernan gonzalez <hgonzalez@gmail.com> writes:

BTW, I understand that postgresql uses locale semantics in the server code.
But is this really necessary/appropiate in the client (psql) side?
Couldnt we stick with C locale here?

As far as that goes, I think we have to turn on that machinery in order
to have gettext() work (ie, to have localized error messages).

regards, tom lane

#14Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#12)
1 attachment(s)
Re: [GENERAL] psql weird behaviour with charset encodings

On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lane wrote:

hgonzalez@gmail.com writes:

http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

The last explains why they do not consider it a bug:

ISO C99 requires for %.*s to only write complete characters that fit below
the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.

Yeah, that was about the position I thought they'd take.

GNU libc eventually revisited that conclusion and fixed the bug through commit
715a900c9085907fa749589bf738b192b1a2bda5. RHEL 7.1 is fixed, but RHEL 6.6 and
RHEL 5.11 are still affected; the bug will be relevant for another 8+ years.

So the bottom line here is that we're best off to avoid %.*s because
it may fail if the string contains data that isn't validly encoded
according to libc's idea of the prevailing encoding.

Yep. Immediate precisions like %.10s trigger the bug as effectively as %.*s,
so tarCreateHeader() [_tarWriteHeader() in 9.2 and earlier] is also affected.
Switching to strlcpy(), as attached, fixes the bug while simplifying the code.
The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"' when
the name of a file in the data directory is not a valid multibyte string.

Commit 6dd9584 introduced a new use of .*s, to pg_upgrade. It works reliably
for now, because it always runs in the C locale. pg_upgrade never calls
set_pglocale_pgservice() or otherwise sets its permanent locale. It would be
natural for us to fix that someday, at which point non-ASCII database names
would perturb this status output.

It would be good to purge the code of precisions on "s" conversion specifiers,
then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan
to do it myself, but it would be a nice little defensive change.

Attachments:

tar-namechars-v2.patchtext/plain; charset=us-asciiDownload
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 0e4bd12..f46c5fc 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -17,6 +17,12 @@ command_fails(
 	[ 'pg_basebackup', '-D', "$tempdir/backup" ],
 	'pg_basebackup fails because of hba');
 
+# Some Windows ANSI code pages may reject this filename, in which case we
+# quietly proceed without this bit of test coverage.
+open BADCHARS, ">>$tempdir/pgdata/FOO\xe0\xe0\xe0BAR";
+print BADCHARS "test backup of file with non-UTF8 name\n";
+close BADCHARS;
+
 open HBA, ">>$tempdir/pgdata/pg_hba.conf";
 print HBA "local replication all trust\n";
 print HBA "host replication all 127.0.0.1/32 trust\n";
diff --git a/src/port/tar.c b/src/port/tar.c
index 4721df3..72fd4e1 100644
--- a/src/port/tar.c
+++ b/src/port/tar.c
@@ -68,7 +68,7 @@ tarCreateHeader(char *h, const char *filename, const char *linktarget,
 	memset(h, 0, 512);			/* assume tar header size */
 
 	/* Name 100 */
-	sprintf(&h[0], "%.99s", filename);
+	strlcpy(&h[0], filename, 100);
 	if (linktarget != NULL || S_ISDIR(mode))
 	{
 		/*
@@ -110,7 +110,7 @@ tarCreateHeader(char *h, const char *filename, const char *linktarget,
 		/* Type - Symbolic link */
 		sprintf(&h[156], "2");
 		/* Link Name 100 */
-		sprintf(&h[157], "%.99s", linktarget);
+		strlcpy(&h[157], linktarget, 100);
 	}
 	else if (S_ISDIR(mode))
 		/* Type - directory */
@@ -127,11 +127,11 @@ tarCreateHeader(char *h, const char *filename, const char *linktarget,
 
 	/* User 32 */
 	/* XXX: Do we need to care about setting correct username? */
-	sprintf(&h[265], "%.31s", "postgres");
+	strlcpy(&h[265], "postgres", 32);
 
 	/* Group 32 */
 	/* XXX: Do we need to care about setting correct group name? */
-	sprintf(&h[297], "%.31s", "postgres");
+	strlcpy(&h[297], "postgres", 32);
 
 	/* Major Dev 8 */
 	sprintf(&h[329], "%07o ", 0);
#15Michael Paquier
michael.paquier@gmail.com
In reply to: Noah Misch (#14)
Re: [GENERAL] psql weird behaviour with charset encodings

On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote:

On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lane wrote:

hgonzalez@gmail.com writes:

http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

The last explains why they do not consider it a bug:

ISO C99 requires for %.*s to only write complete characters that fit

below

the
precision number of bytes. If you are using say UTF-8 locale, but

ISO-8859-1

characters as shown in the input file you provided, some of the

strings are

not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.

Yeah, that was about the position I thought they'd take.

GNU libc eventually revisited that conclusion and fixed the bug through

commit

715a900c9085907fa749589bf738b192b1a2bda5. RHEL 7.1 is fixed, but RHEL

6.6 and

RHEL 5.11 are still affected; the bug will be relevant for another 8+

years.

So the bottom line here is that we're best off to avoid %.*s because
it may fail if the string contains data that isn't validly encoded
according to libc's idea of the prevailing encoding.

Yep. Immediate precisions like %.10s trigger the bug as effectively as

%.*s,

so tarCreateHeader() [_tarWriteHeader() in 9.2 and earlier] is also

affected.

Switching to strlcpy(), as attached, fixes the bug while simplifying the

code.

The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"'

when

the name of a file in the data directory is not a valid multibyte string.

Commit 6dd9584 introduced a new use of .*s, to pg_upgrade. It works

reliably

for now, because it always runs in the C locale. pg_upgrade never calls
set_pglocale_pgservice() or otherwise sets its permanent locale. It

would be

natural for us to fix that someday, at which point non-ASCII database

names

would perturb this status output.

I caught up the following places that need attention on top of the 4 ones
in tar.c:
src/backend/nodes/read.c:
elog(ERROR, "unrecognized integer: \"%.*s\"",
src/backend/nodes/read.c:
elog(ERROR, "unrecognized OID: \"%.*s\"",
src/backend/nodes/read.c: elog(ERROR,
"unrecognized token: \"%.*s\"", tok_len, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized
integer: \"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized boolop
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "badly formatted node
string \"%.32s\"...", token);
src/backend/tsearch/wparser_def.c: * Use of %.*s here is a bit risky
since it can misbehave if the data is
src/backend/tsearch/wparser_def.c: fprintf(stderr, "parsing
\"%.*s\"\n", len, str);
src/backend/tsearch/wparser_def.c: /* See note above about %.*s */
src/backend/tsearch/wparser_def.c: fprintf(stderr, "parsing copy of
\"%.*s\"\n", prs->lenstr, prs->str);
src/backend/utils/adt/datetime.c: * Note: the uses
of %.*s in this function would be risky if the
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c: /* %.*s is safe
since all our tokens are ASCII */
src/backend/utils/adt/datetime.c: elog(LOG, "token
too long in %s table: \"%.*s\"",
src/interfaces/ecpg/ecpglib/error.c: /* %.*s is safe here as long as
sqlstate is all-ASCII */
src/interfaces/ecpg/ecpglib/error.c: ecpg_log("raising sqlstate %.*s
(sqlcode %ld): %s\n",
src/interfaces/ecpg/pgtypeslib/dt_common.c: * Note:
the uses of %.*s in this function would be risky if the
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/bin/pg_basebackup/pg_basebackup.c:
ngettext("%*s/%s kB (%d%%), %d/%d tablespace (%s%-*.*s)",
src/bin/pg_basebackup/pg_basebackup.c:
"%*s/%s kB (%d%%), %d/%d tablespaces (%s%-*.*s)",
src/bin/pg_upgrade/util.c: printf("
%s%-*.*s\r",

It would be good to purge the code of precisions on "s" conversion

specifiers,

then Assert(!pointflag) in fmtstr() to catch new introductions. I won't

plan

to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing backend
code relying in sprintf's port version we should only do the assertion in
HEAD in my opinion, and mention it in the release notes of the next major
version at the time a patch in this area is applied. I guess that we had
better backpatch the places using .*s though on back-branches.
--
Michael

#16Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#15)
1 attachment(s)
Re: [GENERAL] psql weird behaviour with charset encodings

On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote:

It would be good to purge the code of precisions on "s" conversion

specifiers,

then Assert(!pointflag) in fmtstr() to catch new introductions. I won't

plan

to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing
backend code relying in sprintf's port version we should only do the
assertion in HEAD in my opinion, and mention it in the release notes of the
next major version at the time a patch in this area is applied. I guess
that we had better backpatch the places using .*s though on back-branches.

Attached is a patch purging a bunch of places from using %.*s, this will
make the code more solid when facing non-ASCII strings. I let pg_upgrade
and pg_basebackup code paths alone as it reduces the code lisibility by
moving out of this separator. We may want to fix them though if file path
names have non-ASCII characters, but it seems less critical.
Thoughts?
--
Michael

Attachments:

20150603_string_prec_non_ascii.patchtext/x-patch; charset=US-ASCII; name=20150603_string_prec_non_ascii.patchDownload
diff --git a/src/backend/nodes/read.c b/src/backend/nodes/read.c
index 0dabfa7..910c124 100644
--- a/src/backend/nodes/read.c
+++ b/src/backend/nodes/read.c
@@ -326,8 +326,18 @@ nodeRead(char *token, int tok_len)
 							break;
 						val = (int) strtol(token, &endptr, 10);
 						if (endptr != token + tok_len)
-							elog(ERROR, "unrecognized integer: \"%.*s\"",
-								 tok_len, token);
+						{
+							/*
+							 * Cannot use %.*s here because some machines
+							 * interpret precision of %s sometimes in
+							 * characters or in bytes.
+							 */
+							char *buf = (char *) palloc(tok_len + 1);
+							memcpy(buf, token, tok_len);
+							buf[tok_len] = '\0';
+							elog(ERROR, "unrecognized integer: \"%s\"",
+								 buf);
+						}
 						l = lappend_int(l, val);
 					}
 				}
@@ -346,8 +356,17 @@ nodeRead(char *token, int tok_len)
 							break;
 						val = (Oid) strtoul(token, &endptr, 10);
 						if (endptr != token + tok_len)
-							elog(ERROR, "unrecognized OID: \"%.*s\"",
-								 tok_len, token);
+						{
+							/*
+							 * Cannot use %.*s here because some machines
+							 * interpret precision of %s sometimes in
+							 * characters or in bytes.
+							 */
+							char *buf = (char *) palloc(tok_len + 1);
+							memcpy(buf, token, tok_len);
+							buf[tok_len] = '\0';
+							elog(ERROR, "unrecognized OID: \"%s\"", buf);
+						}
 						l = lappend_oid(l, val);
 					}
 				}
@@ -380,7 +399,14 @@ nodeRead(char *token, int tok_len)
 			}
 			else
 			{
-				elog(ERROR, "unrecognized token: \"%.*s\"", tok_len, token);
+				/*
+				 * Cannot use %.*s here because some machines interpret
+				 * precision of %s sometimes in characters or in bytes.
+				 */
+				char *buf = (char *) palloc(tok_len + 1);
+				memcpy(buf, token, tok_len);
+				buf[tok_len] = '\0';
+				elog(ERROR, "unrecognized token: \"%s\"", buf);
 				result = NULL;	/* keep compiler happy */
 			}
 			break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index f5a40fb..444b54d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -142,6 +142,13 @@
 #define nullable_string(token,length)  \
 	((length) == 0 ? NULL : debackslash(token, length))
 
+#define error_token(message, token, len)	\
+	do {									\
+		char *buf = palloc(len + 1);		\
+		memcpy(buf, token, len);			\
+		buf[len] = '\0';					\
+		elog(ERROR, message, buf);			\
+	} while (0);
 
 static Datum readDatum(bool typbyval);
 
@@ -159,13 +166,13 @@ _readBitmapset(void)
 	if (token == NULL)
 		elog(ERROR, "incomplete Bitmapset structure");
 	if (length != 1 || token[0] != '(')
-		elog(ERROR, "unrecognized token: \"%.*s\"", length, token);
+		error_token("unrecognized token: \"%s\"", token, length);
 
 	token = pg_strtok(&length);
 	if (token == NULL)
 		elog(ERROR, "incomplete Bitmapset structure");
 	if (length != 1 || token[0] != 'b')
-		elog(ERROR, "unrecognized token: \"%.*s\"", length, token);
+		error_token("unrecognized token: \"%s\"", token, length);
 
 	for (;;)
 	{
@@ -179,7 +186,7 @@ _readBitmapset(void)
 			break;
 		val = (int) strtol(token, &endptr, 10);
 		if (endptr != token + length)
-			elog(ERROR, "unrecognized integer: \"%.*s\"", length, token);
+			error_token("unrecognized token: \"%s\"", token, length);
 		result = bms_add_member(result, val);
 	}
 
@@ -803,7 +810,7 @@ _readBoolExpr(void)
 	else if (strncmp(token, "not", 3) == 0)
 		local_node->boolop = NOT_EXPR;
 	else
-		elog(ERROR, "unrecognized boolop \"%.*s\"", length, token);
+		error_token("unrecognized boolop: \"%s\"", token, length);
 
 	READ_NODE_FIELD(args);
 	READ_LOCATION_FIELD(location);
@@ -1534,7 +1541,10 @@ parseNodeString(void)
 		return_value = _readDeclareCursorStmt();
 	else
 	{
-		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
+		char buf[33];
+		memcpy(buf, token, 32);
+		buf[33] = '\0';
+		elog(ERROR, "badly formatted node string \"%s\"...", buf);
 		return_value = NULL;	/* keep compiler quiet */
 	}
 
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 18ff9e2..b8fafd1 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -329,13 +329,17 @@ TParserInit(char *str, int len)
 	prs->state->state = TPS_Base;
 
 #ifdef WPARSER_TRACE
-
-	/*
-	 * Use of %.*s here is a bit risky since it can misbehave if the data is
-	 * not in what libc thinks is the prevailing encoding.  However, since
-	 * this is just a debugging aid, we choose to live with that.
-	 */
-	fprintf(stderr, "parsing \"%.*s\"\n", len, str);
+	{
+		/*
+		 * %.*s is not used here to avoid hazards with libc's prevailing encoding
+		 * where precision can be counted as bytes or as characters.
+		 */
+		char *buf = (char *) palloc(prs->lenstr + 1);
+		memcpy(buf, prs->str, prs->lenstr);
+		buf[prs->str] = '\0';
+		fprintf(stderr, "parsing \"%s\"\n", buf);
+		pfree(buf);
+	}
 #endif
 
 	return prs;
@@ -374,8 +378,14 @@ TParserCopyInit(const TParser *orig)
 	prs->state->state = TPS_Base;
 
 #ifdef WPARSER_TRACE
-	/* See note above about %.*s */
-	fprintf(stderr, "parsing copy of \"%.*s\"\n", prs->lenstr, prs->str);
+	{
+		char *buf = (char *) palloc(prs->lenstr + 1);
+		memcpy(buf, prs->str, prs->lenstr);
+		buf[prs->str] = '\0';
+		/* See note above about not using %.*s */
+		fprintf(stderr, "parsing copy of \"%s\"\n", buf);
+		pfree(buf);
+	}
 #endif
 
 	return prs;
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 2a44b6e..0146e9f 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -4005,15 +4005,21 @@ EncodeDateTime(struct pg_tm * tm, fsec_t fsec, bool print_tz, int tz, const char
 			AppendTimestampSeconds(str + strlen(str), tm, fsec);
 
 			/*
-			 * Note: the uses of %.*s in this function would be risky if the
-			 * timezone names ever contain non-ASCII characters.  However, all
-			 * TZ abbreviations in the Olson database are plain ASCII.
+			 * Note: the use of %.*s in this function would be risky if the
+			 * timezone names ever contain non-ASCII characters.  All TZ
+			 * abbreviations in the Olson database are plain ASCII, still
+			 * its use is avoided.
 			 */
 
 			if (print_tz)
 			{
 				if (tzn)
-					sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
+				{
+					char buf[MAXTZLEN + 1];
+					memcpy(buf, tzn, MAXTZLEN);
+					buf[MAXTZLEN] = '\0';
+					sprintf(str + strlen(str), " %s", buf);
+				}
 				else
 					EncodeTimezone(str, tz, style);
 			}
@@ -4036,7 +4042,12 @@ EncodeDateTime(struct pg_tm * tm, fsec_t fsec, bool print_tz, int tz, const char
 			if (print_tz)
 			{
 				if (tzn)
-					sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
+				{
+					char buf[MAXTZLEN + 1];
+					memcpy(buf, tzn, MAXTZLEN);
+					buf[MAXTZLEN] = '\0';
+					sprintf(str + strlen(str), " %s", buf);
+				}
 				else
 					EncodeTimezone(str, tz, style);
 			}
@@ -4070,7 +4081,12 @@ EncodeDateTime(struct pg_tm * tm, fsec_t fsec, bool print_tz, int tz, const char
 			if (print_tz)
 			{
 				if (tzn)
-					sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
+				{
+					char buf[MAXTZLEN + 1];
+					memcpy(buf, tzn, MAXTZLEN);
+					buf[MAXTZLEN] = '\0';
+					sprintf(str + strlen(str), " %s", buf);
+				}
 				else
 				{
 					/*
@@ -4368,10 +4384,11 @@ CheckDateTokenTable(const char *tablename, const datetkn *base, int nel)
 		/* check for token strings that don't fit */
 		if (strlen(base[i].token) > TOKMAXLEN)
 		{
-			/* %.*s is safe since all our tokens are ASCII */
-			elog(LOG, "token too long in %s table: \"%.*s\"",
-				 tablename,
-				 TOKMAXLEN + 1, base[i].token);
+			char buf[TOKMAXLEN + 1];
+			memcpy(buf, base[i].token, TOKMAXLEN);
+			buf[TOKMAXLEN] = '\0';
+			elog(LOG, "token too long in %s table: \"%s\"",
+				 tablename, buf);
 			ok = false;
 			break;				/* don't risk applying strcmp */
 		}
diff --git a/src/interfaces/ecpg/ecpglib/error.c b/src/interfaces/ecpg/ecpglib/error.c
index 715bedb..a40152f 100644
--- a/src/interfaces/ecpg/ecpglib/error.c
+++ b/src/interfaces/ecpg/ecpglib/error.c
@@ -257,8 +257,8 @@ ecpg_raise_backend(int line, PGresult *result, PGconn *conn, int compat)
 		sqlca->sqlcode = ECPG_PGSQL;
 
 	/* %.*s is safe here as long as sqlstate is all-ASCII */
-	ecpg_log("raising sqlstate %.*s (sqlcode %ld): %s\n",
-			 (int) sizeof(sqlca->sqlstate), sqlca->sqlstate, sqlca->sqlcode, sqlca->sqlerrm.sqlerrmc);
+	ecpg_log("raising sqlstate %s (sqlcode %ld): %s\n",
+			 sqlstate, sqlca->sqlcode, sqlca->sqlerrm.sqlerrmc);
 
 	/* free all memory we have allocated for the user */
 	ECPGfree_auto_mem();
diff --git a/src/interfaces/ecpg/pgtypeslib/dt_common.c b/src/interfaces/ecpg/pgtypeslib/dt_common.c
index 7fe2982..b554201 100644
--- a/src/interfaces/ecpg/pgtypeslib/dt_common.c
+++ b/src/interfaces/ecpg/pgtypeslib/dt_common.c
@@ -851,16 +851,15 @@ EncodeDateTime(struct tm * tm, fsec_t fsec, bool print_tz, int tz, const char *t
 			if (tm->tm_year <= 0)
 				sprintf(str + strlen(str), " BC");
 
-			/*
-			 * Note: the uses of %.*s in this function would be risky if the
-			 * timezone names ever contain non-ASCII characters.  However, all
-			 * TZ abbreviations in the Olson database are plain ASCII.
-			 */
-
 			if (print_tz)
 			{
 				if (tzn)
-					sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
+				{
+					char buf[MAXTZLEN + 1];
+					memcpy(buf, tzn, MAXTZLEN);
+					buf[MAXTZLEN] = '\0';
+					sprintf(str + strlen(str), " %s", buf);
+				}
 				else
 				{
 					hour = -(tz / SECS_PER_HOUR);
@@ -909,7 +908,12 @@ EncodeDateTime(struct tm * tm, fsec_t fsec, bool print_tz, int tz, const char *t
 			if (print_tz)
 			{
 				if (tzn)
-					sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
+				{
+					char buf[MAXTZLEN + 1];
+					memcpy(buf, tzn, MAXTZLEN);
+					buf[MAXTZLEN] = '\0';
+					sprintf(str + strlen(str), " %s", buf);
+				}
 				else
 				{
 					hour = -(tz / SECS_PER_HOUR);
@@ -968,7 +972,12 @@ EncodeDateTime(struct tm * tm, fsec_t fsec, bool print_tz, int tz, const char *t
 			if (print_tz)
 			{
 				if (tzn)
-					sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
+				{
+					char buf[MAXTZLEN + 1];
+					memcpy(buf, tzn, MAXTZLEN);
+					buf[MAXTZLEN] = '\0';
+					sprintf(str + strlen(str), " %s", buf);
+				}
 				else
 				{
 					/*
#17Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#16)
Re: [GENERAL] psql weird behaviour with charset encodings

On Wed, Jun 03, 2015 at 05:25:45PM +0900, Michael Paquier wrote:

On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote:

It would be good to purge the code of precisions on "s" conversion specifiers,
then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan
to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing
backend code relying in sprintf's port version we should only do the
assertion in HEAD in my opinion, and mention it in the release notes of the
next major version at the time a patch in this area is applied. I guess

Adding the assertion would be master-only. We don't necessarily release-note
C API changes.

that we had better backpatch the places using .*s though on back-branches.

I would tend to back-patch only the ones that cause interesting bugs. For
example, we should not reach the read.c elog() calls anyway, so it's not a big
deal if the GNU libc bug makes them a bit less helpful in back branches.
(Thanks for the list of code sites; it was more complete than anything I had.)
So far, only tar.c looks harmed enough to back-patch.

Attached is a patch purging a bunch of places from using %.*s, this will
make the code more solid when facing non-ASCII strings. I let pg_upgrade
and pg_basebackup code paths alone as it reduces the code lisibility by
moving out of this separator. We may want to fix them though if file path
names have non-ASCII characters, but it seems less critical.

To add the assertion, we must of course fix all uses.

Having seen the patch I requested, I don't like the result as much as I
expected to like it. The patched code is definitely harder to read and write:

@@ -1534,7 +1541,10 @@ parseNodeString(void)
return_value = _readDeclareCursorStmt();
else
{
-		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
+		char buf[33];
+		memcpy(buf, token, 32);
+		buf[33] = '\0';
+		elog(ERROR, "badly formatted node string \"%s\"...", buf);
return_value = NULL;	/* keep compiler quiet */
}

(Apropos, that terminator belongs in buf[32], not buf[33].)

Perhaps we're better off setting aside the whole idea, or forcing use of
snprintf.c on configurations having the bug?

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Noah Misch (#17)
Re: [GENERAL] psql weird behaviour with charset encodings

On Thu, Jun 18, 2015 at 9:47 AM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Jun 03, 2015 at 05:25:45PM +0900, Michael Paquier wrote:

On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote:

It would be good to purge the code of precisions on "s" conversion specifiers,
then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan
to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing
backend code relying in sprintf's port version we should only do the
assertion in HEAD in my opinion, and mention it in the release notes of the
next major version at the time a patch in this area is applied. I guess

Adding the assertion would be master-only. We don't necessarily release-note
C API changes.

Cool. So we are on the same page.

that we had better backpatch the places using .*s though on back-branches.

I would tend to back-patch only the ones that cause interesting bugs. For
example, we should not reach the read.c elog() calls anyway, so it's not a big
deal if the GNU libc bug makes them a bit less helpful in back branches.
(Thanks for the list of code sites; it was more complete than anything I had.)
So far, only tar.c looks harmed enough to back-patch.

Attached is a patch purging a bunch of places from using %.*s, this will
make the code more solid when facing non-ASCII strings. I let pg_upgrade
and pg_basebackup code paths alone as it reduces the code lisibility by
moving out of this separator. We may want to fix them though if file path
names have non-ASCII characters, but it seems less critical.

To add the assertion, we must of course fix all uses.

Sure.

Having seen the patch I requested, I don't like the result as much as I
expected to like it. The patched code is definitely harder to read and write:

@@ -1534,7 +1541,10 @@ parseNodeString(void)
return_value = _readDeclareCursorStmt();
else
{
-             elog(ERROR, "badly formatted node string \"%.32s\"...", token);
+             char buf[33];
+             memcpy(buf, token, 32);
+             buf[33] = '\0';
+             elog(ERROR, "badly formatted node string \"%s\"...", buf);
return_value = NULL;    /* keep compiler quiet */
}

We could spread what the first patch did in readfuncs.c by having some
more macros doing the duplicated work. Not that it would improve the
code readability of those macros..

(Apropos, that terminator belongs in buf[32], not buf[33].)

Indeed.

Perhaps we're better off setting aside the whole idea,

At least on OSX (10.8), I am seeing that no more than the number of
bytes defined by the precision is written. So it looks that we are
safe there. So yes thinking long-term this looks the better
alternative. And I am wondering about the potential breakage that this
could actually have with Postgres third-part tools using src/port's
snprintf.

or forcing use of snprintf.c on configurations having the bug?

I am less sure about that. It doesn't seem worth it knowing that the
tendance is to evaluate the precision in terms in bytes and not
characters.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers