BUG #3697: utf8 issue: can not reimport a table that was successfully exported.
The following bug has been logged online:
Bug reference: 3697
Logged by: Marc Mamin
Email address: m.mamin@intershop.de
PostgreSQL version: 8.2.4
Operating system: SuSE Linux 9.1 (i586)
Description: utf8 issue: can not reimport a table that was
successfully exported.
Details:
Hello,
I'm not sure this is a bug; the problem might be related to a client
encoding issue.
My Database is on a Linux server which I connect to using putty from
Windows.
both server and client are set to UTF8:
client_encoding | UTF8
backslash_quote | safe_encoding
server_encoding | UTF8
I stumbled on this issue while trying to import a "malicious" user agent
string...
I didn't check if all characters are valid UTF8...
My concern is about database recovery.
I'm using pg_dump to regulary export my users, bu according to the example
below,
it seems that my dumps may be worthless !
May be you should consider not to publish this before a fix exist
as this is a serious issue which could eventually be exploited
to damage existing instances (for the case this is really a bug)...
regards,
Marc Mamin
steps to repeat:
CREATE TABLE utf8_test(s varchar);
CREATE OR REPLACE FUNCTION f_utf8_test( st VARCHAR) RETURNS INT AS $$
DECLARE
quotedline varchar = quote_literal($1);
BEGIN
INSERT INTO utf8_test ( s ) VALUES ( quotedline);
RETURN 0;
END;
$$ LANGUAGE plpgsql;
select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
\xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
\xE3\xEE\xF1\xF3\xE4
xE4\xE6 \xCD\xC1 \xD0\xC1")');
-- here the same statement, but with all backslashed duplicated for the case
when the string was modified when posting this issue:
select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
\\xE0\\xF0\\xF1\\xF2\\xE2\\xE5\\xED\\xED\\xFB\\xE9 \\xE2\\xFB\\xF1\\xF8\\9
\\xE3\\xEE\\xF1\\xF3\\xE4
xE4\\xE6 \\xCD\\xC1 \\xD0\\xC1")');
WARNING: nonstandard use of escape in a string literal
LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
COPY utf8_test TO '/tmp/utf8_test.txt';
COPY 1
COPY utf8_test FROM '/tmp/utf8_test.txt';
ERROR: invalid byte sequence for encoding "UTF8": 0xd3ce
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY utf8_test, line 1
The same isuue occure when using pg_dump:
pg_dump -i -v -p 5433 -Uisdb2 -tutf8_test > /tmp/utf8_dump
pg_dump: server version: 8.2.4; pg_dump version: 8.2.1
pg_dump: proceeding despite version mismatch
psql -f"/tmp/utf8_dump"
=>
psql:/tmp/utf8_dump:40: ERROR: invalid byte sequence for encoding "UTF8":
0xd3ce
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY utf8_test, line 1
"Marc Mamin" <m.mamin@intershop.de> writes:
I didn't check if all characters are valid UTF8...
They aren't ...
select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
\xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
\xE3\xEE\xF1\xF3\xE4
xE4\xE6 \xCD\xC1 \xD0\xC1")');
In 8.3 that will throw an error:
utf8=# select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
utf8'# \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
utf8'# \xE3\xEE\xF1\xF3\xE4
utf8'# xE4\xE6 \xCD\xC1 \xD0\xC1")');
WARNING: nonstandard use of escape in a string literal
LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR: invalid byte sequence for encoding "UTF8": 0xe0f0f1
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
utf8=#
However, since this behavior isn't backwards-compatible, there's not
much appetite for back-patching it.
I don't think this is a security issue --- if you don't quote
backslashes in untrusted input you'll have problems far worse
than this one.
regards, tom lane
Thank you for your quick response,
if you don't quote backslashes in untrusted input you'll have problems
far worse than this one
I do it now but not since by db is live...
So I probably have some invalid caraters in.
Is this an issue that must be fixed before I can upgrade to 8.3 ?
Is there a recommendation how to clean these data (I know where to
search for them)
Thanks,
Marc Mamin
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, October 25, 2007 6:08 PM
To: Marc Mamin
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #3697: utf8 issue: can not reimport a table that
was successfully exported.
"Marc Mamin" <m.mamin@intershop.de> writes:
I didn't check if all characters are valid UTF8...
They aren't ...
select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
\xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
\xE3\xEE\xF1\xF3\xE4
xE4\xE6 \xCD\xC1 \xD0\xC1")');
In 8.3 that will throw an error:
utf8=# select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
utf8'# \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
utf8'# \xE3\xEE\xF1\xF3\xE4 utf8'# xE4\xE6 \xCD\xC1 \xD0\xC1")');
WARNING: nonstandard use of escape in a string literal LINE 1: select
f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR: invalid byte sequence for encoding "UTF8": 0xe0f0f1
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
utf8=#
However, since this behavior isn't backwards-compatible, there's not
much appetite for back-patching it.
I don't think this is a security issue --- if you don't quote
backslashes in untrusted input you'll have problems far worse than this
one.
regards, tom lane
"Marc Mamin" <M.Mamin@intershop.de> writes:
Is there a recommendation how to clean these data (I know where to
search for them)
Usually people do it by running iconv with the delete-bad-data option
on a pg_dump file.
regards, tom lane