Bug in UTF8-Validation Code?
Hi,
I've a problem with a database, I can dump the database to a file, but
restoration fails, happens with 8.1.4.
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
pg_dump testdb -f testdb.dump -Fc
pg_restore -f testdb.dump -d testdb => fails with an error:
ERROR: invalid byte sequence for encoding "UTF8": 0xa4
The problem itself comes from a CSV file, which is imported with \copy without
proper quoting (so I have to fix this anyway), but I still think this is an
error, making restoration very complicated in such cases...
Or am I doing something completly wrong here?
Best regards,
Mario Weilguni
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
This is working as expected, see the remark in
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
L-SYNTAX-STRINGS
"It is your responsibility that the byte sequences you create
are valid characters in the server character set encoding."
Yours,
Laurenz Albe
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.This is working as expected, see the remark in
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
L-SYNTAX-STRINGS"It is your responsibility that the byte sequences you create
are valid characters in the server character set encoding."
In that case, pg_dump is doing wrong here and should quote the output. IMO it
cannot be defined as working as expected, when this makes any database dumps
worthless, without any warnings at dump-time.
pg_dump should output \244 itself in that case.
Best regards,
Mario Weilguni
Mario Weilguni wrote:
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.This is working as expected, see the remark in
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
L-SYNTAX-STRINGS"It is your responsibility that the byte sequences you create
are valid characters in the server character set encoding."In that case, pg_dump is doing wrong here and should quote the output. IMO it
cannot be defined as working as expected, when this makes any database dumps
worthless, without any warnings at dump-time.pg_dump should output \244 itself in that case.
The sentence quoted from the docs is perhaps less than a model of
clarity. I would take it to mean that no client-encoding ->
server-encoding translation will take place. Does it really mean that
the server will happily accept any escaped byte sequence, whether or not
it is valid for the server encoding? If so that seems ... odd.
cheers
andrew
Am Dienstag, 13. März 2007 15:12 schrieb Andrew Dunstan:
The sentence quoted from the docs is perhaps less than a model of
clarity. I would take it to mean that no client-encoding ->
server-encoding translation will take place. Does it really mean that
the server will happily accept any escaped byte sequence, whether or not
it is valid for the server encoding? If so that seems ... odd.
Yes, \octal sequences are accepted even if invalid. The problem is, pgdump
will happily dump those sequences as is, so in that case a char ascii 0xa4 is
emitted, and so the dump cannot be restored with pg_restore.
A dangerous feature IMO, and will make a lot of DB admins very unhappy if they
have to validate every day if the precious database dumps can be restored in
case of an error.
Best regards,
Mario Weilguni
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.This is working as expected, see the remark in
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
"It is your responsibility that the byte sequences you create
are valid characters in the server character set encoding."In that case, pg_dump is doing wrong here and should quote the output. IMO it
cannot be defined as working as expected, when this makes any database dumps
worthless, without any warnings at dump-time.pg_dump should output \244 itself in that case.
True. Here is a test case on 8.2.3
(OS, database and client all use UTF8):
test=> CREATE TABLE test(x text);
CREATE TABLE
test=> INSERT INTO test VALUES ('correct: ä');
INSERT 0 1
test=> INSERT INTO test VALUES (E'incorrect: \244');
INSERT 0 1
test=> \q
laurenz:~> pg_dump -d -t test -f test.sql
Here is an excerpt from 'od -c test.sql':
0001040 e n z \n - - \n \n I N S E R T I
0001060 N T O t e s t V A L U E S
0001100 ( ' c o r r e c t : 303 244 ' ) ;
0001120 \n I N S E R T I N T O t e s
0001140 t V A L U E S ( ' i n c o r
0001160 r e c t : 244 ' ) ; \n \n \n - - \n
The invalid character (octal 244) is in the INSERT statement!
This makes psql gag:
test=> DROP TABLE test;
DROP TABLE
test=> \i test.sql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
INSERT 0 1
psql:test.sql:33: ERROR: invalid byte sequence for encoding "UTF8": 0xa4
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
A fix could be either that the server checks escape sequences for validity
or that pg_dump outputs invalid bytes as escape sequences.
Or pg_dump could stop with an error.
I think that the cleanest way would be the first.
Yours,
Laurenz Albe
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for validity
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.
The question in my mind is how we help people to recover from the fact
that we haven't done that.
cheers
andrew
Andrew Dunstan wrote:
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for
validityThis strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.The question in my mind is how we help people to recover from the fact
that we haven't done that.
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?
Joshua D. Drake
cheers
andrew
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Andrew Dunstan wrote:
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for
validityThis strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.The question in my mind is how we help people to recover from the fact
that we haven't done that.Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?
What issues? I've upgraded several 8.0 database to 8.1. without having to use
iconv. Did I miss something?
Regards,
Mario Weilguni
Andrew Dunstan wrote:
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for
validityThis strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.The question in my mind is how we help people to recover from the fact
that we haven't done that.
I would also say that it's a bug that escape sequences can get characters
into the database that are not valid in the specified encoding. If you
compare the encoding to table constraints, there is no way to simply
"escape" a constraint check.
This seems to violate the principle of consistency in ACID. Additionally,
if you include pg_dump into ACID, it also violates durability, since it
cannot restore what it wrote itself.
Is there anything in the SQL spec that asks for such a behaviour? I guess not.
A DBA will usually not even learn about this issue until they are presented
with a failing restore.
Best Regards,
Michael Paesold
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
Am Dienstag, 13. M�rz 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?What issues? I've upgraded several 8.0 database to 8.1. without having to use
iconv. Did I miss something?
http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
"Some users are having problems loading UTF-8 data into 8.1.X. This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
--
Michael Fuhr
Am Mittwoch, 14. M�rz 2007 08:01 schrieb Michael Paesold:
Is there anything in the SQL spec that asks for such a behaviour? I guess
not.
I think that the octal escapes are a holdover from the single-byte days where
they were simply a way to enter characters that are difficult to find on a
keyboard. In today's multi-encoding world, it would make more sense if there
were an escape sequence for a *codepoint* which is then converted to the
actual encoding (if possible and valid) in the server. The meaning of
codepoint is, however, character set dependent as well.
The SQL standard supports escape sequences for Unicode codepoints, which I
think would be a very useful feature (try entering a UTF-8 character
bytewise ...), but it's a bit weird to implement and it's not clear how to
handle character sets other than Unicode.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
Andrew Dunstan wrote:
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.The question in my mind is how we help people to recover from the fact
that we haven't done that.I would also say that it's a bug that escape sequences can get characters
into the database that are not valid in the specified encoding. If you
compare the encoding to table constraints, there is no way to simply
"escape" a constraint check.This seems to violate the principle of consistency in ACID. Additionally,
if you include pg_dump into ACID, it also violates durability, since it
cannot restore what it wrote itself.
Is there anything in the SQL spec that asks for such a behaviour? I guess
not.A DBA will usually not even learn about this issue until they are presented
with a failing restore.
Is there anything I can do to help with this problem? Maybe implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs still
can turn it on if they really depend on it)?
For me,
Best regards,
Mario Weilguni
Mario Weilguni wrote:
Is there anything I can do to help with this problem? Maybe
implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs
still
can turn it on if they really depend on it)?
I think that this should be done away with unconditionally.
Or does anybody have a good point for allowing corrupt data
in text columns?
Maybe it is the way it is now because nobody could be bothered
to add the appropriate checks...
Yours,
Laurenz Albe
Albe Laurenz wrote:
Mario Weilguni wrote:
Is there anything I can do to help with this problem? Maybe
implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs
still
can turn it on if they really depend on it)?
I think that this should be done away with unconditionally.
Or does anybody have a good point for allowing corrupt data
in text columns?
Maybe it is the way it is now because nobody could be bothered
to add the appropriate checks...
I agree. It's more or less an integrity violation, IMNSHO.
cheers
andrew
On Tue, 2007-03-13 at 12:00 +0100, Mario Weilguni wrote:
Hi,
I've a problem with a database, I can dump the database to a file, but
restoration fails, happens with 8.1.4.
I reported the same problem a while back:
http://archives.postgresql.org/pgsql-bugs/2006-10/msg00246.php
Some people think it's a bug, some people don't. It is technically
documented behavior, but I don't think the documentation is clear
enough. I think it is a bug that should be fixed, and here's another
message in the thread that expresses my opinion:
http://archives.postgresql.org/pgsql-bugs/2006-11/msg00033.php
If you look at that email, it includes some examples of surprising
behaviors caused by that bug, particularly with bytea.
In some applications (for which it's impractical to change the source
code), I actually use a CHECK constraint (which raises an exception on
invalid utf8 data) on every text column so that some binary data doesn't
break my slony replication.
I'd like to see this fixed.
Regards,
Jeff Davis
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?What issues? I've upgraded several 8.0 database to 8.1. without having to use
iconv. Did I miss something?http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
"Some users are having problems loading UTF-8 data into 8.1.X. This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
If the above quote were actually true, then Mario wouldn't be having a
problem. Instead, it's half-true: Invalid byte sequences are rejected in
some situations and accepted in others. If postgresql consistently
rejected or consistently accepted invalid byte sequences, that would not
cause problems with COPY (meaning problems with pg_dump, slony, etc.).
Regards,
Jeff Davis
Jeff Davis wrote:
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?What issues? I've upgraded several 8.0 database to 8.1. without having to use
iconv. Did I miss something?http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
"Some users are having problems loading UTF-8 data into 8.1.X. This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."If the above quote were actually true, then Mario wouldn't be having a
problem. Instead, it's half-true: Invalid byte sequences are rejected in
some situations and accepted in others. If postgresql consistently
rejected or consistently accepted invalid byte sequences, that would not
cause problems with COPY (meaning problems with pg_dump, slony, etc.).
How can we fix this? Frankly, the statement in the docs warning about
making sure that escaped sequences are valid in the server encoding is a
cop-out. We don't accept invalid data elsewhere, and this should be no
different IMNSHO. I don't see why this should be any different from,
say, date or numeric data. For years people have sneered at MySQL
because it accepted dates like Feb 31st, and rightly so. But this seems
to me to be like our own version of the same problem.
Last year Jeff suggested adding something like:
pg_verifymbstr(string,strlen(string),0);
to each relevant input routine. Would that be an acceptable solution? If
not, what would be?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Last year Jeff suggested adding something like:
pg_verifymbstr(string,strlen(string),0);
to each relevant input routine. Would that be an acceptable solution?
The problem with that is that it duplicates effort: in many cases
(especially COPY IN) the data's already been validated. I'm not sure
how to fix that, but I think you'll get some push-back if you double
the encoding verification work in COPY for nothing.
Given that we are moving away from backslash-enabled literals, I'm
not as convinced as some that this must be fixed...
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Last year Jeff suggested adding something like:
pg_verifymbstr(string,strlen(string),0);
to each relevant input routine. Would that be an acceptable solution?The problem with that is that it duplicates effort: in many cases
(especially COPY IN) the data's already been validated. I'm not sure
how to fix that, but I think you'll get some push-back if you double
the encoding verification work in COPY for nothing.Given that we are moving away from backslash-enabled literals, I'm
not as convinced as some that this must be fixed...
They will still be available in E'\nn' form, won't they?
One thought I had was that it might make sense to have a flag that would
inhibit the check, that could be set (and reset) by routines that check
for themselves, such as COPY IN. Then bulk load performance should not
be hit much.
cheers
andrew