Invalid unicode in COPY problem

Started by Madison Kellyalmost 21 years ago13 messageshackersgeneral
Jump to latest
#1Madison Kelly
linux@alteeve.com
hackersgeneral

Hi all,

I've been chasing down a bug and from what I have learned it may be
because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
invalid unicode. I've been given some ideas on how to try to catch
invalid unicode but it seems expensive so I am hoping there is a
postgresql way to deal with this problem.

I've run into a problem where a bulk postgres "COPY..." statement is
dieing because one of the lines contains a file name with an invalid
unicode character. In nautilus this file has '(invalid encoding)' and
the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column
file_name: "Femme Fatal\uffff.url"'.

To actually look at the file from the shell (bash) shows what appears
to be a whitespace but when I copy/paste the file name I get the
'\uffff' you see above.

I could, with the help of the TLUG people, use regex to match for an
invalid character and skip the file but that is not ideal. The reason is
that this is for my backup program and invalid unicode or not, the
contents of the file may still be important and I would prefer to have
it in the database so that it is later copied. I can copy and move the
file in the shell so the file isn't apparently in an of itself corrupt.

So then, is there a way I can tell postresql to accept the invalid
unicode name? Here is a copy of my schema:

tle-bu=> \d file_info_2
Table "public.file_info_2"
Column | Type | Modifiers
----------------------+----------------------+-----------------------------------------
file_group_name | text |
file_group_uid | bigint | not null
file_mod_time | bigint | not null
file_name | text | not null
file_parent_dir | text | not null
file_perm | text | not null
file_size | bigint | not null
file_type | character varying(2) | not null default
'f'::character varying
file_user_name | text |
file_user_uid | bigint | not null
file_backup | boolean | not null default true
file_display | boolean | not null default false
file_restore_display | boolean | not null default false
file_restore | boolean | not null default false
Indexes:
"file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)

'file_name' and 'file_parent_dir' are the columns that could have
entries with the invalid unicode characters. Maybe I could/should use
something other than 'text'? These columns could contain anything that a
file or directory name could be.

Thanks!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Madison Kelly (#1)
hackersgeneral
Re: Invalid unicode in COPY problem

We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item "mbstr_check" added.
2) if mbstr_check = 0 then invalid characters are not accepted
(same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
warnings
5) We have checked PostgreSQL source code if accepting invalid
characters makes some troubles. We have found that we need to fix a
place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

Show quoted text

Hi all,

I've been chasing down a bug and from what I have learned it may be
because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
invalid unicode. I've been given some ideas on how to try to catch
invalid unicode but it seems expensive so I am hoping there is a
postgresql way to deal with this problem.

I've run into a problem where a bulk postgres "COPY..." statement is
dieing because one of the lines contains a file name with an invalid
unicode character. In nautilus this file has '(invalid encoding)' and
the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column
file_name: "Femme Fatal\uffff.url"'.

To actually look at the file from the shell (bash) shows what appears
to be a whitespace but when I copy/paste the file name I get the
'\uffff' you see above.

I could, with the help of the TLUG people, use regex to match for an
invalid character and skip the file but that is not ideal. The reason is
that this is for my backup program and invalid unicode or not, the
contents of the file may still be important and I would prefer to have
it in the database so that it is later copied. I can copy and move the
file in the shell so the file isn't apparently in an of itself corrupt.

So then, is there a way I can tell postresql to accept the invalid
unicode name? Here is a copy of my schema:

tle-bu=> \d file_info_2
Table "public.file_info_2"
Column | Type | Modifiers
----------------------+----------------------+-----------------------------------------
file_group_name | text |
file_group_uid | bigint | not null
file_mod_time | bigint | not null
file_name | text | not null
file_parent_dir | text | not null
file_perm | text | not null
file_size | bigint | not null
file_type | character varying(2) | not null default
'f'::character varying
file_user_name | text |
file_user_uid | bigint | not null
file_backup | boolean | not null default true
file_display | boolean | not null default false
file_restore_display | boolean | not null default false
file_restore | boolean | not null default false
Indexes:
"file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)

'file_name' and 'file_parent_dir' are the columns that could have
entries with the invalid unicode characters. Maybe I could/should use
something other than 'text'? These columns could contain anything that a
file or directory name could be.

Thanks!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#3Madison Kelly
linux@alteeve.com
In reply to: Tatsuo Ishii (#2)
hackersgeneral
Re: Invalid unicode in COPY problem

Thank you, I would!

What versions have you tested the patch against? I am sorry but I am
not too familiar with applying patches against the main program, is
there documentation on how to apply the patch? Is there a way to roll
the patch back/remove it? Would I be able to script the installation of
the patch (I would expect so).

The reason for the last question is that I expect (hope) many people
will use it and I want to make it as easy as possible for a user to
simply select or unselect the patch if it works well. If I can script
the install and removal of this patch then I can do just this and that
would be wonderful.

Thank you again!

どうも ありがとう ございます! (I hope that is right, my Japanese is
still elementary. :) )

Madison

Tatsuo Ishii wrote:

Show quoted text

We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item "mbstr_check" added.
2) if mbstr_check = 0 then invalid characters are not accepted
(same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
warnings
5) We have checked PostgreSQL source code if accepting invalid
characters makes some troubles. We have found that we need to fix a
place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

Hi all,

I've been chasing down a bug and from what I have learned it may be
because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
invalid unicode. I've been given some ideas on how to try to catch
invalid unicode but it seems expensive so I am hoping there is a
postgresql way to deal with this problem.

I've run into a problem where a bulk postgres "COPY..." statement is
dieing because one of the lines contains a file name with an invalid
unicode character. In nautilus this file has '(invalid encoding)' and
the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column
file_name: "Femme Fatal\uffff.url"'.

To actually look at the file from the shell (bash) shows what appears
to be a whitespace but when I copy/paste the file name I get the
'\uffff' you see above.

I could, with the help of the TLUG people, use regex to match for an
invalid character and skip the file but that is not ideal. The reason is
that this is for my backup program and invalid unicode or not, the
contents of the file may still be important and I would prefer to have
it in the database so that it is later copied. I can copy and move the
file in the shell so the file isn't apparently in an of itself corrupt.

So then, is there a way I can tell postresql to accept the invalid
unicode name? Here is a copy of my schema:

tle-bu=> \d file_info_2
Table "public.file_info_2"
Column | Type | Modifiers
----------------------+----------------------+-----------------------------------------
file_group_name | text |
file_group_uid | bigint | not null
file_mod_time | bigint | not null
file_name | text | not null
file_parent_dir | text | not null
file_perm | text | not null
file_size | bigint | not null
file_type | character varying(2) | not null default
'f'::character varying
file_user_name | text |
file_user_uid | bigint | not null
file_backup | boolean | not null default true
file_display | boolean | not null default false
file_restore_display | boolean | not null default false
file_restore | boolean | not null default false
Indexes:
"file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)

'file_name' and 'file_parent_dir' are the columns that could have
entries with the invalid unicode characters. Maybe I could/should use
something other than 'text'? These columns could contain anything that a
file or directory name could be.

Thanks!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#4John Hansen
john@geeknet.com.au
In reply to: Madison Kelly (#3)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Tatsuo Ishii wrote:

Sent: Sunday, May 08, 2005 12:01 PM
To: linux@alteeve.com
Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

We have developed patches which relaxes the character
validation so that PostgreSQL accepts invalid characters. It
works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

#5Madison Kelly
linux@alteeve.com
In reply to: John Hansen (#4)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

John Hansen wrote:

Tatsuo Ishii wrote:

We have developed patches which relaxes the character
validation so that PostgreSQL accepts invalid characters. It
works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

Under most circumstances I would agree with you completely. In my
case though I have to decide between risking a loss of a user's data or
attempt to store the file name in some manner that would return the same
name used by the file system.

The user (or one of his/her users in the case of an admin) may be
completely unaware of the file name being an invalid unicode name. The
file itself though may still be quite valid and contain information
worthy of backing up. I could notify the user/admin that the name is not
valid but there is no way I could rely on the name being changed. Given
the choices, I would prefer to attempt to store/use the file name with
the invalid unicode character than simply ignore the file.

Is there a way to store the name in raw binary? If so, would this not
be safe because to postgresql it should no longer matter what data is or
represents, right? Maybe there is a third option I am not yet concidering?

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#6John Hansen
john@geeknet.com.au
In reply to: Madison Kelly (#5)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Madison Kelly wrote:

Under most circumstances I would agree with you completely. In my
case though I have to decide between risking a loss of a
user's data or
attempt to store the file name in some manner that would
return the same
name used by the file system.

The user (or one of his/her users in the case of an admin) may be
completely unaware of the file name being an invalid unicode
name. The
file itself though may still be quite valid and contain information
worthy of backing up. I could notify the user/admin that the
name is not
valid but there is no way I could rely on the name being
changed. Given
the choices, I would prefer to attempt to store/use the file
name with
the invalid unicode character than simply ignore the file.

Is there a way to store the name in raw binary? If so,
would this not
be safe because to postgresql it should no longer matter what
data is or
represents, right? Maybe there is a third option I am not yet
concidering?

Set the client_encoding to ascii when storing that name, and again when
retrieving it.
Or, use a bytea column.

Madison

... John

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#4)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

"John Hansen" <john@geeknet.com.au> writes:

Tatsuo Ishii wrote:

We have developed patches which relaxes the character
validation so that PostgreSQL accepts invalid characters.

That is just plain 100% wrong!!

That was my first reaction too. Why would this be a good idea?
If someone does want an encoding-agnostic database, they can
set it as SQL_ASCII.

regards, tom lane

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: John Hansen (#4)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Tatsuo Ishii wrote:

Sent: Sunday, May 08, 2005 12:01 PM
To: linux@alteeve.com
Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

We have developed patches which relaxes the character
validation so that PostgreSQL accepts invalid characters. It
works like this:

That is just plain 100% wrong!!

Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them.

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

I saw this kind of situation in the real world and that's why we
developed the patches.
--
Tatsuo Ishii

#9Tino Wildenhain
tino@wildenhain.de
In reply to: Tatsuo Ishii (#8)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Am Sonntag, den 08.05.2005, 14:30 +0900 schrieb Tatsuo Ishii:
...

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

...

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

Well, if you are so in a hurry you better not migrate. Postgres is
proud of validating the input and to have no invalid data. So if
you have invalid data, better fix it.

I saw this kind of situation in the real world and that's why we
developed the patches.

Why not developing a helper for contrib to help reencoding the
database instead?

Regards
Tino

#10John Hansen
john@geeknet.com.au
In reply to: Tino Wildenhain (#9)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Tatsuo Ishii wrote:

Sent: Sunday, May 08, 2005 3:31 PM
To: John Hansen
Cc: linux@alteeve.com; pgsql-general@postgresql.org;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

Tatsuo Ishii wrote:

Sent: Sunday, May 08, 2005 12:01 PM
To: linux@alteeve.com
Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

We have developed patches which relaxes the character

validation so

that PostgreSQL accepts invalid characters. It works like this:

That is just plain 100% wrong!!

Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data,

then at least

encode it using a valid encoding.

In fact, I've proposed strengthening the validation

routines for UTF-8.

Actually I myself thought as you are before. Later I found
that it was not so good idea. People already have invalid
encoded data in their precious database and have very hard
time to migrate to newer version of PostgreSQL because of
encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key,
member_name text, address text, phone text, email text). I
have to reach each member by either adress, phone or email.
Unfortunately some of address field have wrong encoded data.
In this case I will use phone or email to reach them.

Now I need to upgrade to newer PostgreSQL within 1 day. I
know I have to fix wrong encoded field but it will take more
than 1 day. So I would like to import the data first then fix
wrong encoded field on running database since I can reach
members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

Show quoted text

I saw this kind of situation in the real world and that's why
we developed the patches.
--
Tatsuo Ishii

#11Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Madison Kelly (#5)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Is there a way to store the name in raw binary? If so, would this not
be safe because to postgresql it should no longer matter what data is or
represents, right? Maybe there is a third option I am not yet concidering?

In the backup rename the file and add another file
<file-name-with-invalids-set-to-?>.README
which explains the issue, details the steps taken (eg
renaming) and offers some sort of raw binary value of the
original file name.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#12Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Tatsuo Ishii (#8)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them.

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

You could add a bytea field in the migration process which
holds the original data and put some "sloppily corrected" data
inside the real field for the time being. Then, once corrected
set the additional bytea field to NULL. Pretty much the
concept of staging tables for data import.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#13Oliver Jowett
oliver@opencloud.com
In reply to: Madison Kelly (#5)
hackersgeneral
Re: [HACKERS] Invalid unicode in COPY problem

Madison Kelly wrote:

Is there a way to store the name in raw binary?

Yes: bytea.

-O