\COPY to accept non UTF-8 chars in CHAR columns

Started by Matthias Apitzabout 6 years ago9 messagesgeneral

guru@unixarea.de

about 6 years ago

Hello,

We're moving databases from Sybase/Oracle to PostgreSQL 11.4 by
unloading them with our own tool into a CSV like format and loading the
rows with \COPY. One can imagine that in old databases, in use for
a longer time, you will find any sort of dust which should not be there, for
example ISO-8859-1 chars while the CHAR column should be in UTF-8. Needless
to say, this data should not be there, it is just a fact, in parts
caused by foreign data loaded into the Sybase/Oracle database, sometimes
our own Java- or Perl-written software, breaks a longer CHAR column into
two sequential rows (with an INT column to note the order), but does
the break in the middle of an UTF-8 multi-byte, clumsy. :-(

In short, it there a way to let \COPY accept such broken ISO bytes, just
complaining about, but not stopping the insert of the row?

Thanks

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Matthias Apitz (#1)

Re: \COPY to accept non UTF-8 chars in CHAR columns

Matthias Apitz <guru@unixarea.de> writes:

In short, it there a way to let \COPY accept such broken ISO bytes, just
complaining about, but not stopping the insert of the row?

No. We don't particularly believe in the utility of invalid data.

If you don't actually care about what encoding your data is in,
you could use SQL_ASCII as the database "encoding" and thereby
disable all UTF8-specific behavior. Otherwise, maybe this conversion
is a good time to clean up the mess?

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

about 6 years ago

In reply to: Tom Lane (#2)

Re: \COPY to accept non UTF-8 chars in CHAR columns

On Sat, Mar 28, 2020 at 4:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Matthias Apitz <guru@unixarea.de> writes:

In short, it there a way to let \COPY accept such broken ISO bytes, just
complaining about, but not stopping the insert of the row?

No. We don't particularly believe in the utility of invalid data.

If you don't actually care about what encoding your data is in,
you could use SQL_ASCII as the database "encoding" and thereby
disable all UTF8-specific behavior. Otherwise, maybe this conversion
is a good time to clean up the mess?

Something like this approach might be useful for fixing the CSV file:

https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8

I haven't tested that program but it looks like the right sort of
approach; I remember writing similar logic to untangle the strange
mixtures of Latin 1, Windows 1252, and UTF-8 that late 90s browsers
used to send. That sort of approach can't fix every theoretical
problem (some valid Latin1 sequences are also valid UTF-8 sequences)
but it's doable with text in European languages.

Andrew Gierth

andrew@tao11.riddles.org.uk

about 6 years ago

In reply to: Thomas Munro (#3)

Re: \COPY to accept non UTF-8 chars in CHAR columns

"Thomas" == Thomas Munro <thomas.munro@gmail.com> writes:

Thomas> Something like this approach might be useful for fixing the CSV file:

Thomas> https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8

Or:

perl -MEncode -pe '
use bytes;
sub c { decode("UTF-8",shift,sub { decode("windows-1252", chr(shift)) }); }
s/([\x80-\xFF]+)/encode("UTF-8",c($1))/eg' <infile >outfile

--
Andrew (irc:RhodiumToad)

Rory Campbell-Lange

rory@campbell-lange.net

about 6 years ago

In reply to: Andrew Gierth (#4)

Re: \COPY to accept non UTF-8 chars in CHAR columns

On 27/03/20, Andrew Gierth (andrew@tao11.riddles.org.uk) wrote:

"Thomas" == Thomas Munro <thomas.munro@gmail.com> writes:

Thomas> Something like this approach might be useful for fixing the CSV file:

Thomas> https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8

Or:

perl -MEncode -pe '
use bytes;
sub c { decode("UTF-8",shift,sub { decode("windows-1252", chr(shift)) }); }
s/([\x80-\xFF]+)/encode("UTF-8",c($1))/eg' <infile >outfile

Or:

iconv -f WINDOWS-1252 -t UTF-8 -c < tempfile2 > tempfile3

Andrew Gierth

andrew@tao11.riddles.org.uk

about 6 years ago

In reply to: Rory Campbell-Lange (#5)

Re: \COPY to accept non UTF-8 chars in CHAR columns

"Rory" == Rory Campbell-Lange <rory@campbell-lange.net> writes:

Rory> Or:

Rory> iconv -f WINDOWS-1252 -t UTF-8 -c < tempfile2 > tempfile3

No. That's just a conversion of win1252 to utf8 without regard for any
UTF8 that might already be present in the input. Any such input will end
up double-encoded, requiring further work to fix.

--
Andrew (irc:RhodiumToad)

Rory Campbell-Lange

rory@campbell-lange.net

about 6 years ago

In reply to: Andrew Gierth (#6)

Re: \COPY to accept non UTF-8 chars in CHAR columns

On 27/03/20, Andrew Gierth (andrew@tao11.riddles.org.uk) wrote:

"Rory" == Rory Campbell-Lange <rory@campbell-lange.net> writes:

Rory> Or:

Rory> iconv -f WINDOWS-1252 -t UTF-8 -c < tempfile2 > tempfile3

No. That's just a conversion of win1252 to utf8 without regard for any
UTF8 that might already be present in the input. Any such input will end
up double-encoded, requiring further work to fix.

My apologies. I missed the subtlety of the substitution in your perl code.

Matthias Apitz

guru@unixarea.de

about 6 years ago

In reply to: Thomas Munro (#3)

Re: \COPY to accept non UTF-8 chars in CHAR columns

El día sábado, marzo 28, 2020 a las 09:40:30a. m. +1300, Thomas Munro escribió:

On Sat, Mar 28, 2020 at 4:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Matthias Apitz <guru@unixarea.de> writes:

In short, it there a way to let \COPY accept such broken ISO bytes, just
complaining about, but not stopping the insert of the row?

No. We don't particularly believe in the utility of invalid data.

If you don't actually care about what encoding your data is in,
you could use SQL_ASCII as the database "encoding" and thereby
disable all UTF8-specific behavior. Otherwise, maybe this conversion
is a good time to clean up the mess?

Something like this approach might be useful for fixing the CSV file:

https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8

I haven't tested that program but it looks like the right sort of
approach; I remember writing similar logic to untangle the strange
mixtures of Latin 1, Windows 1252, and UTF-8 that late 90s browsers
used to send. That sort of approach can't fix every theoretical
problem (some valid Latin1 sequences are also valid UTF-8 sequences)
but it's doable with text in European languages.

Thomas,

This is a very good finding, thanks for this.

I played around a bit with the original code, and tested some
modifications to fit better in our problem... It works fine:

- it translates any ISO char, for example 0xfc (German Umlaut ü in ISO), into the
correct UTF-8 coding 0xc3bc:

perl -e 'print pack("H*", "4040fc4040")' | ./convert2properUTF8 | hexdump -C
00000000 40 40 c3 bc 40 40 |@@..@@|
00000006

- it translates a situation where 0xc3bc (German Umlaut ü in UTF-8
coding) was broken into two columns, one terminating in 0xc3 and the 2nd
row starting with 0xbc; this would give:

i.e. 0xc3 is translated to 0xc383 and the 2nd half, the 0xbc to
0xc2bc, both translations have nothing to do with the original split 0xc3bc, and
perhaps in this case it would be better to spill out a blank 0x40 for
each of the bytes which formed the 0xc3bc.

But this we will discuss here and align the code to our use cases.

Thanks again

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Andrew Gierth

andrew@tao11.riddles.org.uk

about 6 years ago

In reply to: Matthias Apitz (#8)

Re: \COPY to accept non UTF-8 chars in CHAR columns

"Matthias" == Matthias Apitz <guru@unixarea.de> writes:

Matthias> i.e. 0xc3 is translated to 0xc383 and the 2nd half, the
Matthias> 0xbc to 0xc2bc, both translations have nothing to do with
Matthias> the original split 0xc3bc, and perhaps in this case it
Matthias> would be better to spill out a blank 0x40 for each of the
Matthias> bytes which formed the 0xc3bc.

If the only malformed sequences are there as a result of splitting up
valid sequences, then you could do something like convert all invalid
sequences to (sequences of) noncharacters, then once the data is
imported, fix it up by adjusting how the data is split and regenerating
the correct sequence (assuming your application allows this).

For example you could encode an arbitrary byte xy as a sequence of two
codepoints U+FDDx U+FDEy (the range FDD0-FDEF are all defined as
noncharacters).

--
Andrew (irc:RhodiumToad)