Invalid byte sequence when importing Oracle BLOB
Hi,
The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8
The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.
The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.
Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
Is there anything I can change on the Postgresql side to make these records
import correctly?
--
Angular momentum makes the world go 'round.
On 2021-04-26 06:49:18 -0500, Ron wrote:
The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
Decoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On 4/26/21 7:32 AM, Peter J. Holzer wrote:
On 2021-04-26 06:49:18 -0500, Ron wrote:
The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbfDecoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?
I thought it was weird, too, but COPY has to read text, no?
Anyway, here are the first 8 lines (beware line wrapping) of the input file:
SET client_encoding TO 'UTF8';
SET synchronous_commit TO off;
SET search_path = strans,public;
TRUNCATE TABLE mv_response_attachment_old;
COPY mv_response_attachment_old
(response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...
It looks like a bog-standard COPY command, with which I've imported a lot of
data.
--
Angular momentum makes the world go 'round.
On 2021-04-26 07:45:26 -0500, Ron wrote:
On 4/26/21 7:32 AM, Peter J. Holzer wrote:
On 2021-04-26 06:49:18 -0500, Ron wrote:
The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbfDecoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?I thought it was weird, too, but COPY has to read text, no?
Yes, but data for a bytea column would normally be encoded in hex or
something like that ...
COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...
... Yes, like this. There are only hex digits (plus \ and x) in the
column, nothing which would require decoding UTF-8.
My guess is that the error is actually in the data for another column.
I'd try to identify the broken records and check whether they contain
some other strange content.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
El día lunes, abril 26, 2021 a las 06:49:18a. m. -0500, Ron escribió:
Hi,
The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
This error can't be caused by a BLOB or bytea column. Only char or vchar
columns can cause (and did cause) such errors in our Oracle/Sybase to
PostgreSQL migrations.
matthias
--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
On 4/26/21 7:58 AM, Peter J. Holzer wrote:
On 2021-04-26 07:45:26 -0500, Ron wrote:
On 4/26/21 7:32 AM, Peter J. Holzer wrote:
On 2021-04-26 06:49:18 -0500, Ron wrote:
The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbfDecoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?I thought it was weird, too, but COPY has to read text, no?
Yes, but data for a bytea column would normally be encoded in hex or
something like that ...COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...... Yes, like this. There are only hex digits (plus \ and x) in the
column, nothing which would require decoding UTF-8.My guess is that the error is actually in the data for another column.
I'd try to identify the broken records and check whether they contain
some other strange content.
That's a good idea. Thanks.
--
Angular momentum makes the world go 'round.
On Mon, 2021-04-26 at 06:49 -0500, Ron wrote:
The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbfIs there anything I can change on the Postgresql side to make these records
import correctly?
The "bytea" column is probably a red herring - this error message should
be caused by a "text" or "varchar" or other string data column.
It is surprisingly easy to enter currupt strings into Oracle - just set
client encoding to the same value as server encoding, and it won't check
the integrity of your strings.
If that is your problem, you must identify and fix the data in Oracle.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com