Invalid byte sequence when importing Oracle BLOB

Started by Ronalmost 5 years ago7 messagesgeneral
Jump to latest
#1Ron
ronljohnsonjr@gmail.com

Hi,

The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Is there anything I can change on the Postgresql side to make these records
import correctly?

--
Angular momentum makes the world go 'round.

#2Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Ron (#1)
Re: Invalid byte sequence when importing Oracle BLOB

On 2021-04-26 06:49:18 -0500, Ron wrote:

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Decoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#3Ron
ronljohnsonjr@gmail.com
In reply to: Peter J. Holzer (#2)
Re: Invalid byte sequence when importing Oracle BLOB

On 4/26/21 7:32 AM, Peter J. Holzer wrote:

On 2021-04-26 06:49:18 -0500, Ron wrote:

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Decoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?

I thought it was weird, too, but COPY has to read text, no?

Anyway, here are the first 8 lines (beware line wrapping) of the input file:

SET client_encoding TO 'UTF8';
SET synchronous_commit TO off;
SET search_path = strans,public;

TRUNCATE TABLE mv_response_attachment_old;

COPY mv_response_attachment_old
(response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...

It looks like a bog-standard COPY command, with which I've imported a lot of
data.

--
Angular momentum makes the world go 'round.

#4Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Ron (#3)
Re: Invalid byte sequence when importing Oracle BLOB

On 2021-04-26 07:45:26 -0500, Ron wrote:

On 4/26/21 7:32 AM, Peter J. Holzer wrote:

On 2021-04-26 06:49:18 -0500, Ron wrote:

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Decoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?

I thought it was weird, too, but COPY has to read text, no?

Yes, but data for a bytea column would normally be encoded in hex or
something like that ...

COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...

... Yes, like this. There are only hex digits (plus \ and x) in the
column, nothing which would require decoding UTF-8.

My guess is that the error is actually in the data for another column.

I'd try to identify the broken records and check whether they contain
some other strange content.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#5Matthias Apitz
guru@unixarea.de
In reply to: Ron (#1)
Re: Invalid byte sequence when importing Oracle BLOB

El día lunes, abril 26, 2021 a las 06:49:18a. m. -0500, Ron escribió:

Hi,

The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

This error can't be caused by a BLOB or bytea column. Only char or vchar
columns can cause (and did cause) such errors in our Oracle/Sybase to
PostgreSQL migrations.

matthias
--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

#6Ron
ronljohnsonjr@gmail.com
In reply to: Peter J. Holzer (#4)
Re: Invalid byte sequence when importing Oracle BLOB

On 4/26/21 7:58 AM, Peter J. Holzer wrote:

On 2021-04-26 07:45:26 -0500, Ron wrote:

On 4/26/21 7:32 AM, Peter J. Holzer wrote:

On 2021-04-26 06:49:18 -0500, Ron wrote:

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Decoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?

I thought it was weird, too, but COPY has to read text, no?

Yes, but data for a bytea column would normally be encoded in hex or
something like that ...

COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...

... Yes, like this. There are only hex digits (plus \ and x) in the
column, nothing which would require decoding UTF-8.

My guess is that the error is actually in the data for another column.

I'd try to identify the broken records and check whether they contain
some other strange content.

That's a good idea.  Thanks.

--
Angular momentum makes the world go 'round.

#7Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Ron (#1)
Re: Invalid byte sequence when importing Oracle BLOB

On Mon, 2021-04-26 at 06:49 -0500, Ron wrote:

The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Is there anything I can change on the Postgresql side to make these records
import correctly?

The "bytea" column is probably a red herring - this error message should
be caused by a "text" or "varchar" or other string data column.

It is surprisingly easy to enter currupt strings into Oracle - just set
client encoding to the same value as server encoding, and it won't check
the integrity of your strings.

If that is your problem, you must identify and fix the data in Oracle.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com