BUG #5944: COPY FROM doesn't work with international characters

Started by Nathan M. Davalosabout 15 years ago10 messagesbugs
Jump to latest
#1Nathan M. Davalos
n.davalos@sharedmarketing.com

The following bug has been logged online:

Bug reference: 5944
Logged by: Nathan Davalos
Email address: n.davalos@sharedmarketing.com
PostgreSQL version: 9
Operating system: Windows XP/2003
Description: COPY FROM doesn't work with international characters
Details:

I'm trying to load a tab delimited text file with COPY FROM under version
9.

create table tmpintermediate (
AcNumber character varying(20),
AcName character varying(50));

SET CLIENT_ENCODING TO 'WIN1251';
copy tmpintermediate from 'thefile.txt';

Sample contents of thefile:
230002 Alto Desempeño, S.A. De C.V.

When using WIN1251 or WIN1252 I get nothing in the second field, it just
ignores the data. Same thing for LATIN-1.

When using UTF8 for client encoding I get this message:
ERROR: invalid byte sequence for encoding "UTF8": 0xf16f2c20
CONTEXT: COPY tmpintermediate , line 1

Not sure what the problem is, I assumed encoding the database as UTF8 would
just take any character. Inserting the data with an insert statement works
perfectly normal. Is it something I'm doing wrong with COPY FROM?

The database is UTF8, in 8.3 we were using WIN1252 for the database
encoding, but when we migrated to 8.4 everything was forced to UTF8
regardless of what the locale was specified during install using pg_dumpall
to migrate the data, so I just left things as UTF8. Not sure how to proceed
with this one. We create all our databases using a customized template
datatabase (we leave the ones that install with postgresql intact and
untouched).

The copy commands are being issued by pgadmin3 if that makes any difference.
That and we use MicroOlap's PosgresDAC components for Delphi.

#2John R Pierce
pierce@hogranch.com
In reply to: Nathan M. Davalos (#1)
Re: BUG #5944: COPY FROM doesn't work with international characters

On 03/23/11 4:32 PM, Nathan Davalos wrote:

...
SET CLIENT_ENCODING TO 'WIN1251';
copy tmpintermediate from 'thefile.txt';

Sample contents of thefile:
230002 Alto Desempeño, S.A. De C.V.

When using WIN1251 or WIN1252 I get nothing in the second field, it just
ignores the data. Same thing for LATIN-1.

When using UTF8 for client encoding I get this message:
ERROR: invalid byte sequence for encoding "UTF8": 0xf16f2c20
CONTEXT: COPY tmpintermediate , line 1

what is the byte (binary) encoding of the file? in hex,

ñ in win1251 == (no such character. win1251 is cyrillic)
ñ in win1252 == F1
ñ in UTF-8 == C3 B1

#3Nathan M. Davalos
n.davalos@sharedmarketing.com
In reply to: John R Pierce (#2)
Re: BUG #5944: COPY FROM doesn't work with international characters

32333030303209416C746F20446573656D7065F16F2C20532E412E20446520432E562E0D0A
The character in question is F1

-----Original Message-----
From: John R Pierce [mailto:pierce@hogranch.com]
Sent: Wednesday, March 23, 2011 6:49 PM
To: Nathan M. Davalos
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #5944: COPY FROM doesn't work with international characters

On 03/23/11 4:32 PM, Nathan Davalos wrote:

...
SET CLIENT_ENCODING TO 'WIN1251';
copy tmpintermediate from 'thefile.txt';

Sample contents of thefile:
230002 Alto Desempeño, S.A. De C.V.

When using WIN1251 or WIN1252 I get nothing in the second field, it just
ignores the data. Same thing for LATIN-1.

When using UTF8 for client encoding I get this message:
ERROR: invalid byte sequence for encoding "UTF8": 0xf16f2c20
CONTEXT: COPY tmpintermediate , line 1

what is the byte (binary) encoding of the file? in hex,

ñ in win1251 == (no such character. win1251 is cyrillic)
ñ in win1252 == F1
ñ in UTF-8 == C3 B1

#4John R Pierce
pierce@hogranch.com
In reply to: Nathan M. Davalos (#1)
Re: BUG #5944: COPY FROM doesn't work with international characters

On 03/23/11 5:42 PM, Nathan M. Davalos wrote:

Just in case the reply didn't go through the character in question is F1.

I replied to the message using pgsql-bugs@postgresql.org, but I'm a wee bit new to the bug reporting stuff using the mailing list.

The hex of the file in total is
32.33.30.30.30.32.09.41.6C.74.6F.20.44.65.73.65.6D.70.65.F1.6F.2C.20.53.2E.41.2E.20.44.65.20.43.2E.56.2E.0D.0A

k, thats certainly win-1252. I see a tab after the 230002, then the
rest of it is all one field, ending in a <CR><LF> sequence.

The table you're copying this data to, what fields does it have?

#5Nathan M. Davalos
n.davalos@sharedmarketing.com
In reply to: John R Pierce (#4)
Re: BUG #5944: COPY FROM doesn't work with international characters

AcNumber character varying(20),
AcName character varying(50)

The database itself is:
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'English_United States.1252'
LC_CTYPE = 'English_United States.1252'

-----Original Message-----
From: John R Pierce [mailto:pierce@hogranch.com]
Sent: Wednesday, March 23, 2011 7:54 PM
To: Nathan M. Davalos; PostgreSQL Bugs
Subject: Re: [BUGS] BUG #5944: COPY FROM doesn't work with international characters

On 03/23/11 5:42 PM, Nathan M. Davalos wrote:

Just in case the reply didn't go through the character in question is F1.

I replied to the message using pgsql-bugs@postgresql.org, but I'm a wee bit new to the bug reporting stuff using the mailing list.

The hex of the file in total is
32.33.30.30.30.32.09.41.6C.74.6F.20.44.65.73.65.6D.70.65.F1.6F.2C.20.53.2E.41.2E.20.44.65.20.43.2E.56.2E.0D.0A

k, thats certainly win-1252. I see a tab after the 230002, then the
rest of it is all one field, ending in a <CR><LF> sequence.

The table you're copying this data to, what fields does it have?

#6Nathan M. Davalos
n.davalos@sharedmarketing.com
In reply to: Nathan M. Davalos (#5)
Re: BUG #5944: COPY FROM doesn't work with international characters

I just tried it from the psql console and it seems to work, which isn't really an option for us to use since we need to be able to let end users load data via our custom designed database application. None of them would understand or know how to use the console. Currently there's no pressing need to be able to do this for us, so we're just loading the data with a series of insert statements instead of COPY FROM.

testdb=# SET CLIENT_ENCODING TO 'WIN1252';
SET
testdb=# create table tmpintermediate (acnumber character varying(20),acname character varying(50)
);
CREATE TABLE
testdb=# copy tmpintermediate from 'c:\\temp\\thefile.txt';
WARNING: nonstandard use of \\ in a string literal
LINE 1: copy tmpintermediate from 'c:\\temp\\thefile.txt';
^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
COPY 1
testdb=# select * from tmpintermediate;
acnumber | acname
----------+------------------------------
230002 | Alto Desempe±o, S.A. De C.V.
(1 row)

The output is wrong in the select statement from the console, but appears correctly using pgadmin3. I think it's a problem with one if the redistributable libraries instead of postgresql itself. When loading through pgadmin3 or elsewhere the acname field just ends up blank using WIN1252, the COPY FROM will just ignore whatever data is supposed to be loaded into that field when there is an international character.

-----Original Message-----
From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Nathan M. Davalos
Sent: Wednesday, March 23, 2011 7:59 PM
To: PostgreSQL Bugs
Subject: Re: [BUGS] BUG #5944: COPY FROM doesn't work with international characters

AcNumber character varying(20),
AcName character varying(50)

The database itself is:
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'English_United States.1252'
LC_CTYPE = 'English_United States.1252'

-----Original Message-----
From: John R Pierce [mailto:pierce@hogranch.com]
Sent: Wednesday, March 23, 2011 7:54 PM
To: Nathan M. Davalos; PostgreSQL Bugs
Subject: Re: [BUGS] BUG #5944: COPY FROM doesn't work with international characters

On 03/23/11 5:42 PM, Nathan M. Davalos wrote:

Just in case the reply didn't go through the character in question is F1.

I replied to the message using pgsql-bugs@postgresql.org, but I'm a wee bit new to the bug reporting stuff using the mailing list.

The hex of the file in total is
32.33.30.30.30.32.09.41.6C.74.6F.20.44.65.73.65.6D.70.65.F1.6F.2C.20.53.2E.41.2E.20.44.65.20.43.2E.56.2E.0D.0A

k, thats certainly win-1252. I see a tab after the 230002, then the
rest of it is all one field, ending in a <CR><LF> sequence.

The table you're copying this data to, what fields does it have?

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Josh Berkus
josh@agliodbs.com
In reply to: Nathan M. Davalos (#6)
Re: BUG #5944: COPY FROM doesn't work with international characters

I have another example of this issue with WIN1252. This line in a copy
file:

659446 828 1 /6�\bH@^W^Za$H�\b�@\\/ <p><p></p><p>No
valid or unique HTTP objects found in XML response.</p></p>

Into this table:

Table "public.ep_tests"
Column | Type | Modifiers
---------+--------------+-----------
id | bigint | not null
v_id | integer | not null
status | character(1) | not null
vkey | text |
details | text |

Results in this error while loading the data:

the following error is encounted by the \copy: ERROR: missing data for
column "details"
CONTEXT: COPY ep_tests, line 1028752: "659446 828 1 /6�\bH@↨"

The dump file was produced by using 9.0.3's pg_dump in text mode to dump
an 8.2 database, then using 9.0.3's psql to load the file. Both servers
are UTF8, locale WIN1252.

So it looks like we're not successfully escaping characters on WIN1252.
The characters in question are also latin characters.

We've reproduced this on a clean install.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#8Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#7)
Re: BUG #5944: COPY FROM doesn't work with international characters

Josh Berkus wrote:

I have another example of this issue with WIN1252. This line in a copy
file:

659446 828 1 /6???\bH@^W^Za$H???\b???@\\/ <p><p></p><p>No
valid or unique HTTP objects found in XML response.</p></p>

Into this table:

Table "public.ep_tests"
Column | Type | Modifiers
---------+--------------+-----------
id | bigint | not null
v_id | integer | not null
status | character(1) | not null
vkey | text |
details | text |

Results in this error while loading the data:

the following error is encounted by the \copy: ERROR: missing data for
column "details"
CONTEXT: COPY ep_tests, line 1028752: "659446 828 1 /6???\bH@?"

The dump file was produced by using 9.0.3's pg_dump in text mode to dump
an 8.2 database, then using 9.0.3's psql to load the file. Both servers
are UTF8, locale WIN1252.

So it looks like we're not successfully escaping characters on WIN1252.
The characters in question are also latin characters.

We've reproduced this on a clean install.

Has this been fixed?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#9Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#8)
Re: BUG #5944: COPY FROM doesn't work with international characters

On Tue, Sep 6, 2011 at 12:48 PM, Bruce Momjian <bruce@momjian.us> wrote:

Josh Berkus wrote:

I have another example of this issue with WIN1252.  This line in a copy
file:

659446  828     1       /6???\bH@^W^Za$H???\b???@\\/    <p><p></p><p>No
valid or unique HTTP objects found in XML response.</p></p>

Into this table:

     Table "public.ep_tests"
Column  |     Type     | Modifiers
---------+--------------+-----------
id      | bigint       | not null
v_id    | integer      | not null
status  | character(1) | not null
vkey    | text         |
details | text         |

Results in this error while loading the data:

the following error is encounted by the \copy: ERROR:  missing data for
column "details"
CONTEXT:  COPY ep_tests, line 1028752: "659446  828     1       /6???\bH@?"

The dump file was produced by using 9.0.3's pg_dump in text mode to dump
an 8.2 database, then using 9.0.3's psql to load the file.  Both servers
are UTF8, locale WIN1252.

So it looks like we're not successfully escaping characters on WIN1252.
 The characters in question are also latin characters.

We've reproduced this on a clean install.

Has this been fixed?

I don't think so. It's not really clear to me what the issue is. Is
there some confusion between the encoding of the file and the encoding
in use in the client session? It's odd that we would have a bug that
only affects win1252.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#9)
Re: BUG #5944: COPY FROM doesn't work with international characters

On 10/19/11 5:41 AM, Robert Haas wrote:

I don't think so. It's not really clear to me what the issue is. Is
there some confusion between the encoding of the file and the encoding
in use in the client session? It's odd that we would have a bug that
only affects win1252.

I think it's quite possible that this is something broken in the win1252
encoding itself. I've seen a lot of reports online for errors from
other software. However, we need to at least find a workaround for
users if we can't fix it ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com