Maintaining accents with "COPY" ?

Started by Laura Smithalmost 3 years ago6 messagesgeneral
Jump to latest
#1Laura Smith
n5d9xq3ti233xiyif2vp@protonmail.ch

Hi

I'm currently doing a CSV export using COPY:

COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;

This works great apart from accents are not preserved in the output, for example é gets converted to random characters, e.g. √© or similar.

How can I preserve accents ?

Thanks !

Laura

#2Erik Wienhold
ewie@ewie.name
In reply to: Laura Smith (#1)
Re: Maintaining accents with "COPY" ?

On 25/05/2023 09:14 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:

I'm currently doing a CSV export using COPY:

COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;

This works great apart from accents are not preserved in the output, for
example é gets converted to random characters, e.g. √© or similar.

How can I preserve accents ?

Looks like an encoding issue and a mismatch between database encoding and client
encoding. You can check both with:

SHOW server_encoding;
SHOW client_encoding;

Then either set the client encoding or use COPY's encoding option to match the
database encoding (I assume utf8 in this example):

SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';

--
Erik

#3Laura Smith
n5d9xq3ti233xiyif2vp@protonmail.ch
In reply to: Erik Wienhold (#2)
Re: Maintaining accents with "COPY" ?

Looks like an encoding issue and a mismatch between database encoding and client
encoding. You can check both with:

SHOW server_encoding;
SHOW client_encoding;

Then either set the client encoding or use COPY's encoding option to match the
database encoding (I assume utf8 in this example):

SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';

--
Erik

Hi Erik,

Looks like you could well be right about encoding:

postgres=# SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)

postgres=# SHOW client_encoding;
client_encoding
-----------------
SQL_ASCII
(1 row)

I will try your suggestion...

#4Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Laura Smith (#1)
Re: Maintaining accents with "COPY" ?

On 2023-05-25 07:14:40 +0000, Laura Smith wrote:

I'm currently doing a CSV export using COPY:

COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;

This works great apart from accents are not preserved in the output,
for example é gets converted to random characters, e.g. √© or similar.

How do you check the output?

If a single character is turned into 2 or 3 characters the issue is
usually that the program which produces the output (in the case of COPY
I think that would be the PostgreSQL server, not the client) produces
UTF-8, but the program consuming it expects an 8-bit character set
(typically windows-1252). See if oyu can tell that program that the file
is in UTF-8.

How can I preserve accents ?

They probably already are preserved.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#5Erik Wienhold
ewie@ewie.name
In reply to: Peter J. Holzer (#4)
Re: Maintaining accents with "COPY" ?

On 25/05/2023 13:26 CEST Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2023-05-25 07:14:40 +0000, Laura Smith wrote:

I'm currently doing a CSV export using COPY:

COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;

This works great apart from accents are not preserved in the output,
for example é gets converted to random characters, e.g. √© or similar.

How do you check the output?

If a single character is turned into 2 or 3 characters the issue is
usually that the program which produces the output (in the case of COPY
I think that would be the PostgreSQL server, not the client) produces
UTF-8, but the program consuming it expects an 8-bit character set
(typically windows-1252). See if oyu can tell that program that the file
is in UTF-8.

How can I preserve accents ?

They probably already are preserved.

You're right. The bytes are probably interpreted as Mac OS Roman:

$ echo é | iconv -f macintosh
é
$ echo -n é | xxd
00000000: c3a9

--
Erik

#6Erik Wienhold
ewie@ewie.name
In reply to: Laura Smith (#3)
Re: Maintaining accents with "COPY" ?

On 25/05/2023 12:08 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:

Looks like an encoding issue and a mismatch between database encoding and
client encoding. You can check both with:

SHOW server_encoding;
SHOW client_encoding;

Then either set the client encoding or use COPY's encoding option to match
the database encoding (I assume utf8 in this example):

SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';

Hi Erik,

Looks like you could well be right about encoding:

postgres=# SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)

postgres=# SHOW client_encoding;
client_encoding
-----------------
SQL_ASCII
(1 row)

I will try your suggestion...

The client encoding is not the problem here. Using SQL_ASCII effectively uses
the server encoding. SQL_ASCII basically means uninterpreted bytes/characters.

From https://www.postgresql.org/docs/15/multibyte.html#id-1.6.11.5.7:

"If the client character set is defined as SQL_ASCII, encoding conversion is
disabled, regardless of the server's character set. (However, if the server's
character set is not SQL_ASCII, the server will still check that incoming data
is valid for that encoding; so the net effect is as though the client character
set were the same as the server's.) Just as for the server, use of SQL_ASCII is
unwise unless you are working with all-ASCII data."

--
Erik