Maintaining accents with "COPY" ?
Hi
I'm currently doing a CSV export using COPY:
COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
This works great apart from accents are not preserved in the output, for example é gets converted to random characters, e.g. √© or similar.
How can I preserve accents ?
Thanks !
Laura
On 25/05/2023 09:14 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:
I'm currently doing a CSV export using COPY:
COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
This works great apart from accents are not preserved in the output, for
example é gets converted to random characters, e.g. √© or similar.How can I preserve accents ?
Looks like an encoding issue and a mismatch between database encoding and client
encoding. You can check both with:
SHOW server_encoding;
SHOW client_encoding;
Then either set the client encoding or use COPY's encoding option to match the
database encoding (I assume utf8 in this example):
SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';
--
Erik
Looks like an encoding issue and a mismatch between database encoding and client
encoding. You can check both with:SHOW server_encoding;
SHOW client_encoding;Then either set the client encoding or use COPY's encoding option to match the
database encoding (I assume utf8 in this example):SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';--
Erik
Hi Erik,
Looks like you could well be right about encoding:
postgres=# SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)
postgres=# SHOW client_encoding;
client_encoding
-----------------
SQL_ASCII
(1 row)
I will try your suggestion...
On 2023-05-25 07:14:40 +0000, Laura Smith wrote:
I'm currently doing a CSV export using COPY:
COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
This works great apart from accents are not preserved in the output,
for example é gets converted to random characters, e.g. √© or similar.
How do you check the output?
If a single character is turned into 2 or 3 characters the issue is
usually that the program which produces the output (in the case of COPY
I think that would be the PostgreSQL server, not the client) produces
UTF-8, but the program consuming it expects an 8-bit character set
(typically windows-1252). See if oyu can tell that program that the file
is in UTF-8.
How can I preserve accents ?
They probably already are preserved.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On 25/05/2023 13:26 CEST Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-05-25 07:14:40 +0000, Laura Smith wrote:
I'm currently doing a CSV export using COPY:
COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
This works great apart from accents are not preserved in the output,
for example é gets converted to random characters, e.g. √© or similar.How do you check the output?
If a single character is turned into 2 or 3 characters the issue is
usually that the program which produces the output (in the case of COPY
I think that would be the PostgreSQL server, not the client) produces
UTF-8, but the program consuming it expects an 8-bit character set
(typically windows-1252). See if oyu can tell that program that the file
is in UTF-8.How can I preserve accents ?
They probably already are preserved.
You're right. The bytes are probably interpreted as Mac OS Roman:
$ echo é | iconv -f macintosh
é
$ echo -n é | xxd
00000000: c3a9
--
Erik
On 25/05/2023 12:08 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:
Looks like an encoding issue and a mismatch between database encoding and
client encoding. You can check both with:SHOW server_encoding;
SHOW client_encoding;Then either set the client encoding or use COPY's encoding option to match
the database encoding (I assume utf8 in this example):SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';Hi Erik,
Looks like you could well be right about encoding:
postgres=# SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)postgres=# SHOW client_encoding;
client_encoding
-----------------
SQL_ASCII
(1 row)I will try your suggestion...
The client encoding is not the problem here. Using SQL_ASCII effectively uses
the server encoding. SQL_ASCII basically means uninterpreted bytes/characters.
From https://www.postgresql.org/docs/15/multibyte.html#id-1.6.11.5.7:
"If the client character set is defined as SQL_ASCII, encoding conversion is
disabled, regardless of the server's character set. (However, if the server's
character set is not SQL_ASCII, the server will still check that incoming data
is valid for that encoding; so the net effect is as though the client character
set were the same as the server's.) Just as for the server, use of SQL_ASCII is
unwise unless you are working with all-ASCII data."
--
Erik