Issue with loading unicode characters with copy command
Hi,
I have a UTF8 database and simple table with two columns (integer and
varchar). Created a csv file with some multibyte characters and trying to
perform load operation using the copy command.
Database info:
Postgresql database details:
Name | Owner | Encoding | Collate | Ctype
| Access privileges
-----------+----------+----------+--------------------+--------------------+-----------------------
postgres | postgres | UTF8 | English_India.1252 | English_India.1252 |
(Note: I also tried with collate utf8 and no luck)
postgres=# set client_encoding='UTF8';
SET
Table:
create table public.test ( PKCOL integer not null, STR1 character
varying(64) null, primary key( PKCOL ))
csv contents:
1|"àáâãäåæçèéêëìíîï"
After data loading, actual data is becoming
à áâãäåæçèéêëìÃîï
hex of this is - c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af
The hex values are indeed the UTF-8 encodings of the characters in your
expected string, and the presence of `C2` before each character is
indicative of how UTF-8 represents certain characters.
In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, etc.)
are represented as two bytes. The first byte `C2` or `C3` indicates that
this is a two-byte character, and the second byte specifies the character.
For example:
- `à` is represented as `C3 A0`
- `á` is `C3 A1`
- `â` is `C3 A2`, and so on.
In this case, the `C2` byte is getting interpreted as a separate character
and that is the likely reason that an `Â` (which corresponds to `C2`) is
seen before each intended character. Looks like UTF-8 encoded data is
mistakenly interpreted as Latin-1 (ISO-8859-1) or Windows-1252, where each
byte is treated as a separate character.
Please advise. Thank you very much.
Regards,
Kiran
On 1/12/24 07:23, Kiran K V wrote:
Hi,
I have a UTF8 database and simple table with two columns (integer and
varchar). Created a csv file with some multibyte characters and trying
to perform load operation using the copy command.
The multibyte characters come from what character set?
__ __
Database info:____
Postgresql database details:____
Name | Owner | Encoding | Collate |
Ctype | Access privileges____-----------+----------+----------+--------------------+--------------------+-----------------------____
postgres | postgres | UTF8 | English_India.1252 |
English_India.1252 |______ __
(Note: I also tried with collate utf8 and no luck)
postgres=# set client_encoding='UTF8';____
SET____
__ __
Table:____
create table public.test ( PKCOL integer not null, STR1 character
varying(64) null, primary key( PKCOL )) ________
csv contents:____
1|"àáâãäåæçèéêëìíîï"____
__ __
After data loading, actual data is becoming____
à áâãäåæçèéêëìÃîï____
hex of this is -
c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af______ __
The hex values are indeed the UTF-8 encodings of the characters in your
expected string, and the presence of `C2` before each character is
indicative of how UTF-8 represents certain characters.____In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`,
etc.) are represented as two bytes. The first byte `C2` or `C3`
indicates that this is a two-byte character, and the second byte
specifies the character. For example:____- `à` is represented as `C3 A0`____
- `á` is `C3 A1`____
- `â` is `C3 A2`, and so on.____
In this case, the `C2` byte is getting interpreted as a separate
character and that is the likely reason that an `Â` (which corresponds
to `C2`) is seen before each intended character. Looks like UTF-8
encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or
Windows-1252, where each byte is treated as a separate character.Please advise. Thank you very much.
Regards,
Kiran
--
Adrian Klaver
adrian.klaver@aklaver.com
Its UTF-8. Also verified the load file and its utf-8.
Regards,
Kiran
On Fri, Jan 12, 2024 at 10:48 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
Show quoted text
On 1/12/24 07:23, Kiran K V wrote:
Hi,
I have a UTF8 database and simple table with two columns (integer and
varchar). Created a csv file with some multibyte characters and trying
to perform load operation using the copy command.The multibyte characters come from what character set?
__ __
Database info:____
Postgresql database details:____
Name | Owner | Encoding | Collate |
Ctype | Access privileges____-----------+----------+----------+--------------------+--------------------+-----------------------____
postgres | postgres | UTF8 | English_India.1252 |
English_India.1252 |______ __
(Note: I also tried with collate utf8 and no luck)
postgres=# set client_encoding='UTF8';____
SET____
__ __
Table:____
create table public.test ( PKCOL integer not null, STR1 character
varying(64) null, primary key( PKCOL )) ________
csv contents:____
1|"àáâãäåæçèéêëìíîï"____
__ __
After data loading, actual data is becoming____
à áâãäåæçèéêëìÃîï____
hex of this is -
c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af______ __
The hex values are indeed the UTF-8 encodings of the characters in your
expected string, and the presence of `C2` before each character is
indicative of how UTF-8 represents certain characters.____In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`,
etc.) are represented as two bytes. The first byte `C2` or `C3`
indicates that this is a two-byte character, and the second byte
specifies the character. For example:____- `à` is represented as `C3 A0`____
- `á` is `C3 A1`____
- `â` is `C3 A2`, and so on.____
In this case, the `C2` byte is getting interpreted as a separate
character and that is the likely reason that an `Â` (which corresponds
to `C2`) is seen before each intended character. Looks like UTF-8
encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or
Windows-1252, where each byte is treated as a separate character.Please advise. Thank you very much.
Regards,
Kiran
--
Adrian Klaver
adrian.klaver@aklaver.com