Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn't work

Started by PostgreSQL Bugs Listabout 23 years ago4 messageshackersbugs
Jump to latest
#1PostgreSQL Bugs List
pgsql-bugs@postgresql.org
hackersbugs

Michael Enke (michael.enke@wincor-nixdorf.com) reports a bug with a severity of 2
The lower the number the more severe it is.

Short Description
Server-Encoding from EUC_TW to UTF-8 doesn't work

Long Description
System: SuSE Linux 8.1, kernel 2.4.19, glibc 2.2.5/glibc-locale 2.2.5
the same error on RedHat 7.3, kernel 2.4.20, glibc2.2.5
postgresql version 7.3.2
description: I loaded Chinese (TW) characters, encoded as UTF-8 into a
database which has UTF-8 encoding with "copy table from 'original'" with psql. Ok.
Than I exit from psql, exported PGCLIENTENCODING=EUC_TW
I started psql, make a "copy table to 'file.EUC_TW'". Ok.
If I convert this file to UTF-8 with iconv -f EUC-TW -t UTF-8 file.EUC_TW file.UTF-8
than file.UTF-8 looks ecaxtly the same as the original.
That means, PostgreSQL converts from UTF-8 to EUC_TW correct.
Now I load the exported file 'file.EUC_TW' back into DB:
"copy table2 from 'file.EUC_TW'", still I did not finish psql,
PGCLIENTENCODING is the same as for "copy to".
Now I get error telling me: "copy: line 1, LocalToUtf: could not convert (0xe5b5) EUC_TW to UTF-8" ... and the characters are missing in table2

Sample Code
UTF-8:
00000000: e795 b6e6 97a5 0ae5 959f e58b 95e4 b8ad
00000010: 2ce4 bd86 e69c 89e9 8caf e8aa a40a

EUC_TW as exported from PostgreSQL and not imported:
00000000: e5b5 c5ca 0ada f6d9 afc4 e32c c8fe c8b4
00000010: f2e3 eba8 0a

No file was uploaded with this report

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: PostgreSQL Bugs List (#1)
hackersbugs
Re: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn't

It turned out that it's a bug with encoding conversion engine of
PostgreSQL. It just failed to find proper entry from a encoding
conversion table because of a integer overflow problem. Since only
maps for EUC_TW have such a huge code point values (for example
0x8eaee7aa), I believe the conversion failure merely occurs with the
particular encoding. Included patches should solve the problem (it is
against PostgreSQL 7.3.2).

BTW, I'm surprised to find the bug since it has been there since 7.2
days.

I'm going to commit the fix to both current and 7.3-stable trees.
--
Tatsuo Ishii

Short Description
Server-Encoding from EUC_TW to UTF-8 doesn't work

Long Description
System: SuSE Linux 8.1, kernel 2.4.19, glibc 2.2.5/glibc-locale 2.2.5
the same error on RedHat 7.3, kernel 2.4.20, glibc2.2.5
postgresql version 7.3.2
description: I loaded Chinese (TW) characters, encoded as UTF-8 into a
database which has UTF-8 encoding with "copy table from 'original'" with psql. Ok.
Than I exit from psql, exported PGCLIENTENCODING=EUC_TW
I started psql, make a "copy table to 'file.EUC_TW'". Ok.
If I convert this file to UTF-8 with iconv -f EUC-TW -t UTF-8 file.EUC_TW file.UTF-8
than file.UTF-8 looks ecaxtly the same as the original.
That means, PostgreSQL converts from UTF-8 to EUC_TW correct.
Now I load the exported file 'file.EUC_TW' back into DB:
"copy table2 from 'file.EUC_TW'", still I did not finish psql,
PGCLIENTENCODING is the same as for "copy to".
Now I get error telling me: "copy: line 1, LocalToUtf: could not convert (0xe5b5) EUC_TW to UTF-8" ... and the characters are missing in table2

Sample Code
UTF-8:
00000000: e795 b6e6 97a5 0ae5 959f e58b 95e4 b8ad
00000010: 2ce4 bd86 e69c 89e9 8caf e8aa a40a

EUC_TW as exported from PostgreSQL and not imported:
00000000: e5b5 c5ca 0ada f6d9 afc4 e32c c8fe c8b4
00000010: f2e3 eba8 0a

*** src/backend/utils/mb/conv.c.orig	2003-04-12 10:03:25.000000000 +0900
--- src/backend/utils/mb/conv.c	2003-04-12 10:16:04.000000000 +0900
***************
*** 313,319 ****

v1 = *(unsigned int *) p1;
v2 = ((pg_utf_to_local *) p2)->utf;
! return (v1 - v2);
}

  /*
--- 313,319 ----

v1 = *(unsigned int *) p1;
v2 = ((pg_utf_to_local *) p2)->utf;
! return (v1 > v2)?1:((v1 == v2)?0:-1);
}

/*
***************
*** 328,334 ****

v1 = *(unsigned int *) p1;
v2 = ((pg_local_to_utf *) p2)->code;
! return (v1 - v2);
}

  /*
--- 328,334 ----

v1 = *(unsigned int *) p1;
v2 = ((pg_local_to_utf *) p2)->code;
! return (v1 > v2)?1:((v1 == v2)?0:-1);
}

/*

#3Michael Enke
michael.enke@wincor-nixdorf.com
In reply to: PostgreSQL Bugs List (#1)
hackersbugs
Re: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn'twork

I tried also BIG5 encoded data (Trad. Chinese for Taiwan) and got warnings:
WARNING: copy: line 4586, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
...
Is this also solved with this fix?

Michael

Tatsuo Ishii wrote:

Show quoted text

It turned out that it's a bug with encoding conversion engine of
PostgreSQL. It just failed to find proper entry from a encoding
conversion table because of a integer overflow problem. Since only
maps for EUC_TW have such a huge code point values (for example
0x8eaee7aa), I believe the conversion failure merely occurs with the
particular encoding. Included patches should solve the problem (it is
against PostgreSQL 7.3.2).

BTW, I'm surprised to find the bug since it has been there since 7.2
days.

I'm going to commit the fix to both current and 7.3-stable trees.
--
Tatsuo Ishii

Short Description
Server-Encoding from EUC_TW to UTF-8 doesn't work

Long Description
System: SuSE Linux 8.1, kernel 2.4.19, glibc 2.2.5/glibc-locale 2.2.5
the same error on RedHat 7.3, kernel 2.4.20, glibc2.2.5
postgresql version 7.3.2
description: I loaded Chinese (TW) characters, encoded as UTF-8 into a
database which has UTF-8 encoding with "copy table from 'original'" with psql. Ok.
Than I exit from psql, exported PGCLIENTENCODING=EUC_TW
I started psql, make a "copy table to 'file.EUC_TW'". Ok.
If I convert this file to UTF-8 with iconv -f EUC-TW -t UTF-8 file.EUC_TW file.UTF-8
than file.UTF-8 looks ecaxtly the same as the original.
That means, PostgreSQL converts from UTF-8 to EUC_TW correct.
Now I load the exported file 'file.EUC_TW' back into DB:
"copy table2 from 'file.EUC_TW'", still I did not finish psql,
PGCLIENTENCODING is the same as for "copy to".
Now I get error telling me: "copy: line 1, LocalToUtf: could not convert (0xe5b5) EUC_TW to UTF-8" ... and the characters are missing in table2

Sample Code
UTF-8:
00000000: e795 b6e6 97a5 0ae5 959f e58b 95e4 b8ad
00000010: 2ce4 bd86 e69c 89e9 8caf e8aa a40a

EUC_TW as exported from PostgreSQL and not imported:
00000000: e5b5 c5ca 0ada f6d9 afc4 e32c c8fe c8b4
00000010: f2e3 eba8 0a

*** src/backend/utils/mb/conv.c.orig    2003-04-12 10:03:25.000000000 +0900
--- src/backend/utils/mb/conv.c 2003-04-12 10:16:04.000000000 +0900
***************
*** 313,319 ****

v1 = *(unsigned int *) p1;
v2 = ((pg_utf_to_local *) p2)->utf;
! return (v1 - v2);
}

/*
--- 313,319 ----

v1 = *(unsigned int *) p1;
v2 = ((pg_utf_to_local *) p2)->utf;
! return (v1 > v2)?1:((v1 == v2)?0:-1);
}

/*
***************
*** 328,334 ****

v1 = *(unsigned int *) p1;
v2 = ((pg_local_to_utf *) p2)->code;
! return (v1 - v2);
}

/*
--- 328,334 ----

v1 = *(unsigned int *) p1;
v2 = ((pg_local_to_utf *) p2)->code;
! return (v1 > v2)?1:((v1 == v2)?0:-1);
}

/*

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Michael Enke (#3)
hackersbugs
Re: [HACKERS] Bug #943: Server-Encoding from EUC_TW to

I tried also BIG5 encoded data (Trad. Chinese for Taiwan) and got warnings:
WARNING: copy: line 4586, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
...
Is this also solved with this fix?

No. In your case it seems 0xf9d7 is not a valid BIG5 data, since
there's no corresponding Unicode data for it.
--
Tatsuo Ishii