A Patch for MIC to EUC_TW code converting in mb support

Started by Chih-Chang Hsiehabout 25 years ago15 messages

cch@cc.kmu.edu.tw

about 25 years ago

1 attachment(s)

============================================================================

POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
============================================================================

System Configuration
---------------------
Architecture (example: Intel Pentium) :x86
Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD
3.5R
PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2
Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3

A FULL description of the problem:
------------------------------------------------
In PostgreSQL mb (multi-byte) support, there is a bug in code converting

for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
11643-1992
Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
Plane 2
should be converted into 4 bytes EUC_TW encoding instead.

A way to repeat the problem:
----------------------------------------------------------------------
When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
you will find all the characters in CNS 11643-1992 Plane 2 are
incorrectly stored or output.

This problem might be fixed by the solution in the attachement.

Attachments:

mic2euc_tw-patchtext/plain; charset=big5; name=mic2euc_tw-patchDownload

*** conv.c	Wed Nov  8 22:44:21 2000
--- conv.c.orig	Sat May 20 21:12:26 2000
***************
*** 906,920 ****
  	{
  		len -= pg_mic_mblen(mic++);
  
! 		if (c1 == LC_CNS11643_1)
  		{
- 			*p++ = *mic++;
- 			*p++ = *mic++;
- 		}
- 		else if (c1 == LC_CNS11643_2)
- 		{
- 			*p++ = SS2;
- 			*p++ = 0xa2;
  			*p++ = *mic++;
  			*p++ = *mic++;
  		}
--- 906,913 ----
  	{
  		len -= pg_mic_mblen(mic++);
  
! 		if (c1 == LC_CNS11643_1 || c1 == LC_CNS11643_2)
  		{
  			*p++ = *mic++;
  			*p++ = *mic++;
  		}

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Chih-Chang Hsieh (#1)

Re: A Patch for MIC to EUC_TW code converting in mb support

============================================================================

POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
============================================================================

System Configuration
---------------------
Architecture (example: Intel Pentium) :x86
Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD
3.5R
PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2
Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3

A FULL description of the problem:
------------------------------------------------
In PostgreSQL mb (multi-byte) support, there is a bug in code converting

for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
11643-1992
Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
Plane 2
should be converted into 4 bytes EUC_TW encoding instead.

A way to repeat the problem:
----------------------------------------------------------------------
When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
you will find all the characters in CNS 11643-1992 Plane 2 are
incorrectly stored or output.

This problem might be fixed by the solution in the attachement.

Thanks for pointing it out. Your fix seems correct.

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

*p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

*p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

If they are ok, I will fix the current source and make a patch for
7.0.3 (I guess it's too late to back-patch the 7.0 tree).
--
Tatsuo Ishii

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Chih-Chang Hsieh (#1)

Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport

[Cced to hackers list]

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

*p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

*p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

Thanks for your very quickly reply!!

You are welcome.

I think you are right, but I have not test it.
Because original Big5 encoding does not contain characters in CNS 11643-1992
Plane 3.
But I will have a chance to test it, we here are seeking the support for Big5E
(an extendied Big5
encoding) in PostgreSQL. Though most people who use PostgresSQL in Taiwan only
cares about
Big5 encoding .

Would you like to answer some mb related questions for me? I am a newbie :P

1.) Because the 2nd byte of Big5 encoding overlaps with ASCII,
such as '\' (this is very bad for many programs to work with Big5).

As long as frontend side knows the current client side encoding is
Big5, this should be no problem. At least for libpq. It examins the
first byte of Big5. If it is greater than 0x7f, then it must be a
double byte Hanji. So libpq reads 2 bytes in this case, not matter the
second byte is '\'.

For example: If we initdb -E MULE_INTERNAL first,
SET CLIENT_ENCODING TO 'BIG5', and
INSERT INTO some_table VALUES (..., 'the last byte of some Big5 char is
backslash\',...),
then we can not successfully complete this SQL INSERT -- the prompt of psql
changes
but psql does not execute it. If we initdb -E with EUC_TW, it's OK.
Is this is a parsing problem? What's your suggestion for the solution?

Hum. initdb -E MULE_INTERNAL should work as well. Let me dig into the
problem. It would be nice if you could send me the Big5 data for
testing by a private mail.

BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
on-the-fly encoding changes. Since in this way, frontend side has no
idea what the client encoding is. 7.0.x overcome this problem by
introducing new \encoding command. For 6.5 or before I would recommend
to use PGCLIENTENCODING environment variable.

2.) Is using MULE_INTERNAL faster than EUC_TW as backend encoding when
PostgreSQL processing Big5 data? (It seems
BIG5->big52mic()->mic2euc_tw()->EUC_TW
needs 2 code converting procedures, but BIG5->big52mic()->EUC_TW only needs
one from
the mb sources)

Yes. But the difference would be very small. The expensive part is a
table look-up in big52mic.

BTW 7.1 will support automatic encoding conversion between Unicode
(UTF-8) and Big5 (or EUC_TW). Try the snapshot if you like.

3.) Dose PostgreSQL's ODBC driver support mb?

I don't think so. For Japanese (EUC_JP/SJIS) Kataoka has made patches
to enable MB support in ODBC. It should not be very difficult to
support EUC_TW/Big5, I don't know.
--
Tatsuo Ishii

Import Notes

Reply to msg id not found: 3A0FCEFA.42F0BABF@cc.kmu.edu.tw

Chih-Chang Hsieh

cch@cc.kmu.edu.tw

about 25 years ago

In reply to: Chih-Chang Hsieh (#1)

Re: [PATCHES] A Patch for MIC to EUC_TW code converting inmbsupport

Tatsuo Ishii 嚙篇嚙瘩嚙瘦

For example: If we initdb -E MULE_INTERNAL first,
SET CLIENT_ENCODING TO 'BIG5', and
INSERT INTO some_table VALUES (..., 'the last byte of some Big5 char is
backslash\',...),
then we can not successfully complete this SQL INSERT -- the prompt of psql
changes

Hum. initdb -E MULE_INTERNAL should work as well. Let me dig into the
problem. It would be nice if you could send me the Big5 data for
testing by a private mail.
BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
on-the-fly encoding changes. Since in this way, frontend side has no
idea what the client encoding is. 7.0.x overcome this problem by
introducing new \encoding command. For 6.5 or before I would recommend
to use PGCLIENTENCODING environment variable.

You are right! When I do \encoding BIG5, it works.
But it seems that "\encoding" can only be issued in
psql's command prompt or be done with
PQsetClientEncoding() in libpq.

If our application for input is written in PHP (4.0.2)
How do we notify PostgreSQL that the frontend encoding
is 'BIG5' ? (pg_exec("\encoding BIG5") failed.)
PostgreSQL 7.1 will support automatic code conversion for
BIG5 to utf-8. Does it means that we do not have to
announce client encoding as long as the backend is utf-8?

I have also tried to set the environment variable
PGCLIENTENCODING to 'BIG5'. But when I execute
psql and then issue \encoding, it shows 'SQL_ASCII' in 7.0.2.
Is this environment variable useless in 7.0.x and latter?

Thank you so much for a newbie!
--
Chih-Chang Hsieh

Bruce Momjian

pgman@candle.pha.pa.us

about 25 years ago

In reply to: Tatsuo Ishii (#3)

Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport

Can someone tell me where we are on this? Tatsuo, I think you said you
wanted to apply this fix.

[Cced to hackers list]

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

*p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

*p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

Thanks for your very quickly reply!!

You are welcome.

I think you are right, but I have not test it.
Because original Big5 encoding does not contain characters in CNS 11643-1992
Plane 3.
But I will have a chance to test it, we here are seeking the support for Big5E
(an extendied Big5
encoding) in PostgreSQL. Though most people who use PostgresSQL in Taiwan only
cares about
Big5 encoding .

Would you like to answer some mb related questions for me? I am a newbie :P

1.) Because the 2nd byte of Big5 encoding overlaps with ASCII,
such as '\' (this is very bad for many programs to work with Big5).

As long as frontend side knows the current client side encoding is
Big5, this should be no problem. At least for libpq. It examins the
first byte of Big5. If it is greater than 0x7f, then it must be a
double byte Hanji. So libpq reads 2 bytes in this case, not matter the
second byte is '\'.

For example: If we initdb -E MULE_INTERNAL first,
SET CLIENT_ENCODING TO 'BIG5', and
INSERT INTO some_table VALUES (..., 'the last byte of some Big5 char is
backslash\',...),
then we can not successfully complete this SQL INSERT -- the prompt of psql
changes
but psql does not execute it. If we initdb -E with EUC_TW, it's OK.
Is this is a parsing problem? What's your suggestion for the solution?

Hum. initdb -E MULE_INTERNAL should work as well. Let me dig into the
problem. It would be nice if you could send me the Big5 data for
testing by a private mail.

BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
on-the-fly encoding changes. Since in this way, frontend side has no
idea what the client encoding is. 7.0.x overcome this problem by
introducing new \encoding command. For 6.5 or before I would recommend
to use PGCLIENTENCODING environment variable.

2.) Is using MULE_INTERNAL faster than EUC_TW as backend encoding when
PostgreSQL processing Big5 data? (It seems
BIG5->big52mic()->mic2euc_tw()->EUC_TW
needs 2 code converting procedures, but BIG5->big52mic()->EUC_TW only needs
one from
the mb sources)

Yes. But the difference would be very small. The expensive part is a
table look-up in big52mic.

BTW 7.1 will support automatic encoding conversion between Unicode
(UTF-8) and Big5 (or EUC_TW). Try the snapshot if you like.

3.) Dose PostgreSQL's ODBC driver support mb?

I don't think so. For Japanese (EUC_JP/SJIS) Kataoka has made patches
to enable MB support in ODBC. It should not be very difficult to
support EUC_TW/Big5, I don't know.
--
Tatsuo Ishii

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Chih-Chang Hsieh (#4)

1 attachment(s)

Re: [PATCHES] A Patch for MIC to EUC_TW code converting inmbsupport

BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
on-the-fly encoding changes. Since in this way, frontend side has no
idea what the client encoding is. 7.0.x overcome this problem by
introducing new \encoding command. For 6.5 or before I would recommend
to use PGCLIENTENCODING environment variable.

You are right! When I do \encoding BIG5, it works.
But it seems that "\encoding" can only be issued in
psql's command prompt or be done with
PQsetClientEncoding() in libpq.

Yes.

If our application for input is written in PHP (4.0.2)
How do we notify PostgreSQL that the frontend encoding
is 'BIG5' ? (pg_exec("\encoding BIG5") failed.)

I know there are some patches for supporting \encoding in PHP. Do you
want to get them?

PostgreSQL 7.1 will support automatic code conversion for
BIG5 to utf-8. Does it means that we do not have to
announce client encoding as long as the backend is utf-8?

No. You still need to declare that frontend side encoding is BIG5.

I have also tried to set the environment variable
PGCLIENTENCODING to 'BIG5'. But when I execute
psql and then issue \encoding, it shows 'SQL_ASCII' in 7.0.2.
Is this environment variable useless in 7.0.x and latter?

After checking the souce, I found the capability to recognize the
PGCLIENTENCODING environment variable has been broken since 7.0.
Attached is the patches for 7.0.3. I am going to fix current as well.

Attachments:

libpq.patchtext/plain; charset=us-asciiDownload

*** postgresql-7.0.3/src/interfaces/libpq/fe-connect.c~	Mon May 22 06:19:53 2000
--- postgresql-7.0.3/src/interfaces/libpq/fe-connect.c	Fri Nov 17 10:37:23 2000
***************
*** 1505,1514 ****
  			{
  				const char *env;
  
- 				/* query server encoding */
  				env = getenv(envname);
  				if (!env || *env == '\0')
  				{
  					if (!PQsendQuery(conn,
  									 "select getdatabaseencoding()"))
  						goto error_return;
--- 1505,1515 ----
  			{
  				const char *env;
  
  				env = getenv(envname);
  				if (!env || *env == '\0')
  				{
+ 					/* query server encoding if PGCLIENTENCODING
+ 					   is not specified */
  					if (!PQsendQuery(conn,
  									 "select getdatabaseencoding()"))
  						goto error_return;
***************
*** 1516,1521 ****
--- 1517,1535 ----
  					conn->setenv_state = SETENV_STATE_ENCODINGS_WAIT;
  					return PGRES_POLLING_READING;
  				}
+ 				else
+ 				{
+ 					/* otherwise set client encoding in pg_conn struct */
+ 					int encoding = pg_char_to_encoding(env);
+ 					if (encoding < 0)
+ 					{
+ 						strcpy(conn->errorMessage.data,
+ 							   "PGCLIENTENCODING has no valid encoding name.\n");
+ 						goto error_return;
+ 					}
+ 					conn->client_encoding = encoding;
+ 				}
+ 					
  			}
  
  		case SETENV_STATE_ENCODINGS_WAIT:

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Bruce Momjian (#5)

Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport

Can someone tell me where we are on this? Tatsuo, I think you said you
wanted to apply this fix.

I wanted to apply the fix after Chih-Chang Hsieh tested it out. But he
said he couldn't becuase no test data was available for it. However I
and he now are in the same opinion that the fix seems correct, and I
am going to apply the fix, probably by tomorrow.

Show quoted text

[Cced to hackers list]

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

*p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

*p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

Thanks for your very quickly reply!!

You are welcome.

I think you are right, but I have not test it.
Because original Big5 encoding does not contain characters in CNS 11643-1992
Plane 3.
But I will have a chance to test it, we here are seeking the support for Big5E
(an extendied Big5
encoding) in PostgreSQL. Though most people who use PostgresSQL in Taiwan only
cares about
Big5 encoding .

Would you like to answer some mb related questions for me? I am a newbie :P

1.) Because the 2nd byte of Big5 encoding overlaps with ASCII,
such as '\' (this is very bad for many programs to work with Big5).

As long as frontend side knows the current client side encoding is
Big5, this should be no problem. At least for libpq. It examins the
first byte of Big5. If it is greater than 0x7f, then it must be a
double byte Hanji. So libpq reads 2 bytes in this case, not matter the
second byte is '\'.

For example: If we initdb -E MULE_INTERNAL first,
SET CLIENT_ENCODING TO 'BIG5', and
INSERT INTO some_table VALUES (..., 'the last byte of some Big5 char is
backslash\',...),
then we can not successfully complete this SQL INSERT -- the prompt of psql
changes
but psql does not execute it. If we initdb -E with EUC_TW, it's OK.
Is this is a parsing problem? What's your suggestion for the solution?

Hum. initdb -E MULE_INTERNAL should work as well. Let me dig into the
problem. It would be nice if you could send me the Big5 data for
testing by a private mail.

BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
on-the-fly encoding changes. Since in this way, frontend side has no
idea what the client encoding is. 7.0.x overcome this problem by
introducing new \encoding command. For 6.5 or before I would recommend
to use PGCLIENTENCODING environment variable.

2.) Is using MULE_INTERNAL faster than EUC_TW as backend encoding when
PostgreSQL processing Big5 data? (It seems
BIG5->big52mic()->mic2euc_tw()->EUC_TW
needs 2 code converting procedures, but BIG5->big52mic()->EUC_TW only needs
one from
the mb sources)

Yes. But the difference would be very small. The expensive part is a
table look-up in big52mic.

BTW 7.1 will support automatic encoding conversion between Unicode
(UTF-8) and Big5 (or EUC_TW). Try the snapshot if you like.

3.) Dose PostgreSQL's ODBC driver support mb?

I don't think so. For Japanese (EUC_JP/SJIS) Kataoka has made patches
to enable MB support in ODBC. It should not be very difficult to
support EUC_TW/Big5, I don't know.
--
Tatsuo Ishii
-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Tatsuo Ishii (#6)

1 attachment(s)

Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting inmbsupport

If our application for input is written in PHP (4.0.2)
How do we notify PostgreSQL that the frontend encoding
is 'BIG5' ? (pg_exec("\encoding BIG5") failed.)

I know there are some patches for supporting \encoding in PHP. Do you
want to get them?

Sorry for the delay. Here are the patches I promised against PHP
3.0.15 or later.

To set the client encoding to BIG5:

pg_setclientencoding($cid, "BIG5");

($cid is the connection id)

To get the current client encoding:

pg_clientencoding($cid);

Note that these fucntions are already included in the latest PHP4.
--
Tatsuo Ishii

Chih-Chang Hsieh

cch@cc.kmu.edu.tw

about 25 years ago

In reply to: Tatsuo Ishii (#3)

1 attachment(s)

About PQsetClientEncoding(),"SET NAMES",and "SET CLIENT_ENCODING"

Tatsuo Ishii wrote:

Here are the patches I promised against PHP
3.0.15 or later.

To set the client encoding to BIG5:

pg_setclientencoding($cid, "BIG5");

($cid is the connection id)

To get the current client encoding:

pg_clientencoding($cid);

Note that these fucntions are already included in the latest PHP4.

Thank you!
Your README.mb has been translate into Chinese (Big5 encoding) in the
attachement.
Would someone like to review it?
After translating it, I still have one question: (Sorry, I have not read the
source code.)
What is the difference among "libpq -- PQsetClientEncoding()",
"SQL command -- SET NAMES", and
"SQL command -- SET CLIENT_ENCODING"?
For example: If we use PHP (>4.0.2), which way is correct or mostly correct?

1. pg_setclientencoding($cid, "BIG5")
2. pg_exec("SET NAMES 'BIG5'")
3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")

--
Chih-Chang Hsieh

#10

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Chih-Chang Hsieh (#9)

Re: About PQsetClientEncoding(),"SET NAMES",and "SET CLIENT_ENCODING"

For example: If we use PHP (>4.0.2), which way is correct or mostly correct?

1. pg_setclientencoding($cid, "BIG5")
2. pg_exec("SET NAMES 'BIG5'")
3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")

2 and 3 are actually identical: telling the backend "Your client's
encoding is BIG5, so you need to convert EUC_TW to BIG5 before sending
data to the client. Also you need to convert BIG5 to EUC_TW when
receiving data from him."

1 is doing the same thing as 2 or 3 AND set the internal encoding of
libpq, which is linked to PHP, to BIG5. This is neccesary in case of
the second byte of BIG5 character is "\" or one of other control
characters.
--
Tatsuo Ishii

#11

Bruce Momjian

pgman@candle.pha.pa.us

about 25 years ago

In reply to: Chih-Chang Hsieh (#9)

Re: [HACKERS] About PQsetClientEncoding(), "SET NAMES", and "SET CLIENT_ENCODING"

We have merged README.mb into the official SGML docs, so the file is
gone. Not sure what setup we have for SGML docs in different languages.

Sorry.

---------------------------------------------------------------------------

Tatsuo Ishii wrote:

Here are the patches I promised against PHP
3.0.15 or later.

To set the client encoding to BIG5:

pg_setclientencoding($cid, "BIG5");

($cid is the connection id)

To get the current client encoding:

pg_clientencoding($cid);

Note that these fucntions are already included in the latest PHP4.

1. pg_setclientencoding($cid, "BIG5")
2. pg_exec("SET NAMES 'BIG5'")
3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")

--
Chih-Chang Hsieh

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#12

Tatsuo Ishii

t-ishii@sra.co.jp

about 25 years ago

In reply to: Bruce Momjian (#11)

Re: [HACKERS] About PQsetClientEncoding(),"SET NAMES",and "SET CLIENT_ENCODING"

We have merged README.mb into the official SGML docs, so the file is
gone. Not sure what setup we have for SGML docs in different languages.

Even if the Big5 version of README.mb could be included in our SGML
docs, I don't think sgml tools could process Big5 without any problem
due to the nature of the encoding(probably ok for EUC-TW or
UTF-8). What about placing it as doc/README.mb.big5 or whatever?
--
Tatsuo Ishii

#13

Chih-Chang Hsieh

cch@cc.kmu.edu.tw

about 25 years ago

In reply to: Tatsuo Ishii (#12)

Re: [HACKERS] About PQsetClientEncoding(),"SET NAMES",and "SET CLIENT_ENCODING"

On Sat, 30 Dec 2000, Tatsuo Ishii wrote:

Even if the Big5 version of README.mb could be included in our SGML
docs, I don't think sgml tools could process Big5 without any problem
due to the nature of the encoding(probably ok for EUC-TW or
UTF-8). What about placing it as doc/README.mb.big5 or whatever?

IMHO, doc/README.mb.big5 is enough.

--
Chih-Chang Hsieh

#14

Bruce Momjian

pgman@candle.pha.pa.us

almost 25 years ago

In reply to: Tatsuo Ishii (#2)

Re: A Patch for MIC to EUC_TW code converting in mb support

Tatsuo, I assume these are all done in 7.1, right?

============================================================================

POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
============================================================================

System Configuration
---------------------
Architecture (example: Intel Pentium) :x86
Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD
3.5R
PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2
Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3

A FULL description of the problem:
------------------------------------------------
In PostgreSQL mb (multi-byte) support, there is a bug in code converting

for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
11643-1992
Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
Plane 2
should be converted into 4 bytes EUC_TW encoding instead.

A way to repeat the problem:
----------------------------------------------------------------------
When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
you will find all the characters in CNS 11643-1992 Plane 2 are
incorrectly stored or output.

This problem might be fixed by the solution in the attachement.

Thanks for pointing it out. Your fix seems correct.

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

*p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

*p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

If they are ok, I will fix the current source and make a patch for
7.0.3 (I guess it's too late to back-patch the 7.0 tree).
--
Tatsuo Ishii

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#15

Tatsuo Ishii

t-ishii@sra.co.jp

almost 25 years ago

In reply to: Bruce Momjian (#14)

Re: A Patch for MIC to EUC_TW code converting in mb support

Tatsuo, I assume these are all done in 7.1, right?

Yes.
--
Tatsuo Ishii

Show quoted text

============================================================================

POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
============================================================================

System Configuration
---------------------
Architecture (example: Intel Pentium) :x86
Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD
3.5R
PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2
Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3

A FULL description of the problem:
------------------------------------------------
In PostgreSQL mb (multi-byte) support, there is a bug in code converting

for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
11643-1992
Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
Plane 2
should be converted into 4 bytes EUC_TW encoding instead.

A way to repeat the problem:
----------------------------------------------------------------------
When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
you will find all the characters in CNS 11643-1992 Plane 2 are
incorrectly stored or output.

This problem might be fixed by the solution in the attachement.

Thanks for pointing it out. Your fix seems correct.

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

*p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

*p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

If they are ok, I will fix the current source and make a patch for
7.0.3 (I guess it's too late to back-patch the 7.0 tree).
--
Tatsuo Ishii
-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026