PostgreSQL 7.0.1 multi-byte (MB) support README May 20 2000 Tatsuo Ishii ishii@postgresql.org http://www.sra.co.jp/people/t-ishii/PostgreSQL/ [繕羅] 1. 繚P���永�刈威達瞻�� (Tatsuo Ishii) 瞼羸瞼��! 2. 繕羅���糧癒瞼繩簫穫瞻疇穢��無, 瞻瞻��黑若礎糧聶羅罈~, 翻���繕繡 cch@cc.kmu.edu.tw 0. ��笨刈� MB 瞻瓣織穢竅O竅簞瞻F��� PostgreSQL 簪��處簡z礎h礎穫瞻繡簡��字瞻繡 (multi-byte character), 穡��如: EUC (Extended Unix Code), Unicode (簡��一翻X) 穢M Mule internal code (礎h簞礙罈y穡瞼瞻繙翻X). 礎b MB 穠繙瞻瓣織穢瞻U, 禮A瞼i瞼H礎b瞼聶糧W穠穩瞼��汕� (regexp), LIKE 瞻�� 穡瓣瞼L瞻@穡��並岑汕¯刈刈並�永�多礎穫瞻繡簡��字瞻繡. 繒w糧]穠繙翻s翻X穡t簡��可穡繳穡M穢籀禮A礎w繡�� PostgreSQL 簧��祁� initdb(1) 穢R瞼O, 瞼癟瞼i瞼�� createdb(1) 穢R瞼O穢��姻�永�蜆祁捌�庫穠繙 SQL 穢R瞼O穡M穢w. 穢��以禮A瞼i瞼H礎糧礎h簫��刈�同翻s翻X穡t簡��祁甄蜆祁捌�庫. MB 瞻瓣織穢瞻]繡��決瞻F瞻@穡�� 8 礎穫瞻繡糧疆礎穫瞻繡簡��字瞻繡繞簞 (瞼]禮t ISO-8859-1) 穠繙竅���黑冕��, (禮��並�沒礎糧罈癒穢��汕麻祁甄珍��黑冕��糧瞿繡��決瞻F, 禮��只竅O翻T罈{瞻F簞j��織繳繡��冕肅污汕汕阬功, 礎��一穡��法罈y礎r瞻繡礎b MB 簫��蜆�下瞼i瞼H穡��永�. 礎p穠G禮A礎b穡��永� 8 礎穫瞻繡礎r瞻繡簧��發簡{瞻F 瞼繫礎籀簞���, 翻��通穠職禮��) 1. 礎p礎籀穡��永� 翻s��� PostgreSQL 竄e, 簞繭礎疆 configure 簧��並�永� multibyte 穠繙聶簿繞繕 % ./configure --enable-multibyte[=encoding_system] % ./configure --enable-multibyte[=翻s翻X穡t簡�� 穡瓣瞻瞻穠繙翻s翻X穡t簡��可瞼H竄羹穢w竅簞瞻U簫簣穡瓣瞻瞻瞻禮瞻@: SQL_ASCII ASCII EUC_JP Japanese EUC EUC_CN Chinese EUC EUC_KR Korean EUC EUC_TW Taiwan EUC UNICODE Unicode(UTF-8) MULE_INTERNAL Mule internal LATIN1 ISO 8859-1 English and some European languages LATIN2 ISO 8859-2 English and some European languages LATIN3 ISO 8859-3 English and some European languages LATIN4 ISO 8859-4 English and some European languages LATIN5 ISO 8859-5 English and some European languages KOI8 KOI8-R WIN Windows CP1251 ALT Windows CP866 穡��如: % ./configure --enable-multibyte=EUC_JP 礎p穠G竅��笨刈姻撢定翻s翻X穡t簡��, 穡繙罈簷繒w糧]簫��就竅O SQL_ASCII. 2. 礎p礎籀糧]穢w翻s翻X initdb 穢R瞼O穢w繡q PostgresSQL 礎w繡��姻¯祁甄預糧]翻s翻X, 穡��如: % initdb -E EUC_JP 簣N繒w糧]穠繙翻s翻X糧]穢w竅簞 EUC_JP (Extended Unix Code for Japanese), 礎p穠G禮A糧��� 繡羶穠繪穠繙礎r礎礙, 禮A瞻]瞼i瞼H瞼�� "--encoding" 礎��刈�永� "-E". 礎p穠G穡S礎糧穡��永� -E 穢�� --encoding 穠繙聶簿繞繕, 穡繙罈簷翻s��黑捌�祁甄設穢w繚|礎穡竅簞繒w糧]簫��. 禮A瞼i瞼H竄��永�並�永�刈�同翻s翻X穠繙繡礙簧��庫: % createdb -E EUC_KR korean 糧o簫��命瞼O繚|竄��永�一簫��叫簞繕 "korean" 穠繙繡礙簧��庫, 礎��並刈掙�永� EUC_KR 翻s翻X. 瞼t瞼~礎糧瞻@簫��勻阬法, 竅O穡��永� SQL 穢R瞼O, 瞻]瞼i瞼H繒F穡穫礎P翹��祁甄永�祁�: CREATE DATABASE korean WITH ENCODING = 'EUC_KR'; 礎b pg_database 穡t簡��規簧疆穠穩 (system catalog) 瞻瞻礎糧瞻@簫�� "encoding" 穠繙��汕污�, 織N竅O瞼��並�珍黑螢蝓一簫��蜆祁捌�庫穠繙翻s翻X. 禮A瞼i瞼H瞼�� psql -l 穢��進瞻J psql 竄獺瞼�� \l 穠繙 穢R瞼O穡��查竅��蜆祁捌�庫簣��永�污麻疑�編翻X: $ psql -l List of databases Database | Owner | Encoding ---------------+---------+--------------- euc_cn | t-ishii | EUC_CN euc_jp | t-ishii | EUC_JP euc_kr | t-ishii | EUC_KR euc_tw | t-ishii | EUC_TW mule_internal | t-ishii | MULE_INTERNAL regression | t-ishii | SQL_ASCII template1 | t-ishii | EUC_JP test | t-ishii | EUC_JP unicode | t-ishii | UNICODE (9 rows) 3. 竄e繙��與竄獺繙��編翻X穠繙礎��冕���朝� [繕羅: 竄e繙��泛竄羹竄��勻¯疑�祁甄程礎癒, 瞼i簪��是 psql 穢R瞼O繡���黑壇�, 穢��掙�永� libpq 穠繙 C 繕{礎癒, Perl 繕{礎癒, 穢��秉�是糧z繒L ODBC 穠繙繕繪繕癒��糧瞼��程礎癒. 礎��姻¯疑�就竅O竄羹 PostgreSQL 繡礙簧��庫穠繙礎繪穠A繕{礎癒] PostgreSQL 瞻瓣織穢竅Y穡��編翻X礎b竄e繙��與竄獺繙��黑¯兜肅污�冕���朝�: [繕羅: 糧o繡��怔�螢�祁甄污�冕� ���朝威是竄羹禮A礎b竄e繙��勻�姻¯疑�怔�姻�告簣��永�祁甄編翻X瞻瞿礎P, 礎羸瞼u簫n PostgreSQL 瞻瓣織穢糧o 穡璽繙��編翻X繞癒穠繙���朝�, 穡繙罈簷瞼礎繚|��簞禮A礎b礎s穡繳竄e簞繕���朝侷 encoding of backend available encoding of frontend -------------------------------------------------------------------- EUC_JP EUC_JP, SJIS EUC_TW EUC_TW, BIG5 LATIN2 LATIN2, WIN1250 LATIN5 LATIN5, WIN, ALT MULE_INTERNAL EUC_JP, SJIS, EUC_KR, EUC_CN, EUC_TW, BIG5, LATIN1 to LATIN5, WIN, ALT, WIN1250 礎b簣��冕�污�冕�編翻X���朝威刈岑前, 禮A瞼簡繞繚禮i繞D PostgreSQL 禮A簫n礎b竄e繙��掙�永�污麻疑�編翻X. 礎糧礎n織X簫��勻阬法瞼i瞼H繒F穡穫糧o簫��永�祁�: o 礎b psql 穢R瞼O繡���黑壇嗽刈刈並�永� \encoding 糧o簫��命瞼O \encoding 糧o簫��命瞼O瞼i瞼H��蝓你簞穡瞻W瞻��朝威前繙��編翻X, 穡��如, 禮A簫n簣N竄e繙��編翻X瞻��朝威玲� SJIS, 穡繙罈簷翻��氐�: \encoding SJIS o 穡��永� libpq [繕羅: PostgreSQL 繡礙簧��庫穠繙 C API 繕{礎癒簧w] 穠繙穡癟礎癒 psql 穠繙 \encoding 穢R瞼O穡瓣繒礙瞼u竅O瞼h穢I瞼s PQsetClientEncoding() 糧o簫��並岑汕¯並�達穡穫瞼��祁�. int PQsetClientEncoding(PGconn *conn, const char *encoding) 瞻W礎癒瞻瞻 conn 糧o簫��冕�撳�代穠穩瞻@簫��嘔純姻¯疑�祁甄連翻u, encoding 糧o簫��冕�撳�要穢簽禮A繚Q瞼��祁甄編翻X, 簞簡礎p瞼礎礎穡瞼\礎a糧]穢w瞻F翻s翻X, 竄K繚|繞��回 0 簫��, 瞼瞽簣��祁甄蜆�亂�回 -1. 礎��怔麻永�前糧s翻u穠繙翻s翻X瞼i 禮Q瞼��以瞻U穡癟礎癒竅d穠職: int PQclientEncoding(const PGconn *conn) 糧o繡��要穠`繚N穠繙竅O: 糧o簫��並岑汕¯亂�回穠繙竅O翻s翻X穠繙瞼N繡繒 (encoding id, 竅O簫��壅�撳�倥�), 礎��刈�是翻s翻X穠繙礎W繙��字礎礙 (礎p "EUC_JP"), 礎p穠G禮A簫n瞼��編翻X瞼N繡繒簣o穠職翻s翻X礎W繙��, 瞼簡繞繚穢I瞼s: char *pg_encoding_to_char(int encoding_id) o 穡��永� PGCLIENTENCODING 糧o簫���繫繒����撳� 礎p穠G竄e繙��怏麻設穢w瞻F PGCLIENTENCODING 糧o瞻@簫���繫繒����撳�, 穡繙罈簷竄獺繙��會簞繕翻s翻X礎��冕���朝�. [繕羅] PostgreSQL 7.0.0 ~ 7.0.3 礎糧簫�� bug -- 瞻瞿罈{糧o簫���繫繒����撳� o 穡��永� SET CLIENT_ENCODING TO 糧o簫�� SQL 穠繙穢R瞼O 簫n糧]穢w竄e繙��祁甄編翻X瞼i瞼H瞼��以瞻U糧o簫�� SQL 穢R瞼O: SET CLIENT_ENCODING TO 'encoding'; 禮A瞻]瞼i瞼H穡��永� SQL92 穠繙罈y穠k "SET NAMES" 繒F穡穫礎P翹��祁甄永�祁�: SET NAMES 'encoding'; 竅d繡��永�前穠繙竄e繙��編翻X瞼i瞼H瞼��以瞻U糧o簫�� SQL 穢R瞼O: SHOW CLIENT_ENCODING; 瞻��朝威玲兜倥玲並�預糧]穠繙翻s翻X, 瞼��以瞻U糧o簫�� SQL 穢R瞼O: RESET CLIENT_ENCODING; [繕羅] 穡��永� psql 穢R瞼O繡���黑壇嗽捌�, 竄���麻刈�要瞼��這簫��勻阬法, 翻��永� \encoding 4. ��黑怔� Unicode (簡��一翻X) 簡��一翻X穢M穡瓣瞼L翻s翻X繞癒穠繙���朝威可簪��要礎b 7.1 穠穢竄獺瞻~繚|繒礙簡{. 5. 礎p穠G繕L穠k���朝威會繕o瞼��刈兜酵笨並�? 簞簡糧]禮A礎b竄獺繙��螢純壅�了 EUC_JP 糧o簫��編翻X, 竄e繙��並�永� LATIN1, (竅Y穡��勻怏勻氐字瞻繡繕L穠k���朝威汕� LATIN1) 礎b糧o簫��祁玲況瞻U, 竅Y簫��字瞻繡簫Y瞻瞿簪����汕� LATIN1 礎r瞻繡繞簞, 織N繚|糧Q���汕阬以瞻U穠繙竄竅礎癒: (瞻Q瞻罈繞i礎穫簫��) 6. 簞��污�蜆祁捌� These are good sources to start learning various kind of encoding systems. ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW appear in section 3.2. Unicode: http://www.unicode.org/ The homepage of UNICODE. RFC 2044 UTF-8 is defined here. 5. History May 20, 2000 * SJIS UDC (NEC selection IBM kanji) support contributed by Eiji Tokuya * Changes above will appear in 7.0.1 Mar 22, 2000 * Add new libpq functions PQsetClientEncoding, PQclientEncoding * ./configure --with-mb=EUC_JP now deprecated. use ./configure --enable-multibyte=EUC_JP instead * Add SQL_ASCII regression test case * Add SJIS User Defined Character (UDC) support * All of above will appear in 7.0 July 11, 1999 * Add support for WIN1250 (Windows Czech) as a client encoding (contributed by Pavel Behal) * fix some compiler warnings (contributed by Tomoaki Nishiyama) Mar 23, 1999 * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866) (thanks Oleg Broytmann for testing) * Fix problem with MB and locale Jan 26, 1999 * Add support for Big5 for fronend encoding (you need to create a database with EUC_TW to use Big5) * Add regression test case for EUC_TW (contributed by Jonah Kuo ) Dec 15, 1998 * Bugs related to SQL_ASCII support fixed Nov 5, 1998 * 6.4 release. In this version, pg_database has "encoding" column that represents the database encoding Jul 22, 1998 * determine encoding at initdb/createdb rather than compile time * support for PGCLIENTENCODING when issuing COPY command * support for SQL92 syntax "SET NAMES" * support for LATIN2-5 * add UNICODE regression test case * new test suite for MB * clean up source files Jun 5, 1998 * add support for the encoding translation between the backend and the frontend * new command SET CLIENT_ENCODING etc. added * add support for LATIN1 character set * enhance 8 bit cleaness April 21, 1998 some enhancements/fixes * character_length(), position(), substring() are now aware of multi-byte characters * add octet_length() * add --with-mb option to configure * new regression tests for EUC_KR (contributed by "Soonmyung. Hong" ) * add some test cases to the EUC_JP regression test * fix problem in regress/regress.sh in case of System V * fix toupper(), tolower() to handle 8bit chars Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1 Mar 10, 1998 PL2 released * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL * add an English document (this file) * fix problems concerning 8-bit single byte characters Mar 1, 1998 PL1 released Appendix: [Here is a good documentation explaining how to use WIN1250 on Windows/ODBC from Pavel Behal. Please note that Installation step 1) is not necceary in 6.5.1 -- Tatsuo] Version: 0.91 for PgSQL 6.5 Author: Pavel Behal Revised by: Tatsuo Ishii Email: behal@opf.slu.cz Licence: The Same as PostgreSQL Sorry for my Eglish and C code, I'm not native :-) !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Instalation: ------------ 1) Change three affected files in source directories (I don't have time to create proper patch diffs, I don't know how) 2) Compile with enabled locale and multibyte set to LATIN2 3) Setup properly your instalation, do not forget to create locale variables in your profile (environment). Ex. (may not be exactly true): LC_ALL=cs_CZ.ISO8859-2 LC_COLLATE=cs_CZ.ISO8859-2 LC_CTYPE=cs_CZ.ISO8859-2 LC_MONETARY=cs_CZ.ISO8859-2 LC_NUMERIC=cs_CZ.ISO8859-2 LC_TIME=cs_CZ.ISO8859-2 4) You have to start the postmaster with locales set! 5) Try it with Czech language, it have to sort 5) Install ODBC driver for PgSQL into your M$ Windows 6) Setup properly your data source. Include this line in your ODBC configuration dialog in field "Connect Settings:" : SET CLIENT_ENCODING = 'WIN1250'; 7) Now try it again, but in Windows with ODBC. Description: ------------ - Depends on proper system locales, tested with RH6.0 and Slackware 3.6, with cs_CZ.iso8859-2 loacle - Never try to set-up server multibyte database encoding to WIN1250, always use LATIN2 instead. There is not WIN1250 locale in Unix - WIN1250 encoding is useable only for M$W ODBC clients. The characters are on thy fly re-coded, to be displayed and stored back properly Important: ---------- - it reorders your sort order depending on your LC_... setting, so don't be confused with regression tests, they don't use locale - "ch" is corectly sorted only in some newer locales (Ex. RH6.0) - you have to insert money as '162,50' (with comma in aphostrophes!) - not tested properly