encoding names

Started by Karel Zakover 24 years ago40 messageshackers
Jump to latest
#1Karel Zak
zakkr@zf.jcu.cz

Hi,

attached is patch with:

- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.

(Note: patch add new file mb/encname.c and remove mb/common.c)

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Attachments:

mb-08172001.patch.gzapplication/x-gzipDownload
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Karel Zak (#1)
Re: encoding names

Karel Zak writes:

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

ISO 8859 is a pretty well-know term these days.

KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name,
actually. Do you know what encoding is stands for and could you add that
as an alias?

On the code:

#ifdef WIN32
#include "win32.h"
#else
#include <unistd.h>
#endif

needs to be written as

#ifdef WIN32
# include "win32.h"
#else
# include <unistd.h>
#endif

for portability.

For extra credit: A patch to configure and the documentation.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#3Barry Lind
barry@xythos.com
In reply to: Karel Zak (#1)
Re: encoding names

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode. This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work. For
example "LATIN1" which used to be returned will now come back as
"iso88591". This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.

I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in
the future. Then create a new function that returns the new encoding
names that can be used going forward.

thanks,
--Barry

Karel Zak wrote:

Show quoted text

Hi,

attached is patch with:

- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.

(Note: patch add new file mb/encname.c and remove mb/common.c)

Karel

------------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Part 1.1

Content-Type:

text/plain

------------------------------------------------------------------------
mb-08172001.patch.gz

Content-Type:

application/x-gzip
Content-Encoding:

base64

------------------------------------------------------------------------
Part 1.3

Content-Type:

text/plain
Content-Encoding:

binary

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Barry Lind (#3)
Re: encoding names

Barry Lind writes:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#5Serguei Mokhov
sa_mokho@alcor.concordia.ca
In reply to: Peter Eisentraut (#2)
Re: encoding names

----- Original Message -----
From: Peter Eisentraut <peter_e@gmx.net>
Sent: Friday, August 17, 2001 12:11 PM

Karel Zak writes:

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

I'm not sure about others, but 'cyrillic' is a quite ambigous alias,
because it can denote many slavic languages: Russian, Ukranian,
Bulgarian, Serbian are few examples, so I believe it should be excluded
from the list of synomyms.

KOI8 needs to be aliased as koi8r.

... and Karel can you change these so they are consistent with
others:

KOI8_to_utf(unsigned char *iso, unsigned char *utf, int len)
{
local_to_utf(iso, utf, LUmapKOI8, sizeof(LUmapKOI8) / sizeof(pg_local_to_utf), PG_KOI8, len);
}

to

koi8r_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^
{
local_to_utf(iso, utf, LUmapKOI8R, sizeof(LUmapKOI8R) / sizeof(pg_local_to_utf), PG_KOI8R, len);
} ^^^^^ ^^^^^ ^^^^^

WIN_to_utf(unsigned char *iso, unsigned char *utf, int len)
{
local_to_utf(iso, utf, LUmapWIN, sizeof(LUmapWIN) / sizeof(pg_local_to_utf), PG_WIN1251, len);
}

to

win1251_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^^^
{
local_to_utf(iso, utf, LUmapWIN1251, sizeof(LUmapWIN1251) / sizeof(pg_local_to_utf), PG_WIN1251, len);
^^^^^^^ ^^^^^^^ ^^^^^^^
}

S.

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#4)
Re: encoding names

Barry Lind writes:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility. I agree with Barry's opinion.
--
Tatsuo Ishii

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#6)
Re: encoding names

Tatsuo Ishii writes:

Barry Lind writes:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility.

How so?

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#7)
Re: encoding names

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility.

How so?

Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.
--
Tatsuo Ishii

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#8)
Re: encoding names

Tatsuo Ishii writes:

Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.

Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.

However, the 7.1 JDBC driver is going to be incompatible with a 7.2 server
in a number of other areas as well, so I'm not completely sure whether
it'd be worth the effort.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#9)
Re: encoding names

Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.
--
Tatsuo Ishii

#11Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#10)
Re: encoding names

Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

However, this brings up the issue of how a backend will fail if the
client provides a newer protocol version. I think we should get it to
send back its current protocol version and see if the client responds
with a protocol version we can accept. I know we don't need it now, but
when we do need to up the protocol version number, we are stuck because
of the older releases that can't cope with this.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#11)
Re: encoding names

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.
--
Tatsuo Ishii

#13Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#1)
Re: [PATCHES] encoding names

Hi,

attached is patch with:

- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.

(Note: patch add new file mb/encname.c and remove mb/common.c)

Karel

Thanks for the patches, but...

1) There is a compiler error if --enable-unicode-conversion is not
enabled

2) The patches break createdb. createdb should raise an error if
client-only encodings such as SJIS etc. is specified.

3) I don't like following ugliness. Why not changing all of SQL_ASCII
occurrences in the sources.

/*
* A lot of PG stuff use 'SQL_ASCII' without prefix (dirty...)
*/
#define SQL_ASCII PG_SQL_ASCII

4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):

ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)

Also, I'm wondering why windows-1251, not windows_1251? or
ISO_8859_1, not ISO-8859-1? there seems a confusion about the
usage of "_" and "-".

pg_enc2name pg_enc2name_tbl[] =
{
{ "SQL_ASCII", PG_SQL_ASCII },
{ "EUC_JP", PG_EUC_JP },
{ "EUC_CN", PG_EUC_CN },
{ "EUC_KR", PG_EUC_KR },
{ "EUC_TW", PG_EUC_TW },
{ "UNICODE", PG_UNICODE },
{ "MULE_INTERNAL",PG_MULE_INTERNAL },
{ "ISO_8859_1", PG_LATIN1 },
{ "ISO_8859_2", PG_LATIN2 },
{ "ISO_8859_3", PG_LATIN3 },
{ "ISO_8859_4", PG_LATIN4 },
{ "ISO_8859_5", PG_LATIN5 },
{ "KOI8", PG_KOI8 },
{ "window-1251",PG_WIN1251 },
{ "ALT", PG_ALT },
{ "Shift_JIS", PG_SJIS },
{ "Big5", PG_BIG5 },
{ "window-1250",PG_WIN1251 }
};

#14Serguei Mokhov
sa_mokho@alcor.concordia.ca
In reply to: Karel Zak (#1)
Re: Re: [PATCHES] encoding names

----- Original Message -----
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Sent: Saturday, August 18, 2001 10:02 PM

ALT -> IBM866

Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.

Serguei

#15Karel Zak
zakkr@zf.jcu.cz
In reply to: Peter Eisentraut (#2)
Re: encoding names

On Fri, Aug 17, 2001 at 06:11:00PM +0200, Peter Eisentraut wrote:

Karel Zak writes:

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

IMHO is not problem if PG will understand to more aliases, or is here some
relevant problem with it? :-)

ISO 8859 is a pretty well-know term these days.

KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name,

Agree.

actually. Do you know what encoding is stands for and could you add that
as an alias?

On the code:

#ifdef WIN32
#include "win32.h"
#else
#include <unistd.h>
#endif

needs to be written as

#ifdef WIN32
# include "win32.h"
#else
# include <unistd.h>
#endif

for portability.

OK, but sounds curious (how compiler has problem with it?)

For extra credit: A patch to configure and the documentation.

:-) needs time... but yes, I add it to next patch version.

Thanks for suggestions.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#16Karel Zak
zakkr@zf.jcu.cz
In reply to: Barry Lind (#3)
Re: encoding names

On Fri, Aug 17, 2001 at 10:37:18AM -0700, Barry Lind wrote:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode. This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work. For
example "LATIN1" which used to be returned will now come back as
"iso88591". This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.

Hmm.. but I agree with Peter that correct solution is rewrite it to
standard names.

I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in

^^^^^^^^^^^^^^^^^^^^^
We can finish as great Microsoft systems... nice face but terrible old stuff
in kernel.

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#17Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#12)
Re: encoding names

On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote:

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.

Ok, Is here some suggestion for name of this function?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#18Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#13)
Re: [PATCHES] encoding names

On Sun, Aug 19, 2001 at 11:02:57AM +0900, Tatsuo Ishii wrote:

4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):

ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)

Right.

But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.

Thanks, I try fix all.
Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#19Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Serguei Mokhov (#14)
Re: Re: [PATCHES] encoding names

ALT -> IBM866

Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.

Ok. Let's leave ALT as it is.
--
Tatsuo Ishii

#20Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#18)
Re: [PATCHES] encoding names

4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):

ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)

Right.

But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.

Sure.

Thanks, I try fix all.

Thanks! But we seem to leave ALT as it is (Serguei's suggestion).
--
Tatsuo Ishii

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#11)
#22Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#13)
#23Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Karel Zak (#1)
#24Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#22)
#25Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#24)
#26Karel Zak
zakkr@zf.jcu.cz
In reply to: Hiroshi Inoue (#23)
#27Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#25)
#28Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#17)
#29Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#28)
#30Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#29)
#31Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#30)
#32Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#31)
#33Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#32)
#34Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#34)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#34)
#37Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#36)
#38Karel Zak
zakkr@zf.jcu.cz
In reply to: Peter Eisentraut (#36)
#39Bruce Momjian
bruce@momjian.us
In reply to: Karel Zak (#22)
#40Karel Zak
zakkr@zf.jcu.cz
In reply to: Bruce Momjian (#39)