encoding names
Hi,
attached is patch with:
- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)
- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)
- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)
- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)
- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.
(Note: patch add new file mb/encname.c and remove mb/common.c)
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Attachments:
Karel Zak writes:
- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)
On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:
cyrillic, cp819, ibm819, isoir100x, l1-4
ISO 8859 is a pretty well-know term these days.
KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name,
actually. Do you know what encoding is stands for and could you add that
as an alias?
On the code:
#ifdef WIN32
#include "win32.h"
#else
#include <unistd.h>
#endif
needs to be written as
#ifdef WIN32
# include "win32.h"
#else
# include <unistd.h>
#endif
for portability.
For extra credit: A patch to configure and the documentation.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode. This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work. For
example "LATIN1" which used to be returned will now come back as
"iso88591". This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.
I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in
the future. Then create a new function that returns the new encoding
names that can be used going forward.
thanks,
--Barry
Karel Zak wrote:
Show quoted text
Hi,
attached is patch with:
- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.(Note: patch add new file mb/encname.c and remove mb/common.c)
Karel
------------------------------------------------------------------------
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)Part 1.1
Content-Type:
text/plain
------------------------------------------------------------------------
mb-08172001.patch.gzContent-Type:
application/x-gzip
Content-Encoding:base64
------------------------------------------------------------------------
Part 1.3Content-Type:
text/plain
Content-Encoding:binary
Barry Lind writes:
This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.
Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
----- Original Message -----
From: Peter Eisentraut <peter_e@gmx.net>
Sent: Friday, August 17, 2001 12:11 PM
Karel Zak writes:
- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:cyrillic, cp819, ibm819, isoir100x, l1-4
I'm not sure about others, but 'cyrillic' is a quite ambigous alias,
because it can denote many slavic languages: Russian, Ukranian,
Bulgarian, Serbian are few examples, so I believe it should be excluded
from the list of synomyms.
KOI8 needs to be aliased as koi8r.
... and Karel can you change these so they are consistent with
others:
KOI8_to_utf(unsigned char *iso, unsigned char *utf, int len)
{
local_to_utf(iso, utf, LUmapKOI8, sizeof(LUmapKOI8) / sizeof(pg_local_to_utf), PG_KOI8, len);
}
to
koi8r_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^
{
local_to_utf(iso, utf, LUmapKOI8R, sizeof(LUmapKOI8R) / sizeof(pg_local_to_utf), PG_KOI8R, len);
} ^^^^^ ^^^^^ ^^^^^
WIN_to_utf(unsigned char *iso, unsigned char *utf, int len)
{
local_to_utf(iso, utf, LUmapWIN, sizeof(LUmapWIN) / sizeof(pg_local_to_utf), PG_WIN1251, len);
}
to
win1251_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^^^
{
local_to_utf(iso, utf, LUmapWIN1251, sizeof(LUmapWIN1251) / sizeof(pg_local_to_utf), PG_WIN1251, len);
^^^^^^^ ^^^^^^^ ^^^^^^^
}
S.
Barry Lind writes:
This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)
This will break the backward compatibility. I agree with Barry's opinion.
--
Tatsuo Ishii
Tatsuo Ishii writes:
Barry Lind writes:
This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)This will break the backward compatibility.
How so?
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)This will break the backward compatibility.
How so?
Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.
--
Tatsuo Ishii
Tatsuo Ishii writes:
Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.
Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.
However, the 7.1 JDBC driver is going to be incompatible with a 7.2 server
in a number of other areas as well, so I'm not completely sure whether
it'd be worth the effort.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.
I'm not sure if we are going to raise the FE/BE protocol number for
7.2.
--
Tatsuo Ishii
Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.I'm not sure if we are going to raise the FE/BE protocol number for
7.2.
We are not, as far as I know. I have made my changes without doing
that.
However, this brings up the issue of how a backend will fail if the
client provides a newer protocol version. I think we should get it to
send back its current protocol version and see if the client responds
with a protocol version we can accept. I know we don't need it now, but
when we do need to up the protocol version number, we are stuck because
of the older releases that can't cope with this.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I'm not sure if we are going to raise the FE/BE protocol number for
7.2.We are not, as far as I know. I have made my changes without doing
that.
Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.
--
Tatsuo Ishii
Hi,
attached is patch with:
- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.(Note: patch add new file mb/encname.c and remove mb/common.c)
Karel
Thanks for the patches, but...
1) There is a compiler error if --enable-unicode-conversion is not
enabled
2) The patches break createdb. createdb should raise an error if
client-only encodings such as SJIS etc. is specified.
3) I don't like following ugliness. Why not changing all of SQL_ASCII
occurrences in the sources.
/*
* A lot of PG stuff use 'SQL_ASCII' without prefix (dirty...)
*/
#define SQL_ASCII PG_SQL_ASCII
4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):
ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)
Also, I'm wondering why windows-1251, not windows_1251? or
ISO_8859_1, not ISO-8859-1? there seems a confusion about the
usage of "_" and "-".
pg_enc2name pg_enc2name_tbl[] =
{
{ "SQL_ASCII", PG_SQL_ASCII },
{ "EUC_JP", PG_EUC_JP },
{ "EUC_CN", PG_EUC_CN },
{ "EUC_KR", PG_EUC_KR },
{ "EUC_TW", PG_EUC_TW },
{ "UNICODE", PG_UNICODE },
{ "MULE_INTERNAL",PG_MULE_INTERNAL },
{ "ISO_8859_1", PG_LATIN1 },
{ "ISO_8859_2", PG_LATIN2 },
{ "ISO_8859_3", PG_LATIN3 },
{ "ISO_8859_4", PG_LATIN4 },
{ "ISO_8859_5", PG_LATIN5 },
{ "KOI8", PG_KOI8 },
{ "window-1251",PG_WIN1251 },
{ "ALT", PG_ALT },
{ "Shift_JIS", PG_SJIS },
{ "Big5", PG_BIG5 },
{ "window-1250",PG_WIN1251 }
};
----- Original Message -----
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Sent: Saturday, August 18, 2001 10:02 PM
ALT -> IBM866
Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.
Serguei
On Fri, Aug 17, 2001 at 06:11:00PM +0200, Peter Eisentraut wrote:
Karel Zak writes:
- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:cyrillic, cp819, ibm819, isoir100x, l1-4
IMHO is not problem if PG will understand to more aliases, or is here some
relevant problem with it? :-)
ISO 8859 is a pretty well-know term these days.
KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name,
Agree.
actually. Do you know what encoding is stands for and could you add that
as an alias?On the code:
#ifdef WIN32
#include "win32.h"
#else
#include <unistd.h>
#endifneeds to be written as
#ifdef WIN32
# include "win32.h"
#else
# include <unistd.h>
#endiffor portability.
OK, but sounds curious (how compiler has problem with it?)
For extra credit: A patch to configure and the documentation.
:-) needs time... but yes, I add it to next patch version.
Thanks for suggestions.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Fri, Aug 17, 2001 at 10:37:18AM -0700, Barry Lind wrote:
This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode. This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work. For
example "LATIN1" which used to be returned will now come back as
"iso88591". This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.
Hmm.. but I agree with Peter that correct solution is rewrite it to
standard names.
I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in
^^^^^^^^^^^^^^^^^^^^^
We can finish as great Microsoft systems... nice face but terrible old stuff
in kernel.
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote:
I'm not sure if we are going to raise the FE/BE protocol number for
7.2.We are not, as far as I know. I have made my changes without doing
that.Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.
Ok, Is here some suggestion for name of this function?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Sun, Aug 19, 2001 at 11:02:57AM +0900, Tatsuo Ishii wrote:
4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)
Right.
But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.
Thanks, I try fix all.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
ALT -> IBM866
Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.
Ok. Let's leave ALT as it is.
--
Tatsuo Ishii
4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)Right.
But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.
Sure.
Thanks, I try fix all.
Thanks! But we seem to leave ALT as it is (Serguei's suggestion).
--
Tatsuo Ishii