encoding names

Started by Karel Zakover 24 years ago40 messages
#1Karel Zak
zakkr@zf.jcu.cz
1 attachment(s)

Hi,

attached is patch with:

- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.

(Note: patch add new file mb/encname.c and remove mb/common.c)

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Attachments:

mb-08172001.patch.gzapplication/x-gzipDownload
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Karel Zak (#1)
Re: encoding names

Karel Zak writes:

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

ISO 8859 is a pretty well-know term these days.

KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name,
actually. Do you know what encoding is stands for and could you add that
as an alias?

On the code:

#ifdef WIN32
#include "win32.h"
#else
#include <unistd.h>
#endif

needs to be written as

#ifdef WIN32
# include "win32.h"
#else
# include <unistd.h>
#endif

for portability.

For extra credit: A patch to configure and the documentation.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#3Barry Lind
barry@xythos.com
In reply to: Karel Zak (#1)
Re: encoding names

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode. This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work. For
example "LATIN1" which used to be returned will now come back as
"iso88591". This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.

I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in
the future. Then create a new function that returns the new encoding
names that can be used going forward.

thanks,
--Barry

Karel Zak wrote:

Show quoted text

Hi,

attached is patch with:

- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.

(Note: patch add new file mb/encname.c and remove mb/common.c)

Karel

------------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Part 1.1

Content-Type:

text/plain

------------------------------------------------------------------------
mb-08172001.patch.gz

Content-Type:

application/x-gzip
Content-Encoding:

base64

------------------------------------------------------------------------
Part 1.3

Content-Type:

text/plain
Content-Encoding:

binary

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Barry Lind (#3)
Re: encoding names

Barry Lind writes:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#5Serguei Mokhov
sa_mokho@alcor.concordia.ca
In reply to: Peter Eisentraut (#2)
Re: encoding names

----- Original Message -----
From: Peter Eisentraut <peter_e@gmx.net>
Sent: Friday, August 17, 2001 12:11 PM

Karel Zak writes:

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

I'm not sure about others, but 'cyrillic' is a quite ambigous alias,
because it can denote many slavic languages: Russian, Ukranian,
Bulgarian, Serbian are few examples, so I believe it should be excluded
from the list of synomyms.

KOI8 needs to be aliased as koi8r.

... and Karel can you change these so they are consistent with
others:

KOI8_to_utf(unsigned char *iso, unsigned char *utf, int len)
{
local_to_utf(iso, utf, LUmapKOI8, sizeof(LUmapKOI8) / sizeof(pg_local_to_utf), PG_KOI8, len);
}

to

koi8r_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^
{
local_to_utf(iso, utf, LUmapKOI8R, sizeof(LUmapKOI8R) / sizeof(pg_local_to_utf), PG_KOI8R, len);
} ^^^^^ ^^^^^ ^^^^^

WIN_to_utf(unsigned char *iso, unsigned char *utf, int len)
{
local_to_utf(iso, utf, LUmapWIN, sizeof(LUmapWIN) / sizeof(pg_local_to_utf), PG_WIN1251, len);
}

to

win1251_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^^^
{
local_to_utf(iso, utf, LUmapWIN1251, sizeof(LUmapWIN1251) / sizeof(pg_local_to_utf), PG_WIN1251, len);
^^^^^^^ ^^^^^^^ ^^^^^^^
}

S.

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#4)
Re: encoding names

Barry Lind writes:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility. I agree with Barry's opinion.
--
Tatsuo Ishii

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#6)
Re: encoding names

Tatsuo Ishii writes:

Barry Lind writes:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility.

How so?

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#7)
Re: encoding names

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well. (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility.

How so?

Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.
--
Tatsuo Ishii

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#8)
Re: encoding names

Tatsuo Ishii writes:

Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.

Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.

However, the 7.1 JDBC driver is going to be incompatible with a 7.2 server
in a number of other areas as well, so I'm not completely sure whether
it'd be worth the effort.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#9)
Re: encoding names

Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.
--
Tatsuo Ishii

#11Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#10)
Re: encoding names

Then the server needs to look at the protocol number to decide what to
send back. But we need to be able to move forward with the encoding names
sooner or later anyway.

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

However, this brings up the issue of how a backend will fail if the
client provides a newer protocol version. I think we should get it to
send back its current protocol version and see if the client responds
with a protocol version we can accept. I know we don't need it now, but
when we do need to up the protocol version number, we are stuck because
of the older releases that can't cope with this.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#11)
Re: encoding names

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.
--
Tatsuo Ishii

#13Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#1)
Re: [PATCHES] encoding names

Hi,

attached is patch with:

- new encoding names stuff with better performance (binary search
intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
(other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
is pretty dirty in source code, rather use PG_ALT.

(Note: patch add new file mb/encname.c and remove mb/common.c)

Karel

Thanks for the patches, but...

1) There is a compiler error if --enable-unicode-conversion is not
enabled

2) The patches break createdb. createdb should raise an error if
client-only encodings such as SJIS etc. is specified.

3) I don't like following ugliness. Why not changing all of SQL_ASCII
occurrences in the sources.

/*
* A lot of PG stuff use 'SQL_ASCII' without prefix (dirty...)
*/
#define SQL_ASCII PG_SQL_ASCII

4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):

ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)

Also, I'm wondering why windows-1251, not windows_1251? or
ISO_8859_1, not ISO-8859-1? there seems a confusion about the
usage of "_" and "-".

pg_enc2name pg_enc2name_tbl[] =
{
{ "SQL_ASCII", PG_SQL_ASCII },
{ "EUC_JP", PG_EUC_JP },
{ "EUC_CN", PG_EUC_CN },
{ "EUC_KR", PG_EUC_KR },
{ "EUC_TW", PG_EUC_TW },
{ "UNICODE", PG_UNICODE },
{ "MULE_INTERNAL",PG_MULE_INTERNAL },
{ "ISO_8859_1", PG_LATIN1 },
{ "ISO_8859_2", PG_LATIN2 },
{ "ISO_8859_3", PG_LATIN3 },
{ "ISO_8859_4", PG_LATIN4 },
{ "ISO_8859_5", PG_LATIN5 },
{ "KOI8", PG_KOI8 },
{ "window-1251",PG_WIN1251 },
{ "ALT", PG_ALT },
{ "Shift_JIS", PG_SJIS },
{ "Big5", PG_BIG5 },
{ "window-1250",PG_WIN1251 }
};

#14Serguei Mokhov
sa_mokho@alcor.concordia.ca
In reply to: Karel Zak (#1)
Re: Re: [PATCHES] encoding names

----- Original Message -----
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Sent: Saturday, August 18, 2001 10:02 PM

ALT -> IBM866

Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.

Serguei

#15Karel Zak
zakkr@zf.jcu.cz
In reply to: Peter Eisentraut (#2)
Re: encoding names

On Fri, Aug 17, 2001 at 06:11:00PM +0200, Peter Eisentraut wrote:

Karel Zak writes:

- possible is use synonyms for encoding (an example ISO-8859-1,
Latin1, l1)

On the choice of synonyms: Do we really want to add that many synonyms
that are not the standard name? I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

IMHO is not problem if PG will understand to more aliases, or is here some
relevant problem with it? :-)

ISO 8859 is a pretty well-know term these days.

KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name,

Agree.

actually. Do you know what encoding is stands for and could you add that
as an alias?

On the code:

#ifdef WIN32
#include "win32.h"
#else
#include <unistd.h>
#endif

needs to be written as

#ifdef WIN32
# include "win32.h"
#else
# include <unistd.h>
#endif

for portability.

OK, but sounds curious (how compiler has problem with it?)

For extra credit: A patch to configure and the documentation.

:-) needs time... but yes, I add it to next patch version.

Thanks for suggestions.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#16Karel Zak
zakkr@zf.jcu.cz
In reply to: Barry Lind (#3)
Re: encoding names

On Fri, Aug 17, 2001 at 10:37:18AM -0700, Barry Lind wrote:

This patch will break the JDBC driver. The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode. This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work. For
example "LATIN1" which used to be returned will now come back as
"iso88591". This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.

Hmm.. but I agree with Peter that correct solution is rewrite it to
standard names.

I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in

^^^^^^^^^^^^^^^^^^^^^
We can finish as great Microsoft systems... nice face but terrible old stuff
in kernel.

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#17Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#12)
Re: encoding names

On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote:

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.

Ok, Is here some suggestion for name of this function?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#18Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#13)
Re: [PATCHES] encoding names

On Sun, Aug 19, 2001 at 11:02:57AM +0900, Tatsuo Ishii wrote:

4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):

ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)

Right.

But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.

Thanks, I try fix all.
Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#19Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Serguei Mokhov (#14)
Re: Re: [PATCHES] encoding names

ALT -> IBM866

Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.

Ok. Let's leave ALT as it is.
--
Tatsuo Ishii

#20Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#18)
Re: [PATCHES] encoding names

4) Encoding "official" names are inconsistent. Here are my suggested
changes (referring http://www.iana.org/assignments/character-sets,
according to Peter's suggestiuon):

ALT -> IBM866
KOI8 -> KOI8_R
UNICODE -> UTF_8 (Peter's suggestion)

Right.

But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.

Sure.

Thanks, I try fix all.

Thanks! But we seem to leave ALT as it is (Serguei's suggestion).
--
Tatsuo Ishii

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#11)
Re: encoding names

Bruce Momjian <pgman@candle.pha.pa.us> writes:

However, this brings up the issue of how a backend will fail if the
client provides a newer protocol version. I think we should get it to
send back its current protocol version and see if the client responds
with a protocol version we can accept.

Why? A client that wants to do this can retry with a lower version
number upon seeing the "unsupported protocol version" failure. There's
no need to change the postmaster code --- indeed, doing so would negate
the main value of such a feature, namely being able to talk to *old*
postmasters.

regards, tom lane

#22Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#13)
encoding: ODBC, createdb

I found some other things:

- why database encoding for new DB check 'createdb' script and not
CREATE DATABASE statement? (means client only encodings, like BIG5)?

Bug?

- ODBC -- here is some multibyte stuff too. Why ODBC code don't use
pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
all encoding identificators.

IMHO we can use for ODBC same solution as for libpq and compile it
with encname.c file too.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#23Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Karel Zak (#1)
Re: encoding: ODBC, createdb

Karel Zak wrote:

I found some other things:

- ODBC -- here is some multibyte stuff too. Why ODBC code don't use
pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
all encoding identificators.

IMHO we can use for ODBC same solution as for libpq and compile it
with encname.c file too.

ODBC under Windows needs no source/header files in PostgreSQL
other than in src/interfaces/odbc. It's not preferable for
psqlodbc driver to be sensitive about other PostgreSQL changes
because the driver has to be able to talk multiple versions of
PostgreSQL servers. In fact the current driver could talk to
any server whose version >= 6.2(according to a person).
As for pg_wchar.h I'm not sure if it could be an exception
and we could expect for the maintainer to take care of ODBC.
If I were he, I would hate it.

regards,
Hiroshi Inoue

#24Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#22)
Re: encoding: ODBC, createdb

I found some other things:

- why database encoding for new DB check 'createdb' script and not
CREATE DATABASE statement? (means client only encodings, like BIG5)?

Bug?

Oh, that must be a bug. Do yo want to take care of it by yourself?

- ODBC -- here is some multibyte stuff too. Why ODBC code don't use
pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
all encoding identificators.

IMHO we can use for ODBC same solution as for libpq and compile it
with encname.c file too.

Don't know about ODBC. Hiroshi?
--
Tatsuo Ishii

#25Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#24)
Re: encoding: ODBC, createdb

On Tue, Aug 21, 2001 at 10:00:50AM +0900, Tatsuo Ishii wrote:

I found some other things:

- why database encoding for new DB check 'createdb' script and not
CREATE DATABASE statement? (means client only encodings, like BIG5)?

Bug?

Oh, that must be a bug. Do yo want to take care of it by yourself?

I check and fix it. The 'createdb' script needn't check somethig, all
must be in backend.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#26Karel Zak
zakkr@zf.jcu.cz
In reply to: Hiroshi Inoue (#23)
Re: encoding: ODBC, createdb

On Tue, Aug 21, 2001 at 10:00:21AM +0900, Hiroshi Inoue wrote:

Karel Zak wrote:

I found some other things:

- ODBC -- here is some multibyte stuff too. Why ODBC code don't use
pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
all encoding identificators.

IMHO we can use for ODBC same solution as for libpq and compile it
with encname.c file too.

ODBC under Windows needs no source/header files in PostgreSQL
other than in src/interfaces/odbc. It's not preferable for
psqlodbc driver to be sensitive about other PostgreSQL changes
because the driver has to be able to talk multiple versions of
PostgreSQL servers. In fact the current driver could talk to
any server whose version >= 6.2(according to a person).
As for pg_wchar.h I'm not sure if it could be an exception
and we could expect for the maintainer to take care of ODBC.
If I were he, I would hate it.

In the odbc/multibyte.h is

if (strstr(str, "%27SJIS%27") || strstr(str, "'SJIS'") ||
strstr(str, "'sjis'"))

..and same line for BIG5

I add here new names 'Shift_JIS' and 'Big5' only.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#27Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#25)
Re: encoding: ODBC, createdb

On Tue, Aug 21, 2001 at 10:00:50AM +0900, Tatsuo Ishii wrote:

I found some other things:

- why database encoding for new DB check 'createdb' script and not
CREATE DATABASE statement? (means client only encodings, like BIG5)?

Bug?

Oh, that must be a bug. Do yo want to take care of it by yourself?

I check and fix it. The 'createdb' script needn't check somethig, all
must be in backend.

Agreed.
--
Tatsuo Ishii

#28Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#17)
Re: encoding names

On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote:

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.

We are not, as far as I know. I have made my changes without doing
that.

Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.

Ok, Is here some suggestion for name of this function?

The new function returns "canonical database encoding names". So

"get_canonical_database_encoding" or shorter name looks appropriate
to me.
--
Tatsuo Ishii

#29Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#28)
Re: encoding names

On Wed, Aug 22, 2001 at 05:09:50PM +0900, Tatsuo Ishii wrote:

The new function returns "canonical database encoding names". So

"get_canonical_database_encoding" or shorter name looks appropriate
to me.

Oops, I overlook this mail in my inbox. Hmm .. I use getdbencoding(),
but we can change it later (before 7.2 release of course). It's
cosmetic change.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#30Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#29)
Re: encoding names

On Wed, Aug 22, 2001 at 05:09:50PM +0900, Tatsuo Ishii wrote:

The new function returns "canonical database encoding names". So

"get_canonical_database_encoding" or shorter name looks appropriate
to me.

Oops, I overlook this mail in my inbox. Hmm .. I use getdbencoding(),
but we can change it later (before 7.2 release of course). It's
cosmetic change.

I don't think you need to change the function name "getdbencoding".
"get_canonical_database_encoding" is too long anyway.
--
Tatsuo Ishii

#31Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#30)
Re: encoding names

Tatsuo Ishii writes:

I don't think you need to change the function name "getdbencoding".
"get_canonical_database_encoding" is too long anyway.

But getdbencoding isn't semantically different from the old
getdatabaseencoding. "encoding" isn't the right term anyway, methinks, it
should be "character set". So maybe database_character_set()? (No "get"
please.)

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#32Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#31)
Re: encoding names

Tatsuo Ishii writes:

I don't think you need to change the function name "getdbencoding".
"get_canonical_database_encoding" is too long anyway.

But getdbencoding isn't semantically different from the old
getdatabaseencoding. "encoding" isn't the right term anyway, methinks, it
should be "character set". So maybe database_character_set()? (No "get"
please.)

I'm not a native English speaker, so please feel free to choose more
appropriate name.

BTW, what's wrong with "encoding"? I don't think, for example EUC-JP
or utf-8, are character set names.
--
Tatsuo Ishii

#33Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#32)
Re: [PATCHES] encoding names

Tatsuo Ishii writes:

But getdbencoding isn't semantically different from the old
getdatabaseencoding. "encoding" isn't the right term anyway, methinks, it
should be "character set". So maybe database_character_set()? (No "get"
please.)

I'm not a native English speaker, so please feel free to choose more
appropriate name.

BTW, what's wrong with "encoding"? I don't think, for example EUC-JP
or utf-8, are character set names.

Hmm, SQL talks of character sets, it has a CHARACTER_SETS view and such.
It's slightly incorrect, I agree.

Maybe we should not touch getdatabaseencoding() right now, given that the
names we currently use are apparently almost correct anyway and
considering the pain it creates to alter them, and instead implement the
information schema views in the future?

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#34Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#33)
Re: [PATCHES] encoding names

BTW, what's wrong with "encoding"? I don't think, for example EUC-JP
or utf-8, are character set names.

Hmm, SQL talks of character sets, it has a CHARACTER_SETS view and such.
It's slightly incorrect, I agree.

Maybe we should not touch getdatabaseencoding() right now, given that the
names we currently use are apparently almost correct anyway and
considering the pain it creates to alter them, and instead implement the
information schema views in the future?

I thought schema stuffs would be introduced in 7.2 but apparently it
would not happen...
--
Tatsuo Ishii

#35Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#34)
Re: [PATCHES] encoding names

BTW, what's wrong with "encoding"? I don't think, for example EUC-JP
or utf-8, are character set names.

Hmm, SQL talks of character sets, it has a CHARACTER_SETS view and such.
It's slightly incorrect, I agree.

Maybe we should not touch getdatabaseencoding() right now, given that the
names we currently use are apparently almost correct anyway and
considering the pain it creates to alter them, and instead implement the
information schema views in the future?

I thought schema stuffs would be introduced in 7.2 but apparently it
would not happen...

I thought I could do it but ran out of time.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#34)
Re: [PATCHES] encoding names

Tatsuo Ishii writes:

Maybe we should not touch getdatabaseencoding() right now, given that the
names we currently use are apparently almost correct anyway and
considering the pain it creates to alter them, and instead implement the
information schema views in the future?

I thought schema stuffs would be introduced in 7.2 but apparently it
would not happen...

True, but right now we'd have to do rather elaborate changes just to
switch a couple of names to "more correct" versions. Accepting them as
input is good, but maybe we should hold back on the output part a bit
until we can do it correctly.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#37Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#36)
Re: Re: [PATCHES] encoding names

I thought schema stuffs would be introduced in 7.2 but apparently it
would not happen...

True, but right now we'd have to do rather elaborate changes just to
switch a couple of names to "more correct" versions. Accepting them as
input is good, but maybe we should hold back on the output part a bit
until we can do it correctly.

Agreed.
--
Tatsuo Ishii

#38Karel Zak
zakkr@zf.jcu.cz
In reply to: Peter Eisentraut (#36)
Re: [PATCHES] encoding names

On Fri, Aug 24, 2001 at 09:29:06PM +0200, Peter Eisentraut wrote:

Tatsuo Ishii writes:

Maybe we should not touch getdatabaseencoding() right now, given that the
names we currently use are apparently almost correct anyway and
considering the pain it creates to alter them, and instead implement the
information schema views in the future?

I thought schema stuffs would be introduced in 7.2 but apparently it
would not happen...

True, but right now we'd have to do rather elaborate changes just to
switch a couple of names to "more correct" versions. Accepting them as
input is good, but maybe we should hold back on the output part a bit
until we can do it correctly.

Change output is a very easy work (edit strings in one array). The
important thing is clean internal code for encoding names to faster
and non-duplicated code (use same code for FE and BE).

Well, I prepare it with total back compatible output for all current
routines (pg_char_to_encoding too) and new names will visible by new
routines only (suggested database_character_set(), etc). Right?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#39Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Karel Zak (#22)
Re: encoding: ODBC, createdb

Was this completed?

I found some other things:

- why database encoding for new DB check 'createdb' script and not
CREATE DATABASE statement? (means client only encodings, like BIG5)?

Bug?

- ODBC -- here is some multibyte stuff too. Why ODBC code don't use
pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
all encoding identificators.

IMHO we can use for ODBC same solution as for libpq and compile it
with encname.c file too.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#40Karel Zak
zakkr@zf.jcu.cz
In reply to: Bruce Momjian (#39)
Re: encoding: ODBC, createdb

On Fri, Sep 07, 2001 at 04:11:25PM -0400, Bruce Momjian wrote:

Was this completed?

I found some other things:

- why database encoding for new DB check 'createdb' script and not
CREATE DATABASE statement? (means client only encodings, like BIG5)?

It was include in my large multibyte patch and it's complete in
dbcommands.c (It was non-reported bug in previous releases).

- ODBC -- here is some multibyte stuff too. Why ODBC code don't use
pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
all encoding identificators.

Probably done, it check ODBC maintainer.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz