Proposal: CREATE CONVERSION

Started by Tatsuo Ishiiover 23 years ago50 messages
#1Tatsuo Ishii
t-ishii@sra.co.jp

Here is my proposal for new CREATE CONVERSION which makes it possible
to define new encoding conversion mapping between two encodings on the
fly.

The background:

We are getting having more and more encoding conversion tables. Up to
now, they reach to 385352 source lines and over 3MB in compiled forms
in total. They are statically linked to the backend. I know this
itself is not a problem since modern OSs have smart memory management
capabilities to fetch only necessary pages from a disk. However, I'm
worried about the infinite growing of these static tables. I think
users won't love 50MB PostgreSQL backend load module.

Second problem is more serious. The conversion definitions between
certain encodings, such as Unicode and others are not well
defined. For example, there are several conversion tables for Japanese
Shift JIS and Unicode. This is because each vendor has its own
"special characters" and they define the table in that the conversion
fits for their purpose.

The solution:

The proposed new CREATE CONVERSION will solve these problems. A
particular conversion table is statically linked to a dynamic loaded
function and CREATE CONVERSION will tell PostgreSQL that if
a conversion from encoding A to encoding B, then function C should be
used. In this way, conversion tables are no more statically linked to
the backend.

Users also could define their own conversion tables easily that would
best fit for their purpose. Also needless to say, people could define
new conversions which PostgreSQL does not support yet.

Syntax proposal:

CREATE CONVERSION <conversion name>
SOURCE <source encoding name>
DESTINATION <destination encoding name>
FROM <conversion function name>
;
DROP CONVERSION <conversion name>;

Example usage:

CREATE OR REPLACE FUNCTION euc_jp_to_utf8(TEXT, TEXT, INTEGER)
RETURNS INTEGER AS euc_jp_to_utf8.so LANGUAGE 'c';
CREATE CONVERSION euc_jp_to_utf8
SOURCE EUC_JP DESTINATION UNICODE
FROM euc_jp_to_utf8;

Implementation:

Implementation would be quite straightforward. Create a new system
table, and CREATE CONVERSION stores info onto
it. pg_find_encoding_converters(utils/mb/mbutils.c) and friends needs
to be modified so that they recognize dynamically defined conversions.
Also psql would need some capabilities to print conversion definition
info.

Comments?
--
Tatsuo Ishii

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#1)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Syntax proposal:
CREATE CONVERSION <conversion name>
SOURCE <source encoding name>
DESTINATION <destination encoding name>
FROM <conversion function name>

Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?

Basically I'd like to see a spec for the API of the conversion
function...

Also, is there anything in SQL99 that we ought to try to be
compatible with?

regards, tom lane

#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#2)
Re: Proposal: CREATE CONVERSION

CREATE CONVERSION <conversion name>
SOURCE <source encoding name>
DESTINATION <destination encoding name>
FROM <conversion function name>

Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?

No, only one function is sufficient. What else do you think of?

Basically I'd like to see a spec for the API of the conversion
function...

That would be very simple (the previous example I gave was unnecessary
complex). The function signature would look like:

conversion_funcion(TEXT) RETURNS TEXT

It receives source text and converts it then returns it. That's all.

Also, is there anything in SQL99 that we ought to try to be
compatible with?

As far as I know there's no such an equivalent in SQL99.
--
Tatsuo Ishii

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#3)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?

No, only one function is sufficient. What else do you think of?

I see two different functions linked to from each pg_wchar_table
entry... although perhaps those are associated with encodings
not with conversions.

Basically I'd like to see a spec for the API of the conversion
function...

That would be very simple (the previous example I gave was unnecessary
complex). The function signature would look like:
conversion_funcion(TEXT) RETURNS TEXT
It receives source text and converts it then returns it. That's all.

IIRC the existing conversion functions deal in C string pointers and
lengths. I'm a little worried about the extra overhead implicit
in converting to a TEXT object and back again; that probably means at
least two more palloc and memcpy operations. I think you'd be better
off sticking to a C-level API, because I really don't believe that
anyone is going to code conversion functions in (say) plpgsql.

regards, tom lane

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#1)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii wrote:

Here is my proposal for new CREATE CONVERSION which makes it possible
to define new encoding conversion mapping between two encodings on the
fly.

The background:

We are getting having more and more encoding conversion tables. Up to
now, they reach to 385352 source lines and over 3MB in compiled forms
in total. They are statically linked to the backend. I know this
itself is not a problem since modern OSs have smart memory management
capabilities to fetch only necessary pages from a disk. However, I'm
worried about the infinite growing of these static tables. I think
users won't love 50MB PostgreSQL backend load module.

Yes, those conversion tables are getting huge in the tarball too:

$ pwd
/pg/backend/utils/mb
$ du
4 ./CVS
7 ./Unicode/CVS
9541 ./Unicode
15805 .

Look at these two file alone:

-rw-r--r-- 1 postgres wheel 1427492 Jun 13 04:28 gb18030_to_utf8.map
-rw-r--r-- 1 postgres wheel 1427492 Jun 13 04:28 utf8_to_gb18030.map

If we can make these loadable, that would be good. What would be really
interesting is if we could split these out into a separate
directory/project so development on those could take place in an
independent way. This would probably stimulate even more encoding
options for users.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#4)
Re: Proposal: CREATE CONVERSION

I see two different functions linked to from each pg_wchar_table
entry... although perhaps those are associated with encodings
not with conversions.

Yes. those are not directly associated with conversions.

IIRC the existing conversion functions deal in C string pointers and
lengths. I'm a little worried about the extra overhead implicit
in converting to a TEXT object and back again; that probably means at
least two more palloc and memcpy operations. I think you'd be better
off sticking to a C-level API, because I really don't believe that
anyone is going to code conversion functions in (say) plpgsql.

I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?
--
Tatsuo Ishii

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#6)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?

Well, you'd have to use the same cheat that's used for selectivity
estimation functions, triggers, I/O functions and everything else that
deals in internal datatypes: declare the function as taking and
returning OPAQUE. This is moderately annoying but I don't think
there's anything really wrong with it in practice.

regards, tom lane

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#7)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?

Well, you'd have to use the same cheat that's used for selectivity
estimation functions, triggers, I/O functions and everything else that
deals in internal datatypes: declare the function as taking and
returning OPAQUE. This is moderately annoying but I don't think
there's anything really wrong with it in practice.

Oh, I see.
--
Tatsuo Ishii

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#3)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii writes:

Also, is there anything in SQL99 that we ought to try to be
compatible with?

As far as I know there's no such an equivalent in SQL99.

Sure:

11.34 <translation definition>

Function

Define a character translation.

Format

<translation definition> ::=
CREATE TRANSLATION <translation name>
FOR <source character set specification>
TO <target character set specification>
FROM <translation source>

<source character set specification> ::= <character set specification>

<target character set specification> ::= <character set specification>

<translation source> ::=
<existing translation name>
| <translation routine>

<existing translation name> ::= <translation name>

<translation routine> ::= <specific routine designator>

That's pretty much exactly what you are descibing.

What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,
I have 176 "modules" under /usr/lib/gconv. They should be useful for
something.

--
Peter Eisentraut peter_e@gmx.net

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#9)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii writes:

Also, is there anything in SQL99 that we ought to try to be
compatible with?

As far as I know there's no such an equivalent in SQL99.

Sure:

11.34 <translation definition>

I guess you mix up SQL99's "trasnlate" and "convert".

As far as I know, SQL99's "translation" is exactly a translation. e.g.

rr) translation: A method of translating characters in one character
repertoire into characters of the same or a different character
repertoire.

For example, certain "translation" might take an input of Engish text,
and makes an output of Japanese one (I don't know if we could
implement such a translation though :-).

On the other hand "convert" just changes the "form-of-use" (SQL's
term, actually equivalent to "encoding"), keeping the character
repertoire.
--
Tatsuo Ishii

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#10)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I guess you mix up SQL99's "trasnlate" and "convert".

No, I believe Peter has read the spec correctly. Further down they have

<character translation> is a function for changing each character
of a given string according to some many-to-one or one-to-one
mapping between two not necessarily distinct character sets.

So this is intended as a one-character-at-a-time mapping, not a language
translation (which would be far beyond what anyone would expect of a
database anyway).

One thing that's really unclear to me is what's the difference between
a <character translation> and a <form-of-use conversion>, other than
that they didn't provide a syntax for defining new conversions.

regards, tom lane

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#11)
Re: Proposal: CREATE CONVERSION

Tom Lane writes:

One thing that's really unclear to me is what's the difference between
a <character translation> and a <form-of-use conversion>, other than
that they didn't provide a syntax for defining new conversions.

The standard has this messed up. In part 1, a form-of-use and an encoding
are two distinct things that can be applied to a character repertoire (see
clause 4.6.2.1), whereas in part 2 the term encoding is used in the
definition of form-of-use (clause 3.1.5 r).

When I sort it out, however, I think that what Tatsuo was describing is
indeed a form-of-use conversion. Note that in part 2, clause 4.2.2.1, it
says about form-of-use conversions,

It is intended,
though not enforced by this part of ISO/IEC 9075, that S2 be
exactly the same sequence of characters as S1, but encoded
according some different form-of-use. A typical use might be to
convert a character string from two-octet UCS to one-octet Latin1
or vice versa.

This seems to match what we're doing.

A character translation does not make this requirement and it explicitly
calls out the possibility of "many-to-one or one-to-one mapping between
two not necessarily distinct character sets". I imagine that what this is
intended to do is to allow the user to create mappings such as ᅵ
-> oe (as is common in German to avoid using characters with diacritic
marks), or ᅵ -> o (as one might do in French to achieve the same). In
fact, it's a glorified sed command.

So I withdraw my earlier comment. But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.

--
Peter Eisentraut peter_e@gmx.net

#13Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#12)
Re: Proposal: CREATE CONVERSION

So I withdraw my earlier comment. But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.

Ok. What about this?

CREATE CONVERSION <conversion name>
FOR <encoding name>
TO <encoding name>
FROM <conversion routine name>

DROP CONVERSION <conversion name>

BTW, I wonder if we should invent new access privilege for conversion.
--
Tatsuo Ishii

#14Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#13)
Re: Proposal: CREATE CONVERSION

Here is a proposal for new pg_conversion system table. Comments?

/*-------------------------------------------------------------------------
*
* pg_conversion.h
* definition of the system "conversion" relation (pg_conversion)
* along with the relation's initial contents.
*
*
* Portions Copyright (c) 1996-2002, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* $Id$
*
* NOTES
* the genbki.sh script reads this file and generates .bki
* information from the DATA() statements.
*
*-------------------------------------------------------------------------
*/
#ifndef PG_CONVERSION_H
#define PG_CONVERSION_H

/* ----------------
* postgres.h contains the system type definitions and the
* CATALOG(), BOOTSTRAP and DATA() sugar words so this file
* can be read by both genbki.sh and the C compiler.
* ----------------
*/

/* ----------------------------------------------------------------
* pg_conversion definition.
*
* cpp turns this into typedef struct FormData_pg_namespace
*
* conname name of the conversion
* connamespace name space which the conversion belongs to
* conowner ower of the conversion
* conforencoding FOR encoding id
* contoencoding TO encoding id
* conproc OID of the conversion proc
* ----------------------------------------------------------------
*/
CATALOG(pg_conversion)
{
NameData conname;
Oid connamespace;
int4 conowner;
int4 conforencoding;
int4 contoencoding;
Oid conproc;
} FormData_pg_conversion;

/* ----------------
* Form_pg_conversion corresponds to a pointer to a tuple with
* the format of pg_conversion relation.
* ----------------
*/
typedef FormData_pg_conversion *Form_pg_conversion;

/* ----------------
* compiler constants for pg_conversion
* ----------------
*/

#define Natts_pg_conversion 6
#define Anum_pg_conversion_conpname 1
#define Anum_pg_conversion_connamespace 2
#define Anum_pg_conversion_conowner 3
#define Anum_pg_conversion_conforencoding 4
#define Anum_pg_conversion_contoencoding 5
#define Anum_pg_conversion_conproc 6

/* ----------------
* initial contents of pg_conversion
* ---------------
*/

/*
* prototypes for functions in pg_conversion.c
*/
extern Oid ConversionCreate(const char *conname, Oid connamespace,
int32 conowner,
int4 conforencoding, int4 contoencoding,
Oid conproc);

#endif /* PG_CONVERSION_H */

#15Karel Zak
zakkr@zf.jcu.cz
In reply to: Peter Eisentraut (#9)
Re: Proposal: CREATE CONVERSION

On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:

What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,

^^^^^^^

Very good point. Why use own conversion routines/tables if there is common
library for this?

The encoding API for PostgreSQL is really cool idea.

I unsure with only one argument for encoding function. What if I want
to use one generic function for all encodings (for example as API to
iconv)? I think better C interface is:

encode( TEXT data, NAME from, NAME to );

where from/to are encoding names. The other way is use some struct
that handle this information -- like ARGS in trigger functions.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#16Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#15)
Re: Proposal: CREATE CONVERSION

On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:

What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,

^^^^^^^

Very good point. Why use own conversion routines/tables if there is common
library for this?

I'm not still sure about the details of conversion map used by
iconv. Japanese users have enough trouble with the conversin between
Unicode and othe charsets. This is because there are many variation of
conversion maps provided by vendors. For example, the conversion map
used for Unicode and SJIS in PostgreSQL has been carefully designed to
minimize problems described above. Another issue is the availabilty of
iconv among platforms. If we are sure that a particlular iconv
conversion routine is available on all platforms and the conversion
result is good eough, our conversion routine could be replaced by new
one using iconv.

The encoding API for PostgreSQL is really cool idea.

I unsure with only one argument for encoding function. What if I want
to use one generic function for all encodings (for example as API to
iconv)?

Use a simple wrap function.
--
Tatsuo Ishii

#17Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#16)
Re: Proposal: CREATE CONVERSION

On Mon, Jul 08, 2002 at 09:59:44PM +0900, Tatsuo Ishii wrote:

On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:

What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,

^^^^^^^

Very good point. Why use own conversion routines/tables if there is common
library for this?

I'm not still sure about the details of conversion map used by
iconv. Japanese users have enough trouble with the conversin between
Unicode and othe charsets. This is because there are many variation of
conversion maps provided by vendors. For example, the conversion map
used for Unicode and SJIS in PostgreSQL has been carefully designed to
minimize problems described above. Another issue is the availabilty of
iconv among platforms. If we are sure that a particlular iconv
conversion routine is available on all platforms and the conversion
result is good eough, our conversion routine could be replaced by new
one using iconv.

This is not problem if we will have some common API. You can use
current conversion tables (maps) and for example I can use iconv
on my i386/Linux.

I don't want to replace current maps if somebody needs it. I would
like to API.

I see iconv is included into glibc now.

I unsure with only one argument for encoding function. What if I want
to use one generic function for all encodings (for example as API to
iconv)?

Use a simple wrap function.

How knows this function to/from encoding?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#13)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

BTW, I wonder if we should invent new access privilege for conversion.

I believe the spec just demands USAGE on the underlying function for
the TRANSLATE case, and I don't see why it should be different for
CONVERT. (In principle, if we didn't use a C-only API, you could
just call the underlying function directly; so there's little point
in having protection restrictions different from that case.)

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#14)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

CATALOG(pg_conversion)
{
NameData conname;
Oid connamespace;
int4 conowner;
int4 conforencoding;
int4 contoencoding;
Oid conproc;
} FormData_pg_conversion;

Should use type "regproc" for conproc, I think.

regards, tom lane

#20Thomas Lockhart
lockhart@fourpalms.org
In reply to: Peter Eisentraut (#12)
Re: Proposal: CREATE CONVERSION

...

So I withdraw my earlier comment. But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.

Tatsuo, it seems that we should use SQL99 terminology and commands where
appropriate. We do not yet implement the SQL99 forms of character
support, and I'm not sure if our current system is modeled to fit the
SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
infringing on SQL99 syntax to allow us to use that sometime later?

- Thomas

#21Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#14)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii writes:

Here is a proposal for new pg_conversion system table. Comments?

I wonder if the encodings themselves shouldn't be represented in some
system table, too. Admittedly, this is nearly orthogonal to the proposed
system table, except perhaps the data type of the two encoding fields.

--
Peter Eisentraut peter_e@gmx.net

#22Peter Eisentraut
peter_e@gmx.net
In reply to: Thomas Lockhart (#20)
Re: Proposal: CREATE CONVERSION

Thomas Lockhart writes:

Tatsuo, it seems that we should use SQL99 terminology and commands where
appropriate. We do not yet implement the SQL99 forms of character
support, and I'm not sure if our current system is modeled to fit the
SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
infringing on SQL99 syntax to allow us to use that sometime later?

SQL99 says that the method by which conversions are created is
implementation-defined. Tatsuo is defining the implementation.

--
Peter Eisentraut peter_e@gmx.net

#23Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#20)
Re: Proposal: CREATE CONVERSION

Tatsuo, it seems that we should use SQL99 terminology and commands where
appropriate. We do not yet implement the SQL99 forms of character
support, and I'm not sure if our current system is modeled to fit the
SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
infringing on SQL99 syntax to allow us to use that sometime later?

I'm not sure I understand your question, but I would say I would like
to follow SQL99 as much as possible.

When you say "We do not yet implement the SQL99 forms of character
support", I think you mean the ability to specify per column (or even
per string) charset. I don't think this would happen for 7.3(or 8.0
whatever), but sometime later I would like to make it reality.
--
Tatsuo Ishii

#24Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#17)
Re: Proposal: CREATE CONVERSION

Use a simple wrap function.

How knows this function to/from encoding?

For example you want to define a function for LATIN1 to UNICODE conversion
function would look like:

function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length
integer)
{
:
:
generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",
length);
}

CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
RETURNS integer;
CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
function_for_LATIN1_to_UTF-8;

#25Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#21)
Re: Proposal: CREATE CONVERSION

Here is a proposal for new pg_conversion system table. Comments?

I wonder if the encodings themselves shouldn't be represented in some
system table, too. Admittedly, this is nearly orthogonal to the proposed
system table, except perhaps the data type of the two encoding fields.

That would be ideal, but I think that would happen at the same time
when CREATE CHARACTER SET would be implemented.
--
Tatsuo Ishii

#26Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#18)
Re: Proposal: CREATE CONVERSION

I believe the spec just demands USAGE on the underlying function for
the TRANSLATE case, and I don't see why it should be different for
CONVERT. (In principle, if we didn't use a C-only API, you could
just call the underlying function directly; so there's little point
in having protection restrictions different from that case.)

Ok, so:

(1) a CONVERSION can only be dropped by the superuser or its owner.
(2) a grant syntax for CONVERSION is:

GRANT USAGE ON CONVERSION <conversion_name> to
{<user_name> | GROUP <group_name> | PUBLIC} [, ...]
--
Tatsuo Ishii

#27Thomas Lockhart
lockhart@fourpalms.org
In reply to: Peter Eisentraut (#12)
Re: Proposal: CREATE CONVERSION

When you say "We do not yet implement the SQL99 forms of character
support", I think you mean the ability to specify per column (or even
per string) charset. I don't think this would happen for 7.3(or 8.0
whatever), but sometime later I would like to make it reality.

Right.

An aside: I was thinking about this some, from the PoV of using our
existing type system to handle this (as you might remember, this is an
inclination I've had for quite a while). I think that most things line
up fairly well to allow this (and having transaction-enabled features
may require it), but do notice that the SQL feature of allowing a
different character set for every column *name* does not map
particularly well to our underlying structures.

- Thomas

#28Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#27)
Re: Proposal: CREATE CONVERSION

An aside: I was thinking about this some, from the PoV of using our
existing type system to handle this (as you might remember, this is an
inclination I've had for quite a while). I think that most things line
up fairly well to allow this (and having transaction-enabled features
may require it), but do notice that the SQL feature of allowing a
different character set for every column *name* does not map
particularly well to our underlying structures.

I've been think this for a while too. What about collation? If we add
new chaset A and B, and each has 10 collations then we are going to
have 20 new types? That seems overkill to me.
--
Tatsuo Ishii

#29Thomas Lockhart
lockhart@fourpalms.org
In reply to: Thomas Lockhart (#20)
Re: Proposal: CREATE CONVERSION

I've been think this for a while too. What about collation? If we add
new chaset A and B, and each has 10 collations then we are going to
have 20 new types? That seems overkill to me.

Well, afaict all of the operations we would ask of a type we will be
required to provide for character sets and collations. So ordering,
conversions, operators, index methods, etc etc are all required. It
*does* seem like a lot of work, but the type system is specifically
designed to do exactly this. Lifting those capabilities out of the type
system only to reimplement them elsewhere seems all trouble with no
upside.

Perhaps the current concept of "binary compatible types" could help
reduce the complexity, if it were made extensible, which it needs
anyway. But in most cases the character set/collation pair is a unique
combination, with a limited set of possibilities for other character
set/collation pairs with equivalent forms of use, which would keep us
from being able to reuse pieces anyway.

For most installations, we would install just those character sets the
installation/database requires, so in practice the database size need
not grow much beyond what it already is. And we could have conventions
on how functions and operators are named for a character set and/or
collation, so we could auto-generate the SQL definitions given an
implementation which meets a template standard.

Hmm, an aside which might be relevant: I've been looking at the
"national character string" syntax (you know, the N'string' convention)
and at the binary and hex string syntax (B'101010' and X'AB1D', as
examples) and would like to implement them in the lexer and parser by
having the string preceded with a type identifier as though they were
something like

NATIONAL CHARACTER 'string'
BIN '101010'
HEX 'AB1D'

where both BIN and HEX result in the *same* underlying data type once
ingested (or at least a reasonable facimile). I won't be allowed to
create two data types with the same type OID, but maybe if I assign them
to be binary compatible then I won't have to flesh out the hex data type
but only provide an input and output function.

- Thomas

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#26)
Re: Proposal: CREATE CONVERSION

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I believe the spec just demands USAGE on the underlying function for
the TRANSLATE case, and I don't see why it should be different for
CONVERT. (In principle, if we didn't use a C-only API, you could
just call the underlying function directly; so there's little point
in having protection restrictions different from that case.)

Ok, so:

(1) a CONVERSION can only be dropped by the superuser or its owner.

Okay ...

(2) a grant syntax for CONVERSION is:

GRANT USAGE ON CONVERSION <conversion_name> to
{<user_name> | GROUP <group_name> | PUBLIC} [, ...]

No, I don't think a conversion has any privileges of its own at all.
You either have USAGE on the underlying function, or not.

regards, tom lane

#31Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#29)
Re: Proposal: CREATE CONVERSION

I've been think this for a while too. What about collation? If we add
new chaset A and B, and each has 10 collations then we are going to
have 20 new types? That seems overkill to me.

Well, afaict all of the operations we would ask of a type we will be
required to provide for character sets and collations. So ordering,
conversions, operators, index methods, etc etc are all required. It
*does* seem like a lot of work, but the type system is specifically
designed to do exactly this. Lifting those capabilities out of the type
system only to reimplement them elsewhere seems all trouble with no
upside.

If so, what about the "coercibility" property?
The standard defines four distinct coercibility properties. So in
above my example, actually you are going to define 80 new types?
(also a collation could be either "PAD SPACE" or "NO PAD". So you
might have 160 new types).
--
Tatsuo Ishii

#32Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#30)
Re: Proposal: CREATE CONVERSION

(1) a CONVERSION can only be dropped by the superuser or its owner.

Okay ...

(2) a grant syntax for CONVERSION is:

GRANT USAGE ON CONVERSION <conversion_name> to
{<user_name> | GROUP <group_name> | PUBLIC} [, ...]

No, I don't think a conversion has any privileges of its own at all.
You either have USAGE on the underlying function, or not.

I see.
--
Tatsuo Ishii

#33Thomas Lockhart
lockhart@fourpalms.org
In reply to: Thomas Lockhart (#27)
Re: Proposal: CREATE CONVERSION

If so, what about the "coercibility" property?
The standard defines four distinct coercibility properties. So in
above my example, actually you are going to define 80 new types?
(also a collation could be either "PAD SPACE" or "NO PAD". So you
might have 160 new types).

Well, yes I suppose so. The point is that these relationships *must be
defined anyway*. Allowed and forbidden conversions must be defined,
collation order must be defined, indexing operations must be defined,
etc etc etc. In fact, everything typically associated with a type must
be defined, including the allowed conversions between other types
(character sets/collations).

So, how are we going to do this *in a general way* without carrying the
infrastructure of a (the) type system along with it? What would we be
able to leave out or otherwise get for free if we use another mechanism?
And is that mechanism fundamentally simpler than (re)using the type
system that we already have?

- Thomas

#34Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#33)
Re: Proposal: CREATE CONVERSION

If so, what about the "coercibility" property?
The standard defines four distinct coercibility properties. So in
above my example, actually you are going to define 80 new types?
(also a collation could be either "PAD SPACE" or "NO PAD". So you
might have 160 new types).

Well, yes I suppose so. The point is that these relationships *must be
defined anyway*. Allowed and forbidden conversions must be defined,
collation order must be defined, indexing operations must be defined,
etc etc etc. In fact, everything typically associated with a type must
be defined, including the allowed conversions between other types
(character sets/collations).

So, how are we going to do this *in a general way* without carrying the
infrastructure of a (the) type system along with it? What would we be
able to leave out or otherwise get for free if we use another mechanism?
And is that mechanism fundamentally simpler than (re)using the type
system that we already have?

Well, I think charset/collation/coercibility/pad are all string data
type specific properties, not common to any other data types. So it
seems more appropreate for type systems not to have those certain type
specific knowledges. For example,

S1 < S2

should raise an error if S1 has "no collating properties" and S2 has
"implicit collating properties", while ok if S1 has "no collating
properties" and S2 has "explicit collating properties". It would be
very hard for the type system to handle this kind of cases since it
requires special knowledges about string data type.

Alternative?

Why don't we have these properties in the string data itself?
(probably we do not need to have them on disk storage). Existing text
data type has length + data. I suggest to extend it like:

length + charset + collation + pad + coercibility + data

With this above example could be easily handled by < operator.

For index, maybe we could dynamically replace varstr_cmp function
according to collation, though I have not actually examined my idea
closely.
--
Tatsuo Ishii

#35Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#1)
Re: Proposal: CREATE CONVERSION

SQL99 allows on the fly encoding conversion:

CONVERT('aaa' USING myconv 'bbb')

So there could be more than 1 conversion for a paticlular encodings
pair. This lead to an ambiguity for "default" conversion used for the
frontend/backend automatic encoding conversion. Can we add a flag
indicating that this is the "default" conversion? The new proposed
syntax would be:

CREATE CONVERSION <conversion name>
FOR <source encoding name>
TO <destination encoding name>
FROM <conversion function name>
[DEFAULT]

Comments?
--
Tatsuo Ishii

#36Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#24)
Re: Proposal: CREATE CONVERSION

On Tue, Jul 09, 2002 at 10:07:11AM +0900, Tatsuo Ishii wrote:

Use a simple wrap function.

How knows this function to/from encoding?

For example you want to define a function for LATIN1 to UNICODE conversion
function would look like:

function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length
integer)
{
:
:
generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",
length);
}

CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
RETURNS integer;
CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
function_for_LATIN1_to_UTF-8;

Hmm, but it require define "function_for_..." for each conversion.
For example trigger function I needn't define for each table, but I can
use only one PostgreSQL function for arbirary table.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#37Hannu Krosing
hannu@tm.ee
In reply to: Tatsuo Ishii (#28)
Re: Proposal: CREATE CONVERSION

On Tue, 2002-07-09 at 03:47, Tatsuo Ishii wrote:

An aside: I was thinking about this some, from the PoV of using our
existing type system to handle this (as you might remember, this is an
inclination I've had for quite a while). I think that most things line
up fairly well to allow this (and having transaction-enabled features
may require it), but do notice that the SQL feature of allowing a
different character set for every column *name* does not map
particularly well to our underlying structures.

I've been think this for a while too. What about collation? If we add
new chaset A and B, and each has 10 collations then we are going to
have 20 new types? That seems overkill to me.

Can't we do all collating in unicode and convert charsets A and B to and
from it ?

I would even reccommend going a step further and storing all 'national'
character sets in unicode.

--------------
Hannu

#38Peter Eisentraut
peter_e@gmx.net
In reply to: Thomas Lockhart (#27)
Re: Proposal: CREATE CONVERSION

Thomas Lockhart writes:

An aside: I was thinking about this some, from the PoV of using our
existing type system to handle this (as you might remember, this is an
inclination I've had for quite a while). I think that most things line
up fairly well to allow this (and having transaction-enabled features
may require it), but do notice that the SQL feature of allowing a
different character set for every column *name* does not map
particularly well to our underlying structures.

There more I think about it, the more I come to the conclusion that the
SQL framework for "character sets" is both bogus and a red herring. (And
it begins with figuring out exactly what a character set is, as opposed
to a form-of-use, a.k.a.(?) encoding, but let's ignore that.)

The ability to store each column value in a different encoding sounds
interesting, because it allows you to create tables such as

product_id | product_name_en | product_name_kr | product_name_jp

but you might as well create a table such as

product_id | lang | product_name

with product_name in Unicode, and have a more extensible application that
way, too.

I think it's fine to have the encoding fixed for the entire database. It
sure makes coding easier. If you want to be international, you use
Unicode. If not you can "optimize" your database by using a more
efficient encoding. In fact, I think we should consider making UTF-8 the
default encoding sometime.

The real issue is the collation. But the collation is a small subset of
the whole locale/character set gobbledigook. Standardized collation rules
in standardized forms exist. Finding/creating routines to interpret and
apply them should be the focus. SQL's notion to funnel the decision which
collation rule to apply through the character sets is bogus. It's
impossible to pick a default collation rule for many character sets
without applying bias.

--
Peter Eisentraut peter_e@gmx.net

#39Peter Eisentraut
peter_e@gmx.net
In reply to: Hannu Krosing (#37)
Re: Proposal: CREATE CONVERSION

Hannu Krosing writes:

Can't we do all collating in unicode and convert charsets A and B to and

from it ?

I would even reccommend going a step further and storing all 'national'
character sets in unicode.

Sure. However, Tatsuo maintains that the customary Japanese character
sets don't map very well with Unicode. Personally, I believe that this is
an issue that should be fixed, not avoided, but I don't understand the
issues well enough.

--
Peter Eisentraut peter_e@gmx.net

#40Tim Allen
tim@proximity.com.au
In reply to: Peter Eisentraut (#39)
Re: Proposal: CREATE CONVERSION

On Wed, 10 Jul 2002 08:21, Peter Eisentraut wrote:

Hannu Krosing writes:

...

I would even reccommend going a step further and storing all 'national'
character sets in unicode.

Sure. However, Tatsuo maintains that the customary Japanese character
sets don't map very well with Unicode. Personally, I believe that this is
an issue that should be fixed, not avoided, but I don't understand the
issues well enough.

Presumably improving the Unicode support to cover the full UTF32 (or UCS4)
range would help with this. Last time I checked, PostgreSQL only supports the
UCS2 subset of Unicode, ie 16 bits. From the Unicode propaganda I've read, it
seems that one of the main goals of the expansion of the range beyond 16 bits
was to answer the complaints of Japanese users.

Tim

--
-----------------------------------------------
Tim Allen tim@proximity.com.au
Proximity Pty Ltd http://www.proximity.com.au/
http://www4.tpg.com.au/users/rita_tim/

#41Curt Sampson
cjs@cynic.net
In reply to: Peter Eisentraut (#39)
Re: Proposal: CREATE CONVERSION

On Wed, 10 Jul 2002, Peter Eisentraut wrote:

Sure. However, Tatsuo maintains that the customary Japanese character
sets don't map very well with Unicode. Personally, I believe that this is
an issue that should be fixed, not avoided, but I don't understand the
issues well enough.

I hear this all the time, but I have yet to have someone show me what,
Iin SO-2022-JP, EUC-JP or SJIS cannot be transparently translated into
Unicode and back.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#42Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#36)
Re: Proposal: CREATE CONVERSION

For example you want to define a function for LATIN1 to UNICODE conversion
function would look like:

function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length
integer)
{
:
:
generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",
length);
}

CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
RETURNS integer;
CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
function_for_LATIN1_to_UTF-8;

Hmm, but it require define "function_for_..." for each conversion.
For example trigger function I needn't define for each table, but I can
use only one PostgreSQL function for arbirary table.

I don't think this is a big problem, IMO.

However, thinking more, I came to a conclusion that passing encoding
ids would be a good thing. With the encoding id parameters, the
function could check if it is called with correct encodings, and this
would prevent disaster. New interface proposal:

pgconv(
INTEGER, -- source encoding id
INTEGER, -- destination encoding id
OPAQUE, -- source string (null terminated C string)
OPAQUE, -- destination string (null terminated C string)
INTERGER -- source string length
) returns INTEGER; -- dummy. returns nothing, actually.

--
Tatsuo Ishii

#43Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#42)
Re: Proposal: CREATE CONVERSION

On Thu, Jul 11, 2002 at 03:37:49PM +0900, Tatsuo Ishii wrote:

CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
RETURNS integer;
CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
function_for_LATIN1_to_UTF-8;

Hmm, but it require define "function_for_..." for each conversion.
For example trigger function I needn't define for each table, but I can
use only one PostgreSQL function for arbirary table.

I don't think this is a big problem, IMO.

However, thinking more, I came to a conclusion that passing encoding
ids would be a good thing. With the encoding id parameters, the
function could check if it is called with correct encodings, and this
would prevent disaster. New interface proposal:

OK.

pgconv(
INTEGER, -- source encoding id
INTEGER, -- destination encoding id

Where/how is describe conversion between encoding id and encoding
name? (I maybe something overlook:-) I expect new encoding system
will extendable and encodings list not will hardcoded like now.
(extendable = add new encoding without PostgreSQL rebuild)

BTW, the client site needs routines for work with encoding names too
(pg_char_to_encoding()). Hmm.. it can't be extendable, or yes?

OPAQUE, -- source string (null terminated C string)
OPAQUE, -- destination string (null terminated C string)
INTERGER -- source string length
) returns INTEGER; -- dummy. returns nothing, actually.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#44Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#43)
Re: Proposal: CREATE CONVERSION

Where/how is describe conversion between encoding id and encoding
name? (I maybe something overlook:-) I expect new encoding system
will extendable and encodings list not will hardcoded like now.
(extendable = add new encoding without PostgreSQL rebuild)

User defined charsets(encodings) is under discussion and I believe it
would not happen for 7.3.

BTW, the client site needs routines for work with encoding names too
(pg_char_to_encoding()). Hmm.. it can't be extendable, or yes?

pg_char_to_encoding() is already in libpq. Or am I missing something?
--
Tatsuo Ishii

#45Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#44)
Re: Proposal: CREATE CONVERSION

On Thu, Jul 11, 2002 at 05:26:01PM +0900, Tatsuo Ishii wrote:

Where/how is describe conversion between encoding id and encoding
name? (I maybe something overlook:-) I expect new encoding system
will extendable and encodings list not will hardcoded like now.
(extendable = add new encoding without PostgreSQL rebuild)

User defined charsets(encodings) is under discussion and I believe it
would not happen for 7.3.

BTW, the client site needs routines for work with encoding names too
(pg_char_to_encoding()). Hmm.. it can't be extendable, or yes?

pg_char_to_encoding() is already in libpq. Or am I missing something?

It works with encoding table (pg_enc2name_tbl) and it's compiled
into backend and client too. It means number of encoding is not possible
change after compilation and you (user) can't add new encoding without
pg_enc2name_tbl[] change. I original thought we can add new encodings
on-the-fly in 7.3 :-) You're right.

IMHO implement "User defined charsets(encodings)" will problem for
current libpq design.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#46Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#45)
Re: Proposal: CREATE CONVERSION

pg_char_to_encoding() is already in libpq. Or am I missing something?

It works with encoding table (pg_enc2name_tbl) and it's compiled
into backend and client too. It means number of encoding is not possible
change after compilation and you (user) can't add new encoding without
pg_enc2name_tbl[] change. I original thought we can add new encodings
on-the-fly in 7.3 :-) You're right.

IMHO implement "User defined charsets(encodings)" will problem for
current libpq design.

No, it's not a libpq problem, but more common "client/server" problem
IMO. It's very hard to share dynamically created object (info)
effectively between client and server.
--
Tatsuo Ishii

#47Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#46)
Re: Proposal: CREATE CONVERSION

On Thu, Jul 11, 2002 at 05:52:18PM +0900, Tatsuo Ishii wrote:

pg_char_to_encoding() is already in libpq. Or am I missing something?

It works with encoding table (pg_enc2name_tbl) and it's compiled
into backend and client too. It means number of encoding is not possible
change after compilation and you (user) can't add new encoding without
pg_enc2name_tbl[] change. I original thought we can add new encodings
on-the-fly in 7.3 :-) You're right.

IMHO implement "User defined charsets(encodings)" will problem for
current libpq design.

No, it's not a libpq problem, but more common "client/server" problem
IMO. It's very hard to share dynamically created object (info)
effectively between client and server.

IMHO dynamic object will keep server and client must ask for wanted
information to server.

Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#48Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#47)
Re: Proposal: CREATE CONVERSION

No, it's not a libpq problem, but more common "client/server" problem
IMO. It's very hard to share dynamically created object (info)
effectively between client and server.

IMHO dynamic object will keep server and client must ask for wanted
information to server.

I agree with you. However real problem is how fast it could be. For
example, pg_mblen() is called for each word processed by libpq to know
the byte length of the word. If each call to pg_mblen() accesses
backend, the performance might be unacceptably slow.
--
Tatsuo Ishii

#49Karel Zak
zakkr@zf.jcu.cz
In reply to: Tatsuo Ishii (#48)
Re: Proposal: CREATE CONVERSION

On Thu, Jul 11, 2002 at 06:30:48PM +0900, Tatsuo Ishii wrote:

No, it's not a libpq problem, but more common "client/server" problem
IMO. It's very hard to share dynamically created object (info)
effectively between client and server.

IMHO dynamic object will keep server and client must ask for wanted
information to server.

I agree with you. However real problem is how fast it could be. For
example, pg_mblen() is called for each word processed by libpq to know
the byte length of the word. If each call to pg_mblen() accesses
backend, the performance might be unacceptably slow.

It must load all relevant information about actual encoding(s) and
cache it in libpq.

IMHO basic encoding information like name and id are not problem.
The PQmblen() is big problem. Strange question: is PQmblen() really
needful? I see it's used for result printing, but why backend not
mark size of field (word) to result? If backend good knows size of
data why not send this information to client togeter with data?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#50Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Karel Zak (#49)
Re: Proposal: CREATE CONVERSION

IMHO basic encoding information like name and id are not problem.
The PQmblen() is big problem. Strange question: is PQmblen() really
needful? I see it's used for result printing, but why backend not
mark size of field (word) to result? If backend good knows size of
data why not send this information to client togeter with data?

PQmblen() is used by psql in many places. It is used for parsing query
texts supplied by user, not only for data sent from backend.
--
Tatsuo Ishii