Proposal: CREATE CONVERSION
Here is my proposal for new CREATE CONVERSION which makes it possible
to define new encoding conversion mapping between two encodings on the
fly.
The background:
We are getting having more and more encoding conversion tables. Up to
now, they reach to 385352 source lines and over 3MB in compiled forms
in total. They are statically linked to the backend. I know this
itself is not a problem since modern OSs have smart memory management
capabilities to fetch only necessary pages from a disk. However, I'm
worried about the infinite growing of these static tables. I think
users won't love 50MB PostgreSQL backend load module.
Second problem is more serious. The conversion definitions between
certain encodings, such as Unicode and others are not well
defined. For example, there are several conversion tables for Japanese
Shift JIS and Unicode. This is because each vendor has its own
"special characters" and they define the table in that the conversion
fits for their purpose.
The solution:
The proposed new CREATE CONVERSION will solve these problems. A
particular conversion table is statically linked to a dynamic loaded
function and CREATE CONVERSION will tell PostgreSQL that if
a conversion from encoding A to encoding B, then function C should be
used. In this way, conversion tables are no more statically linked to
the backend.
Users also could define their own conversion tables easily that would
best fit for their purpose. Also needless to say, people could define
new conversions which PostgreSQL does not support yet.
Syntax proposal:
CREATE CONVERSION <conversion name>
SOURCE <source encoding name>
DESTINATION <destination encoding name>
FROM <conversion function name>
;
DROP CONVERSION <conversion name>;
Example usage:
CREATE OR REPLACE FUNCTION euc_jp_to_utf8(TEXT, TEXT, INTEGER)
RETURNS INTEGER AS euc_jp_to_utf8.so LANGUAGE 'c';
CREATE CONVERSION euc_jp_to_utf8
SOURCE EUC_JP DESTINATION UNICODE
FROM euc_jp_to_utf8;
Implementation:
Implementation would be quite straightforward. Create a new system
table, and CREATE CONVERSION stores info onto
it. pg_find_encoding_converters(utils/mb/mbutils.c) and friends needs
to be modified so that they recognize dynamically defined conversions.
Also psql would need some capabilities to print conversion definition
info.
Comments?
--
Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
Syntax proposal:
CREATE CONVERSION <conversion name>
SOURCE <source encoding name>
DESTINATION <destination encoding name>
FROM <conversion function name>
Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?
Basically I'd like to see a spec for the API of the conversion
function...
Also, is there anything in SQL99 that we ought to try to be
compatible with?
regards, tom lane
CREATE CONVERSION <conversion name>
SOURCE <source encoding name>
DESTINATION <destination encoding name>
FROM <conversion function name>Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?
No, only one function is sufficient. What else do you think of?
Basically I'd like to see a spec for the API of the conversion
function...
That would be very simple (the previous example I gave was unnecessary
complex). The function signature would look like:
conversion_funcion(TEXT) RETURNS TEXT
It receives source text and converts it then returns it. That's all.
Also, is there anything in SQL99 that we ought to try to be
compatible with?
As far as I know there's no such an equivalent in SQL99.
--
Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?
No, only one function is sufficient. What else do you think of?
I see two different functions linked to from each pg_wchar_table
entry... although perhaps those are associated with encodings
not with conversions.
Basically I'd like to see a spec for the API of the conversion
function...
That would be very simple (the previous example I gave was unnecessary
complex). The function signature would look like:
conversion_funcion(TEXT) RETURNS TEXT
It receives source text and converts it then returns it. That's all.
IIRC the existing conversion functions deal in C string pointers and
lengths. I'm a little worried about the extra overhead implicit
in converting to a TEXT object and back again; that probably means at
least two more palloc and memcpy operations. I think you'd be better
off sticking to a C-level API, because I really don't believe that
anyone is going to code conversion functions in (say) plpgsql.
regards, tom lane
Tatsuo Ishii wrote:
Here is my proposal for new CREATE CONVERSION which makes it possible
to define new encoding conversion mapping between two encodings on the
fly.The background:
We are getting having more and more encoding conversion tables. Up to
now, they reach to 385352 source lines and over 3MB in compiled forms
in total. They are statically linked to the backend. I know this
itself is not a problem since modern OSs have smart memory management
capabilities to fetch only necessary pages from a disk. However, I'm
worried about the infinite growing of these static tables. I think
users won't love 50MB PostgreSQL backend load module.
Yes, those conversion tables are getting huge in the tarball too:
$ pwd
/pg/backend/utils/mb
$ du
4 ./CVS
7 ./Unicode/CVS
9541 ./Unicode
15805 .
Look at these two file alone:
-rw-r--r-- 1 postgres wheel 1427492 Jun 13 04:28 gb18030_to_utf8.map
-rw-r--r-- 1 postgres wheel 1427492 Jun 13 04:28 utf8_to_gb18030.map
If we can make these loadable, that would be good. What would be really
interesting is if we could split these out into a separate
directory/project so development on those could take place in an
independent way. This would probably stimulate even more encoding
options for users.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I see two different functions linked to from each pg_wchar_table
entry... although perhaps those are associated with encodings
not with conversions.
Yes. those are not directly associated with conversions.
IIRC the existing conversion functions deal in C string pointers and
lengths. I'm a little worried about the extra overhead implicit
in converting to a TEXT object and back again; that probably means at
least two more palloc and memcpy operations. I think you'd be better
off sticking to a C-level API, because I really don't believe that
anyone is going to code conversion functions in (say) plpgsql.
I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?
--
Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?
Well, you'd have to use the same cheat that's used for selectivity
estimation functions, triggers, I/O functions and everything else that
deals in internal datatypes: declare the function as taking and
returning OPAQUE. This is moderately annoying but I don't think
there's anything really wrong with it in practice.
regards, tom lane
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?Well, you'd have to use the same cheat that's used for selectivity
estimation functions, triggers, I/O functions and everything else that
deals in internal datatypes: declare the function as taking and
returning OPAQUE. This is moderately annoying but I don't think
there's anything really wrong with it in practice.
Oh, I see.
--
Tatsuo Ishii
Tatsuo Ishii writes:
Also, is there anything in SQL99 that we ought to try to be
compatible with?As far as I know there's no such an equivalent in SQL99.
Sure:
11.34 <translation definition>
Function
Define a character translation.
Format
<translation definition> ::=
CREATE TRANSLATION <translation name>
FOR <source character set specification>
TO <target character set specification>
FROM <translation source>
<source character set specification> ::= <character set specification>
<target character set specification> ::= <character set specification>
<translation source> ::=
<existing translation name>
| <translation routine>
<existing translation name> ::= <translation name>
<translation routine> ::= <specific routine designator>
That's pretty much exactly what you are descibing.
What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,
I have 176 "modules" under /usr/lib/gconv. They should be useful for
something.
--
Peter Eisentraut peter_e@gmx.net
Tatsuo Ishii writes:
Also, is there anything in SQL99 that we ought to try to be
compatible with?As far as I know there's no such an equivalent in SQL99.
Sure:
11.34 <translation definition>
I guess you mix up SQL99's "trasnlate" and "convert".
As far as I know, SQL99's "translation" is exactly a translation. e.g.
rr) translation: A method of translating characters in one character
repertoire into characters of the same or a different character
repertoire.
For example, certain "translation" might take an input of Engish text,
and makes an output of Japanese one (I don't know if we could
implement such a translation though :-).
On the other hand "convert" just changes the "form-of-use" (SQL's
term, actually equivalent to "encoding"), keeping the character
repertoire.
--
Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
I guess you mix up SQL99's "trasnlate" and "convert".
No, I believe Peter has read the spec correctly. Further down they have
<character translation> is a function for changing each character
of a given string according to some many-to-one or one-to-one
mapping between two not necessarily distinct character sets.
So this is intended as a one-character-at-a-time mapping, not a language
translation (which would be far beyond what anyone would expect of a
database anyway).
One thing that's really unclear to me is what's the difference between
a <character translation> and a <form-of-use conversion>, other than
that they didn't provide a syntax for defining new conversions.
regards, tom lane
Tom Lane writes:
One thing that's really unclear to me is what's the difference between
a <character translation> and a <form-of-use conversion>, other than
that they didn't provide a syntax for defining new conversions.
The standard has this messed up. In part 1, a form-of-use and an encoding
are two distinct things that can be applied to a character repertoire (see
clause 4.6.2.1), whereas in part 2 the term encoding is used in the
definition of form-of-use (clause 3.1.5 r).
When I sort it out, however, I think that what Tatsuo was describing is
indeed a form-of-use conversion. Note that in part 2, clause 4.2.2.1, it
says about form-of-use conversions,
It is intended,
though not enforced by this part of ISO/IEC 9075, that S2 be
exactly the same sequence of characters as S1, but encoded
according some different form-of-use. A typical use might be to
convert a character string from two-octet UCS to one-octet Latin1
or vice versa.
This seems to match what we're doing.
A character translation does not make this requirement and it explicitly
calls out the possibility of "many-to-one or one-to-one mapping between
two not necessarily distinct character sets". I imagine that what this is
intended to do is to allow the user to create mappings such as ᅵ
-> oe (as is common in German to avoid using characters with diacritic
marks), or ᅵ -> o (as one might do in French to achieve the same). In
fact, it's a glorified sed command.
So I withdraw my earlier comment. But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.
--
Peter Eisentraut peter_e@gmx.net
So I withdraw my earlier comment. But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.
Ok. What about this?
CREATE CONVERSION <conversion name>
FOR <encoding name>
TO <encoding name>
FROM <conversion routine name>
DROP CONVERSION <conversion name>
BTW, I wonder if we should invent new access privilege for conversion.
--
Tatsuo Ishii
Here is a proposal for new pg_conversion system table. Comments?
/*-------------------------------------------------------------------------
*
* pg_conversion.h
* definition of the system "conversion" relation (pg_conversion)
* along with the relation's initial contents.
*
*
* Portions Copyright (c) 1996-2002, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* $Id$
*
* NOTES
* the genbki.sh script reads this file and generates .bki
* information from the DATA() statements.
*
*-------------------------------------------------------------------------
*/
#ifndef PG_CONVERSION_H
#define PG_CONVERSION_H
/* ----------------
* postgres.h contains the system type definitions and the
* CATALOG(), BOOTSTRAP and DATA() sugar words so this file
* can be read by both genbki.sh and the C compiler.
* ----------------
*/
/* ----------------------------------------------------------------
* pg_conversion definition.
*
* cpp turns this into typedef struct FormData_pg_namespace
*
* conname name of the conversion
* connamespace name space which the conversion belongs to
* conowner ower of the conversion
* conforencoding FOR encoding id
* contoencoding TO encoding id
* conproc OID of the conversion proc
* ----------------------------------------------------------------
*/
CATALOG(pg_conversion)
{
NameData conname;
Oid connamespace;
int4 conowner;
int4 conforencoding;
int4 contoencoding;
Oid conproc;
} FormData_pg_conversion;
/* ----------------
* Form_pg_conversion corresponds to a pointer to a tuple with
* the format of pg_conversion relation.
* ----------------
*/
typedef FormData_pg_conversion *Form_pg_conversion;
/* ----------------
* compiler constants for pg_conversion
* ----------------
*/
#define Natts_pg_conversion 6
#define Anum_pg_conversion_conpname 1
#define Anum_pg_conversion_connamespace 2
#define Anum_pg_conversion_conowner 3
#define Anum_pg_conversion_conforencoding 4
#define Anum_pg_conversion_contoencoding 5
#define Anum_pg_conversion_conproc 6
/* ----------------
* initial contents of pg_conversion
* ---------------
*/
/*
* prototypes for functions in pg_conversion.c
*/
extern Oid ConversionCreate(const char *conname, Oid connamespace,
int32 conowner,
int4 conforencoding, int4 contoencoding,
Oid conproc);
#endif /* PG_CONVERSION_H */
On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:
What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,
^^^^^^^
Very good point. Why use own conversion routines/tables if there is common
library for this?
The encoding API for PostgreSQL is really cool idea.
I unsure with only one argument for encoding function. What if I want
to use one generic function for all encodings (for example as API to
iconv)? I think better C interface is:
encode( TEXT data, NAME from, NAME to );
where from/to are encoding names. The other way is use some struct
that handle this information -- like ARGS in trigger functions.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:
What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,^^^^^^^
Very good point. Why use own conversion routines/tables if there is common
library for this?
I'm not still sure about the details of conversion map used by
iconv. Japanese users have enough trouble with the conversin between
Unicode and othe charsets. This is because there are many variation of
conversion maps provided by vendors. For example, the conversion map
used for Unicode and SJIS in PostgreSQL has been carefully designed to
minimize problems described above. Another issue is the availabilty of
iconv among platforms. If we are sure that a particlular iconv
conversion routine is available on all platforms and the conversion
result is good eough, our conversion routine could be replaced by new
one using iconv.
The encoding API for PostgreSQL is really cool idea.
I unsure with only one argument for encoding function. What if I want
to use one generic function for all encodings (for example as API to
iconv)?
Use a simple wrap function.
--
Tatsuo Ishii
On Mon, Jul 08, 2002 at 09:59:44PM +0900, Tatsuo Ishii wrote:
On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:
What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library. For example,^^^^^^^
Very good point. Why use own conversion routines/tables if there is common
library for this?I'm not still sure about the details of conversion map used by
iconv. Japanese users have enough trouble with the conversin between
Unicode and othe charsets. This is because there are many variation of
conversion maps provided by vendors. For example, the conversion map
used for Unicode and SJIS in PostgreSQL has been carefully designed to
minimize problems described above. Another issue is the availabilty of
iconv among platforms. If we are sure that a particlular iconv
conversion routine is available on all platforms and the conversion
result is good eough, our conversion routine could be replaced by new
one using iconv.
This is not problem if we will have some common API. You can use
current conversion tables (maps) and for example I can use iconv
on my i386/Linux.
I don't want to replace current maps if somebody needs it. I would
like to API.
I see iconv is included into glibc now.
I unsure with only one argument for encoding function. What if I want
to use one generic function for all encodings (for example as API to
iconv)?Use a simple wrap function.
How knows this function to/from encoding?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
BTW, I wonder if we should invent new access privilege for conversion.
I believe the spec just demands USAGE on the underlying function for
the TRANSLATE case, and I don't see why it should be different for
CONVERT. (In principle, if we didn't use a C-only API, you could
just call the underlying function directly; so there's little point
in having protection restrictions different from that case.)
regards, tom lane
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
CATALOG(pg_conversion)
{
NameData conname;
Oid connamespace;
int4 conowner;
int4 conforencoding;
int4 contoencoding;
Oid conproc;
} FormData_pg_conversion;
Should use type "regproc" for conproc, I think.
regards, tom lane
...
So I withdraw my earlier comment. But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.
Tatsuo, it seems that we should use SQL99 terminology and commands where
appropriate. We do not yet implement the SQL99 forms of character
support, and I'm not sure if our current system is modeled to fit the
SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
infringing on SQL99 syntax to allow us to use that sometime later?
- Thomas