Deprecate custom encoding conversions

Started by Heikki Linnakangasabout 5 years ago13 messages

hlinnaka@iki.fi

about 5 years ago

Hi,

PostgreSQL allows writing custom encoding conversion functions between
any character encodings, using the CREATE CONVERSION command. It's
pretty flexible, you can define default and non-default conversions, and
the conversions live in schemas so you can have multiple conversions
installed in a system and you can switch between them by changing
search_path.

However:

We never use non-default conversions for anything. All code that
performs encoding conversions only cares about the default ones.

I think this flexibility is kind of ridiculous anyway. If a built-in
conversion routine doesn't handle some characters correctly, surely we
should fix the built-in function, rather than provide a mechanism for
having your own conversion functions. If you truly need to perform the
conversions differently than the built-in routines do, you can always
perform the conversion in the client instead.

Note that we don't support adding completely new custom encodings, all
this is just for conversions between the built-in encodings that we have.

I propose that we add a notice to the CREATE CONVERSION docs to say that
it is deprecated, and remove it in a few years.

Any objections? Anyone using custom encoding conversions in production?

- Heikki

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Heikki Linnakangas (#1)

Re: Deprecate custom encoding conversions

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I propose that we add a notice to the CREATE CONVERSION docs to say that
it is deprecated, and remove it in a few years.

While I agree that it's probably not that useful, what would we gain
by removing it? If you intend to remove those catalogs, what lookup
mechanism would replace them? We can't exactly drop the encoding
conversion functionality.

regards, tom lane

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Tom Lane (#2)

Re: Deprecate custom encoding conversions

On 02/12/2020 18:18, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I propose that we add a notice to the CREATE CONVERSION docs to say that
it is deprecated, and remove it in a few years.

While I agree that it's probably not that useful, what would we gain
by removing it? If you intend to remove those catalogs, what lookup
mechanism would replace them? We can't exactly drop the encoding
conversion functionality.

I'm thinking of replacing the catalog with a hard-coded 2D array of
function pointers. Or something along those lines.

I had this idea while looking at the encoding conversions performed in
COPY. The current conversion functions return a null-terminated,
palloc'd string, which is a bit awkward for the callers. The caller
needs to call strlen() on the result, and it would be nice to reuse the
same buffer for all the conversions. And I've got a hunch that it'd be
faster to convert the whole 64 kB raw input buffer in one go, rather
than convert each line separately, but the current function signature
doesn't make that easy either.

So I'm looking for refactoring the conversion routines to be more
convenient for the callers. But the current function signature is
exposed at the SQL level, thanks to CREATE CONVERSION.

- Heikki

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Heikki Linnakangas (#3)

Re: Deprecate custom encoding conversions

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 02/12/2020 18:18, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I propose that we add a notice to the CREATE CONVERSION docs to say that
it is deprecated, and remove it in a few years.

While I agree that it's probably not that useful, what would we gain
by removing it? If you intend to remove those catalogs, what lookup
mechanism would replace them? We can't exactly drop the encoding
conversion functionality.

I'm thinking of replacing the catalog with a hard-coded 2D array of
function pointers. Or something along those lines.

I like the current structure in which the encoding functions are in
separate .so's, so that you don't load the ones you don't need.
It's not real clear how to preserve that if we hard-code things.

So I'm looking for refactoring the conversion routines to be more
convenient for the callers. But the current function signature is
exposed at the SQL level, thanks to CREATE CONVERSION.

I'd be the first to agree that the current API for conversion functions is
not terribly well-designed. But what if we just change it? That can't be
any worse of a compatibility hit than removing CREATE CONVERSION
altogether.

regards, tom lane

tsunakawa.takay@fujitsu.com

about 5 years ago

In reply to: Heikki Linnakangas (#1)

RE: Deprecate custom encoding conversions

From: Heikki Linnakangas <hlinnaka@iki.fi>

I propose that we add a notice to the CREATE CONVERSION docs to say that
it is deprecated, and remove it in a few years.

Any objections? Anyone using custom encoding conversions in production?

I can't answer deeper questions because I'm not familiar with character sets, but I saw some Japanese web articles that use CREATE CONVERSION to handle external characters. So, I think we may as well retain it. (OTOH, I'm wondering how other DBMSs support external characters and what Postgres should implement it.)

Also, the SQL standard has CREATE CHARACTER SET and CREATE TRANSLATION. I don't know how these are useful, but the mechanism of CREATE CONVERSION can be used to support them.

CREATE CHARACTER SET <character set name> [ AS ]

<character set source> [ <collate clause> ]

CREATE TRANSLATION <transliteration name> FOR <source character set specification>

TO <target character set specification> FROM <transliteration source>

Regards
Takayuki Tsunakawa

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#5)

Re: Deprecate custom encoding conversions

"tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> writes:

From: Heikki Linnakangas <hlinnaka@iki.fi>

Any objections? Anyone using custom encoding conversions in production?

I can't answer deeper questions because I'm not familiar with character sets, but I saw some Japanese web articles that use CREATE CONVERSION to handle external characters. So, I think we may as well retain it. (OTOH, I'm wondering how other DBMSs support external characters and what Postgres should implement it.)

I recall a discussion in which someone explained that things are messy for
Japanese because there's not really just one version of SJIS; there are
several, because various groups extended the original standard in not-
quite-compatible ways. In turn this means that there's not really one
well-defined conversion to/from Unicode --- which is the reason why
allowing custom conversions seemed like a good idea. I don't know
whether anyone over there is actually using custom conversions, but
I'd be hesitant to just nuke the feature altogether.

Having said that, it doesn't seem like the conversion API is necessarily
any more set in stone than any of our other C-level APIs. We break things
at the C API level whenever there's sufficient reason.

regards, tom lane

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#5)

Re: Deprecate custom encoding conversions

On Thu, Dec 03, 2020 at 12:54:56AM +0000, tsunakawa.takay@fujitsu.com wrote:

I can't answer deeper questions because I'm not familiar with
character sets, but I saw some Japanese web articles that use CREATE
CONVERSION to handle external characters. So, I think we may as
well retain it. (OTOH, I'm wondering how other DBMSs support
external characters and what Postgres should implement it.)

Tsunakawa-san, could you post a link to this article, if possible? I
am curious about their problem and why they used CREATE CONVERSION as
a way to solve it. That's fine even if it is in Japanese.
--
Michael

Tatsuo Ishii

ishii@sraoss.co.jp

about 5 years ago

In reply to: Tom Lane (#6)

Re: Deprecate custom encoding conversions

I recall a discussion in which someone explained that things are messy for
Japanese because there's not really just one version of SJIS; there are
several, because various groups extended the original standard in not-
quite-compatible ways. In turn this means that there's not really one
well-defined conversion to/from Unicode

That's true.

--- which is the reason why
allowing custom conversions seemed like a good idea.  I don't know
whether anyone over there is actually using custom conversions, but
I'd be hesitant to just nuke the feature altogether.

By Googling I found an instance which is using CREATE CONVERSION
(Japanese article).

http://grep.blog49.fc2.com/blog-entry-87.html
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

tsunakawa.takay@fujitsu.com

about 5 years ago

In reply to: Michael Paquier (#7)

RE: Deprecate custom encoding conversions

From: Michael Paquier <michael@paquier.xyz>

Tsunakawa-san, could you post a link to this article, if possible? I am curious
about their problem and why they used CREATE CONVERSION as a way to
solve it. That's fine even if it is in Japanese.

I just pulled info from my old memory in my previous mail. Now the following links seem to be relevant. (They should be different from what I saw in the past.)

https://ml.postgresql.jp/pipermail/pgsql-jp/2007-January/013103.html

https://teratail.com/questions/295988

IIRC, the open source Postgres extension EUDC also uses CREATE CONVERSION.

Regards
Takayuki Tsunakawa

#10

Fujii Masao

masao.fujii@oss.nttdata.com

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#9)

Re: Deprecate custom encoding conversions

On 2020/12/03 11:48, tsunakawa.takay@fujitsu.com wrote:

From: Michael Paquier <michael@paquier.xyz>

Tsunakawa-san, could you post a link to this article, if possible? I am curious
about their problem and why they used CREATE CONVERSION as a way to
solve it. That's fine even if it is in Japanese.

I just pulled info from my old memory in my previous mail. Now the following links seem to be relevant. (They should be different from what I saw in the past.)

https://ml.postgresql.jp/pipermail/pgsql-jp/2007-January/013103.html

https://teratail.com/questions/295988

IIRC, the open source Postgres extension EUDC also uses CREATE CONVERSION.

FWIW, about four years before, for the real project, I wrote
the extension [1]https://github.com/MasaoFujii/pg_fallback_utf8_to_euc_jp that provides yet another encoding conversion
from UTF-8 to EUC_JP, to handle some external characters.
This extension uses CREATE CONVERSION.

[1]: https://github.com/MasaoFujii/pg_fallback_utf8_to_euc_jp
https://github.com/MasaoFujii/pg_fallback_utf8_to_euc_jp

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#11

Fujii Masao

masao.fujii@oss.nttdata.com

about 5 years ago

In reply to: Heikki Linnakangas (#1)

Re: Deprecate custom encoding conversions

On 2020/12/03 1:04, Heikki Linnakangas wrote:

Hi,

PostgreSQL allows writing custom encoding conversion functions between any character encodings, using the CREATE CONVERSION command. It's pretty flexible, you can define default and non-default conversions, and the conversions live in schemas so you can have multiple conversions installed in a system and you can switch between them by changing search_path.

However:

We never use non-default conversions for anything. All code that performs encoding conversions only cares about the default ones.

Yes. I had to update pg_conversion.condefault directly so that we can
use custom encoding when I registered it. The direct update of
pg_conversion is of course not good thing. So I was wondering
if we should have something like ALTER CONVERSION SET DEFAULT.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#12

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Fujii Masao (#11)

Re: Deprecate custom encoding conversions

Fujii Masao <masao.fujii@oss.nttdata.com> writes:

On 2020/12/03 1:04, Heikki Linnakangas wrote:

We never use non-default conversions for anything. All code that performs encoding conversions only cares about the default ones.

Yes. I had to update pg_conversion.condefault directly so that we can
use custom encoding when I registered it. The direct update of
pg_conversion is of course not good thing. So I was wondering
if we should have something like ALTER CONVERSION SET DEFAULT.

Perhaps, but what I'd think is more urgent is having some way for
a session to select a non-default conversion for client I/O.

regards, tom lane

#13

Fujii Masao

masao.fujii@oss.nttdata.com

about 5 years ago

In reply to: Tom Lane (#12)

Re: Deprecate custom encoding conversions

On 2020/12/03 13:07, Tom Lane wrote:

Fujii Masao <masao.fujii@oss.nttdata.com> writes:

On 2020/12/03 1:04, Heikki Linnakangas wrote:

We never use non-default conversions for anything. All code that performs encoding conversions only cares about the default ones.

Yes. I had to update pg_conversion.condefault directly so that we can
use custom encoding when I registered it. The direct update of
pg_conversion is of course not good thing. So I was wondering
if we should have something like ALTER CONVERSION SET DEFAULT.

Perhaps, but what I'd think is more urgent is having some way for
a session to select a non-default conversion for client I/O.

What about adding new libpq parameter like encoding_conversion
(like client_encoding) so that a client can specify what conversion to use?
That setting is sent to the server while establishing the connection.
Probably we would need to verify that the setting of this new parameter
is consistent with that of client_encoding.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION