proposal: UTF8 to_ascii function

Started by Pavel Stehuleover 17 years ago17 messages
#1Pavel Stehule
pavel.stehule@gmail.com

Hello,

combination functions to_ascii and convert_to is broken now. Problem
is in convert_to function. It doesn't support 8bit output encoding.

Current workaround:

CREATE FUNCTION to_ascii(bytea, name)
RETURNS text AS 'to_ascii_encname' LANGUAGE internal;

SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');

I don't expect column collate for 8.4, so we need to have workable
to_ascii function.

I propose function to_ascii(text, name) that internally convert text
from utf8 encoding when it's necessary.

Regards
Pavel Stehule

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Pavel Stehule (#1)
Re: proposal: UTF8 to_ascii function

Pavel Stehule wrote:

Hello,

combination functions to_ascii and convert_to is broken now. Problem
is in convert_to function. It doesn't support 8bit output encoding.

Current workaround:

CREATE FUNCTION to_ascii(bytea, name)
RETURNS text AS 'to_ascii_encname' LANGUAGE internal;

SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');

I don't expect column collate for 8.4, so we need to have workable
to_ascii function.

I propose function to_ascii(text, name) that internally convert text
from utf8 encoding when it's necessary.cheers

convert_to is not broken. It returns a bytea, and it is up to you to
de-escape it if you get the text representation.

We are surely not going to go back to a situation where we have
functions returning text in any encoding other than the database
encoding. That becomes a vehicle for storing wrongly encoded data in the
database, and we have just gone through the exercise of plugging those
holes. I privately predicted when we did this work that it might
motivate people who had been abusing convert_to to get proper support
for multiple encodings done. That is the right way to go, not re-opening
holes we have just very deliberately plugged.

cheers

andrew

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#2)
Re: proposal: UTF8 to_ascii function

Hello

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

Hello,

combination functions to_ascii and convert_to is broken now. Problem
is in convert_to function. It doesn't support 8bit output encoding.

Current workaround:

CREATE FUNCTION to_ascii(bytea, name)
RETURNS text AS 'to_ascii_encname' LANGUAGE internal;

SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');

I don't expect column collate for 8.4, so we need to have workable
to_ascii function.

I propose function to_ascii(text, name) that internally convert text
from utf8 encoding when it's necessary.cheers

convert_to is not broken. It returns a bytea, and it is up to you to
de-escape it if you get the text representation.

ok, I talked about combination convert_to and to_ascii. to_ascii
doesn't support bytea, what is probably correct. We cannot use
descape, because it remove 8bit. This issue was noticed more times -
http://archives.postgresql.org/pgsql-general/2008-06/msg00495.php

We are surely not going to go back to a situation where we have functions
returning text in any encoding other than the database encoding. That
becomes a vehicle for storing wrongly encoded data in the database, and we
have just gone through the exercise of plugging those holes. I privately
predicted when we did this work that it might motivate people who had been
abusing convert_to to get proper support for multiple encodings done. That
is the right way to go, not re-opening holes we have just very deliberately
plugged.

to_ascii isn't related to multiple encodings. And actually there is
only one man who works on it. We will be happy for database collation
in 8.4. So without any change this feature will be broken more than
two years.

Regards
Pavel

Show quoted text

cheers

andrew

#4Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#2)
Re: proposal: UTF8 to_ascii function

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

Hello,

combination functions to_ascii and convert_to is broken now. Problem
is in convert_to function. It doesn't support 8bit output encoding.

Current workaround:

CREATE FUNCTION to_ascii(bytea, name)
RETURNS text AS 'to_ascii_encname' LANGUAGE internal;

SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');

I don't expect column collate for 8.4, so we need to have workable
to_ascii function.

I propose function to_ascii(text, name) that internally convert text
from utf8 encoding when it's necessary.cheers

convert_to is not broken. It returns a bytea, and it is up to you to
de-escape it if you get the text representation.

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

Regards
Pavel Stehule

Show quoted text

We are surely not going to go back to a situation where we have functions
returning text in any encoding other than the database encoding. That
becomes a vehicle for storing wrongly encoded data in the database, and we
have just gone through the exercise of plugging those holes. I privately
predicted when we did this work that it might motivate people who had been
abusing convert_to to get proper support for multiple encodings done. That
is the right way to go, not re-opening holes we have just very deliberately
plugged.

cheers

andrew

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Pavel Stehule (#4)
Re: proposal: UTF8 to_ascii function

Pavel Stehule wrote:

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets to
ASCII by folding the chars with high bits set, so what we get is a pure
ASCII result which is safe in any server encoding, as they are all ASCII
supersets.

But what conversion rule will you use for the gazillions of Unicode
characters?

I honestly do not understand the use case for this at all.

cheers

andrew

#6Jan Urbański
j.urbanski@students.mimuw.edu.pl
In reply to: Andrew Dunstan (#5)
Re: proposal: UTF8 to_ascii function

Andrew Dunstan wrote:

Pavel Stehule wrote:

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets to
ASCII by folding the chars with high bits set, so what we get is a pure
ASCII result which is safe in any server encoding, as they are all ASCII
supersets.

But what conversion rule will you use for the gazillions of Unicode
characters?

I honestly do not understand the use case for this at all.

I do. Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to 8.3...).
Or maybe there's a better way to do it?

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Jan Urbański (#6)
Re: proposal: UTF8 to_ascii function

Jan Urbański wrote:

Andrew Dunstan wrote:

Pavel Stehule wrote:

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets
to ASCII by folding the chars with high bits set, so what we get is a
pure ASCII result which is safe in any server encoding, as they are
all ASCII supersets.

But what conversion rule will you use for the gazillions of Unicode
characters?

I honestly do not understand the use case for this at all.

I do. Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to
8.3...).
Or maybe there's a better way to do it?

Well, my first question would be "Why aren't you using a database
encoding that supports to_ascii()?"

However, I suppose that your use case would support this signature:

to_ascii(bytea, name)

where it would just error out if the encoding name were something other
than LATIN1, LATIN2, LATIN9, or WIN1250.

But what would be the meaning of this?:

to_ascii(bytea, integer)

cheers

andrew

#8Jan Urbański
j.urbanski@students.mimuw.edu.pl
In reply to: Andrew Dunstan (#7)
Re: proposal: UTF8 to_ascii function

Andrew Dunstan wrote:

Jan Urbański wrote:

Andrew Dunstan wrote:

Pavel Stehule wrote:
What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets
to ASCII by folding the chars with high bits set, so what we get is a
pure ASCII result which is safe in any server encoding, as they are
all ASCII supersets.

But what conversion rule will you use for the gazillions of Unicode
characters?

I honestly do not understand the use case for this at all.

I do. Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to
8.3...).
Or maybe there's a better way to do it?

Well, my first question would be "Why aren't you using a database
encoding that supports to_ascii()?"

Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes
input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode
characters like (U+2026) HORIZONTAL ELLIPSIS.

I'd like to have
to_ascii(text, [error_handling]) returns text

So no bytea, to_ascii would accept text that's legal in my current
database encoding and return text in that encoding. And error_handling
would be something like:
- 'error' (the default, throw an error if a character is untranslable to
ASCII)
- 'ignore' (omit untranslable characters)
- 'transliterate' (do your best to transliterate the character, or leave
it as it is if impossible).

Examples would include (assuming UTF-8 database)
to_ascii('łódź') -> 'lodz'
to_ascii('china is written 中國') -> ERROR
to_ascii('china is written 中國', 'ignore') -> 'china is written '
to_ascii('china is written 中國', 'transliterate') -> 'china is written
zhong guo' (in an ideal world)
to_ascii('china is written 中國', 'transliterate') -> 'china is written
中國' (in reality)\

These would have the property, that:
to_ascii(X, 'ignore') is always pure ASCII data and never throws an error
to_ascii(X, 'transliterate') is sometimes non-ASCII data and never
throws an error
to_ascii(X) is sometimes non-ASCII data and sometimes throws an error

It's something like PHP's iconv that can have //TRANSLIT or somesuch
(forgive me for giving PHP as an example...). Now I'd love to hear
people punch holes in my daydreaming design ;)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

#9Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#5)
Re: proposal: UTF8 to_ascii function

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets to
ASCII by folding the chars with high bits set, so what we get is a pure
ASCII result which is safe in any server encoding, as they are all ASCII
supersets.

But what conversion rule will you use for the gazillions of Unicode
characters?

I honestly do not understand the use case for this at all.

It's typical case in czech language, where some searchings are accents
insensitive - Stěhule, Stehule, Novotný, Novotny.

Show quoted text

cheers

andrew

#10Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#7)
Re: proposal: UTF8 to_ascii function

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Jan Urbański wrote:

Andrew Dunstan wrote:

Pavel Stehule wrote:

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets to
ASCII by folding the chars with high bits set, so what we get is a pure
ASCII result which is safe in any server encoding, as they are all ASCII
supersets.

But what conversion rule will you use for the gazillions of Unicode
characters?

I honestly do not understand the use case for this at all.

I do. Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for 'łódź'
returns 'lodz'. So the use case is there (in fact, the lack of such facility
made me consider not upgrading particular client to 8.3...).
Or maybe there's a better way to do it?

Well, my first question would be "Why aren't you using a database encoding
that supports to_ascii()?"

However, I suppose that your use case would support this signature:

to_ascii(bytea, name)

where it would just error out if the encoding name were something other than
LATIN1, LATIN2, LATIN9, or WIN1250.

But what would be the meaning of this?:

to_ascii(bytea, integer)

it's symmetric. Nothing more.

Show quoted text

cheers

andrew

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Pavel Stehule (#10)
Re: proposal: UTF8 to_ascii function

Pavel Stehule wrote:

But what would be the meaning of this?:

to_ascii(bytea, integer)

it's symmetric. Nothing more.

Symmetric to what? What is the second argument supposed to be?

cheers

andrew

#12Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#11)
Re: proposal: UTF8 to_ascii function

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

But what would be the meaning of this?:

to_ascii(bytea, integer)

it's symmetric. Nothing more.

Symmetric to what? What is the second argument supposed to be?

postgres=# \df to_ascii
List of functions
Schema | Name | Result data type | Argument data types
------------+----------+------------------+---------------------
pg_catalog | to_ascii | text | text
pg_catalog | to_ascii | text | text, integer
pg_catalog | to_ascii | text | text, name

postgres=# select to_ascii('pavel',8);
to_ascii
----------
pavel
(1 row)

Show quoted text

cheers

andrew

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Pavel Stehule (#12)
Re: proposal: UTF8 to_ascii function

Pavel Stehule wrote:

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

But what would be the meaning of this?:

to_ascii(bytea, integer)

it's symmetric. Nothing more.

Symmetric to what? What is the second argument supposed to be?

postgres=# \df to_ascii
List of functions
Schema | Name | Result data type | Argument data types
------------+----------+------------------+---------------------
pg_catalog | to_ascii | text | text
pg_catalog | to_ascii | text | text, integer
pg_catalog | to_ascii | text | text, name

postgres=# select to_ascii('pavel',8);
to_ascii
----------
pavel
(1 row)

Hmm. That's not documented, and I suspect shouldn't be there. Everywhere
else pretty much that I can think of we pass the encoding as a name, and
I think we should be consistent about it.

cheers

andrew

#14Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#13)
Re: proposal: UTF8 to_ascii function

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:

Pavel Stehule wrote:

But what would be the meaning of this?:

to_ascii(bytea, integer)

it's symmetric. Nothing more.

Symmetric to what? What is the second argument supposed to be?

postgres=# \df to_ascii
List of functions
Schema | Name | Result data type | Argument data types
------------+----------+------------------+---------------------
pg_catalog | to_ascii | text | text
pg_catalog | to_ascii | text | text, integer
pg_catalog | to_ascii | text | text, name

postgres=# select to_ascii('pavel',8);
to_ascii
----------
pavel
(1 row)

Hmm. That's not documented, and I suspect shouldn't be there. Everywhere
else pretty much that I can think of we pass the encoding as a name, and I
think we should be consistent about it.

I don't need it

regards
Pavel

Show quoted text

cheers

andrew

#15Peter Eisentraut
peter_e@gmx.net
In reply to: Jan Urbański (#6)
Re: proposal: UTF8 to_ascii function

On Monday 11 August 2008 16:23:29 Jan Urbański wrote:

Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to 8.3...).

These are valid ideas, but then please design a new function that addresses
your use case in a well-defined way, and don't overload questionable old
interfaces for new purposes.

In the Unicode standard you can find well-defined methods to decompose
characters into diacritic marks, and then you could strip them off. But this
has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can
have diacritic marks as well, for example.

#16Pavel Stehule
pavel.stehule@gmail.com
In reply to: Peter Eisentraut (#15)
Re: proposal: UTF8 to_ascii function

2008/8/11 Peter Eisentraut <peter_e@gmx.net>:

On Monday 11 August 2008 16:23:29 Jan Urbański wrote:

Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to 8.3...).

These are valid ideas, but then please design a new function that addresses
your use case in a well-defined way, and don't overload questionable old
interfaces for new purposes.

In the Unicode standard you can find well-defined methods to decompose
characters into diacritic marks, and then you could strip them off. But this
has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can
have diacritic marks as well, for example.

Hi Peter,

changes to_ascii from text to bytea is more bugfix than new feature
and should be done immediately. Correct conversions are related to
colum collation - is not necessary repeat same work and same code from
some unicode libs.

Regards
Pavel

#17Jan Urbański
j.urbanski@students.mimuw.edu.pl
In reply to: Peter Eisentraut (#15)
Re: proposal: UTF8 to_ascii function

Peter Eisentraut wrote:

On Monday 11 August 2008 16:23:29 Jan Urbański wrote:

Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to 8.3...).

These are valid ideas, but then please design a new function that addresses
your use case in a well-defined way, and don't overload questionable old
interfaces for new purposes.

In the Unicode standard you can find well-defined methods to decompose
characters into diacritic marks, and then you could strip them off. But this
has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can
have diacritic marks as well, for example.

OK, I was envisioning something like that:
http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm
but now that I think of it, I can always just write a plperlu function
that uses that module. The only inconvenience is having to have plperlu
in the db, but I can live with that.
Postgres extensibility rocks and I rest my case.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin