BUG #13440: unaccent does not remove all diacritics

Started by Michael Gradekalmost 11 years ago64 messagesbugs

mike@busbud.com

almost 11 years ago

The following bug has been logged on the website:

Bug reference: 13440
Logged by: Mike Gradek
Email address: mike@busbud.com
PostgreSQL version: 9.3.5
Operating system: Mac OS X 10.10.3 (14D136)
Description:

Sorry, I couldn't install the most recent minor release, but I did try this
on several different versions. I used Heroku to try a 9.4.3 build, and got
the same results

select 'ț' as input, unaccent('ț') as observed, 't' as expected;
input | observed | expected
-------+----------+----------
ț | ț | t
(1 row)

Let me know how I can help resolve this bug, or if it's expected.

Best regards,
Mike

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Michael Gradek (#1)

Re: BUG #13440: unaccent does not remove all diacritics

mike@busbud.com writes:

Sorry, I couldn't install the most recent minor release, but I did try this
on several different versions. I used Heroku to try a 9.4.3 build, and got
the same results

select 'ț' as input, unaccent('ț') as observed, 't' as expected;
input | observed | expected
-------+----------+----------
ț | ț | t
(1 row)

Hm, I do see

ţ t

in unaccent.rules, so the transformation ought to happen. I suspect
an encoding issue, eg your terminal window is not transmitting characters
in the encoding Postgres thinks you're using. You did not provide any
info about server encoding, client encoding, or client LC_xxx environment,
so it's hard to debug from here.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Gradek

mike@busbud.com

almost 11 years ago

In reply to: Tom Lane (#2)

Re: BUG #13440: unaccent does not remove all diacritics

Hi Tom,

Thanks for looking into this issue. Would this help?

psql -l

List of databases

------------------------+---------------+----------+-------------+-------------+---------------------------------

Here's a case showing the transformation failing, and another succeeding

psql grand-central

psql (9.4.1, server 9.3.5)

Type "help" for help.

grand-central=# select 'ț' as input, unaccent('ț') as observed, 't' as
expected;

input | observed | expected

-------+----------+----------

ț | ț | t

(1 row)

grand-central=# select 'é' as input, unaccent('é') as observed, 'e' as
expected;

input | observed | expected

-------+----------+----------

é | e | e

(1 row)

On Sun, Jun 14, 2015 at 1:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

mike@busbud.com writes:

Sorry, I couldn't install the most recent minor release, but I did try

this

on several different versions. I used Heroku to try a 9.4.3 build, and

got

the same results

select 'ț' as input, unaccent('ț') as observed, 't' as expected;
input | observed | expected
-------+----------+----------
ț | ț | t
(1 row)

Hm, I do see

ţ t

in unaccent.rules, so the transformation ought to happen. I suspect
an encoding issue, eg your terminal window is not transmitting characters
in the encoding Postgres thinks you're using. You did not provide any
info about server encoding, client encoding, or client LC_xxx environment,
so it's hard to debug from here.

regards, tom lane

--
Cheers,
Mike
--
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike@busbud.com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Tom Lane (#2)

Re: BUG #13440: unaccent does not remove all diacritics

On Mon, Jun 15, 2015 at 5:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

mike@busbud.com writes:

Sorry, I couldn't install the most recent minor release, but I did try this
on several different versions. I used Heroku to try a 9.4.3 build, and got
the same results

select 'ț' as input, unaccent('ț') as observed, 't' as expected;
input | observed | expected
-------+----------+----------
ț | ț | t
(1 row)

Hm, I do see

ţ t

in unaccent.rules, so the transformation ought to happen. I suspect
an encoding issue, eg your terminal window is not transmitting characters
in the encoding Postgres thinks you're using. You did not provide any
info about server encoding, client encoding, or client LC_xxx environment,
so it's hard to debug from here.

The one that is in unaccent.rules is apparently t-cedilla, from Gagauz
and Romanian:

https://en.wiktionary.org/wiki/%C5%A3

The one that is referred to above is apparently t-comma, from Livonian
and Romanian, but "[o]ften replaced by Ţ / ţ (t with cedilla),
especially in computing":

https://en.wiktionary.org/wiki/%C8%9B

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Michael Gradek (#3)

Re: BUG #13440: unaccent does not remove all diacritics

Michael Gradek wrote:

grand-central=# select 'ț' as input, unaccent('ț') as observed, 't' as
expected;

input | observed | expected

-------+----------+----------

ț | ț | t

Hm, I do see

ţ t

My terminal shows these characters to be different. One is
http://graphemica.com/%C8%9B
latin small letter t with comma below (U+021B)

The other is
http://graphemica.com/%C5%A3
latin small letter t with cedilla (U+0163)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Alvaro Herrera (#5)

Re: BUG #13440: unaccent does not remove all diacritics

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

My terminal shows these characters to be different. One is
http://graphemica.com/%C8%9B
latin small letter t with comma below (U+021B)

The other is
http://graphemica.com/%C5%A3
latin small letter t with cedilla (U+0163)

Ah-hah -- I did not look closely enough. So the immediate answer for
Michael is to add another entry to his unaccent.rules file.

Should we add the missing character to the standard unaccent.rules file?
I should think so in HEAD at least, but what about back-patching?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Tom Lane (#6)

Re: BUG #13440: unaccent does not remove all diacritics

On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

My terminal shows these characters to be different. One is
http://graphemica.com/%C8%9B
latin small letter t with comma below (U+021B)

The other is
http://graphemica.com/%C5%A3
latin small letter t with cedilla (U+0163)

Ah-hah -- I did not look closely enough. So the immediate answer for
Michael is to add another entry to his unaccent.rules file.

Should we add the missing character to the standard unaccent.rules file?

It looks like Romanian also has s with comma. Perhaps we should have
all these characters:

$ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
702

That's quite a lot more than the 187 we currently have. Of those, I
think only the following ligature characters don't fit the above
pattern: Æ, æ, Ĳ, ĳ, Œ, œ, ß. Incidentally, I don't believe that the
way we "unaccent" ligatures is correct anyway. Maybe they should be
expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
O, o, S as we have it, but I guess it depends what the purpose of
unaccent is...

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Gradek

mike@busbud.com

almost 11 years ago

In reply to: Thomas Munro (#7)

Re: BUG #13440: unaccent does not remove all diacritics

Thanks everyone, I've been comparing the behavior to that of
https://github.com/andrewrk/node-diacritics/blob/master/index.js if that
can be of any help.

On Monday, June 15, 2015, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us
<javascript:;>> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com <javascript:;>> writes:

My terminal shows these characters to be different. One is
http://graphemica.com/%C8%9B
latin small letter t with comma below (U+021B)

The other is
http://graphemica.com/%C5%A3
latin small letter t with cedilla (U+0163)

Ah-hah -- I did not look closely enough. So the immediate answer for
Michael is to add another entry to his unaccent.rules file.

Should we add the missing character to the standard unaccent.rules file?

It looks like Romanian also has s with comma. Perhaps we should have
all these characters:

$ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
702

That's quite a lot more than the 187 we currently have. Of those, I
think only the following ligature characters don't fit the above
pattern: Æ, æ, Ĳ, ĳ, Œ, œ, ß. Incidentally, I don't believe that the
way we "unaccent" ligatures is correct anyway. Maybe they should be
expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
O, o, S as we have it, but I guess it depends what the purpose of
unaccent is...

--
Thomas Munro
http://www.enterprisedb.com

--
Cheers,
Mike
--
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike@busbud.com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Thomas Munro (#7)

Re: BUG #13440: unaccent does not remove all diacritics

On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

My terminal shows these characters to be different. One is
http://graphemica.com/%C8%9B
latin small letter t with comma below (U+021B)

The other is
http://graphemica.com/%C5%A3
latin small letter t with cedilla (U+0163)

Ah-hah -- I did not look closely enough. So the immediate answer for
Michael is to add another entry to his unaccent.rules file.

Should we add the missing character to the standard unaccent.rules file?

It looks like Romanian also has s with comma. Perhaps we should have
all these characters:

$ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
702

Here is an unaccent.rules file that maps those 702 characters from
Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
..." to their base letter, plus 14 extra cases to match the existing
unaccent.rules file. If you sort and diff this and the existing file,
you can see that this file only adds new lines. Also, here is the
script I used to build it from UnicodeData.txt.

--
Thomas Munro
http://www.enterprisedb.com

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Thomas Munro (#9)

Re: BUG #13440: unaccent does not remove all diacritics

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

It looks like Romanian also has s with comma. Perhaps we should have
all these characters:

$ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
702

Here is an unaccent.rules file that maps those 702 characters from
Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
..." to their base letter, plus 14 extra cases to match the existing
unaccent.rules file. If you sort and diff this and the existing file,
you can see that this file only adds new lines. Also, here is the
script I used to build it from UnicodeData.txt.

Hm. The "extra cases" are pretty disturbing, because some of them sure
look like bugs; which makes me wonder how closely the unaccent.rules
file was vetted to begin with. For those following along at home,
here are Thomas' extra cases, annotated by me with the Unicode file's
description of each source character:

print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
print_record(0x00e6, "a") # LATIN SMALL LETTER AE
print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
print_record(0x0138, "k") # LATIN SMALL LETTER KRA
print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
print_record(0x014b, "n") # LATIN SMALL LETTER ENG
print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Also unclear why we're dealing with KRA and ENG but not any of the
other marginal letters that Unicode labels as LATIN (what the heck
is an "AFRICAN D", for instance?)

Also, while my German is nearly nonexistent, I had the idea that sharp-S
to "S" would be considered a case-folding transformation not an accent
removal. Comments from German speakers welcome of course.

Likewise dubious about those Cyrillic entries, although I suppose
Teodor probably had good reasons for including them.

On the other side of the coin, I think Thomas' regex might have swept up a
bit too much. I did this to see what sort of decorations were described:

$ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq -c
34 ACUTE
2 ACUTE AND DOT ABOVE
4 BAR
2 BELT
12 BREVE
2 BREVE AND ACUTE
2 BREVE AND DOT BELOW
2 BREVE AND GRAVE
2 BREVE AND HOOK ABOVE
2 BREVE AND TILDE
2 BREVE BELOW
34 CARON
2 CARON AND DOT ABOVE
22 CEDILLA
2 CEDILLA AND ACUTE
2 CEDILLA AND BREVE
26 CIRCUMFLEX
6 CIRCUMFLEX AND ACUTE
6 CIRCUMFLEX AND DOT BELOW
6 CIRCUMFLEX AND GRAVE
6 CIRCUMFLEX AND HOOK ABOVE
6 CIRCUMFLEX AND TILDE
12 CIRCUMFLEX BELOW
4 COMMA BELOW
4 CROSSED-TAIL
7 CURL
8 DESCENDER
19 DIAERESIS
4 DIAERESIS AND ACUTE
2 DIAERESIS AND CARON
2 DIAERESIS AND GRAVE
6 DIAERESIS AND MACRON
2 DIAERESIS BELOW
8 DIAGONAL STROKE
39 DOT ABOVE
4 DOT ABOVE AND MACRON
38 DOT BELOW
2 DOT BELOW AND DOT ABOVE
4 DOT BELOW AND MACRON
4 DOUBLE ACUTE
2 DOUBLE BAR
12 DOUBLE GRAVE
1 DOUBLE MIDDLE TILDE
1 FISHHOOK
1 FISHHOOK AND MIDDLE TILDE
5 FLOURISH
16 GRAVE
2 HIGH STROKE
30 HOOK
12 HOOK ABOVE
1 HOOK AND TAIL
1 HOOK TAIL
4 HORN
4 HORN AND ACUTE
4 HORN AND DOT BELOW
4 HORN AND GRAVE
4 HORN AND HOOK ABOVE
4 HORN AND TILDE
12 INVERTED BREVE
1 INVERTED LAZY S
3 LEFT HOOK
17 LINE BELOW
1 LONG LEFT LEG
1 LONG LEFT LEG AND LOW RIGHT RING
1 LONG LEG
2 LONG RIGHT LEG
2 LONG STROKE OVERLAY
4 LOOP
1 LOW RIGHT RING
1 LOW RING INSIDE
14 MACRON
4 MACRON AND ACUTE
2 MACRON AND DIAERESIS
4 MACRON AND GRAVE
2 MIDDLE DOT
1 MIDDLE RING
13 MIDDLE TILDE
1 NOTCH
10 OBLIQUE STROKE
10 OGONEK
2 OGONEK AND MACRON
17 PALATAL HOOK
9 RETROFLEX HOOK
1 RETROFLEX HOOK AND BELT
1 RIGHT HALF RING
1 RIGHT HOOK
6 RING ABOVE
2 RING ABOVE AND ACUTE
2 RING BELOW
1 SERIF
2 SHORT RIGHT LEG
2 SMALL LETTER J
1 SMALL LETTER Z
2 SQUIRREL TAIL
36 STROKE
2 STROKE AND ACUTE
2 STROKE AND DIAGONAL STROKE
4 STROKE THROUGH DESCENDER
4 SWASH TAIL
3 TAIL
16 TILDE
4 TILDE AND ACUTE
2 TILDE AND DIAERESIS
2 TILDE AND MACRON
6 TILDE BELOW
4 TOPBAR

Do we really need to expand the rule list fivefold to get rid of things
like FISHHOOK and SQUIRREL TAIL? Is removing those sorts of things even
legitimately "unaccenting"? I dunno, but I think it would be good to
have some consensus about what we want this file to do. I'm not sure
that we should be basing the transformation on minor phrasing details
in the Unicode data file.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Tom Lane (#10)

Re: BUG #13440: unaccent does not remove all diacritics

On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Here is an unaccent.rules file that maps those 702 characters from
Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
..." to their base letter, plus 14 extra cases to match the existing
unaccent.rules file. If you sort and diff this and the existing file,
you can see that this file only adds new lines. Also, here is the
script I used to build it from UnicodeData.txt.

Hm. The "extra cases" are pretty disturbing, because some of them sure
look like bugs; which makes me wonder how closely the unaccent.rules
file was vetted to begin with. For those following along at home,
here are Thomas' extra cases, annotated by me with the Unicode file's
description of each source character:

print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
print_record(0x00e6, "a") # LATIN SMALL LETTER AE
print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
print_record(0x0138, "k") # LATIN SMALL LETTER KRA
print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
print_record(0x014b, "n") # LATIN SMALL LETTER ENG
print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Perhaps these conversions are intended only for comparisons, full text
indexing etc but not showing the converted text to a user, in which
case it doesn't matter too much if the conversions are a bit weird
(œuf and oeuf are interchangeable in French, but euf is nonsense).
But can we actually change them? That could cause difficulty for
users with existing unaccented data stored/indexed... But I suppose
even adding new mappings could cause problems.

Also unclear why we're dealing with KRA and ENG but not any of the
other marginal letters that Unicode labels as LATIN (what the heck
is an "AFRICAN D", for instance?)

Also, while my German is nearly nonexistent, I had the idea that sharp-S
to "S" would be considered a case-folding transformation not an accent
removal. Comments from German speakers welcome of course.

Likewise dubious about those Cyrillic entries, although I suppose
Teodor probably had good reasons for including them.

On the other side of the coin, I think Thomas' regex might have swept up a
bit too much. I did this to see what sort of decorations were described:

$ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq -c
34 ACUTE
...snip...
4 TOPBAR

Do we really need to expand the rule list fivefold to get rid of things
like FISHHOOK and SQUIRREL TAIL? Is removing those sorts of things even
legitimately "unaccenting"? I dunno, but I think it would be good to
have some consensus about what we want this file to do. I'm not sure
that we should be basing the transformation on minor phrasing details
in the Unicode data file.

Right, that does seem a little bit weak. Instead of making
assumptions about the format of those names, we could make use of the
precomposed -> composed character mappings in the file. We could look
for characters in the "letters" category where there is decomposition
information (ie combining characters for the individual accents) and
the base character is [a-zA-Z]. See attached. This produces 411
mappings (including the 14 extras). I didn't spend the time to figure
out which 300 odd characters were dropped but I noticed that our
Romanian characters of interest are definitely in.

(There is a separate can of worms here about whether to deal with
decomposed text...)

--
Thomas Munro
http://www.enterprisedb.com

#12

Curd Reinert

curdreinert@gmx.de

almost 11 years ago

In reply to: Thomas Munro (#11)

Re: BUG #13440: unaccent does not remove all diacritics

Tom Lane <tgl@sss.pgh.pa.us> schrieb am 17.06.2015 00:01:48:

Also, while my German is nearly nonexistent, I had the idea that sharp-S
to "S" would be considered a case-folding transformation not an accent
removal. Comments from German speakers welcome of course.

The sharp-s 'ß' is historically a ligature of two different kinds of s,
of which the first one looks more like an f and the second one looks
either like a normal 's' or a 'z' (that's why it is called 'szlig' in
html). It is usually considered to be a lower-case only character, event
though an uppercase sharp-s has recently been defined. If you are using
an encoding that doesn't support 'ß', the rule is to substitute it with
'ss'. If you want to capitalize a word containing a 'ß', you substitute
it with 'SS'. For sorting purposes, DIN 5007 says that 'ß' should be
treated as 'ss'.

That's just the German point of view. Thinks can be a little bit
different in other german speaking countries, e.g. in Switzerland, where
you may always substite 'ß' with 'ss' (even if your encoding has an 'ß').

In short: I would think that replacing 'ß' with 's' is wrong, and
certainly not an accent removal.

Best regards,

Curd

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Import Notes

Resolved by subject fallback

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Thomas Munro (#11)

Re: BUG #13440: unaccent does not remove all diacritics

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Perhaps these conversions are intended only for comparisons, full text
indexing etc but not showing the converted text to a user, in which
case it doesn't matter too much if the conversions are a bit weird
(œuf and oeuf are interchangeable in French, but euf is nonsense).
But can we actually change them? That could cause difficulty for
users with existing unaccented data stored/indexed... But I suppose
even adding new mappings could cause problems.

Yeah, if we do anything other than adding new mappings, I suspect that
part could not be back-patched. Maybe adding new mappings shouldn't
be back-patched either, though it seems relatively safe to me.

Right, that does seem a little bit weak. Instead of making
assumptions about the format of those names, we could make use of the
precomposed -> composed character mappings in the file. We could look
for characters in the "letters" category where there is decomposition
information (ie combining characters for the individual accents) and
the base character is [a-zA-Z]. See attached. This produces 411
mappings (including the 14 extras). I didn't spend the time to figure
out which 300 odd characters were dropped but I noticed that our
Romanian characters of interest are definitely in.

I took a quick look at this list and it seems fairly sane as far as the
automatically-generated items go, except that I see it hits a few
LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
I'm still quite dubious that that is appropriate; at least, if we do it
I think we should be expanding out to the equivalent multi-letter form,
not simply taking one of the letters and dropping the rest. Anybody else
have an opinion on how to handle ligatures?

The manually added special cases don't look any saner than they did
before :-(. Anybody have an objection to removing those (except maybe
dotless i) in HEAD?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#14

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Tom Lane (#13)

Re: BUG #13440: unaccent does not remove all diacritics

On 2015-06-18 15:30:46 -0400, Tom Lane wrote:

Yeah, if we do anything other than adding new mappings, I suspect that
part could not be back-patched. Maybe adding new mappings shouldn't
be back-patched either, though it seems relatively safe to me.

Hm. Why is it safe to add new mappings? If previously something has been
indexed with accents because unaccent didn't remove them and you're now
adding a new mapping to unaccent, the tsearch query will lookup the
wrong key (with accents removed)?

Greetings,

Andres Freund

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Andres Freund (#14)

Re: BUG #13440: unaccent does not remove all diacritics

Andres Freund <andres@anarazel.de> writes:

On 2015-06-18 15:30:46 -0400, Tom Lane wrote:

Yeah, if we do anything other than adding new mappings, I suspect that
part could not be back-patched. Maybe adding new mappings shouldn't
be back-patched either, though it seems relatively safe to me.

Hm. Why is it safe to add new mappings? If previously something has been
indexed with accents because unaccent didn't remove them and you're now
adding a new mapping to unaccent, the tsearch query will lookup the
wrong key (with accents removed)?

This is the same situation as any change whatsoever to tsearch
dictionaries. The party line on that is that it usually doesn't
matter much, and if it does you can rebuild your indexes.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Tom Lane (#15)

Re: BUG #13440: unaccent does not remove all diacritics

On 2015-06-18 16:30:46 -0400, Tom Lane wrote:

This is the same situation as any change whatsoever to tsearch
dictionaries. The party line on that is that it usually doesn't
matter much, and if it does you can rebuild your indexes.

I think that's an acceptable answer if the user changes their
dictionary, but if we do it for them it's different.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Andres Freund (#16)

Re: BUG #13440: unaccent does not remove all diacritics

Andres Freund <andres@anarazel.de> writes:

On 2015-06-18 16:30:46 -0400, Tom Lane wrote:

This is the same situation as any change whatsoever to tsearch
dictionaries. The party line on that is that it usually doesn't
matter much, and if it does you can rebuild your indexes.

I think that's an acceptable answer if the user changes their
dictionary, but if we do it for them it's different.

So you're arguing that unaccent.rules is forever frozen, no matter whether
it's obviously broken or not?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Tom Lane (#17)

Re: BUG #13440: unaccent does not remove all diacritics

On 2015-06-18 16:36:02 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-06-18 16:30:46 -0400, Tom Lane wrote:

This is the same situation as any change whatsoever to tsearch
dictionaries. The party line on that is that it usually doesn't
matter much, and if it does you can rebuild your indexes.

I think that's an acceptable answer if the user changes their
dictionary, but if we do it for them it's different.

So you're arguing that unaccent.rules is forever frozen, no matter whether
it's obviously broken or not?

I think it's perfectly sensible to update the rules in master (even if
that has consequences for pg_upgraded databases). I'm just doubtful
about the merits of backpatching changes like this. But I'm not going to
fight hard/any further...

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Andres Freund (#18)

Re: BUG #13440: unaccent does not remove all diacritics

Andres Freund <andres@anarazel.de> writes:

I think it's perfectly sensible to update the rules in master (even if
that has consequences for pg_upgraded databases). I'm just doubtful
about the merits of backpatching changes like this.

Well, that's certainly a fair position. How do others feel?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#20

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Tom Lane (#13)

Re: BUG #13440: unaccent does not remove all diacritics

Tom Lane wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Perhaps these conversions are intended only for comparisons, full text
indexing etc but not showing the converted text to a user, in which
case it doesn't matter too much if the conversions are a bit weird
(œuf and oeuf are interchangeable in French, but euf is nonsense).
But can we actually change them? That could cause difficulty for
users with existing unaccented data stored/indexed... But I suppose
even adding new mappings could cause problems.

Yeah, if we do anything other than adding new mappings, I suspect that
part could not be back-patched. Maybe adding new mappings shouldn't
be back-patched either, though it seems relatively safe to me.

To me, conceptually what unaccent does is turn whatever junk you have
into a very basic common alphabet (ascii); then it's very easy to do
full text searches without having to worry about what accents the people
did or did not use in their searches. If we say "okay, but that funny
char is not an accent so let's leave it alone" then the charter doesn't
sound so useful to me.

The cases I care about are okay anyway, because all the funny chars in
spanish are already covered; and maybe German people always enter their
queries using the funny ss thing I can't even write, and then this is
not a problem for them.

Regarding back-patching unaccent.rules changes as discussed downthread,
I think it's okay to simply document that any indexes using the module
should be reindexed immediately after upgrading to that minor version.
The consequence of not doing so is not *that* serious anyway. But then,
since I'm not actually affected in any way, I'm not strongly holding
this position either.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#21

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Tom Lane (#13)

#22

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Thomas Munro (#21)

#23

Emre Hasegeli

emre@hasegeli.com

almost 11 years ago

In reply to: Alvaro Herrera (#20)

#24

Thomas Munro

thomas.munro@gmail.com

almost 11 years ago

In reply to: Thomas Munro (#21)

#25

Peter Eisentraut

peter_e@gmx.net

almost 11 years ago

In reply to: Tom Lane (#19)

#26

Peter Eisentraut

peter_e@gmx.net

almost 11 years ago

In reply to: Alvaro Herrera (#20)

#27

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Thomas Munro (#24)

#28

Thomas Munro

thomas.munro@gmail.com

over 10 years ago

In reply to: Andres Freund (#27)

#29

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Thomas Munro (#28)

#30

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#29)

#31

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Andres Freund (#30)

#32

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#31)

#33

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Andres Freund (#32)

#34

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#33)

#35

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Tom Lane (#33)

#36

Thomas Munro

thomas.munro@gmail.com

over 10 years ago

In reply to: Andres Freund (#34)

#37

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Alvaro Herrera (#35)

#38

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Tom Lane (#37)

#39

Teodor Sigaev

teodor@sigaev.ru

over 10 years ago

In reply to: Tom Lane (#37)

#40

Teodor Sigaev

teodor@sigaev.ru

over 10 years ago

In reply to: Bruce Momjian (#38)

#41

Léonard Benedetti

benedetti@mlpo.fr

over 10 years ago

In reply to: Thomas Munro (#21)

#42

Léonard Benedetti

benedetti@mlpo.fr

over 10 years ago

In reply to: Léonard Benedetti (#41)

#43

Thomas Munro

thomas.munro@gmail.com

over 10 years ago

In reply to: Léonard Benedetti (#41)

#44

Teodor Sigaev

teodor@sigaev.ru

over 10 years ago

In reply to: Thomas Munro (#43)

#45

Léonard Benedetti

benedetti@mlpo.fr

over 10 years ago

In reply to: Teodor Sigaev (#44)

#46

Léonard Benedetti

benedetti@mlpo.fr

over 10 years ago

In reply to: Léonard Benedetti (#45)

#47

Teodor Sigaev

teodor@sigaev.ru

over 10 years ago

In reply to: Léonard Benedetti (#46)

#48

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Teodor Sigaev (#47)

#49

Teodor Sigaev

teodor@sigaev.ru

over 10 years ago

In reply to: Bruce Momjian (#48)

#50

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Teodor Sigaev (#49)

#51

Léonard Benedetti

benedetti@mlpo.fr

over 10 years ago

In reply to: Teodor Sigaev (#47)

#52

Teodor Sigaev

teodor@sigaev.ru

about 10 years ago

In reply to: Léonard Benedetti (#51)

#53

Léonard Benedetti

benedetti@mlpo.fr

about 10 years ago

In reply to: Teodor Sigaev (#52)

#54

Léonard Benedetti

benedetti@mlpo.fr

about 10 years ago

In reply to: Léonard Benedetti (#53)

#55

Teodor Sigaev

teodor@sigaev.ru

about 10 years ago

In reply to: Léonard Benedetti (#54)

#56

Léonard Benedetti

benedetti@mlpo.fr

about 10 years ago

In reply to: Teodor Sigaev (#55)

#57

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Léonard Benedetti (#56)

#58

Léonard Benedetti

benedetti@mlpo.fr

about 10 years ago

In reply to: Tom Lane (#57)

#59

Peter Eisentraut

peter_e@gmx.net

about 10 years ago

In reply to: Tom Lane (#57)

#60

Léonard Benedetti

benedetti@mlpo.fr

about 10 years ago

In reply to: Peter Eisentraut (#59)

#61

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Peter Eisentraut (#59)

#62

Teodor Sigaev

teodor@sigaev.ru

about 10 years ago

In reply to: Léonard Benedetti (#60)

#63

Léonard Benedetti

benedetti@mlpo.fr

about 10 years ago

In reply to: Teodor Sigaev (#62)

#64

Teodor Sigaev

teodor@sigaev.ru

about 10 years ago

In reply to: Léonard Benedetti (#63)

BUG #13440: unaccent does not remove all diacritics

Attachments:

Attachments: