BUG #10589: hungarian.stop file spelling error

Started by Nonamealmost 12 years ago10 messagesbugs
Jump to latest
#1Noname
zsoros@gmail.com

The following bug has been logged on the website:

Bug reference: 10589
Logged by: Sörös Zoltán
Email address: zsoros@gmail.com
PostgreSQL version: 9.3.4
Operating system: Linux
Description:

Hi!
The 'hungarian.stop' file (for tsearch, located in
src/backend/snowball/stopwords in the source tarball) contains the õ
('otilde' in HTML) character instead of the correct 'ő' character. (There
are 7 occuerences in this file.)

Our database uses latin2 encoding, where we use the correct 'ő' characters.
Here's an excerpt from today's log:

< 2014-06-10 08:49:24.416 CEST >ERROR: character with byte sequence 0xc3
0xb5 in encoding "UTF8" has no equivalent in encoding "LATIN2"
< 2014-06-10 08:49:24.416 CEST >CONTEXT: line 58 of configuration file
"/usr/pgsql-9.3/share/tsearch_data/hungarian.stop"

After I replaced the tilde-capped letters in hungarian.stop file, the
problem vanished, and tsearch works fine.
I'm sorry, I can't give you the utf8 byte sequence for 'ő', but I can send
the corrected hungarian.stop file if needed.

Please fix this file in the next release.

Thanks in advance,
Zoltán Sörös

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Noname (#1)
Re: BUG #10589: hungarian.stop file spelling error

"zsoros@gmail.com" <zsoros@gmail.com> wrote:

I'm sorry, I can't give you the utf8 byte sequence for 'ő'

A quick copy/paste from your email into psql (using UTF-8 encoding)
shows:

test=# select to_hex(ascii('ő'));
 to_hex
--------
 151
(1 row)

test=# select E'\u0151', convert_to(E'\u0151', 'UTF8');
 ?column? | convert_to
----------+------------
 ő        | \xc591
(1 row)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#2)
Re: BUG #10589: hungarian.stop file spelling error

Kevin Grittner <kgrittn@ymail.com> writes:

"zsoros@gmail.com" <zsoros@gmail.com> wrote:

I'm sorry, I can't give you the utf8 byte sequence for 'ő'

A quick copy/paste from your email into psql (using UTF-8 encoding)
shows:
[ it's U+0151 ]

I believe that the way we got this file in the first place was to
scrape it from
http://snowball.tartarus.org/algorithms/hungarian/stop.txt
since it's not in the Snowball distribution. It looks to me like the
webserver delivers that page in LATIN1 (ISO-8859-1) encoding, which would
go far towards explaining the encoding problem, since U+0151 isn't
representable in LATIN1. So now I'm wondering what other similar mistakes
there may be in the non-LATIN1 languages.

I have an inquiry in to the upstream Snowball list asking if there's a
safer way to obtain copies of their stopword files.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: BUG #10589: hungarian.stop file spelling error

I wrote:

[ we seem to have gotten a misencoded version of hungarian.stop ]

Actually, it looks like things are even worse than that: the Hungarian
stemmer code seems to be confused about this too. In the first place,
we've got a LATIN1 version of that stemmer, which I would imagine is
entirely useless; and in the second place, the UTF8 version has no
reference to any non-LATIN1 characters.

Again, I'm suspecting this problem goes further than Hungarian,
because the set of stem_ISO_8859_1_foo.c files in
src/backend/snowball/libstemmer/ covers a lot more languages than
I think LATIN1 is meant to cope with. I'm not sure how much of this
is broken in the original Snowball code and how much is our error
while importing the code.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: BUG #10589: hungarian.stop file spelling error

I wrote:

[ we seem to have gotten a misencoded version of hungarian.stop ]

Actually, it looks like things are even worse than that: the Hungarian
stemmer code seems to be confused about this too. In the first place,
we've got a LATIN1 version of that stemmer, which I would imagine is
entirely useless; and in the second place, the UTF8 version has no
reference to any non-LATIN1 characters.

Again, I'm suspecting this problem goes further than Hungarian,
because the set of stem_ISO_8859_1_foo.c files in
src/backend/snowball/libstemmer/ covers a lot more languages than
I think LATIN1 is meant to cope with. I'm not sure how much of this
is broken in the original Snowball code and how much is our error
while importing the code.

After further analysis, it appears that:

1. The cause of the immediately complained-of problem is that we took
the stopword file we got from the Snowball website to be in LATIN1,
whereas it evidently was meant to be in LATIN2. The problematic
characters were code 0xF5 in the file, which we translated to U+00F5,
but the correct translation is U+0151. (There is another discrepancy
between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
none of those in the stopword file.)

2. The Snowball people were just as confused as we were about the
appropriate encoding to use for Hungarian: their code claims that the
Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
character codes used in it:

/* special characters (in ISO Latin I) */

stringdef a' hex 'E1' //a-acute
stringdef e' hex 'E9' //e-acute
stringdef i' hex 'ED' //i-acute
stringdef o' hex 'F3' //o-acute
stringdef o" hex 'F6' //o-umlaut
stringdef oq hex 'F5' //o-double acute
stringdef u' hex 'FA' //u-acute
stringdef u" hex 'FC' //u-umlaut
stringdef uq hex 'FB' //u-double acute

Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
and u-double-acute don't appear in LATIN1 at all, and the codes shown here
are really for LATIN2.

I've reported this issue upstream and there are fixes pending.

3. While I was concerned that there might be similar bugs in the other
Snowball stemmers, it appears after a bit of research that LATIN1 is
commonly used as an encoding for all the other languages the Snowball
code claims it can be used for, even though in a few cases there are
seldom-used characters that LATIN1 can't represent. So there's not a
clear reason to think there are any other undetected problems (and
I would certainly not be the man to find them if they exist).

I've gone ahead and committed the encoding fix for hungarian.stop in all
active branches. I'm going to wait for Snowball upstream to accept the
proposed patches before I think about incorporating the code changes.

I'm not real sure whether we should consider back-patching those changes.
Right now, the Hungarian stemmer is applying rules meant for
o-double-acute to o-tilde, which probably means that those stemming rules
don't fire at all on actual Hungarian text. If we fix that then the
stemmer will behave differently, which might not be all that desirable to
change in a minor release. Perhaps we should only make the code changes
in HEAD and 9.4?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Tom Lane (#5)
Re: BUG #10589: hungarian.stop file spelling error

On 11/06/14 15:09, Tom Lane wrote:

I wrote:

[ we seem to have gotten a misencoded version of hungarian.stop ]

Actually, it looks like things are even worse than that: the Hungarian
stemmer code seems to be confused about this too. In the first place,
we've got a LATIN1 version of that stemmer, which I would imagine is
entirely useless; and in the second place, the UTF8 version has no
reference to any non-LATIN1 characters.
Again, I'm suspecting this problem goes further than Hungarian,
because the set of stem_ISO_8859_1_foo.c files in
src/backend/snowball/libstemmer/ covers a lot more languages than
I think LATIN1 is meant to cope with. I'm not sure how much of this
is broken in the original Snowball code and how much is our error
while importing the code.

After further analysis, it appears that:

1. The cause of the immediately complained-of problem is that we took
the stopword file we got from the Snowball website to be in LATIN1,
whereas it evidently was meant to be in LATIN2. The problematic
characters were code 0xF5 in the file, which we translated to U+00F5,
but the correct translation is U+0151. (There is another discrepancy
between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
none of those in the stopword file.)

2. The Snowball people were just as confused as we were about the
appropriate encoding to use for Hungarian: their code claims that the
Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
character codes used in it:

/* special characters (in ISO Latin I) */

stringdef a' hex 'E1' //a-acute
stringdef e' hex 'E9' //e-acute
stringdef i' hex 'ED' //i-acute
stringdef o' hex 'F3' //o-acute
stringdef o" hex 'F6' //o-umlaut
stringdef oq hex 'F5' //o-double acute
stringdef u' hex 'FA' //u-acute
stringdef u" hex 'FC' //u-umlaut
stringdef uq hex 'FB' //u-double acute

Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
and u-double-acute don't appear in LATIN1 at all, and the codes shown here
are really for LATIN2.

I've reported this issue upstream and there are fixes pending.

3. While I was concerned that there might be similar bugs in the other
Snowball stemmers, it appears after a bit of research that LATIN1 is
commonly used as an encoding for all the other languages the Snowball
code claims it can be used for, even though in a few cases there are
seldom-used characters that LATIN1 can't represent. So there's not a
clear reason to think there are any other undetected problems (and
I would certainly not be the man to find them if they exist).

I've gone ahead and committed the encoding fix for hungarian.stop in all
active branches. I'm going to wait for Snowball upstream to accept the
proposed patches before I think about incorporating the code changes.

I'm not real sure whether we should consider back-patching those changes.
Right now, the Hungarian stemmer is applying rules meant for
o-double-acute to o-tilde, which probably means that those stemming rules
don't fire at all on actual Hungarian text. If we fix that then the
stemmer will behave differently, which might not be all that desirable to
change in a minor release. Perhaps we should only make the code changes
in HEAD and 9.4?

regards, tom lane

Not saying there is any problem, but you might like to check how the EUR
currency symbol is handled (it is in LATIN2, but not in LATIN1):

https://en.wikipedia.org/wiki/Euro_sign
U+20AC ᅵ euro sign
(HTML: |&#8364;| |&euro;|)

Cheers,
Gavin

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Gavin Flower (#6)
Re: BUG #10589: hungarian.stop file spelling error

Gavin Flower wrote:

Not saying there is any problem, but you might like to check how the
EUR currency symbol is handled (it is in LATIN2, but not in LATIN1):

https://en.wikipedia.org/wiki/Euro_sign
U+20AC € euro sign
(HTML: |&#8364;| |&euro;|)

Latin1 doesn't have euro, which is why Latin9 (iso-8859-15) was invented
IIUC.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#7)
Re: BUG #10589: hungarian.stop file spelling error

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Gavin Flower wrote:

Not saying there is any problem, but you might like to check how the
EUR currency symbol is handled (it is in LATIN2, but not in LATIN1):

Latin1 doesn't have euro, which is why Latin9 (iso-8859-15) was invented
IIUC.

Yeah, I doubt there's much to be learned from the euro-sign case.
The Snowball stemmers certainly don't care about euro --- they
only work with alphabetic characters.

Actually, an interesting point is that we could probably use one of the
single-byte-encoding LATIN1 stemmers when the database encoding is LATIN9,
and thereby save a translation to UTF8 and back, since the stemmer logic
isn't going to care about euro signs. Likewise for LATIN2 vs LATIN10.
Not sure it's worth the trouble though.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#5)
Re: BUG #10589: hungarian.stop file spelling error

On Tue, Jun 10, 2014 at 11:09:22PM -0400, Tom Lane wrote:

I'm not real sure whether we should consider back-patching those changes.
Right now, the Hungarian stemmer is applying rules meant for
o-double-acute to o-tilde, which probably means that those stemming rules
don't fire at all on actual Hungarian text. If we fix that then the
stemmer will behave differently, which might not be all that desirable to
change in a minor release. Perhaps we should only make the code changes
in HEAD and 9.4?

Does this affect any tsvectors stored in earlier major releases that
would read differently after this patch? Does it cause a pg_upgrade
problem?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: BUG #10589: hungarian.stop file spelling error

Bruce Momjian <bruce@momjian.us> writes:

On Tue, Jun 10, 2014 at 11:09:22PM -0400, Tom Lane wrote:

I'm not real sure whether we should consider back-patching those changes.
Right now, the Hungarian stemmer is applying rules meant for
o-double-acute to o-tilde, which probably means that those stemming rules
don't fire at all on actual Hungarian text. If we fix that then the
stemmer will behave differently, which might not be all that desirable to
change in a minor release. Perhaps we should only make the code changes
in HEAD and 9.4?

Does this affect any tsvectors stored in earlier major releases that
would read differently after this patch? Does it cause a pg_upgrade
problem?

My guess is the field usage of the Hungarian stemmer is near zero,
or somebody would've complained about this before. Hence, I'm not
thinking we should expend any huge effort to work around problems.

In any case, Oleg and Teodor have opined in the past that small changes
in dictionary behavior don't cause major practical problems; the worst
case is that some words aren't found by searches because the current
dictionary normalizes them differently than what's in the index.
You can get around that if you have to by entering the tsquery manually
rather than going through to_tsquery.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs