BUG #15805: Problem with lower function for greek sigma (Σ) letter

Started by PG Bug reporting formalmost 7 years ago9 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15805
Logged by: Sergey kuznetsov
Email address: iksss.88@gmail.com
PostgreSQL version: 9.6.10
Operating system: linux
Description:

I see unexpected behaviour of lower function for greek sigma letter Σ.
According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
is in final word position, it should be ς in lowercase, and not σ. But
PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
lower string from another part of a system (java code, for example) Postgre
will not return this row cause it differs from java-generated one.

Thanks,
Sergey Kuznetsov

#2Daniel Gustafsson
daniel@yesql.se
In reply to: PG Bug reporting form (#1)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

On 15 May 2019, at 09:57, PG Bug reporting form <noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 15805
Logged by: Sergey kuznetsov
Email address: iksss.88@gmail.com
PostgreSQL version: 9.6.10
Operating system: linux
Description:

I see unexpected behaviour of lower function for greek sigma letter Σ.
According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
is in final word position, it should be ς in lowercase, and not σ. But
PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
lower string from another part of a system (java code, for example) Postgre
will not return this row cause it differs from java-generated one.

This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
googling there is only a single case of word-final lowercasing which is this
sigma. The attached patch takes a stab at fixing this.

cheers ./daniel

Attachments:

sigma-wordfinal.patchapplication/octet-stream; name=sigma-wordfinal.patch; x-unix-mode=0644Download+25-1
#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Daniel Gustafsson (#2)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

On 2019-May-15, Daniel Gustafsson wrote:

I see unexpected behaviour of lower function for greek sigma letter Σ.
According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
is in final word position, it should be ς in lowercase, and not σ. But
PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
lower string from another part of a system (java code, for example) Postgre
will not return this row cause it differs from java-generated one.

This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
googling there is only a single case of word-final lowercasing which is this
sigma. The attached patch takes a stab at fixing this.

Ummm ... isn't this a counterexample?
https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-May-15, Daniel Gustafsson wrote:

This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
googling there is only a single case of word-final lowercasing which is this
sigma. The attached patch takes a stab at fixing this.

Ummm ... isn't this a counterexample?
https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

I do not think the patch as given is acceptable in any case:

1. assumes without any evidence whatsoever that the system's wide-character
representation is Unicode code points;

2. assumes without checking that the locale is one that would allow this
conversion (counterexample: C locale);

3. unreasonable hard-coded assumption about what the "not a word character"
condition is.

It's possible that 1 and 2 could be finessed by checking both that the
original character is Σ and the new one is σ (in Unicode). We'd still
theoretically be taking a risk of the wrong substitution if the wchar
representation is not Unicode, but the odds seem fairly small. As for
point 3, why aren't you using iswalpha() on the next character?

regards, tom lane

#5Daniel Verite
daniel@manitou-mail.org
In reply to: PG Bug reporting form (#1)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

PG Bug reporting form wrote:

lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου"

With PostgreSQL version 10 or newer, you could use an ICU
locale. lower() would produce the expected result:

psql (11.3 (Debian 11.3-1.pgdg90+1))

=> select lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ' collate "und-x-icu");
lower
-----------------
δημοτες φαιστου
(1 row)

This case looks comparable to the case of the german ß (sharp s),
which should be upcased into 'SS', but the locales backed by libc
don't do that:

=> select upper(''Ich muß');
upper
---------
ICH MUß

For that exemple as well, an ICU locale produces a correct
result with regard to linguistic rules:

=> select upper('Ich muß' collate "und-x-icu");
upper
----------
ICH MUSS

The libc library provides an API with character-by-character
case conversions (tolower/toupper), which is too limited
to deal with the above cases, and PostgreSQL basically
just uses this API.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#5)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

"Daniel Verite" <daniel@manitou-mail.org> writes:

PG Bug reporting form wrote:

lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου"

With PostgreSQL version 10 or newer, you could use an ICU
locale. lower() would produce the expected result:

Oh, if using ICU already fixes this, I think we might as well just
say that you have to use ICU if you want the right behavior for such
cases.

The libc library provides an API with character-by-character
case conversions (tolower/toupper), which is too limited
to deal with the above cases, and PostgreSQL basically
just uses this API.

Right.

regards, tom lane

#7Daniel Gustafsson
daniel@yesql.se
In reply to: Alvaro Herrera (#3)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

On 15 May 2019, at 14:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-May-15, Daniel Gustafsson wrote:

I see unexpected behaviour of lower function for greek sigma letter Σ.
According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
is in final word position, it should be ς in lowercase, and not σ. But
PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
lower string from another part of a system (java code, for example) Postgre
will not return this row cause it differs from java-generated one.

This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
googling there is only a single case of word-final lowercasing which is this
sigma. The attached patch takes a stab at fixing this.

Ummm ... isn't this a counterexample?
https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

Hebrew doesn’t have case, so it doesn’t apply in this case.

cheers ./daniel

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Daniel Gustafsson (#7)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

On 2019-May-15, Daniel Gustafsson wrote:

On 15 May 2019, at 14:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Ummm ... isn't this a counterexample?
https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

Hebrew doesn’t have case, so it doesn’t apply in this case.

Yeah, I realized that afterwards. Sorry for the noise.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#6)
Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

On 15 May 2019, at 17:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Daniel Verite" <daniel@manitou-mail.org> writes:

PG Bug reporting form wrote:

lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου"

With PostgreSQL version 10 or newer, you could use an ICU
locale. lower() would produce the expected result:

Oh, if using ICU already fixes this, I think we might as well just
say that you have to use ICU if you want the right behavior for such
cases.

Seems reasonable. Maybe it warrants a mention in the docs on the string
function page since it may suprise users?

cheers ./daniel

Attachments:

casefolding.diffapplication/octet-stream; name=casefolding.diff; x-unix-mode=0644Download+4-0