Case Insensitive Comparison with Postgres 12

Started by Igal @ Lucee.orgover 6 years ago10 messagesgeneral
Jump to latest
#1Igal @ Lucee.org
igal@lucee.org

I am trying to test a simple case insensitive comparison. Most likely the
collation that I chose is wrong, but I'm not sure how to choose the correct
one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

#2Morris de Oryx
morrisdeoryx@gmail.com
In reply to: Igal @ Lucee.org (#1)
Re: Case Insensitive Comparison with Postgres 12

As I understand it, custom collation are not applied globally. Meaning, you
have to associate a collation with a column or en expression with COLLATE.

#3Pavel Křehula
pavel.krehula@nlm.cz
In reply to: Igal @ Lucee.org (#1)
Re: Case Insensitive Comparison with Postgres 12

Hello,
use correct locale identifier, in your case it should be:
create collation "case_insensitive" (provider=icu,
locale="en-US-u-ks-level2", deterministic = false);

See
http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
for available options.

--
Pavel

Dne 09.10.2019 0:51:52, "Igal Sapir" <igal@lucee.org> napsal:

Show quoted text

I am trying to test a simple case insensitive comparison. Most likely
the collation that I chose is wrong, but I'm not sure how to choose the
correct one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

#4Wim Bertels
wim.bertels@ucll.be
In reply to: Igal @ Lucee.org (#1)
Re: Case Insensitive Comparison with Postgres 12

Using the datatype citext might be an alternative solution

Igal Sapir <igal@lucee.org> schreef op October 8, 2019 10:51:52 PM UTC:

I am trying to test a simple case insensitive comparison. Most likely
the
collation that I chose is wrong, but I'm not sure how to choose the
correct
one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

--
Verstuurd vanaf een fairphone met K-9 Mail. Excuseer mijn beknoptheid.

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Igal @ Lucee.org (#1)
Re: Case Insensitive Comparison with Postgres 12

Igal Sapir wrote:

I am trying to test a simple case insensitive comparison. Most likely the
collation that I chose is wrong, but I'm not sure how to choose the correct
one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Yes, the LOCALE is wrong. Use

create collation case_insensitive (
provider=icu, locale='en-US-u-ks-level2', deterministic=false
);

The name of the locale defines it.

My blog post can give a simple introduction:
https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#6Igal @ Lucee.org
igal@lucee.org
In reply to: Laurenz Albe (#5)
Re: Case Insensitive Comparison with Postgres 12

On 10/9/2019 12:34 AM, Laurenz Albe wrote:

Igal Sapir wrote:

I am trying to test a simple case insensitive comparison. Most likely the
collation that I chose is wrong, but I'm not sure how to choose the correct
one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Yes, the LOCALE is wrong. Use

create collation case_insensitive (
provider=icu, locale='en-US-u-ks-level2', deterministic=false
);

The name of the locale defines it.

My blog post can give a simple introduction:
https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Thank you all for replying.  I tried to use the locale suggested by both
Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a
simple comparison of 'Abc' = 'abc'.  I tried the locale both as a
'string' and as an "identifier":

select version();

version |
-------------------------------------------------------------------------------------------------------|
PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit|

drop collation if exists case_insensitive;

create collation case_insensitive (

   provider=icu, locale="en-US-u-ks-level2", deterministic=false
);

select 'Abc' = 'abc' collate case_insensitive as is_equal;

is_equal|
--------|
false   |

What am I doing wrong here?

Thanks,

Igal

#7Thomas Kellerer
spam_eater@gmx.net
In reply to: Igal @ Lucee.org (#6)
Re: Case Insensitive Comparison with Postgres 12

Igal @ Lucee.org schrieb am 10.10.2019 um 14:41:

Thank you all for replying. I tried to use the locale suggested by
both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting
false for a simple comparison of 'Abc' = 'abc'. I tried the locale
both as a 'string' and as an "identifier":

drop collation if exists case_insensitive;

create collation case_insensitive (

   provider=icu, locale="en-US-u-ks-level2", deterministic=false
);

select 'Abc' = 'abc' collate case_insensitive as is_equal;

is_equal|
--------|
false   |

What am I doing wrong here?

Check the version of libicu that your Linux is using.
That locale format requires version 54 or later.
(My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53).

In another thread about ICU problems, Daniel Verite explained that in more detail:

With ICU 53 or older, instead of the locale above, we must use the old-style syntax:

locale = 'de-DE@colStrength=secondary'

In your case I guess, it should be

locale = 'en-US@colStrength=secondary'

Thomas

#8Igal @ Lucee.org
igal@lucee.org
In reply to: Thomas Kellerer (#7)
Re: Case Insensitive Comparison with Postgres 12

Thomas,

On 10/10/2019 6:22 AM, Thomas Kellerer wrote:

Igal @ Lucee.org schrieb am 10.10.2019 um 14:41:

Thank you all for replying. I tried to use the locale suggested by
both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting
false for a simple comparison of 'Abc' = 'abc'. I tried the locale
both as a 'string' and as an "identifier":

drop collation if exists case_insensitive;
create collation case_insensitive (

   provider=icu, locale="en-US-u-ks-level2", deterministic=false
);

select 'Abc' = 'abc' collate case_insensitive as is_equal;

is_equal|
--------|
false   |

What am I doing wrong here?

Check the version of libicu that your Linux is using.
That locale format requires version 54 or later.
(My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53).

In another thread about ICU problems, Daniel Verite explained that in more detail:

With ICU 53 or older, instead of the locale above, we must use the old-style syntax:

locale = 'de-DE@colStrength=secondary'

In your case I guess, it should be

locale = 'en-US@colStrength=secondary'

That works, thank you!

I also have CentOS installed on that machine: CentOS Linux release
7.7.1908 (Core), showing libicu Version 50.2 via `yum info libicu`.

Best,

Igal

#9Igal @ Lucee.org
igal@lucee.org
In reply to: Igal @ Lucee.org (#1)
Re: Case Insensitive Comparison with Postgres 12

On Fri, Oct 11, 2019 at 1:09 AM stan <stanb@panix.com> wrote:

On Thu, Oct 10, 2019 at 05:41:47AM -0700, Igal @ Lucee.org wrote:

On 10/9/2019 12:34 AM, Laurenz Albe wrote:

Igal Sapir wrote:

I am trying to test a simple case insensitive comparison. Most

likely the

collation that I chose is wrong, but I'm not sure how to choose the

correct

one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Yes, the LOCALE is wrong. Use

create collation case_insensitive (
provider=icu, locale='en-US-u-ks-level2', deterministic=false
);

The name of the locale defines it.

My blog post can give a simple introduction:

https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Thank you all for replying.?? I tried to use the locale suggested by both
Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a
simple comparison of 'Abc' = 'abc'.?? I tried the locale both as a

'string'

and as an "identifier":

select version();

version |

-------------------------------------------------------------------------------------------------------|

PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5

20150623

(Red Hat 4.8.5-39), 64-bit|

drop collation if exists case_insensitive;

create collation case_insensitive (

???? provider=icu, locale="en-US-u-ks-level2", deterministic=false
);

select 'Abc' = 'abc' collate case_insensitive as is_equal;

is_equal|
--------|
false???? |

What am I doing wrong here?

Out of curiosity is there a eason not to use the citext type for th?

Using the collation seems like a much cleaner approach, and I trust ICU to
do a better job at comparing strings according to language rules etc.

Igal

#10Daniel Verite
daniel@manitou-mail.org
In reply to: Igal @ Lucee.org (#9)
Re: Case Insensitive Comparison with Postgres 12

Igal Sapir wrote:

Out of curiosity is there a eason not to use the citext type for th?

Using the collation seems like a much cleaner approach, and I trust ICU to
do a better job at comparing strings according to language rules etc.

One notable difference between citext and case-insensitive collations
by ICU is that the latter recognizes canonically equivalent sequences
of codepoints [1]https://en.wikipedia.org/wiki/Unicode_equivalence as equal, while the former does not.

For instance:

=# CREATE COLLATION ci (locale='und@colStrength=secondary',
provider='icu', deterministic=false);

=# SELECT E'E\u0302TES'::citext = 'Êtes'::citext AS "citext-equal",
E'E\u0302TES' = 'Êtes' collate "ci" AS "ci-equal";
citext-equal | ci-equal
--------------+----------
f | t

Another significant difference is that building or rebuilding an index on a
text column with a CI collation appears to be way faster than with citext
(I've seen 10:1 ratios, but do your own tests).

On the minus side, substring matching with LIKE or other methods
is not possible with CI collations whereas it does work with citext.

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite