Add CASEFOLD() function.

Started by Jeff Davisover 1 year ago33 messageshackers

pgsql@j-davis.com

over 1 year ago

Unicode case folding is a way to convert a string to a canonical case
for the purpose of case-insensitive matching.

Users have long used LOWER() for that purpose, but there are a few edge
case problems:

* Some characters have more than two cased forms, such as "Σ" (U+03A3),
which can be lowercased as "σ" (U+03C3) or "ς" (U+03C2). The CASEFOLD()
function converts all cased forms of the character to "σ".

* The character "İ" (U+0130, capital I with dot) is lowercased to "i",
which can be a problem in locales that don't expect that.

* If new lower case characters are added to Unicode, the results of
LOWER() may change.

The CASEFOLD() function solves these problems.

Patch attached.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Ian Lawrence Barwick

barwick@gmail.com

over 1 year ago

In reply to: Jeff Davis (#1)

Re: Add CASEFOLD() function.

2024年12月12日(木) 18:00 Jeff Davis <pgsql@j-davis.com>:

Unicode case folding is a way to convert a string to a canonical case
for the purpose of case-insensitive matching.

Users have long used LOWER() for that purpose, but there are a few edge
case problems:

* Some characters have more than two cased forms, such as "Σ" (U+03A3),
which can be lowercased as "σ" (U+03C3) or "ς" (U+03C2). The CASEFOLD()
function converts all cased forms of the character to "σ".

* The character "İ" (U+0130, capital I with dot) is lowercased to "i",
which can be a problem in locales that don't expect that.

* If new lower case characters are added to Unicode, the results of
LOWER() may change.

The CASEFOLD() function solves these problems.

Patch attached.

I took a quick look at this as it sounds useful for the described issue,
and it seems to work as advertised, except the function is named "FOLDCASE()"
in the patch, so I'm wondering which is intended? A quick search indicates
there are no functions of either name in other databases; Python has a
"casefold()"
function [1]https://docs.python.org/3/library/stdtypes.html#str.casefold and PHP a "foldCase()" function [2]https://www.php.net/manual/en/intlchar.foldcase.php, so it doesn't seem there's a
de-facto standard for this.

[1]: https://docs.python.org/3/library/stdtypes.html#str.casefold
[2]: https://www.php.net/manual/en/intlchar.foldcase.php

Regards

Ian Barwick

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Ian Lawrence Barwick (#2)

Re: Add CASEFOLD() function.

On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote:

and it seems to work as advertised, except the function is named
"FOLDCASE()"
in the patch, so I'm wondering which is intended?

Thank you for looking into this, I went back and forth on the name, and
mistyped it a few times.

ICU seems to use "foldcase":

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html

and it seems to be slightly more proper grammatically (with "fold"
being the verb). However, "case folding" is common terminology in
Postgres and Unicode, so "casefold" can be seen as the verb instead.

I don't have a strong opinion here so I will just go with whatever
seems like the popular choice.

Regards,
Jeff Davis

Joe Conway

mail@joeconway.com

over 1 year ago

In reply to: Jeff Davis (#3)

Re: Add CASEFOLD() function.

On 12/12/24 13:30, Jeff Davis wrote:

On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote:

and it seems to work as advertised, except the function is named
"FOLDCASE()"
in the patch, so I'm wondering which is intended?

Thank you for looking into this, I went back and forth on the name, and
mistyped it a few times.

ICU seems to use "foldcase":

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html

and it seems to be slightly more proper grammatically (with "fold"
being the verb). However, "case folding" is common terminology in
Postgres and Unicode, so "casefold" can be seen as the verb instead.

I don't have a strong opinion here so I will just go with whatever
seems like the popular choice.

FWIW I prefer casefold()

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Joe Conway (#4)

Re: Add CASEFOLD() function.

On Thu, 2024-12-12 at 13:55 -0500, Joe Conway wrote:

I don't have a strong opinion here so I will just go with whatever
seems like the popular choice.

FWIW I prefer casefold()

Done. I just noticed that it now matches $SUBJECT. The fact that my
code didn't match the email subject before further supports the idea
that "foldcase" was never quite as natural -- so I agree that
"casefold" is the way to go.

One question I have is whether we want this function to normalize the
output.

I believe most usecases would want the output normalized, because
normalization differences (e.g. "a" U+0061 followed by "combining
acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
in case.

Of course, a user could wrap it with the normalize() function, but
that's verbose and easy to forget. I'm also not sure that it can be
made as fast as a combined function that does both.

And a follow-up question: if it does normalize, the second parameter
would be the requested normal form. But to accept the keyword forms
(NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
that a reasonable thing to do?

Regards,
Jeff Davis

Joe Conway

mail@joeconway.com

over 1 year ago

In reply to: Jeff Davis (#5)

Re: Add CASEFOLD() function.

On 12/16/24 12:49, Jeff Davis wrote:

One question I have is whether we want this function to normalize the
output.

I believe most usecases would want the output normalized, because
normalization differences (e.g. "a" U+0061 followed by "combining
acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
in case.

Of course, a user could wrap it with the normalize() function, but
that's verbose and easy to forget. I'm also not sure that it can be
made as fast as a combined function that does both.

Perhaps a one arg version that always casefolds and a two arg version
that accepts nfc, nfd, none (or something similar)?

And a follow-up question: if it does normalize, the second parameter
would be the requested normal form. But to accept the keyword forms
(NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
that a reasonable thing to do?

SQL 2023 seems to include the NORMALIZE syntax, but the only case
folding considered is UPPER and LOWER. As such, I think it ought to be a
function but not part of the grammar.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Andreas Karlsson

andreas.karlsson@percona.com

over 1 year ago

In reply to: Jeff Davis (#1)

Re: Add CASEFOLD() function.

On 12/12/24 10:00 AM, Jeff Davis wrote:

Patch attached.

I have not looked at the patch yet but +1 to the idea. I am leaning
towards that the function also optionally normalizing the codepoints
would be handy too since I think that is what most usecases want.
Otherwise people would have to always type normalize(casefold(str)).

Andreas

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Joe Conway (#6)

Re: Add CASEFOLD() function.

On Mon, 2024-12-16 at 16:27 -0500, Joe Conway wrote:

SQL 2023 seems to include the NORMALIZE syntax, but the only case
folding considered is UPPER and LOWER. As such, I think it ought to
be a
function but not part of the grammar.

Should the standard support something like the Unicode idea of case
folding? If so, do we need to be careful of conflicts?

Regards,
Jeff Davis

Peter Eisentraut

peter_e@gmx.net

over 1 year ago

In reply to: Jeff Davis (#5)

Re: Add CASEFOLD() function.

On 16.12.24 18:49, Jeff Davis wrote:

One question I have is whether we want this function to normalize the
output.

I believe most usecases would want the output normalized, because
normalization differences (e.g. "a" U+0061 followed by "combining
acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
in case.

Can you explain this in further detail? I don't quite follow why this
would be required.

Of course, a user could wrap it with the normalize() function, but
that's verbose and easy to forget. I'm also not sure that it can be
made as fast as a combined function that does both.

And a follow-up question: if it does normalize, the second parameter
would be the requested normal form. But to accept the keyword forms
(NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
that a reasonable thing to do?

That's maybe one reason to keep it separate.

Another might be that's not entirely clear how this should work in
encodings other than UTF-8. For example, the normalized string might
not be representable in the encoding.

#10

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Peter Eisentraut (#9)

Re: Add CASEFOLD() function.

On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote:

Can you explain this in further detail? I don't quite follow why
this
would be required.

I am unsure now.

My initial reasoning was based on the idea that users would want to use
CASEFOLD(t) in a unique expression index as an improvement over
LOWER(t). And if you do that, you'd be surprised if some equivalent
strings ended up in the index. I don't think that's a huge problem,
because in other contexts we leave it up to the user to keep things
normalized consistently, and a CHECK(t IS NFC NORMALIZED) is a good way
to do that.

But there's a problem: full case folding doesn't preserve the normal
form, so even if the input is NFC normalized, the output might not be.
If we solve this problem, then we can just say that CASEFOLD()
preserves the normal form, consistently with how the spec defines
LOWER()/UPPER(), and I think that would be the best outcome.

I'm not sure if that problem is solvable, though, because what if the
input string is in both NFC and NFD, how do we know which normal form
to preserve?

We could tell users to use an expression index on
NORMALIZE(CASEFOLD(t)) instead, but that feels like inefficient
boilerplate.

Another might be that's not entirely clear how this should work in
encodings other than UTF-8. For example, the normalized string might
not be representable in the encoding.

That's a good point.

Regards,
Jeff Davis

#11

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Jeff Davis (#10)

Re: Add CASEFOLD() function.

On Thu, 2024-12-19 at 09:51 -0800, Jeff Davis wrote:

But there's a problem: full case folding doesn't preserve the normal
form, so even if the input is NFC normalized, the output might not
be.
If we solve this problem, then we can just say that CASEFOLD()
preserves the normal form, consistently with how the spec defines
LOWER()/UPPER(), and I think that would be the best outcome.

I'm not sure if that problem is solvable, though, because what if the
input string is in both NFC and NFD, how do we know which normal form
to preserve?

The options as I see it are:

1. Normalize the output (either by using an extra parameter or just
always normalizing to NFC). As you said, the problem here is that the
encoding might not work with normalization. One solution might be that
CASEFOLD() only works in UTF8, like NORMALIZE().

2. Try to preserve normalization as long as the encoding supports it.
The problem here is that we don't know what normal form to preserve,
because the input string might be in both NFC and NFD. We could
document that it preserves NFC form iff the input is NFC.

3. Allow CASEFOLD() to break the normal form of the input string. The
problem here is that the user may be surprised that the output is not
normalized even when all of their data is normalized. It's not clear to
me whether it still works for caseless matching -- it might if the
string is in a consistent form, even if not normalized.

Out of those I think #1 is the most appealing. Most users, and
especially users that care about these edge cases enough to use Full
Case Folding, are almost certainly going to be on UTF8 anyway. It's
also the most user-friendly.

Regards,
Jeff Davis

#12

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Jeff Davis (#11)

Re: Add CASEFOLD() function.

On Wed, 2025-01-08 at 15:19 -0800, Jeff Davis wrote:

3. Allow CASEFOLD() to break the normal form of the input string. The
problem here is that the user may be surprised that the output is not
normalized even when all of their data is normalized. It's not clear
to
me whether it still works for caseless matching -- it might if the
string is in a consistent form, even if not normalized.

Looking at the Unicode standard again, it distinguishes between
"default caseless matching" and "canonical caseless matching". The
latter accounts for a few nuances that the former does not. See Unicode
16.0 section 3.13.4 D144 & D145. Using Default Caseless Matching
simplifies things quite a bit.

We could argue that it would be nice to have canonical caseless
matching, but that seems to be going above and beyond what Unicode
suggests. And normalization is expensive -- if we combine case folding
and normalization, there's no way for the user to avoid the cost. So
I'm changing my answer to #3, and we just document that it does not
preserve normalization. I believe this means that Peter and I are now
in agreement[1]/messages/by-id/8c384b0d-00f2-4515-8e60-ff7d0d4c093a@eisentraut.org, though I'm not sure if his reasoning is the same.

New patch series attached.

Regards,
Jeff Davis

[1]: /messages/by-id/8c384b0d-00f2-4515-8e60-ff7d0d4c093a@eisentraut.org
/messages/by-id/8c384b0d-00f2-4515-8e60-ff7d0d4c093a@eisentraut.org

#13

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Jeff Davis (#12)

Re: Add CASEFOLD() function.

On Fri, 2025-01-10 at 16:27 -0800, Jeff Davis wrote:

New patch series attached.

v5 attached.

This version is rebased over the Full Case Mapping support, and
supports Default Case Folding when using the PG_UNICODE_FAST collation.

That means that "ẞ", "ß", "SS", "Ss", and "ss" all fold to "ss"; and
"Σ", "σ", and "ς" all fold to "σ".

CASEFOLD() is better (according to Unicode, anyway) than LOWER() for
caseless matching, or in an expression index to enforce case-
insensitive uniqueness without relying on ICU.

Additionally, the infrastructure in this patch (as well as 286a365b9c)
can be used in the future for better case-insensitive pattern matching,
or casefolding identifiers in the parser without relying on libc.

I feel this is about ready for commit. The main point of discussion was
whether CASEFOLD() would do normalization, and if so, what the SQL API
would look like. I concluded upthread that it was unnecessary to meet
the Unicode Default Case Folding behavior, and we should just leave
normalization as a separate process. If someone disagrees with
reasoning, please let me know.

Regards,
Jeff Davis

[1]: /messages/by-id/610a56de2bd958e96c149ca60420db30e7d51588.camel@j-davis.com
/messages/by-id/610a56de2bd958e96c149ca60420db30e7d51588.camel@j-davis.com

#14

Jeff Davis

pgsql@j-davis.com

over 1 year ago

In reply to: Jeff Davis (#13)

Re: Add CASEFOLD() function.

On Fri, 2025-01-17 at 16:34 -0800, Jeff Davis wrote:

v5 attached.

v6 attached. I plan to commit this soon.

A couple things to note:

* The ICU API for lower/title/uppercasing is slightly different from
folding. The former accept a locale, while the latter just has an
option which is relevant only to languages 'az' and 'tr'. So the patch
checks for those two languages to enable the option, so that ICU is
consistently locale-aware for all the functions. I also added ICU
tests.

* I'm leaving out the normalization, which is not required for Unicode
Default Caseless Matchinng, as mentioned in the last email. That
simplifies the SQL function as well as the implementation. There was
some discussion on normalization upthread.

Regards,
Jeff Davis

#15

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Jeff Davis (#14)

Re: Add CASEFOLD() function.

Jeff Davis <pgsql@j-davis.com> writes:

v6 attached. I plan to commit this soon.

The documentation for this function is giving the PDF docs build
indigestion:

[WARN] FOUserAgent - Glyph "?" (0x3a3, Sigma) not available in font "Courier".
[WARN] FOUserAgent - Glyph "?" (0x3c3, sigma) not available in font "Courier".
[WARN] FOUserAgent - Glyph "?" (0x3c2, sigma1) not available in font "Courier".

Found characters that cannot be output in the PDF document; see README.non-ASCII

Not sure about a good workaround for this. Are there any characters
within LATIN-1 that have interesting case-folding behavior?

regards, tom lane

#16

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Tom Lane (#15)

Re: Add CASEFOLD() function.

On Sat, 2025-01-25 at 00:00 -0500, Tom Lane wrote:

Found characters that cannot be output in the PDF document; see
README.non-ASCII

Thank you, fixed.

Not sure about a good workaround for this. Are there any characters
within LATIN-1 that have interesting case-folding behavior?

I just removed that example. There's already another example using ß
(U+00DF), though that only applies to PG_UNICODE_FAST (the new
collation that performs full case mapping and now full case folding).

Regards,
Jeff Davis

#17

Vik Fearing

vik@postgresfriends.org

10 months ago

In reply to: Jeff Davis (#5)

Re: Add CASEFOLD() function.

On 16/12/2024 18:49, Jeff Davis wrote:

One question I have is whether we want this function to normalize the
output.

Yes, we do.

I am sorry that I am so late to the party, but I am currently writing
the Change Proposal for the SQL Standard for this function.

For <fold> (which includes LOWER() and UPPER()), the text says in
Section 6.35 GR 7.e:

If the character set of <character factor> is UTF8, UTF16, or UTF32,
then FR is replaced by
    Case:
        i) If the <search condition> S IS NORMALIZED evaluates to True,
then NORMALIZE (FR)
        ii) Otherwise, FR.

Here, FR is the result of the function and S is its argument.

It does not appear to me that our LOWER and UPPER functions obey this
rule, so there is a valid argument that we should continue to ignore it.
Or, we can say that we have at least one of three compliant.

Vik Fearing

#18

Jeff Davis

pgsql@j-davis.com

10 months ago

In reply to: Vik Fearing (#17)

Re: Add CASEFOLD() function.

On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote:

If the character set of <character factor> is UTF8, UTF16, or UTF32,
then FR is replaced by
     Case:
         i) If the <search condition> S IS NORMALIZED evaluates to
True, then NORMALIZE (FR)
         ii) Otherwise, FR.

I read that as "if the input is normalized, then the output should be
normalized", IOW preserve the normalization. But does it mean "preserve
whatever the input normal form is" or "preserve NFC if the input is
NFC, otherwise the normalization is undefined"?

The above wording seems to mean "preserve NFC if the input is NFC",
because that's what NORMALIZE(FR) does when the normal form is
unspecified.

It does not appear to me that our LOWER and UPPER functions obey this
rule,

You are correct:

WITH s(t) AS
(SELECT NORMALIZE(U&'\00C1\00DF\0301' COLLATE "en-US-x-icu"))
SELECT UPPER(t) = NORMALIZE(UPPER(t)) FROM s;
?column?
----------
f

so there is a valid argument that we should continue to ignore it.
Or, we can say that we have at least one of three compliant.

What do other databases do?

Given how costly normalization can be, imposing that on every caller
seems like a bit much. And favoring NFC for the user unconditionally
might not be the best thing. Then again, NFC is good most of the time,
and there are patches to speed up normalization.

I tend to think that a lot of users who want casefolding would also
want normalization, but it's hard to weigh that against the performance
cost. It might not matter outside of a few edge cases, though I'm not
sure exactly how many.

Regards,
Jeff Davis

#19

Vik Fearing

vik@postgresfriends.org

10 months ago

In reply to: Jeff Davis (#18)

Re: Add CASEFOLD() function.

On 17/06/2025 20:14, Jeff Davis wrote:

On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote:

If the character set of <character factor> is UTF8, UTF16, or UTF32,
then FR is replaced by
     Case:
         i) If the <search condition> S IS NORMALIZED evaluates to
True, then NORMALIZE (FR)
         ii) Otherwise, FR.

I read that as "if the input is normalized, then the output should be
normalized", IOW preserve the normalization. But does it mean "preserve
whatever the input normal form is" or "preserve NFC if the input is
NFC, otherwise the normalization is undefined"?

The above wording seems to mean "preserve NFC if the input is NFC",
because that's what NORMALIZE(FR) does when the normal form is
unspecified.

Yes, and that is also the default for <normalized predicate>.

It does not appear to me that our LOWER and UPPER functions obey this
rule,

You are correct:

WITH s(t) AS
(SELECT NORMALIZE(U&'\00C1\00DF\0301' COLLATE "en-US-x-icu"))
SELECT UPPER(t) = NORMALIZE(UPPER(t)) FROM s;
?column?
----------
f

so there is a valid argument that we should continue to ignore it.
Or, we can say that we have at least one of three compliant.

What do other databases do?

I don't know. I am just pointing out what the Standard says. I think
we should either comply, or say that we don't do it for LOWER and UPPER
so let's keep things implementation-consistent.

Given how costly normalization can be, imposing that on every caller
seems like a bit much.

How much does it cost to check for NFC? I honestly don't know the
answer to that question, but that is the only case where we need to
maintain normalization.

And favoring NFC for the user unconditionally
might not be the best thing. Then again, NFC is good most of the time,
and there are patches to speed up normalization.

It's not unconditionally, it's only if the input was NFC.

I tend to think that a lot of users who want casefolding would also
want normalization, but it's hard to weigh that against the performance
cost. It might not matter outside of a few edge cases, though I'm not
sure exactly how many.

I defer to you and others in the thread to make this decision.

Vik Fearing

#20

Jeff Davis

pgsql@j-davis.com

10 months ago

In reply to: Vik Fearing (#19)

Re: Add CASEFOLD() function.

On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:

I don't know. I am just pointing out what the Standard says. I
think
we should either comply, or say that we don't do it for LOWER and
UPPER
so let's keep things implementation-consistent.

For the standard, I see two potential philosophies:

I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
preserve NFC in the same way.

II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
text value that is useful for caseless matching, but should not
ordinarily be used for display or sent to the application (those things
would be allowed, just not encouraged). For normalization, either:
(A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
don't require any kind of normalization; or
(B) Follow Unicode Canonical Caseless Matching (D145), and require
that the input and output are normalized appropriately, but leave the
precise normal form as implementation-defined.

The current implementation could either be seen as philosophy (I) where
we've chosen to ignore the normalization part for the sake of
consistency with LOWER()/UPPER(); or it could be seen as philosophy
(II)(A).

How much does it cost to check for NFC? I honestly don't know the
answer to that question, but that is the only case where we need to
maintain normalization.

I attached a very rough patch and ran a very simple test on strings
averaging 36 bytes in length, all already in NFC and the result is also
NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
seconds, afterward about 8.

There's a patch to optimize some of the normalization paths, which I
haven't had a chance to review yet. So those numbers might come down.

It's not unconditionally, it's only if the input was NFC.

Optimizing the case where the input is _not_ NFC seems strange to me.
If we are normalizing the output, I'd say we should just make the
output always NFC. Being more strict, this seems likely to comply with
the eventual standard.

Additionally, if we are normalizing the output, then we should also do
the input fixup for U+0345, which would make the result usable for
Canonical Caseless Matching. Again, this seems likely to comply with
the eventual standard.

So I only see two reasonable implementations:

1. The current CASEFOLD() implementation.

2. Do the input fixup for U+0345 and unconditionally normalize the
output in NFC.

If there's a case to be made for both implementations, we could also
consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
for #2. I'm not sure whether we'd want to standardize one or both of
those functions.

And if you think there's likely to be a collision with the standard
that's hard to anticipate and fix now, then we should consider
reverting CASEFOLD() for 18 and wait for more progress on the
standardization. What's the likelihood that the name changes or
something like that?

Regards,
Jeff Davis

#21