Initcap works differently with different locale providers

Started by Oleg Tselebrovskiyover 1 year ago16 messagesdocs
Jump to latest
#1Oleg Tselebrovskiy
o.tselebrovskiy@postgrespro.ru

Greetings, everyone!

One of our clients has found a difference in behaviour of initcap
function when
using different locale providers, shown below

postgres=# create database test_db_1 locale_provider=icu
locale="ru_RU.UTF-8" template=template0;
NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
CREATE DATABASE
postgres=# \c test_db_1;
You are now connected to database "test_db_1" as user "postgres".
test_db_1=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.ю.
(1 row)
test_db_1=# select initcap('joHn d.e.');
initcap
-----------
John D.e.
(1 row)
postgres=# create database test_db_2 locale_provider=libc
locale="ru_RU.UTF-8" template=template0;
CREATE DATABASE
postgres=# \c test_db_2
You are now connected to database "test_db_2" as user "postgres".
test_db_2=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.Ю.
(1 row)
test_db_2=# select initcap('joHn d.e.');
initcap
-----------
John D.E.
(1 row)

And an easier reproduction (should work for REL_12_STABLE and up)

postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
initcap
--------------
First.second
(1 row)
postgres=# SELECT initcap('first.second' COLLATE "en_US");
initcap
--------------
First.Second
(1 row)

This behaviour is reproducible on REL_12_STABLE and up to master

I don't believe that this is an erroneous behaviour, just a differing
one, hence
just a documentation change proposition

I suggest adding a clarification that this function works differently
with libc
and ICU providers because there is a difference in what a "word" is
between them

In libc a word is a sequence of alphanumeric characters, separated by
non-alphanumeric characters (as it is written in documentation right
now)
In ICU words are divided according to Unicode® Standard Annex #29 [1]https://www.unicode.org/reports/tr29/#Word_Boundaries

Similar issue was briefly discussed in [2]/messages/by-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA@mail.gmail.com

The suggested documentation patch is attached (versions for
REL_13_STABLE+ and
for REL_12_STABLE only)

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
[2]: /messages/by-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA@mail.gmail.com
/messages/by-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA@mail.gmail.com

Oleg Tselebrovskiy, Postgres Professional

Attachments:

v1-0001-string-functions.patchtext/x-diff; name=v1-0001-string-functions.patchDownload+5-2
v1-0002-string-functions-REL_12.patchtext/x-diff; name=v1-0002-string-functions-REL_12.patchDownload+5-2
#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Oleg Tselebrovskiy (#1)
Re: Initcap works differently with different locale providers

Hi, Oleg!

On 25 Sep 2024, at 18:13, Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru> wrote:

Greetings, everyone!

One of our clients has found a difference in behaviour of initcap function when
using different locale providers, shown below

postgres=# create database test_db_1 locale_provider=icu locale="ru_RU.UTF-8" template=template0;
NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
CREATE DATABASE
postgres=# \c test_db_1;
You are now connected to database "test_db_1" as user "postgres".
test_db_1=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.ю.
(1 row)
test_db_1=# select initcap('joHn d.e.');
initcap
-----------
John D.e.
(1 row)
postgres=# create database test_db_2 locale_provider=libc locale="ru_RU.UTF-8" template=template0;
CREATE DATABASE
postgres=# \c test_db_2
You are now connected to database "test_db_2" as user "postgres".
test_db_2=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.Ю.
(1 row)
test_db_2=# select initcap('joHn d.e.');
initcap
-----------
John D.E.
(1 row)

And an easier reproduction (should work for REL_12_STABLE and up)

postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
initcap
--------------
First.second
(1 row)
postgres=# SELECT initcap('first.second' COLLATE "en_US");
initcap
--------------
First.Second
(1 row)

This behaviour is reproducible on REL_12_STABLE and up to master

I don't believe that this is an erroneous behaviour, just a differing one, hence
just a documentation change proposition

I suggest adding a clarification that this function works differently with libc
and ICU providers because there is a difference in what a "word" is between them

In libc a word is a sequence of alphanumeric characters, separated by
non-alphanumeric characters (as it is written in documentation right now)
In ICU words are divided according to Unicode® Standard Annex #29 [1]

Similar issue was briefly discussed in [2]

The suggested documentation patch is attached (versions for REL_13_STABLE+ and
for REL_12_STABLE only)

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
[2]: /messages/by-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA@mail.gmail.com

Oleg Tselebrovskiy, Postgres Professional<v1-0001-string-functions.patch><v1-0002-string-functions-REL_12.patch>

I can confirm inicap works with libc and libicu as you stated. The documentation patch looks good to me. I’ve written a commit message. The REL_12_STABLE branch is not relevant anymore as it’s out of support. I’m going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v2-0001-Clarify-documentation-for-the-initcap-function.patchapplication/octet-stream; name=v2-0001-Clarify-documentation-for-the-initcap-function.patch; x-unix-mode=0644Download+5-3
#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#2)
Re: Initcap works differently with different locale providers

On Mon, Jul 28, 2025 at 1:20 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On 25 Sep 2024, at 18:13, Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru> wrote:

Greetings, everyone!

One of our clients has found a difference in behaviour of initcap function when
using different locale providers, shown below

postgres=# create database test_db_1 locale_provider=icu locale="ru_RU.UTF-8" template=template0;
NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
CREATE DATABASE
postgres=# \c test_db_1;
You are now connected to database "test_db_1" as user "postgres".
test_db_1=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.ю.
(1 row)
test_db_1=# select initcap('joHn d.e.');
initcap
-----------
John D.e.
(1 row)
postgres=# create database test_db_2 locale_provider=libc locale="ru_RU.UTF-8" template=template0;
CREATE DATABASE
postgres=# \c test_db_2
You are now connected to database "test_db_2" as user "postgres".
test_db_2=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.Ю.
(1 row)
test_db_2=# select initcap('joHn d.e.');
initcap
-----------
John D.E.
(1 row)

And an easier reproduction (should work for REL_12_STABLE and up)

postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
initcap
--------------
First.second
(1 row)
postgres=# SELECT initcap('first.second' COLLATE "en_US");
initcap
--------------
First.Second
(1 row)

This behaviour is reproducible on REL_12_STABLE and up to master

I don't believe that this is an erroneous behaviour, just a differing one, hence
just a documentation change proposition

I suggest adding a clarification that this function works differently with libc
and ICU providers because there is a difference in what a "word" is between them

In libc a word is a sequence of alphanumeric characters, separated by
non-alphanumeric characters (as it is written in documentation right now)
In ICU words are divided according to Unicode® Standard Annex #29 [1]

Similar issue was briefly discussed in [2]

The suggested documentation patch is attached (versions for REL_13_STABLE+ and
for REL_12_STABLE only)

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
[2]: /messages/by-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA@mail.gmail.com

Oleg Tselebrovskiy, Postgres Professional<v1-0001-string-functions.patch><v1-0002-string-functions-REL_12.patch>

I can confirm inicap works with libc and libicu as you stated. The documentation patch looks good to me. I’ve written a commit message. The REL_12_STABLE branch is not relevant anymore as it’s out of support. I’m going to push this if no objections.

I'm sorry for these many messages. My email client just gone crazy.
Must be fixed now.

------
Regards,
Alexander Korotkov
Supabase

#4Oleg Tselebrovskiy
o.tselebrovskiy@postgrespro.ru
In reply to: Alexander Korotkov (#3)
Re: Initcap works differently with different locale providers

Alexander Korotkov wrote at 2025-07-28 17:23:

On Mon, Jul 28, 2025 at 1:20 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On 25 Sep 2024, at 18:13, Oleg Tselebrovskiy
<o.tselebrovskiy@postgrespro.ru> wrote:

Greetings, everyone!

One of our clients has found a difference in behaviour of initcap
function when
using different locale providers, shown below

postgres=# create database test_db_1 locale_provider=icu
locale="ru_RU.UTF-8" template=template0;
NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
CREATE DATABASE
postgres=# \c test_db_1;
You are now connected to database "test_db_1" as user "postgres".
test_db_1=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.ю.
(1 row)
test_db_1=# select initcap('joHn d.e.');
initcap
-----------
John D.e.
(1 row)
postgres=# create database test_db_2 locale_provider=libc
locale="ru_RU.UTF-8" template=template0;
CREATE DATABASE
postgres=# \c test_db_2
You are now connected to database "test_db_2" as user "postgres".
test_db_2=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.Ю.
(1 row)
test_db_2=# select initcap('joHn d.e.');
initcap
-----------
John D.E.
(1 row)

And an easier reproduction (should work for REL_12_STABLE and up)

postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
initcap
--------------
First.second
(1 row)
postgres=# SELECT initcap('first.second' COLLATE "en_US");
initcap
--------------
First.Second
(1 row)

This behaviour is reproducible on REL_12_STABLE and up to master

I don't believe that this is an erroneous behaviour, just a differing
one, hence
just a documentation change proposition

I suggest adding a clarification that this function works differently
with libc
and ICU providers because there is a difference in what a "word" is
between them

In libc a word is a sequence of alphanumeric characters, separated by
non-alphanumeric characters (as it is written in documentation right
now)
In ICU words are divided according to Unicode® Standard Annex #29 [1]

Similar issue was briefly discussed in [2]

The suggested documentation patch is attached (versions for
REL_13_STABLE+ and
for REL_12_STABLE only)

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
[2]:
/messages/by-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA@mail.gmail.com

Oleg Tselebrovskiy, Postgres
Professional<v1-0001-string-functions.patch><v1-0002-string-functions-REL_12.patch>

I can confirm inicap works with libc and libicu as you stated. The
documentation patch looks good to me. I’ve written a commit message.
The REL_12_STABLE branch is not relevant anymore as it’s out of
support. I’m going to push this if no objections.

I'm sorry for these many messages. My email client just gone crazy.
Must be fixed now.

------
Regards,
Alexander Korotkov
Supabase

Commit message looks good to me, also no objections on ignoring
REL_12_STABLE :)
Thank you!

Regards, Oleg Tselebrovskiy

#5Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Korotkov (#2)
Re: Initcap works differently with different locale providers

On Mon, 2025-07-28 at 13:20 +0300, Alexander Korotkov wrote:

I can confirm inicap works with libc and libicu as you stated.  The
documentation patch looks good to me.  I’ve written a commit message.
 The REL_12_STABLE branch is not relevant anymore as it’s out of
support.  I’m going to push this if no objections.

Apologies for the late review.

First, it doesn't mention the "builtin" provider, which uses the same
word break rules as libc.

Second, word boundaries can be complex, and I'm wondering if we should
not be so precise about what ICU does or doesn't do. For instance, ICU
has options like U_TITLECASE_ADJUST_TO_CASED,
U_TITLECASE_NO_BREAK_ADJUSTMENT, etc.[1]https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d, and I'm not sure exactly
which one of those we use.

I'd prefer that we try to explain that INITCAP() is intended for
convenient display, and the specific result should not be relied upon
(at least for ICU; maybe for all providers). If you want specific word
boundary rules, write your own function.

Regards,
Jeff Davis

[1]: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d

#6Oleg Tselebrovskiy
o.tselebrovskiy@postgrespro.ru
In reply to: Jeff Davis (#5)
Re: Initcap works differently with different locale providers

Jeff Davis wrote at 2025-07-31 02:58:

Apologies for the late answer to a review

First, it doesn't mention the "builtin" provider, which uses the same
word break rules as libc.

Completely forgot about builtin provider in the first patch, my bad

Second, word boundaries can be complex, and I'm wondering if we should
not be so precise about what ICU does or doesn't do. For instance, ICU
has options like U_TITLECASE_ADJUST_TO_CASED,
U_TITLECASE_NO_BREAK_ADJUSTMENT, etc., and I'm not sure exactly
which one of those we use.

While [1]https://www.unicode.org/reports/tr29/#Word_Boundaries describes the default word boundary rules and could be useful
as a starting point, I agree that in reality it probably is more
complicated. I didn't exactly find any place where
U_TITLECASE_ADJUST_TO_CASED and alike are set in non-test code, but
U_TITLECASE_ADJUST_TO_CASED was used as a default prior to ICU 60,
so initcap() will also behave differently depending on ICU version

I'd prefer that we try to explain that INITCAP() is intended for
convenient display, and the specific result should not be relied upon
(at least for ICU; maybe for all providers). If you want specific word
boundary rules, write your own function.

First patch just adds this warning about not relying on initcap() exact
result. The second one is the same, but removes the part "what is a
word"
since it's could be moot because we recommend writing custom functions,
so understanding what is a word is not exactly needed. Still on the
fence
about which patch is better, though

Thoughts?

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries

Regards, Oleg Tselebrovskiy

Attachments:

v2-0001-initcap-documentation.patchtext/x-diff; name=v2-0001-initcap-documentation.patchDownload+11-4
v2-0002-initcap-documentation.patchtext/x-diff; name=v2-0002-initcap-documentation.patchDownload+8-5
#7Jeff Davis
pgsql@j-davis.com
In reply to: Oleg Tselebrovskiy (#6)
Re: Initcap works differently with different locale providers

On Mon, 2025-08-04 at 12:30 +0700, Oleg Tselebrovskiy wrote:

First patch just adds this warning about not relying on initcap()
exact
result. The second one is the same, but removes the part "what is a
word"
since it's could be moot because we recommend writing custom
functions,
so understanding what is a word is not exactly needed. Still on the
fence
about which patch is better, though

One more thing: we should also change it to "... to upper case (or
title case) and the rest to lower case...". Title case is for scripts
that have characters like 'Dž' (U+01C5).

Other than that I like the second version, which un-documents the
specific word boundary rules. I'll admit I'm not quite sure how people
use this function in practice, but I expect that it's mostly convenient
(or lazy) display.

Alexander, is there a reason you backported this change? I don't
normally backport doc improvements like this, but I'm not sure what
standard others use. The fact that it's on 7 branches makes me more
reluctant to commit these extra improvements on top. Can you take care
of these follow-up patches? Or, just revert the change and I can make
the improvements in master.

Regards,
Jeff Davis

#8Oleg Tselebrovskiy
o.tselebrovskiy@postgrespro.ru
In reply to: Jeff Davis (#7)
Re: Initcap works differently with different locale providers

Jeff Davis wrote at 2025-08-05 03:59:

One more thing: we should also change it to "... to upper case (or
title case) and the rest to lower case...". Title case is for scripts
that have characters like 'Dž' (U+01C5).

Done based upon second version of previous patch. Again, there are two
versions - the first one has a mention of digraphs, like 'Dž' (U+01C5),
and the second one doesn't. And again, don't know which version is
better - title case without mentioning digraphs could be interpreted
as "don't capitalise articles and prepositions" or just "don't
capitalize articles", since the definition of "title case" is vague.
We have a "write your own function" clause, but still.

Maybe we should add an example of a digraph to the first patch to
make it more clear, if we go that path.

Attachments:

v3-0001-initcap-documentation.patchtext/x-diff; name=v3-0001-initcap-documentation.patchDownload+9-6
v3-0002-initcap-documentation.patchtext/x-diff; name=v3-0002-initcap-documentation.patchDownload+9-6
#9Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#7)
Re: Initcap works differently with different locale providers

On 04.08.25 22:59, Jeff Davis wrote:

On Mon, 2025-08-04 at 12:30 +0700, Oleg Tselebrovskiy wrote:

First patch just adds this warning about not relying on initcap()
exact
result. The second one is the same, but removes the part "what is a
word"
since it's could be moot because we recommend writing custom
functions,
so understanding what is a word is not exactly needed. Still on the
fence
about which patch is better, though

One more thing: we should also change it to "... to upper case (or
title case) and the rest to lower case...". Title case is for scripts
that have characters like 'Dž' (U+01C5).

Other than that I like the second version, which un-documents the
specific word boundary rules. I'll admit I'm not quite sure how people
use this function in practice, but I expect that it's mostly convenient
(or lazy) display.

It's meant to be an Oracle-compatible function, so maybe someone can
check there for some details.

https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/INITCAP.html

I think we should try to document the behavior more precisely. But we
probably first have to agree what it should be.

Alexander, is there a reason you backported this change? I don't
normally backport doc improvements like this, but I'm not sure what
standard others use. The fact that it's on 7 branches makes me more
reluctant to commit these extra improvements on top. Can you take care
of these follow-up patches? Or, just revert the change and I can make
the improvements in master.

Yes, I was not in favor of backpatching this, since it was not a bug
fix. And it turns out it was incomplete. I think we should revert all
the backpatches and iterate on getting the documentation the way we want
in master.

#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Eisentraut (#9)
Re: Initcap works differently with different locale providers

On Wed, Aug 6, 2025 at 2:44 PM Peter Eisentraut <peter@eisentraut.org> wrote:

On 04.08.25 22:59, Jeff Davis wrote:

On Mon, 2025-08-04 at 12:30 +0700, Oleg Tselebrovskiy wrote:

First patch just adds this warning about not relying on initcap()
exact
result. The second one is the same, but removes the part "what is a
word"
since it's could be moot because we recommend writing custom
functions,
so understanding what is a word is not exactly needed. Still on the
fence
about which patch is better, though

One more thing: we should also change it to "... to upper case (or
title case) and the rest to lower case...". Title case is for scripts
that have characters like 'Dž' (U+01C5).

Other than that I like the second version, which un-documents the
specific word boundary rules. I'll admit I'm not quite sure how people
use this function in practice, but I expect that it's mostly convenient
(or lazy) display.

It's meant to be an Oracle-compatible function, so maybe someone can
check there for some details.

https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/INITCAP.html

I think we should try to document the behavior more precisely. But we
probably first have to agree what it should be.

Alexander, is there a reason you backported this change? I don't
normally backport doc improvements like this, but I'm not sure what
standard others use. The fact that it's on 7 branches makes me more
reluctant to commit these extra improvements on top. Can you take care
of these follow-up patches? Or, just revert the change and I can make
the improvements in master.

Yes, I was not in favor of backpatching this, since it was not a bug
fix. And it turns out it was incomplete. I think we should revert all
the backpatches and iterate on getting the documentation the way we want
in master.

Got it. Sorry for the confusion. I'll revert patches from back
branches and then continue to work on the subject for master.

------
Regards,
Alexander Korotkov
Supabase

#11Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#9)
Re: Initcap works differently with different locale providers

On Wed, 2025-08-06 at 13:44 +0200, Peter Eisentraut wrote:

It's meant to be an Oracle-compatible function, so maybe someone can
check there for some details.

If it's purely a compatibility function, then using ICU's sophisticated
word break iterator doesn't make sense.

https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/INITCAP.html

I think we should try to document the behavior more precisely.

I don't think ICU purely follows Unicode on this point (does it?), so
we'd have to point to the ICU documentation.

  But we
probably first have to agree what it should be.

I still don't fully understand the use case here. I've used the
function a few times to assemble a few strings into a page heading, but
that was some time ago so I don't even clearly remember my use case. It
seems plausible there are quite a few people doing something similar,
and they'd benefit from ICU's more sophisticated approach.

But if the primary use case is for compatibility, then we might be
trying to hard to make this a provider-specific feature.

Yes, I was not in favor of backpatching this, since it was not a bug
fix.  And it turns out it was incomplete.  I think we should revert
all
the backpatches and iterate on getting the documentation the way we
want
in master.

+1.

Regards,
Jeff Davis

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#11)
Re: Initcap works differently with different locale providers

On Wed, Aug 6, 2025 at 9:21 PM Jeff Davis <pgsql@j-davis.com> wrote:

Yes, I was not in favor of backpatching this, since it was not a bug
fix. And it turns out it was incomplete. I think we should revert
all
the backpatches and iterate on getting the documentation the way we
want
in master.

+1.

Done, reverted everywhere except master.

------
Regards,
Alexander Korotkov
Supabase

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#5)
Re: Initcap works differently with different locale providers

On Wed, Jul 30, 2025 at 10:58 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-07-28 at 13:20 +0300, Alexander Korotkov wrote:

I can confirm inicap works with libc and libicu as you stated. The
documentation patch looks good to me. I’ve written a commit message.
The REL_12_STABLE branch is not relevant anymore as it’s out of
support. I’m going to push this if no objections.

Apologies for the late review.

First, it doesn't mention the "builtin" provider, which uses the same
word break rules as libc.

Second, word boundaries can be complex, and I'm wondering if we should
not be so precise about what ICU does or doesn't do. For instance, ICU
has options like U_TITLECASE_ADJUST_TO_CASED,
U_TITLECASE_NO_BREAK_ADJUSTMENT, etc.[1], and I'm not sure exactly
which one of those we use.

I think none of these options is used, because options could be
processed by ucasemap_toTitle() [1] while we use u_strToTitle() [2]
which takes no options.

Links
1. https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html#aa49d8b403bd91c52f127fe80679bac11
2. https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ustring_8h.html#a47602e2c2012d77ee91908b9bbfdc063

------
Regards,
Alexander Korotkov
Supabase

#14Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#11)
Re: Initcap works differently with different locale providers

On Wed, Aug 6, 2025 at 9:21 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2025-08-06 at 13:44 +0200, Peter Eisentraut wrote:

It's meant to be an Oracle-compatible function, so maybe someone can
check there for some details.

If it's purely a compatibility function, then using ICU's sophisticated
word break iterator doesn't make sense.

https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/INITCAP.html

I think we should try to document the behavior more precisely.

I don't think ICU purely follows Unicode on this point (does it?), so
we'd have to point to the ICU documentation.

But we
probably first have to agree what it should be.

I still don't fully understand the use case here. I've used the
function a few times to assemble a few strings into a page heading, but
that was some time ago so I don't even clearly remember my use case. It
seems plausible there are quite a few people doing something similar,
and they'd benefit from ICU's more sophisticated approach.

But if the primary use case is for compatibility, then we might be
trying to hard to make this a provider-specific feature.

I'd like to propose a new version of patch. It includes specification
of behavior for each locale provider including particular ICU
function. Also it saves the note from upthread that initcap() is
intended for display convenience. What do you think about that?

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v4-0001-Further-clarify-documentation-for-the-initcap-fun.patchapplication/octet-stream; name=v4-0001-Further-clarify-documentation-for-the-initcap-fun.patchDownload+13-7
#15Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Korotkov (#14)
Re: Initcap works differently with different locale providers

On Mon, 2025-08-18 at 00:44 +0300, Alexander Korotkov wrote:

I'd like to propose a new version of patch.

Nit: it only uses the title case in ICU or builtin PG_UNICODE_FAST.

Regards,
Jeff Davis

#16Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#15)
Re: Initcap works differently with different locale providers

On Thu, Aug 21, 2025 at 12:55 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-08-18 at 00:44 +0300, Alexander Korotkov wrote:

I'd like to propose a new version of patch.

Nit: it only uses the title case in ICU or builtin PG_UNICODE_FAST.

Corrected, thank you. Any objections if I push this?

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v5-0001-Further-clarify-documentation-for-the-initcap-fun.patchapplication/octet-stream; name=v5-0001-Further-clarify-documentation-for-the-initcap-fun.patchDownload+15-7