Move defaults toward ICU in 16?

Started by Jeff Davisover 3 years ago53 messageshackers

pgsql@j-davis.com

over 3 years ago

As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

If so, is version 16 the right time to adjust defaults to favor ICU?

* At build time, default to --with-icu (-Dicu=enabled); users who
don't want ICU can specify --without-icu (-Dicu=disabled/auto)
* At initdb time, default to --locale-provider=icu if built with
ICU support

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Regards,
Jeff Davis

robertmhaas@gmail.com

over 3 years ago

In reply to: Jeff Davis (#1)

Re: Move defaults toward ICU in 16?

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis <pgsql@j-davis.com> wrote:

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Do you think it's better?

--
Robert Haas
EDB: http://www.enterprisedb.com

pgsql@j-davis.com

over 3 years ago

In reply to: Robert Haas (#2)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-02 at 08:44 -0500, Robert Haas wrote:

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis <pgsql@j-davis.com> wrote:

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Do you think it's better?

Yes:

* ICU more featureful: e.g. supports case-insensitive collations (the
citext docs suggest looking at ICU instead).
* It's faster: a simple non-contrived sort is something like 70%
faster[1]/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com than one using glibc.
* It can provide consistent semantics across platforms.

I believe the above reasons are enough to call ICU "better", but it
also seems like a better path for addressing/mitigating collator
versioning problems:

* Easier for users to control what library version is available on
their system. We can also ask packagers to keep some old versions of
ICU available for an extended period of time.
* If one of the ICU multilib patches makes it in, it will be easier
for users to select which of the library versions Postgres will use.
* Reports versions for indiividual collators, distinct from the
library version.

The biggest disadvantage (rather, the flip side of its advantages) is
that it's a separate dependency. Will ICU still be maintained in 10
years or will we end up stuck maintaining it ourselves? Then again,
we've already been shipping it, so I don't know if we can avoid that
problem entirely now even if we wanted to.

I don't mean that ICU solves all of our problems -- far from it. But
you asked if I think it's better, and my answer is yes.

Regards,
Jeff Davis

[1]: /messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com
/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com

thomas.munro@gmail.com

over 3 years ago

In reply to: Jeff Davis (#3)

Re: Move defaults toward ICU in 16?

On Fri, Feb 3, 2023 at 5:31 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-02-02 at 08:44 -0500, Robert Haas wrote:

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis <pgsql@j-davis.com> wrote:

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Do you think it's better?

Yes:

* ICU more featureful: e.g. supports case-insensitive collations (the
citext docs suggest looking at ICU instead).
* It's faster: a simple non-contrived sort is something like 70%
faster[1] than one using glibc.
* It can provide consistent semantics across platforms.

+1

* Easier for users to control what library version is available on
their system. We can also ask packagers to keep some old versions of
ICU available for an extended period of time.
* If one of the ICU multilib patches makes it in, it will be easier
for users to select which of the library versions Postgres will use.
* Reports versions for indiividual collators, distinct from the
library version.

+1

The biggest disadvantage (rather, the flip side of its advantages) is
that it's a separate dependency. Will ICU still be maintained in 10
years or will we end up stuck maintaining it ourselves? Then again,
we've already been shipping it, so I don't know if we can avoid that
problem entirely now even if we wanted to.

It has a pretty special status, with an absolutely enormous amount of
technology depending on it.

http://blog.unicode.org/2016/05/icu-joins-unicode-consortium.html
https://unicode.org/consortium/consort.html
https://home.unicode.org/membership/members/
https://home.unicode.org/about-unicode/

I mean, who knows what the future holds, but ultimately what we're
doing here is taking the de facto reference implementation of the
Unicode collation algorithm. Are Unicode and the consortium still
going to be here in 10 years? We're all in on Unicode, and it's also
tangled up with ISO standards, as are parts of the collation stuff.
Sure, there could be a clean-room implementation that replaces it in
some sense (just as there is a Java implementation) but it would very
likely be "the same" because the real thing we're buying here is the
set of algorithms and data maintenance that the whole industry has
agreed on.

Unless Britain decides to exit the Latin alphabet, terminate
membership of ISO and revert to anglo-saxon runes with a sort order
that is defined in the new constitution as "the opposite of whatever
Unicode says", it's hard to see obstacles to ICU's long term universal
applicability.

It's still important to have libc support as an option, though,
because it's a totally reasonable thing to want sort order to agree
with the "sort" command on the same host, and you are willing to deal
with all the complexities that we're trying to escape.

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#4)

Re: Move defaults toward ICU in 16?

On Thu, Feb 2, 2023 at 2:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Sure, there could be a clean-room implementation that replaces it in
some sense (just as there is a Java implementation) but it would very
likely be "the same" because the real thing we're buying here is the
set of algorithms and data maintenance that the whole industry has
agreed on.

I don't think that a clean room implementation is implausible. They
seem to already exist, and be explicitly provided for by CLDR, which
is not joined at the hip to ICU:

https://github.com/elixir-cldr/cldr

Most of the value that we tend to think of as coming from ICU actually
comes from CLDR itself, as well as related Unicode Consortium and IETF
standards/RFCs such as BCP-47.

Unless Britain decides to exit the Latin alphabet, terminate
membership of ISO and revert to anglo-saxon runes with a sort order
that is defined in the new constitution as "the opposite of whatever
Unicode says", it's hard to see obstacles to ICU's long term universal
applicability.

It would have to literally be defined as "not unicode" for it to
present a real problem. A key goal of Unicode is to accommodate
political and cultural shifts, since even countries can come and go.
In principle Unicode should be able to accommodate just about any
change in preferences, except when there is an irreconcilable
difference of opinion among people that are from the same natural
language group. For example it can accommodate relatively minor
differences of opinion about how text should be sorted among groups
that each speak a regional dialect of the same language. Hardly
anybody even notices this.

Accommodating these variations can only come from making a huge
investment. Most of the work is actually done by natural language
scholars, not technologists. That effort is very unlikely to be
duplicated by some other group with its own conflicting goals. AFAICT
there is no great need for any schisms, since differences of opinion
can usually be accommodated under the umbrella of Unicode.

--
Peter Geoghegan

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#4)

Re: Move defaults toward ICU in 16?

Thomas Munro <thomas.munro@gmail.com> writes:

It's still important to have libc support as an option, though,
because it's a totally reasonable thing to want sort order to agree
with the "sort" command on the same host, and you are willing to deal
with all the complexities that we're trying to escape.

Yeah. I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

regards, tom lane

pgsql@j-davis.com

over 3 years ago

In reply to: Tom Lane (#6)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-02 at 18:10 -0500, Tom Lane wrote:

Yeah. I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

Patch attached.

To get the default locale, the patch initializes a UCollator with NULL
for the locale name, and then queries it for the locale name. Then it's
converted to a language tag, which is consistent with the initial
collation import. I'm not sure that's the best way, but it seems
reasonable.

If it's a user-provided locale (--icu-locale=), then the patch leaves
it as-is, and does not convert it to a language tag (consistent with
CREATE COLLATION and CREATE DATABASE).

I opened another discussion about whether we want to try harder to
validate or canonicalize the locale name:

/messages/by-id/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#7)

Re: Move defaults toward ICU in 16?

On 2023-02-08 12:16:46 -0800, Jeff Davis wrote:

On Thu, 2023-02-02 at 18:10 -0500, Tom Lane wrote:

Yeah.ï¿½ I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

Patch attached.

Unfortunately this fails widely on CI, with both compile time and runtime
issues:
https://cirrus-ci.com/build/5116408950947840

pgsql@j-davis.com

over 3 years ago

In reply to: Andres Freund (#8)

Re: Move defaults toward ICU in 16?

On Wed, 2023-02-08 at 18:22 -0800, Andres Freund wrote:

On 2023-02-08 12:16:46 -0800, Jeff Davis wrote:

On Thu, 2023-02-02 at 18:10 -0500, Tom Lane wrote:

Yeah. I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

Patch attached.

Unfortunately this fails widely on CI, with both compile time and
runtime

New patches attached.

0001: build defaults to requiring ICU
0002: initdb defaults to using ICU (if server built with ICU)

One CI test is failing: "Windows - Server 2019, VS 2019 - Meson &
ninja"; if I apply Andres patch (
https://github.com/anarazel/postgres/commit/dde7c68 ), then it works.

I ran into one annoyance with pg_upgrade, which is that a v15 cluster
initialized with the defaults requires that the v16 cluster is
initialized with --locale-provider=libc, because otherwise the old and
new cluster will have mismatching template databases. Simple to fix
once you see the error, but I wonder how many initdb scripts might be
broken? I suppose it's just the cost of changing a default? Would an
environment variable help for cases where it's difficult to pass that
extra option down through a script?

I also considered posting another patch to change the default for
CREATE COLLATION, but there are a few issues I'm not sure about. Should
the default be based on whether ICU support is available? Or the
datlocprovider for the current database? And/or some kind of
compatibility GUC?

Notes on the tests I needed to fix, in case they are interesting or
point to some kind of larger problem:

* ecpg has a test that involves setting the client_encoding to LATIN1
which required a compatible server encoding so it was setting
ENCODING=SQL_ASCII, which ICU doesn't support. The ecpg test did not
look particularly sensitive to the locale, so I changed it to use
client_encoding=SQL_ASCII instead, so that the server encoding doesn't
matter.
* citext has a test involving Turkish characters, which works for all
libc locales, but in ICU the test only works in Turkish locales. I skip
the test if datlocprovider='i', because citext doesn't seem very
important in an ICU world.
* unaccent is broken if the database provider is ICU and LC_CTYPE=C,
because the t_isspace() (etc.) functions do not properly handle ICU.
Probably some other things are broken with that combination, but only
this test seems to exercise it. I just skipped the test for that broken
combination, but perhaps it should be fixed in the future.
* initdb was being built with ICU as a dependency in meson, but not
autoconf. I assume it's fine to link ICU into initdb, so I changed the
Makefile.
* I changed a couple tests to initialize with --locale-provider=libc.
They were testing that creating a database with the ICU provider but no
ICU locale fails, and that's easiest to test if the template is libc.
* The CI test CompilerWarnings:mingw_cross_warning was failing because
ICU is not available. I added --without-icu in the .cirrus.yml file and
it works.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#9)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-10 16:17:00 -0800, Jeff Davis wrote:

One CI test is failing: "Windows - Server 2019, VS 2019 - Meson &
ninja"; if I apply Andres patch (
https://github.com/anarazel/postgres/commit/dde7c68 ), then it works.

Until something like my patch above is done more generally applicable, I think
your patch should disable ICU on windows. Can't just fail to build.

Perhaps we don't need to force ICU use to on with the meson build, given that
it defaults to auto-detection?

I ran into one annoyance with pg_upgrade, which is that a v15 cluster
initialized with the defaults requires that the v16 cluster is
initialized with --locale-provider=libc, because otherwise the old and
new cluster will have mismatching template databases. Simple to fix
once you see the error, but I wonder how many initdb scripts might be
broken? I suppose it's just the cost of changing a default? Would an
environment variable help for cases where it's difficult to pass that
extra option down through a script?

That seems problematic to me.

But, shouldn't pg_upgrade be able to deal with this? As long as the databases
are created with template0, we can create the collations at that point?

@@ -15323,7 +15311,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];

This stuff shouldn't be in here, it's due to a debian patched autoconf.

Greetings,

Andres Freund

pgsql@j-davis.com

over 3 years ago

In reply to: Jeff Davis (#1)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-02 at 05:13 -0800, Jeff Davis wrote:

As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

One consideration here is security. Any vulnerability in ICU collation
routines could easily become a vulnerability in Postgres.

I looked at these lists:

https://www.cvedetails.com/vulnerability-list/vendor_id-17477/Icu-project.html
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=icu
https://unicode-org.atlassian.net/issues/?jql=labels%20%3D%20%22security%22
https://unicode-org.atlassian.net/issues/?jql=labels%20%3D%20%22was_sensitive%22

Here are the recent CVEs:

CVE-2021-30535 https://unicode-org.atlassian.net/browse/ICU-21587
CVE-2020-21913 https://unicode-org.atlassian.net/browse/ICU-20850
CVE-2020-10531 https://unicode-org.atlassian.net/browse/ICU-20958

But there are quite a few JIRAs that look concerning that don't have a
CVE assigned:

2021 https://unicode-org.atlassian.net/browse/ICU-21537
2021 https://unicode-org.atlassian.net/browse/ICU-21597
2021 https://unicode-org.atlassian.net/browse/ICU-21676
2021 https://unicode-org.atlassian.net/browse/ICU-21749

Not sure which of these are exploitable, and if they are, why they
don't have a CVE. If someone else finds more issues, please let me
know.

The good news is that the Chrome/Chromium projects are actively finding
and reporting issues.

I didn't look for comparable information about glibc, but I would guess
that exploitable memory errors in setlocale/strcoll are very rare,
otherwise it would be a security disaster for many projects.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

pgsql@j-davis.com

over 3 years ago

In reply to: Andres Freund (#10)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-10 at 18:00 -0800, Andres Freund wrote:

Until something like my patch above is done more generally
applicable, I think
your patch should disable ICU on windows. Can't just fail to build.

Perhaps we don't need to force ICU use to on with the meson build,
given that
it defaults to auto-detection?

Done. I changed it back to 'auto', and tests pass.

But, shouldn't pg_upgrade be able to deal with this? As long as the
databases
are created with template0, we can create the collations at that
point?

Are you saying that the upgraded cluster could have a different default
collation for the template databases than the original cluster?

That would be wrong to do, at least by default, but I could see it
being a useful option.

Or maybe I misunderstand what you're saying?

This stuff shouldn't be in here, it's due to a debian patched
autoconf.

Removed, thank you.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#12)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-14 09:48:08 -0800, Jeff Davis wrote:

On Fri, 2023-02-10 at 18:00 -0800, Andres Freund wrote:

But, shouldn't pg_upgrade be able to deal with this? As long as the
databases
are created with template0, we can create the collations at that
point?

Are you saying that the upgraded cluster could have a different default
collation for the template databases than the original cluster?

That would be wrong to do, at least by default, but I could see it
being a useful option.

Or maybe I misunderstand what you're saying?

I am saying that pg_upgrade should be able to deal with the difference. The
details of how to implement that, don't matter that much.

FWIW, I don't think it matters much what collation template0 has, since we
allow to change the locale provider when using template0 as the template.

We could easily update template0, if we think that's necessary. But I don't
think it really is. As long as the newly created databases have the right
provider, I'd lean towards not touching template0. But whatever...

Greetings,

Andres Freund

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Jeff Davis (#11)

Re: Move defaults toward ICU in 16?

On 2/13/23 8:11 PM, Jeff Davis wrote:

On Thu, 2023-02-02 at 05:13 -0800, Jeff Davis wrote:

As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

One consideration here is security. Any vulnerability in ICU collation
routines could easily become a vulnerability in Postgres.

Would it be any different than a vulnerability in OpenSSL et al? I know
that's a general, nuanced question but it would be good to understand if
we are exposing ourselves to any more vulnerabilities. And would it be
any different than today, given people can build PG with libicu as is?

Continuing on $SUBJECT, I wanted to understand performance comparisons.
I saw your comments[1]/messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com in response to Robert's question, looked at your
benchmarks[2]/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com and one that ICU ran on older versions[3]https://icu.unicode.org/charts/collation-icu4c48-glibc. It seems that
in general, users would see performance gains switching to ICU. The only
one in [3]https://icu.unicode.org/charts/collation-icu4c48-glibc that stood out to me was the tests on the "ko_KR" collation
underperformed on a list of Korean names, but maybe that is better in
newer versions.

I agree with most of your points in [1]/messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com. The platform-consistent
behavior is a good point, especially with more PG deployments running on
different systems. While taking on a new dependency is a concern, ICU
was released in 1999[4]https://en.wikipedia.org/wiki/International_Components_for_Unicode, has an active community, and seems to follow
standards (i.e. the Unicode Consortium).

I do wonder about upgrades, beyond the ongoing work with pg_upgrade. I
think the logical methods (pg_dumpall, logical replication) should
generally be OK, but we should ensure we think of things that could go
wrong and how we'd answer them.

Based on the available data, I think it's OK to move towards ICU as the
default, or preferred, collation provider. I agree (for now) in not
taking a hard dependency on ICU.

Thanks,

Jonathan

[1]: /messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com
/messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com
[2]: /messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com
/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com
[3]: https://icu.unicode.org/charts/collation-icu4c48-glibc
[4]: https://en.wikipedia.org/wiki/International_Components_for_Unicode

pgsql@j-davis.com

over 3 years ago

In reply to: Jonathan S. Katz (#14)

Re: Move defaults toward ICU in 16?

On Tue, 2023-02-14 at 16:27 -0500, Jonathan S. Katz wrote:

Would it be any different than a vulnerability in OpenSSL et al?

In principle, no, but sometimes the details matter. I'm just trying to
add data to the discussion.

It seems that
in general, users would see performance gains switching to ICU.

That's great news, and consistent with my experience. I don't think it
should be a driving factor though. If there's a choice is between
platform-independent semantics (ICU) and performance, platform-
independence should be the default.

I agree with most of your points in [1]. The platform-consistent
behavior is a good point, especially with more PG deployments running
on
different systems.

Now I think semantics are the most important driver, being consistent
across platforms and based on some kind of trusted independent
organization that we can point to.

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

Many users have other systems in their data infrastructure, running on
a variety of platforms, and could (in theory) try to synchronize around
a common ICU version to avoid subtle bugs in their data pipeline.

Based on the available data, I think it's OK to move towards ICU as
the
default, or preferred, collation provider. I agree (for now) in not
taking a hard dependency on ICU.

I count several favorable responses, so I'll take it that we (as a
community) are intending to change the default for build and initdb in
v16.

Robert expressed some skepticism[1]/messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com, though I don't see an objection.
If I read his concerns correctly, he's mainly concerned with quality
issues like documentaiton, bugs, etc. I understand those concerns (I'm
the one that raised them), but they seem like the kind of issues that
one finds any time they dig into a dependency enough. "Setting our
sights very high"[1]/messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com, to me, would just be ICU with a bit more rigorous
attention to quality issues.

[1]: /messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com
/messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

robertmhaas@gmail.com

over 3 years ago

In reply to: Jeff Davis (#15)

Re: Move defaults toward ICU in 16?

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis <pgsql@j-davis.com> wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

--
Robert Haas
EDB: http://www.enterprisedb.com

laurenz.albe@cybertec.at

over 3 years ago

In reply to: Robert Haas (#16)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-16 at 15:05 +0530, Robert Haas wrote:

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis <pgsql@j-davis.com> wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

Unless I misunderstand, the lack of Windows support is not a matter
of principle and can be added later on, right?

I am in favor of changing the default. It might be good to add a section
to the documentation in "Server setup and operation" recommending that
if you go with the default choice of ICU, you should configure your
package manager not to upgrade the ICU library.

Yours,
Laurenz Albe

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Robert Haas (#16)

Re: Move defaults toward ICU in 16?

On 2/16/23 4:35 AM, Robert Haas wrote:

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis <pgsql@j-davis.com> wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

This is one reason why we're discussing ICU as the "preferred default"
vs. "the default." While it may not completely eliminate platform
dependent behavior for collations, it takes a step forward.

And AIUI, it does sound like ICU is available on newer versions of
Windows[1]https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

We should continue to improve our documentation. Personally, I found the
biggest challenge was understanding how to set ICU locales / rules,
particularly for nondeterministic collations as it was challenging to
find where these were listed. I was able to overcome this with the
examples in our docs + blogs, but I agree it's an area we can continue
to improve upon.

Thanks,

Jonathan

[1]: https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu
https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#16)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Greetings,

Andres Freund

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#19)

Re: Move defaults toward ICU in 16?

On Thu, Feb 16, 2023 at 9:45 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Uh, I had the contrary impression from the discussion upthread, but it
sounds like I might be misunderstanding the situation?

--
Robert Haas
EDB: http://www.enterprisedb.com

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#20)

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Robert Haas (#20)

andres@anarazel.de

over 3 years ago

In reply to: Michael Paquier (#22)

pgsql@j-davis.com

over 3 years ago

In reply to: Andres Freund (#13)

laurenz.albe@cybertec.at

over 3 years ago

In reply to: Michael Paquier (#22)

pgsql@j-davis.com

over 3 years ago

In reply to: Jeff Davis (#24)

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#26)

pavel.stehule@gmail.com

over 3 years ago

In reply to: Jeff Davis (#26)

pgsql@j-davis.com

over 3 years ago

In reply to: Andres Freund (#27)

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#29)

pryzby@telsasoft.com

over 3 years ago

In reply to: Jeff Davis (#26)

pgsql@j-davis.com

over 3 years ago

In reply to: Andres Freund (#30)

pgsql@j-davis.com

over 3 years ago

In reply to: Pavel Stehule (#28)

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#32)

pgsql@j-davis.com

over 3 years ago

In reply to: Andres Freund (#34)

pavel.stehule@gmail.com

over 3 years ago

In reply to: Jeff Davis (#33)

robertmhaas@gmail.com

over 3 years ago

In reply to: Jeff Davis (#26)

Peter Eisentraut

peter_e@gmx.net

over 3 years ago

In reply to: Jeff Davis (#33)

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Eisentraut (#38)

pgsql@j-davis.com

over 3 years ago

In reply to: Jeff Davis (#35)

pgsql@j-davis.com

over 3 years ago

In reply to: Jeff Davis (#40)

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jeff Davis (#41)

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#40)

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#43)

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#44)

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#45)

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#46)

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#47)

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Peter Eisentraut (#48)

pryzby@telsasoft.com

about 3 years ago

In reply to: Peter Eisentraut (#49)

pgsql@j-davis.com

about 3 years ago

In reply to: Justin Pryzby (#50)

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Jeff Davis (#51)

Jonathan S. Katz

jkatz@postgresql.org

about 3 years ago

In reply to: Tom Lane (#52)