pgsql: Use ICU by default at initdb time.

Started by Jeff Davisabout 3 years ago7 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

Use ICU by default at initdb time.

If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: /messages/by-id/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
Reviewed-by: Peter Eisentraut

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/27b62377b47f9e7bf58613608bc718c86ea91e91

Modified Files
--------------
contrib/citext/expected/citext_utf8.out | 9 +++-
contrib/citext/expected/citext_utf8_1.out | 9 +++-
contrib/citext/sql/citext_utf8.sql | 9 +++-
contrib/unaccent/expected/unaccent.out | 9 ++++
contrib/unaccent/expected/unaccent_1.out | 8 ++++
contrib/unaccent/sql/unaccent.sql | 11 +++++
doc/src/sgml/ref/initdb.sgml | 53 +++++++++++++--------
src/bin/initdb/Makefile | 4 +-
src/bin/initdb/initdb.c | 54 +++++++++++++++++++++-
src/bin/initdb/t/001_initdb.pl | 7 +--
src/bin/pg_dump/t/002_pg_dump.pl | 2 +-
src/bin/scripts/t/020_createdb.pl | 2 +-
src/interfaces/ecpg/test/Makefile | 3 --
src/interfaces/ecpg/test/connect/test5.pgc | 2 +-
src/interfaces/ecpg/test/expected/connect-test5.c | 2 +-
.../ecpg/test/expected/connect-test5.stderr | 2 +-
src/interfaces/ecpg/test/meson.build | 1 -
src/test/icu/t/010_database.pl | 2 +-
18 files changed, 147 insertions(+), 42 deletions(-)

#2Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: pgsql: Use ICU by default at initdb time.

On Thu, 2023-03-09 at 19:11 +0000, Jeff Davis wrote:

Use ICU by default at initdb time.

I'm seeing a failure on hoverfly:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2023-03-09%2021%3A51%3A45&stg=initdb-en_US.8859-15

That's because ICU always uses UTF-8 by default. ICU works just fine
with many other encodings; is there a reason it doesn't take it from
the environment just like for provider=libc?

Of course, we still need to default to UTF-8 when the encoding from the
environment isn't supported by ICU.

Patch attached. Requires a few test fixups to adapt.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v1-0001-initdb-obtain-encoding-from-environment-by-defaul.patchtext/x-patch; charset=UTF-8; name=v1-0001-initdb-obtain-encoding-from-environment-by-defaul.patchDownload+13-21
#3Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#2)
Re: pgsql: Use ICU by default at initdb time.

On 10.03.23 03:26, Jeff Davis wrote:

That's because ICU always uses UTF-8 by default. ICU works just fine
with many other encodings; is there a reason it doesn't take it from
the environment just like for provider=libc?

I think originally the locale forced the encoding. With ICU, we have a
choice. We could either stick to the encoding suggested by the OS, or
pick our own.

Arguably, if we are going to nudge toward ICU, maybe we should nudge
toward UTF-8 as well.

#4Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#3)
Re: pgsql: Use ICU by default at initdb time.

On Fri, 2023-03-10 at 10:59 +0100, Peter Eisentraut wrote:

I think originally the locale forced the encoding.  With ICU, we have
a
choice.  We could either stick to the encoding suggested by the OS,
or
pick our own.

We still need LC_COLLATE and LC_CTYPE to match the database encoding
though. If we get those from the environment (which are connected to an
encoding), then I think we need to get the encoding from the
environment, too, right?

Arguably, if we are going to nudge toward ICU, maybe we should nudge
toward UTF-8 as well.

The OSes are already doing a pretty good job of that. Regardless, we
need to remove the dependence on LC_CTYPE and LC_COLLATE when the
provider is ICU first (we're close to that point but not quite there).

Regards,
Jeff Davis

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#4)
Re: pgsql: Use ICU by default at initdb time.

On 10.03.23 15:38, Jeff Davis wrote:

On Fri, 2023-03-10 at 10:59 +0100, Peter Eisentraut wrote:

I think originally the locale forced the encoding.  With ICU, we have
a
choice.  We could either stick to the encoding suggested by the OS,
or
pick our own.

We still need LC_COLLATE and LC_CTYPE to match the database encoding
though. If we get those from the environment (which are connected to an
encoding), then I think we need to get the encoding from the
environment, too, right?

Yes, of course. So we can't really do what I was thinking of.

#6Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#5)
Re: pgsql: Use ICU by default at initdb time.

On Fri, 2023-03-10 at 15:48 +0100, Peter Eisentraut wrote:

Yes, of course.  So we can't really do what I was thinking of.

OK, I plan to commit something like the patch in this thread soon. I
just need to add an explanatory comment.

It passes CI, but it's possible that there could be more buildfarm
failures that I'll need to look at afterward, so I'll count this as a
"trial fix".

Regards,
Jeff Davis

#7Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#6)
Re: pgsql: Use ICU by default at initdb time.

On Fri, 2023-03-10 at 07:48 -0800, Jeff Davis wrote:

On Fri, 2023-03-10 at 15:48 +0100, Peter Eisentraut wrote:

Yes, of course.  So we can't really do what I was thinking of.

OK, I plan to commit something like the patch in this thread soon. I
just need to add an explanatory comment.

Committed a slightly narrower fix that derives the default encoding the
same way for both libc and ICU; except that ICU still uses UTF-8 for
C/POSIX/--no-locale (because ICU doesn't work with SQL_ASCII).

That seemed more consistent with the comments around
pg_get_encoding_from_locale() and it was also easier to document the -E
switch in initdb.

I'll keep an eye on the buildfarm to see if this fixes the problem or
causes other issues. But it seems like the right change.

Regards,
Jeff Davis