pgsql: Use ICU by default at initdb time.
Use ICU by default at initdb time.
If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.
Discussion: /messages/by-id/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
Reviewed-by: Peter Eisentraut
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/27b62377b47f9e7bf58613608bc718c86ea91e91
Modified Files
--------------
contrib/citext/expected/citext_utf8.out | 9 +++-
contrib/citext/expected/citext_utf8_1.out | 9 +++-
contrib/citext/sql/citext_utf8.sql | 9 +++-
contrib/unaccent/expected/unaccent.out | 9 ++++
contrib/unaccent/expected/unaccent_1.out | 8 ++++
contrib/unaccent/sql/unaccent.sql | 11 +++++
doc/src/sgml/ref/initdb.sgml | 53 +++++++++++++--------
src/bin/initdb/Makefile | 4 +-
src/bin/initdb/initdb.c | 54 +++++++++++++++++++++-
src/bin/initdb/t/001_initdb.pl | 7 +--
src/bin/pg_dump/t/002_pg_dump.pl | 2 +-
src/bin/scripts/t/020_createdb.pl | 2 +-
src/interfaces/ecpg/test/Makefile | 3 --
src/interfaces/ecpg/test/connect/test5.pgc | 2 +-
src/interfaces/ecpg/test/expected/connect-test5.c | 2 +-
.../ecpg/test/expected/connect-test5.stderr | 2 +-
src/interfaces/ecpg/test/meson.build | 1 -
src/test/icu/t/010_database.pl | 2 +-
18 files changed, 147 insertions(+), 42 deletions(-)
On Thu, 2023-03-09 at 19:11 +0000, Jeff Davis wrote:
Use ICU by default at initdb time.
I'm seeing a failure on hoverfly:
That's because ICU always uses UTF-8 by default. ICU works just fine
with many other encodings; is there a reason it doesn't take it from
the environment just like for provider=libc?
Of course, we still need to default to UTF-8 when the encoding from the
environment isn't supported by ICU.
Patch attached. Requires a few test fixups to adapt.
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Attachments:
v1-0001-initdb-obtain-encoding-from-environment-by-defaul.patchtext/x-patch; charset=UTF-8; name=v1-0001-initdb-obtain-encoding-from-environment-by-defaul.patchDownload
From d1e101df09e4485976949dd47c9505dbda1de071 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 9 Mar 2023 16:00:45 -0800
Subject: [PATCH v1] initdb: obtain encoding from environment by default for
ICU.
The libc provider already did so, this just makes ICU
consistent. Previously, if the provider was ICU, initdb defaulted to
UTF-8, which might be in conflict with the locale from the
environment.
Per buildfarm failure on system "hoverfly" related to commit
27b62377b4.
---
contrib/unaccent/meson.build | 2 +-
src/bin/initdb/initdb.c | 12 +++++-------
src/bin/pg_upgrade/t/002_pg_upgrade.pl | 15 +++++----------
src/bin/scripts/t/020_createdb.pl | 2 +-
src/test/icu/t/010_database.pl | 2 +-
5 files changed, 13 insertions(+), 20 deletions(-)
diff --git a/contrib/unaccent/meson.build b/contrib/unaccent/meson.build
index 613dd0be22..284d34ee29 100644
--- a/contrib/unaccent/meson.build
+++ b/contrib/unaccent/meson.build
@@ -37,6 +37,6 @@ tests += {
'sql': [
'unaccent',
],
- 'regress_args': ['--encoding=UTF8'],
+ 'regress_args': ['--encoding=UTF8', '--no-locale'],
},
}
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index bf88cd2439..04a6d58377 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2346,18 +2346,16 @@ setup_locale_encoding(void)
lc_time);
}
- if (!encoding && locale_provider == COLLPROVIDER_ICU)
- {
- encodingid = PG_UTF8;
- printf(_("The default database encoding has been set to \"%s\".\n"),
- pg_encoding_to_char(encodingid));
- }
- else if (!encoding)
+ if (!encoding)
{
int ctype_enc;
ctype_enc = pg_get_encoding_from_locale(lc_ctype, true);
+ if (locale_provider == COLLPROVIDER_ICU &&
+ (ctype_enc == -1 || !is_encoding_supported_by_icu(ctype_enc)))
+ ctype_enc = PG_UTF8;
+
if (ctype_enc == -1)
{
/* Couldn't recognize the locale's codeset */
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 1b5df730e9..90669f3c6d 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -108,7 +108,7 @@ if ($oldnode->pg_version >= 11)
my $original_encoding = "6"; # UTF-8
my $original_provider = "c";
-my $original_collate = "C";
+my $original_locale = "C";
my $original_iculocale = "";
my $provider_field = "'c' AS datlocprovider";
my $iculocale_field = "NULL AS daticulocale";
@@ -123,7 +123,7 @@ if ($oldnode->pg_version >= 15 && $ENV{with_icu} eq 'yes')
my @initdb_params = @custom_opts;
push @initdb_params, ('--encoding', 'UTF-8');
-push @initdb_params, ('--lc-collate', $original_collate);
+push @initdb_params, ('--locale', $original_locale);
if ($original_provider eq "i")
{
push @initdb_params, ('--locale-provider', 'icu');
@@ -136,16 +136,12 @@ $oldnode->start;
my $result;
$result = $oldnode->safe_psql(
- 'postgres', "SELECT encoding, $provider_field, datcollate, $iculocale_field
+ 'postgres', "SELECT encoding, $provider_field, datcollate, datctype, $iculocale_field
FROM pg_database WHERE datname='template0'");
-is($result, "$original_encoding|$original_provider|$original_collate|$original_iculocale",
+is($result, "$original_encoding|$original_provider|$original_locale|$original_locale|$original_iculocale",
"check locales in original cluster"
);
-# check ctype, which was acquired from environment by initdb
-my $original_ctype = $oldnode->safe_psql(
- 'postgres', q{SELECT datctype FROM pg_database WHERE datname='template0'});
-
# The default location of the source code is the root of this directory.
my $srcdir = abs_path("../../..");
@@ -224,7 +220,6 @@ my $newnode = PostgreSQL::Test::Cluster->new('new_node');
# cluster.
push @initdb_params, ('--encoding', 'SQL_ASCII');
push @initdb_params, ('--locale-provider', 'libc');
-push @initdb_params, ('--lc-ctype', 'C');
$node_params{extra} = \@initdb_params;
$newnode->init(%node_params);
@@ -401,7 +396,7 @@ if (-d $log_path)
$result = $newnode->safe_psql(
'postgres', "SELECT encoding, $provider_field, datcollate, datctype, $iculocale_field
FROM pg_database WHERE datname='template0'");
-is($result, "$original_encoding|$original_provider|$original_collate|$original_ctype|$original_iculocale",
+is($result, "$original_encoding|$original_provider|$original_locale|$original_locale|$original_iculocale",
"check that locales in new cluster match original cluster"
);
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 8ec58cdd64..af3b1492e3 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -41,7 +41,7 @@ if ($ENV{with_icu} eq 'yes')
[
'createdb', '-T',
'template0', '-E', 'UTF8', '--locale-provider=icu',
- '--icu-locale=en', 'foobar5'
+ '--locale=C', '--icu-locale=en', 'foobar5'
],
qr/statement: CREATE DATABASE foobar5 .* LOCALE_PROVIDER icu ICU_LOCALE 'en'/,
'create database with ICU locale specified');
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 45d77c319a..715b1bffd6 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -54,7 +54,7 @@ b),
# Test error cases in CREATE DATABASE involving locale-related options
my ($ret, $stdout, $stderr) = $node1->psql('postgres',
- q{CREATE DATABASE dbicu LOCALE_PROVIDER icu TEMPLATE template0 ENCODING UTF8});
+ q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
isnt($ret, 0,
"ICU locale must be specified for ICU provider: exit code not 0");
like(
--
2.34.1
On 10.03.23 03:26, Jeff Davis wrote:
That's because ICU always uses UTF-8 by default. ICU works just fine
with many other encodings; is there a reason it doesn't take it from
the environment just like for provider=libc?
I think originally the locale forced the encoding. With ICU, we have a
choice. We could either stick to the encoding suggested by the OS, or
pick our own.
Arguably, if we are going to nudge toward ICU, maybe we should nudge
toward UTF-8 as well.
On Fri, 2023-03-10 at 10:59 +0100, Peter Eisentraut wrote:
I think originally the locale forced the encoding. With ICU, we have
a
choice. We could either stick to the encoding suggested by the OS,
or
pick our own.
We still need LC_COLLATE and LC_CTYPE to match the database encoding
though. If we get those from the environment (which are connected to an
encoding), then I think we need to get the encoding from the
environment, too, right?
Arguably, if we are going to nudge toward ICU, maybe we should nudge
toward UTF-8 as well.
The OSes are already doing a pretty good job of that. Regardless, we
need to remove the dependence on LC_CTYPE and LC_COLLATE when the
provider is ICU first (we're close to that point but not quite there).
Regards,
Jeff Davis
On 10.03.23 15:38, Jeff Davis wrote:
On Fri, 2023-03-10 at 10:59 +0100, Peter Eisentraut wrote:
I think originally the locale forced the encoding. With ICU, we have
a
choice. We could either stick to the encoding suggested by the OS,
or
pick our own.We still need LC_COLLATE and LC_CTYPE to match the database encoding
though. If we get those from the environment (which are connected to an
encoding), then I think we need to get the encoding from the
environment, too, right?
Yes, of course. So we can't really do what I was thinking of.
On Fri, 2023-03-10 at 15:48 +0100, Peter Eisentraut wrote:
Yes, of course. So we can't really do what I was thinking of.
OK, I plan to commit something like the patch in this thread soon. I
just need to add an explanatory comment.
It passes CI, but it's possible that there could be more buildfarm
failures that I'll need to look at afterward, so I'll count this as a
"trial fix".
Regards,
Jeff Davis
On Fri, 2023-03-10 at 07:48 -0800, Jeff Davis wrote:
On Fri, 2023-03-10 at 15:48 +0100, Peter Eisentraut wrote:
Yes, of course. So we can't really do what I was thinking of.
OK, I plan to commit something like the patch in this thread soon. I
just need to add an explanatory comment.
Committed a slightly narrower fix that derives the default encoding the
same way for both libc and ICU; except that ICU still uses UTF-8 for
C/POSIX/--no-locale (because ICU doesn't work with SQL_ASCII).
That seemed more consistent with the comments around
pg_get_encoding_from_locale() and it was also easier to document the -E
switch in initdb.
I'll keep an eye on the buildfarm to see if this fixes the problem or
causes other issues. But it seems like the right change.
Regards,
Jeff Davis