pgsql: Add option to use ICU as global locale provider

Started by Peter Eisentrautalmost 4 years ago26 messages
#1Peter Eisentraut
peter@eisentraut.org

Add option to use ICU as global locale provider

This adds the option to use ICU as the default locale provider for
either the whole cluster or a database. New options for initdb,
createdb, and CREATE DATABASE are used to select this.

Since some (legacy) code still uses the libc locale facilities
directly, we still need to set the libc global locale settings even if
ICU is otherwise selected. So pg_database now has three
locale-related fields: the existing datcollate and datctype, which are
always set, and a new daticulocale, which is only set if ICU is
selected. A similar change is made in pg_collation for consistency,
but in that case, only the libc-related fields or the ICU-related
field is set, never both.

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: /messages/by-id/5e756dd6-0e91-d778-96fd-b1bcb06c161a@2ndquadrant.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/f2553d43060edb210b36c63187d52a632448e1d2

Modified Files
--------------
doc/src/sgml/catalogs.sgml | 9 ++
doc/src/sgml/charset.sgml | 102 ++++++++++++++++
doc/src/sgml/ref/create_database.sgml | 32 +++++
doc/src/sgml/ref/createdb.sgml | 19 +++
doc/src/sgml/ref/initdb.sgml | 72 +++++++++---
src/backend/catalog/pg_collation.c | 18 ++-
src/backend/commands/collationcmds.c | 97 +++++++++------
src/backend/commands/dbcommands.c | 157 +++++++++++++++++++++----
src/backend/utils/adt/pg_locale.c | 144 ++++++++++++++---------
src/backend/utils/init/postinit.c | 21 +++-
src/bin/initdb/Makefile | 4 +-
src/bin/initdb/initdb.c | 97 +++++++++++++--
src/bin/initdb/t/001_initdb.pl | 27 +++++
src/bin/pg_dump/pg_dump.c | 30 ++++-
src/bin/pg_upgrade/check.c | 13 ++
src/bin/pg_upgrade/info.c | 18 ++-
src/bin/pg_upgrade/pg_upgrade.h | 2 +
src/bin/psql/describe.c | 23 +++-
src/bin/psql/tab-complete.c | 3 +-
src/bin/scripts/Makefile | 2 +
src/bin/scripts/createdb.c | 20 ++++
src/bin/scripts/t/020_createdb.pl | 28 +++++
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_collation.dat | 3 +-
src/include/catalog/pg_collation.h | 20 +++-
src/include/catalog/pg_database.dat | 4 +-
src/include/catalog/pg_database.h | 6 +
src/include/utils/pg_locale.h | 5 +
src/test/Makefile | 6 +-
src/test/icu/.gitignore | 2 +
src/test/icu/Makefile | 25 ++++
src/test/icu/README | 27 +++++
src/test/icu/t/010_database.pl | 58 +++++++++
src/test/regress/expected/collate.icu.utf8.out | 10 +-
src/test/regress/sql/collate.icu.utf8.sql | 8 +-
35 files changed, 947 insertions(+), 167 deletions(-)

#2Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#1)
Re: pgsql: Add option to use ICU as global locale provider

Hi Peter,

On Thu, Mar 17, 2022 at 10:22:32AM +0000, Peter Eisentraut wrote:

Add option to use ICU as global locale provider

This adds the option to use ICU as the default locale provider for
either the whole cluster or a database. New options for initdb,
createdb, and CREATE DATABASE are used to select this.

Since some (legacy) code still uses the libc locale facilities
directly, we still need to set the libc global locale settings even if
ICU is otherwise selected. So pg_database now has three
locale-related fields: the existing datcollate and datctype, which are
always set, and a new daticulocale, which is only set if ICU is
selected. A similar change is made in pg_collation for consistency,
but in that case, only the libc-related fields or the ICU-related
field is set, never both.

FYI, prion is complaining here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&amp;dt=2022-03-18%2001%3A43%3A13

Some details:
# Failed test 'fails for invalid ICU locale: matches'
# at t/001_initdb.pl line 107.
# '2022-03-18 01:54:58.563 UTC [504] FATAL: could
# not open collator for locale "@colNumeric=lower":
# U_ILLEGAL_ARGUMENT_ERROR
--
Michael

#3Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#2)
Re: pgsql: Add option to use ICU as global locale provider

Hi,

On Fri, Mar 18, 2022 at 11:01:11AM +0900, Michael Paquier wrote:

FYI, prion is complaining here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&amp;dt=2022-03-18%2001%3A43%3A13

Some details:
# Failed test 'fails for invalid ICU locale: matches'
# at t/001_initdb.pl line 107.
# '2022-03-18 01:54:58.563 UTC [504] FATAL: could
# not open collator for locale "@colNumeric=lower":
# U_ILLEGAL_ARGUMENT_ERROR

That's very strange, apparently initdb doesn't detect any problem when checking
ucol_open() for initial checks, since it's expecting:

# doesn't match '(?^:initdb: error: could not open collator for locale)'

but then postgres in single-backend mode does detect the problem, with the
exact same check, so it's not like --icu-locale=@colNumeric=lower wasn't
correctly interpreted. Unfortunately we don't have the full initdb output, so
we can't check what setup_locale_encoding reported. The only difference is
that in initdb's setlocale(), the result of ucol_open is discarded, maybe the
compiler is optimizing it away for some reason, even though it seems really
unlikely.

That being said, we could save the result and explicitly close the collator.
That wouldn't make much difference in initdb (but may be a bit cleaner), but I
see that there's a similar coding in createdb(), which seems like it could leak
some memory according to ucol_close man page.

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Julien Rouhaud (#3)
Re: pgsql: Add option to use ICU as global locale provider

On Fri, Mar 18, 2022 at 4:12 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Mar 18, 2022 at 11:01:11AM +0900, Michael Paquier wrote:

FYI, prion is complaining here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&amp;dt=2022-03-18%2001%3A43%3A13

Some details:
# Failed test 'fails for invalid ICU locale: matches'
# at t/001_initdb.pl line 107.
# '2022-03-18 01:54:58.563 UTC [504] FATAL: could
# not open collator for locale "@colNumeric=lower":
# U_ILLEGAL_ARGUMENT_ERROR

That's very strange, apparently initdb doesn't detect any problem when checking
ucol_open() for initial checks, since it's expecting:

# doesn't match '(?^:initdb: error: could not open collator for locale)'

but then postgres in single-backend mode does detect the problem, with the
exact same check, so it's not like --icu-locale=@colNumeric=lower wasn't
correctly interpreted. Unfortunately we don't have the full initdb output, so
we can't check what setup_locale_encoding reported. The only difference is
that in initdb's setlocale(), the result of ucol_open is discarded, maybe the
compiler is optimizing it away for some reason, even though it seems really
unlikely.

That being said, we could save the result and explicitly close the collator.
That wouldn't make much difference in initdb (but may be a bit cleaner), but I
see that there's a similar coding in createdb(), which seems like it could leak
some memory according to ucol_close man page.

No idea what's happening here but one observation is that that animal
is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6
may hold a clue...

#5Julien Rouhaud
rjuju123@gmail.com
In reply to: Thomas Munro (#4)
Re: pgsql: Add option to use ICU as global locale provider

On Fri, Mar 18, 2022 at 06:15:47PM +1300, Thomas Munro wrote:

No idea what's happening here but one observation is that that animal
is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6
may hold a clue...

Right. I'm setting up a similar podman environment, hopefully more info soon.

#6Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#5)
Re: pgsql: Add option to use ICU as global locale provider

On Fri, Mar 18, 2022 at 02:36:48PM +0800, Julien Rouhaud wrote:

On Fri, Mar 18, 2022 at 06:15:47PM +1300, Thomas Munro wrote:

No idea what's happening here but one observation is that that animal
is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6
may hold a clue...

Right. I'm setting up a similar podman environment, hopefully more info soon.

And indeed b8f9a2a6 is the problem. We would need some form of
icu_set_collation_attributes() on the frontend side if we want to detect such a
problem on older ICU version at the expected moment rather than when
bootstrapping the info. A similar check is also needed in createdb().

I was thinking that this could be the cause of problem reported by Andres on
centos 7 (which seems to ship ICU 50), but postinit.c calls
make_icu_collator(), which sets the attribute as expected. Maybe it's because
old ICU version simply don't understand locale ID like "en-u-kf-upper" and
silently falls back to the root collator?

#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#6)
1 attachment(s)
Re: pgsql: Add option to use ICU as global locale provider

(moving to -hackers)

On Fri, Mar 18, 2022 at 03:40:51PM +0800, Julien Rouhaud wrote:

On Fri, Mar 18, 2022 at 02:36:48PM +0800, Julien Rouhaud wrote:

On Fri, Mar 18, 2022 at 06:15:47PM +1300, Thomas Munro wrote:

No idea what's happening here but one observation is that that animal
is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6
may hold a clue...

Right. I'm setting up a similar podman environment, hopefully more info soon.

And indeed b8f9a2a6 is the problem. We would need some form of
icu_set_collation_attributes() on the frontend side if we want to detect such a
problem on older ICU version at the expected moment rather than when
bootstrapping the info. A similar check is also needed in createdb().

I'm attaching a patch that fixes both issues for me with ICU 50. Note that
there's already a test that would have failed for CREATE DATABASE if initdb
tests didn't fail first, so no new test needed.

I ended up copy/pasting icu_set_collation_attributes() in initdb.c. There
shouldn't be new attributes added in old ICU versions, and there are enough
differences to make it work in the frontend that it didn't seems worth to have
a single function.

Attachments:

v1-0001-Fix-global-icu-collations-for-ICU-54.patchtext/plain; charset=us-asciiDownload
From 52ae34ed5413f3274bc6c9a96847ec762f6e388d Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 18 Mar 2022 16:20:01 +0800
Subject: [PATCH v1] Fix global icu collations for ICU < 54

Multiple call sites didn't check for collation attributes validity, which has
to be done explicitly on ICU < 54.
---
 src/backend/commands/dbcommands.c |   9 +--
 src/backend/utils/adt/pg_locale.c |  24 +++++++
 src/bin/initdb/initdb.c           | 109 +++++++++++++++++++++++++++++-
 src/include/utils/pg_locale.h     |   1 +
 4 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e1af642e4..104e45b66d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -460,14 +460,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
 #ifdef USE_ICU
-		UErrorCode  status;
-
-		status = U_ZERO_ERROR;
-		ucol_open(dbiculocale, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							dbiculocale, u_errorName(status))));
+		check_icu_locale(dbiculocale);
 #else
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 4019255f8e..bc0a038168 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1854,6 +1854,9 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * function's result is the number of bytes generated (not counting nul).
  *
  * The result string is nul-terminated.
+ *
+ * If you modify this function, modify accordingly the similar frontend
+ * function in initdb.c
  */
 int32_t
 icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
@@ -1884,6 +1887,27 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
 	return len_result;
 }
 
+/*
+ * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
+ */
+void
+check_icu_locale(const char *icu_locale)
+{
+	UCollator  *collator;
+	UErrorCode  status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not open collator for locale \"%s\": %s",
+						icu_locale, u_errorName(status))));
+
+	if (U_ICU_VERSION_MAJOR_NUM < 54)
+		icu_set_collation_attributes(collator, icu_locale);
+	ucol_close(collator);
+}
+
 /*
  * Parse collation attributes and apply them to the open collator.  This takes
  * a string like "und@colStrength=primary;colCaseLevel=yes" and parses and
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index cbcd55288f..ee6243560b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -282,6 +282,10 @@ static int	locale_date_order(const char *locale);
 static void check_locale_name(int category, const char *locale,
 							  char **canonname);
 static bool check_locale_encoding(const char *locale, int encoding);
+#ifdef USE_ICU
+static void icu_set_collation_attributes(UCollator *collator,
+										 const char *loc);
+#endif
 static void setlocales(void);
 static void usage(const char *progname);
 void		setup_pgdata(void);
@@ -2145,6 +2149,105 @@ check_locale_encoding(const char *locale, int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+/*
+ * Frontend version of the same function in pg_locale.c.
+ *
+ * If you modify this function, modify accordingly the similar backend function
+ * in pg_locale.c
+ */
+static void
+icu_set_collation_attributes(UCollator *collator, const char *loc)
+{
+	char	   *str = pg_strdup(loc);
+
+	str = strchr(str, '@');
+	if (!str)
+		return;
+	str++;
+
+	for (char *token = strtok(str, ";"); token; token = strtok(NULL, ";"))
+	{
+		char	   *e = strchr(token, '=');
+
+		if (e)
+		{
+			char	   *name;
+			char	   *value;
+			UColAttribute uattr;
+			UColAttributeValue uvalue;
+			UErrorCode	status;
+
+			status = U_ZERO_ERROR;
+
+			*e = '\0';
+			name = token;
+			value = e + 1;
+
+			/*
+			 * See attribute name and value lists in ICU i18n/coll.cpp
+			 */
+			if (pg_strcasecmp(name, "colstrength") == 0)
+				uattr = UCOL_STRENGTH;
+			else if (pg_strcasecmp(name, "colbackwards") == 0)
+				uattr = UCOL_FRENCH_COLLATION;
+			else if (pg_strcasecmp(name, "colcaselevel") == 0)
+				uattr = UCOL_CASE_LEVEL;
+			else if (pg_strcasecmp(name, "colcasefirst") == 0)
+				uattr = UCOL_CASE_FIRST;
+			else if (pg_strcasecmp(name, "colalternate") == 0)
+				uattr = UCOL_ALTERNATE_HANDLING;
+			else if (pg_strcasecmp(name, "colnormalization") == 0)
+				uattr = UCOL_NORMALIZATION_MODE;
+			else if (pg_strcasecmp(name, "colnumeric") == 0)
+				uattr = UCOL_NUMERIC_COLLATION;
+			else
+				/* ignore if unknown */
+				continue;
+
+			if (pg_strcasecmp(value, "primary") == 0)
+				uvalue = UCOL_PRIMARY;
+			else if (pg_strcasecmp(value, "secondary") == 0)
+				uvalue = UCOL_SECONDARY;
+			else if (pg_strcasecmp(value, "tertiary") == 0)
+				uvalue = UCOL_TERTIARY;
+			else if (pg_strcasecmp(value, "quaternary") == 0)
+				uvalue = UCOL_QUATERNARY;
+			else if (pg_strcasecmp(value, "identical") == 0)
+				uvalue = UCOL_IDENTICAL;
+			else if (pg_strcasecmp(value, "no") == 0)
+				uvalue = UCOL_OFF;
+			else if (pg_strcasecmp(value, "yes") == 0)
+				uvalue = UCOL_ON;
+			else if (pg_strcasecmp(value, "shifted") == 0)
+				uvalue = UCOL_SHIFTED;
+			else if (pg_strcasecmp(value, "non-ignorable") == 0)
+				uvalue = UCOL_NON_IGNORABLE;
+			else if (pg_strcasecmp(value, "lower") == 0)
+				uvalue = UCOL_LOWER_FIRST;
+			else if (pg_strcasecmp(value, "upper") == 0)
+				uvalue = UCOL_UPPER_FIRST;
+			else
+				status = U_ILLEGAL_ARGUMENT_ERROR;
+
+			if (status == U_ZERO_ERROR)
+				ucol_setAttribute(collator, uattr, uvalue, &status);
+
+			/*
+			 * Pretend the error came from ucol_open(), for consistent error
+			 * message across ICU versions.
+			 */
+			if (U_FAILURE(status))
+			{
+				pg_log_error("could not open collator for locale \"%s\": %s",
+							 loc, u_errorName(status));
+				exit(1);
+			}
+		}
+	}
+}
+#endif
+
 /*
  * set up the locale variables
  *
@@ -2209,16 +2312,20 @@ setlocales(void)
 		 */
 #ifdef USE_ICU
 		{
+			UCollator  *collator;
 			UErrorCode	status;
 
 			status = U_ZERO_ERROR;
-			ucol_open(icu_locale, &status);
+			collator = ucol_open(icu_locale, &status);
 			if (U_FAILURE(status))
 			{
 				pg_log_error("could not open collator for locale \"%s\": %s",
 							 icu_locale, u_errorName(status));
 				exit(1);
 			}
+			if (U_ICU_VERSION_MAJOR_NUM < 54)
+				icu_set_collation_attributes(collator, icu_locale);
+			ucol_close(collator);
 		}
 #else
 		pg_log_error("ICU is not supported in this build");
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 9b158f24a0..297d14d918 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -115,6 +115,7 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern void check_icu_locale(const char *icu_locale);
 #endif
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
-- 
2.33.1

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#3)
Re: pgsql: Add option to use ICU as global locale provider

Julien Rouhaud <rjuju123@gmail.com> writes:

That being said, we could save the result and explicitly close the collator.
That wouldn't make much difference in initdb (but may be a bit cleaner), but I
see that there's a similar coding in createdb(), which seems like it could leak
some memory according to ucol_close man page.

FYI, I verified using valgrind that (as of HEAD) there is a leak
when creating a database with ICU collation that doesn't appear
when creating one with libc collation. It's not a lot, a few
hundred bytes per iteration, but it's there.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: pgsql: Add option to use ICU as global locale provider

I found a different problem with src/test/icu/: it fails altogether
if the prevailing locale is "C", because then the database encoding
defaults to SQL_ASCII which our ICU code won't cope with. I'm not
sure if that explains any of the buildfarm failures, but it broke
my local build (yeah, I'm that guy). I un-broke it for the moment
by forcing the test to use UTF8 encoding, but do we want to do
anything smarter than that?

regards, tom lane

#10Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Julien Rouhaud (#7)
Re: pgsql: Add option to use ICU as global locale provider

On 18.03.22 10:27, Julien Rouhaud wrote:

I'm attaching a patch that fixes both issues for me with ICU 50. Note that
there's already a test that would have failed for CREATE DATABASE if initdb
tests didn't fail first, so no new test needed.

I ended up copy/pasting icu_set_collation_attributes() in initdb.c. There
shouldn't be new attributes added in old ICU versions, and there are enough
differences to make it work in the frontend that it didn't seems worth to have
a single function.

Another option is that we just don't do the check in initdb. As the
tests show, you will then get an error from the backend call, so it's
really just a question of when the error is reported.

Why does your patch introduce a function check_icu_locale() that is only
called once? Did you have further plans for that?

#11Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Tom Lane (#9)
Re: pgsql: Add option to use ICU as global locale provider

On 18.03.22 18:29, Tom Lane wrote:

I found a different problem with src/test/icu/: it fails altogether
if the prevailing locale is "C", because then the database encoding
defaults to SQL_ASCII which our ICU code won't cope with. I'm not
sure if that explains any of the buildfarm failures, but it broke
my local build (yeah, I'm that guy). I un-broke it for the moment
by forcing the test to use UTF8 encoding, but do we want to do
anything smarter than that?

This is an appropriate solution, I think.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#10)
Re: pgsql: Add option to use ICU as global locale provider

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

Another option is that we just don't do the check in initdb. As the
tests show, you will then get an error from the backend call, so it's
really just a question of when the error is reported.

+1 ... seems better to not have two copies of the code.

regards, tom lane

#13Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#10)
Re: pgsql: Add option to use ICU as global locale provider

Hi,

On 2022-03-18 20:28:58 +0100, Peter Eisentraut wrote:

Why does your patch introduce a function check_icu_locale() that is only
called once? Did you have further plans for that?

I like that it moves ICU code out of dbcommands.c - imo there should be few
calls to ICU functions outside of pg_locale.c. There might be an argument for
moving *more* into such a function though.

Greetings,

Andres Freund

#14Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#12)
Re: pgsql: Add option to use ICU as global locale provider

On Fri, Mar 18, 2022 at 04:04:10PM -0400, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

Another option is that we just don't do the check in initdb. As the
tests show, you will then get an error from the backend call, so it's
really just a question of when the error is reported.

+1 ... seems better to not have two copies of the code.

Ok, I also prefer to not have two copies of the code but wasn't sure that
having the error in the boostrapping phase was ok. I will change that.

#15Julien Rouhaud
rjuju123@gmail.com
In reply to: Andres Freund (#13)
1 attachment(s)
Re: pgsql: Add option to use ICU as global locale provider

On Fri, Mar 18, 2022 at 03:09:59PM -0700, Andres Freund wrote:

Hi,

On 2022-03-18 20:28:58 +0100, Peter Eisentraut wrote:

Why does your patch introduce a function check_icu_locale() that is only
called once? Did you have further plans for that?

I like that it moves ICU code out of dbcommands.c

Yes, it seemed cleaner this way. But more importantly code outside pg_locale.c
really shouldn't have to deal with ICU specific version code.

I'm attaching a v2, addressing Peter and Tom comments to not duplicate the old
ICU versions attribute function. I removed the ICU locale check entirely (for
consistency across ICU version) thus removing any need for ucol.h include in
initdb.

For the problem you reported at [1]/messages/by-id/20220318000140.vzri3qw3p4aebn5p@alap3.anarazel.de with the meson branch, I changed createdb
tests with s/en-u-kf-upper/en@colCaseFirst=upper/, as older ICU versions don't
understand the former notation. check-world now pass for me, using either ICU
< 54 or >= 54.

imo there should be few
calls to ICU functions outside of pg_locale.c. There might be an argument for
moving *more* into such a function though.

I think it would be a good improvement. I can work on that next week if
needed.

[1]: /messages/by-id/20220318000140.vzri3qw3p4aebn5p@alap3.anarazel.de

Attachments:

v2-0001-Fix-global-icu-collations-for-ICU-54.patchtext/plain; charset=us-asciiDownload
From f3884b483884a4e39b577dc01b72bab5176964bb Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 18 Mar 2022 16:20:01 +0800
Subject: [PATCH v2] Fix global icu collations for ICU < 54

createdb() didn't check for collation attributes validity, which has
to be done explicitly on ICU < 54.  It also forgot to close the ICU collator
opened during the check which leaks some memory.

To fix both, add a new check_icu_locale() that does all the appropriate
verification and close the ICU collator.

initdb also had some partial check for ICU < 54.  To have consistent error
reporting across major ICU versions, and get rid of the need to include ucol.h,
remove the partial check there.  The backend will report an error if needed
during the post-boostrap iniitialization phase.
---
 src/backend/commands/dbcommands.c |  9 +--------
 src/backend/utils/adt/pg_locale.c | 21 +++++++++++++++++++++
 src/bin/initdb/initdb.c           | 22 +++-------------------
 src/bin/initdb/t/001_initdb.pl    |  2 +-
 src/include/utils/pg_locale.h     |  1 +
 src/test/icu/t/010_database.pl    |  4 ++--
 6 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 962e11dd8f..15b6eb8e0e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -460,14 +460,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
 #ifdef USE_ICU
-		UErrorCode  status;
-
-		status = U_ZERO_ERROR;
-		ucol_open(dbiculocale, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							dbiculocale, u_errorName(status))));
+		check_icu_locale(dbiculocale);
 #else
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 4019255f8e..af69f2729e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1884,6 +1884,27 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
 	return len_result;
 }
 
+/*
+ * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
+ */
+void
+check_icu_locale(const char *icu_locale)
+{
+	UCollator  *collator;
+	UErrorCode  status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not open collator for locale \"%s\": %s",
+						icu_locale, u_errorName(status))));
+
+	if (U_ICU_VERSION_MAJOR_NUM < 54)
+		icu_set_collation_attributes(collator, icu_locale);
+	ucol_close(collator);
+}
+
 /*
  * Parse collation attributes and apply them to the open collator.  This takes
  * a string like "und@colStrength=primary;colCaseLevel=yes" and parses and
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index cbcd55288f..5e36943ef3 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -55,10 +55,6 @@
 #include <signal.h>
 #include <time.h>
 
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
 #ifdef HAVE_SHM_OPEN
 #include "sys/mman.h"
 #endif
@@ -2205,22 +2201,10 @@ setlocales(void)
 		}
 
 		/*
-		 * Check ICU locale ID
+		 * In supported builds, the ICU locale ID will be checked by the
+		 * backend when performing the post-boostrap initialization.
 		 */
-#ifdef USE_ICU
-		{
-			UErrorCode	status;
-
-			status = U_ZERO_ERROR;
-			ucol_open(icu_locale, &status);
-			if (U_FAILURE(status))
-			{
-				pg_log_error("could not open collator for locale \"%s\": %s",
-							 icu_locale, u_errorName(status));
-				exit(1);
-			}
-		}
-#else
+#ifndef USE_ICU
 		pg_log_error("ICU is not supported in this build");
 		fprintf(stderr, _("You need to rebuild PostgreSQL using %s.\n"), "--with-icu");
 		exit(1);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index c636bf3ab2..a3397777cf 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -105,7 +105,7 @@ if ($ENV{with_icu} eq 'yes')
 		'option --icu-locale');
 
 	command_fails_like(['initdb', '--no-sync', '--locale-provider=icu', '--icu-locale=@colNumeric=lower', "$tempdir/dataX"],
-		qr/initdb: error: could not open collator for locale/,
+		qr/FATAL:  could not open collator for locale/,
 		'fails for invalid ICU locale');
 }
 else
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 9b158f24a0..297d14d918 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -115,6 +115,7 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern void check_icu_locale(const char *icu_locale);
 #endif
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index d50941b53d..07a1084b09 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -16,11 +16,11 @@ $node1->init;
 $node1->start;
 
 $node1->safe_psql('postgres',
-	q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' ICU_LOCALE 'en-u-kf-upper' ENCODING 'UTF8' TEMPLATE template0});
+	q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' ICU_LOCALE 'en@colCaseFirst=upper' ENCODING 'UTF8' TEMPLATE template0});
 
 $node1->safe_psql('dbicu',
 q{
-CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
+CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');
 CREATE TABLE icu (def text, en text COLLATE "en-x-icu", upfirst text COLLATE upperfirst);
 INSERT INTO icu VALUES ('a', 'a', 'a'), ('b', 'b', 'b'), ('A', 'A', 'A'), ('B', 'B', 'B');
 });
-- 
2.33.1

#16Christoph Berg
myon@debian.org
In reply to: Peter Eisentraut (#1)
Re: pgsql: Add option to use ICU as global locale provider

Re: Peter Eisentraut

Since some (legacy) code still uses the libc locale facilities
directly, we still need to set the libc global locale settings even if
ICU is otherwise selected. So pg_database now has three
locale-related fields: the existing datcollate and datctype, which are
always set, and a new daticulocale, which is only set if ICU is
selected. A similar change is made in pg_collation for consistency,
but in that case, only the libc-related fields or the ICU-related
field is set, never both.

Since the intended usage seems to be that databases should either be
using libc, or the ICU locales, but probably not both at the same
time, does it make sense to clutter the already very wide `psql -l`
output with two new extra columns?

This hardly fits in normal-size terminals:

=# \l
List of databases
Name │ Owner │ Encoding │ Collate │ Ctype │ ICU Locale │ Locale Provider │ Access privileges
───────────┼───────┼──────────┼────────────┼────────────┼────────────┼─────────────────┼───────────────────
postgres │ myon │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ │ libc │
template0 │ myon │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ │ libc │ =c/myon ↵
│ │ │ │ │ │ │ myon=CTc/myon
template1 │ myon │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ │ libc │ =c/myon ↵
│ │ │ │ │ │ │ myon=CTc/myon
(3 rows)

(Even longer if the username is "postgres")

It also makes \l+ even harder to read when the most often only
relevant new column, the database size, is even more to the far right.

Couldn't that be a single "Locale" column, possibly extended by more
info in parentheses if the values differ?

Locale
de_DE.utf8
de-x-icu-whatever
de_DE.utf8 (Ctype: C.UTF-8)
SQL_ASCII (ICU Locale: en-x-something)

Christoph

#17Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Julien Rouhaud (#15)
Re: pgsql: Add option to use ICU as global locale provider

On 19.03.22 05:14, Julien Rouhaud wrote:

On Fri, Mar 18, 2022 at 03:09:59PM -0700, Andres Freund wrote:

Hi,

On 2022-03-18 20:28:58 +0100, Peter Eisentraut wrote:

Why does your patch introduce a function check_icu_locale() that is only
called once? Did you have further plans for that?

I like that it moves ICU code out of dbcommands.c

Yes, it seemed cleaner this way. But more importantly code outside pg_locale.c
really shouldn't have to deal with ICU specific version code.

I'm attaching a v2, addressing Peter and Tom comments to not duplicate the old
ICU versions attribute function. I removed the ICU locale check entirely (for
consistency across ICU version) thus removing any need for ucol.h include in
initdb.

committed

#18Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Christoph Berg (#16)
Re: pgsql: Add option to use ICU as global locale provider

On 19.03.22 18:53, Christoph Berg wrote:

Re: Peter Eisentraut

Since some (legacy) code still uses the libc locale facilities
directly, we still need to set the libc global locale settings even if
ICU is otherwise selected. So pg_database now has three
locale-related fields: the existing datcollate and datctype, which are
always set, and a new daticulocale, which is only set if ICU is
selected. A similar change is made in pg_collation for consistency,
but in that case, only the libc-related fields or the ICU-related
field is set, never both.

Since the intended usage seems to be that databases should either be
using libc, or the ICU locales, but probably not both at the same
time, does it make sense to clutter the already very wide `psql -l`
output with two new extra columns?

Good point, let me think about that.

#19Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#17)
Re: pgsql: Add option to use ICU as global locale provider

On Sun, Mar 20, 2022 at 11:03:38AM +0100, Peter Eisentraut wrote:

On 19.03.22 05:14, Julien Rouhaud wrote:

On Fri, Mar 18, 2022 at 03:09:59PM -0700, Andres Freund wrote:

Hi,

On 2022-03-18 20:28:58 +0100, Peter Eisentraut wrote:

Why does your patch introduce a function check_icu_locale() that is only
called once? Did you have further plans for that?

I like that it moves ICU code out of dbcommands.c

Yes, it seemed cleaner this way. But more importantly code outside pg_locale.c
really shouldn't have to deal with ICU specific version code.

I'm attaching a v2, addressing Peter and Tom comments to not duplicate the old
ICU versions attribute function. I removed the ICU locale check entirely (for
consistency across ICU version) thus removing any need for ucol.h include in
initdb.

committed

Thanks!

#20Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#17)
Re: pgsql: Add option to use ICU as global locale provider

Hi,

On 2022-03-20 11:03:38 +0100, Peter Eisentraut wrote:

committed

Thanks. Rebasing over that fixed the meson Centos 7 build in my meson
tree. https://cirrus-ci.com/build/5265480968568832

Greetings,

Andres Freund

#21Christoph Berg
myon@debian.org
In reply to: Peter Eisentraut (#18)
Re: pgsql: Add option to use ICU as global locale provider

Re: Peter Eisentraut

Since the intended usage seems to be that databases should either be
using libc, or the ICU locales, but probably not both at the same
time, does it make sense to clutter the already very wide `psql -l`
output with two new extra columns?

Good point, let me think about that.

A possible solution might be to rip out all the locale columns except
"Encoding" from \l, and leave them in place for \l+.

For \l+, I'd suggest moving the database size and the tablespace to
the front, after owner.

Christoph

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christoph Berg (#21)
Re: pgsql: Add option to use ICU as global locale provider

Christoph Berg <myon@debian.org> writes:

A possible solution might be to rip out all the locale columns except
"Encoding" from \l, and leave them in place for \l+.

I'd rather see a single column summarizing the locale situation.
Perhaps it could be COALESCE(daticulocale, datcollate), or
something using a CASE on datlocprovider?
Then \l+ could replace that with all the underlying columns.

For \l+, I'd suggest moving the database size and the tablespace to
the front, after owner.

I think it's confusing if the + and non-+ versions of a command
present their columns in inconsistent orders. I'm not dead set
against this, but -0.5 or so.

regards, tom lane

#23Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#16)
Inconsistent "ICU Locale" output on older server versions

Re: To Peter Eisentraut

This hardly fits in normal-size terminals:

=# \l
List of databases
Name │ Owner │ Encoding │ Collate │ Ctype │ ICU Locale │ Locale Provider │ Access privileges
───────────┼───────┼──────────┼────────────┼────────────┼────────────┼─────────────────┼───────────────────
postgres │ myon │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ │ libc │
template0 │ myon │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ │ libc │ =c/myon ↵
│ │ │ │ │ │ │ myon=CTc/myon
template1 │ myon │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ │ libc │ =c/myon ↵
│ │ │ │ │ │ │ myon=CTc/myon
(3 rows)

Another gripe here: The above is the output when run against a PG15
cluster, created without an ICU locale set.

When running psql 15 against PG 14, the output is this:

$ psql -l
List of databases
Name │ Owner │ Encoding │ Collate │ Ctype │ ICU Locale │ Locale Provider │ Access privileges
───────────┼──────────┼──────────┼────────────┼────────────┼────────────┼─────────────────┼───────────────────────
postgres │ postgres │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ de_DE.utf8 │ libc │
template0 │ postgres │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ de_DE.utf8 │ libc │ =c/postgres ↵
│ │ │ │ │ │ │ postgres=CTc/postgres
template1 │ postgres │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ de_DE.utf8 │ libc │ =c/postgres ↵
│ │ │ │ │ │ │ postgres=CTc/postgres
(3 rows)

The "ICU Locale" column is now populated, that seems wrong.

The problem is in the else branch in src/bin/psql/describe.c around
line 900:

+   if (pset.sversion >= 150000)
+       appendPQExpBuffer(&buf,
+                         "       d.daticulocale as \"%s\",\n"
+                         "       CASE d.datlocprovider WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\
+                         gettext_noop("ICU Locale"),
+                         gettext_noop("Locale Provider"));
+   else
+       appendPQExpBuffer(&buf,
+                         "       d.datcollate as \"%s\",\n"  <--- there
+                         "       'libc' AS \"%s\",\n",
+                         gettext_noop("ICU Locale"),
+                         gettext_noop("Locale Provider"));

I'd think this should rather be

+ " '' as \"%s\",\n"

Christoph

#24Euler Taveira
euler@eulerto.com
In reply to: Christoph Berg (#23)
Re: Inconsistent "ICU Locale" output on older server versions

On Fri, Apr 15, 2022, at 11:58 AM, Christoph Berg wrote:

When running psql 15 against PG 14, the output is this:

$ psql -l
List of databases
Name │ Owner │ Encoding │ Collate │ Ctype │ ICU Locale │ Locale Provider │ Access privileges
───────────┼──────────┼──────────┼────────────┼────────────┼────────────┼─────────────────┼───────────────────────
postgres │ postgres │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ de_DE.utf8 │ libc │
template0 │ postgres │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ de_DE.utf8 │ libc │ =c/postgres ↵
│ │ │ │ │ │ │ postgres=CTc/postgres
template1 │ postgres │ UTF8 │ de_DE.utf8 │ de_DE.utf8 │ de_DE.utf8 │ libc │ =c/postgres ↵
│ │ │ │ │ │ │ postgres=CTc/postgres
(3 rows)

The "ICU Locale" column is now populated, that seems wrong.

Good catch!

The problem is in the else branch in src/bin/psql/describe.c around
line 900:

+   if (pset.sversion >= 150000)
+       appendPQExpBuffer(&buf,
+                         "       d.daticulocale as \"%s\",\n"
+                         "       CASE d.datlocprovider WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\
+                         gettext_noop("ICU Locale"),
+                         gettext_noop("Locale Provider"));
+   else
+       appendPQExpBuffer(&buf,
+                         "       d.datcollate as \"%s\",\n"  <--- there
+                         "       'libc' AS \"%s\",\n",
+                         gettext_noop("ICU Locale"),
+                         gettext_noop("Locale Provider"));

I'd think this should rather be

+ " '' as \"%s\",\n"

Since dataiculocale allows NULL, my suggestion is to use NULL instead of an
empty string. It is consistent with a cluster whose locale provider is libc.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Euler Taveira (#24)
Re: Inconsistent "ICU Locale" output on older server versions

"Euler Taveira" <euler@eulerto.com> writes:

On Fri, Apr 15, 2022, at 11:58 AM, Christoph Berg wrote:

When running psql 15 against PG 14, the output is this:
The "ICU Locale" column is now populated, that seems wrong.

Good catch!

Indeed.

Since dataiculocale allows NULL, my suggestion is to use NULL instead of an
empty string. It is consistent with a cluster whose locale provider is libc.

Yeah, I agree. We should make the pre-v15 output match what you'd see
if looking at a non-ICU v15 database.

regards, tom lane

#26Christoph Berg
myon@debian.org
In reply to: Tom Lane (#22)
psql -l and locales (Re: pgsql: Add option to use ICU as global locale provider)

Re: Tom Lane

Christoph Berg <myon@debian.org> writes:

A possible solution might be to rip out all the locale columns except
"Encoding" from \l, and leave them in place for \l+.

I'd rather see a single column summarizing the locale situation.
Perhaps it could be COALESCE(daticulocale, datcollate), or
something using a CASE on datlocprovider?
Then \l+ could replace that with all the underlying columns.

Fwiw I still think the default psql -l output should be more concise.
Any chance to have that happen for PG15?

I can try creating a patch if it has chances of getting through.

Christoph