CREATE DATABASE command for non-libc providers

Started by Jeff Davis7 months ago8 messages

pgsql@j-davis.com

7 months ago

1 attachment(s)

From the discussion here:

/messages/by-id/CAFCRh--rtqbOBpJYFDmPD9kYCYxsxKpLW7LHxYMYhHXa2XoStw@mail.gmail.com

the CREATE DATABASE command has a tendency to throw errors in confusing
ways when using non-libc providers. I have attached a patch 0001 that
fixes a misleading hint, but it's still not great.

When using ICU or the builtin provider, it still requires coming up
with some valid locale name for LC_COLLATE and LC_CTYPE, even though
those have little or no effect. And because LOCALE is the fallback when
LC_COLLATE and/or LC_CTYPE are unspecified, it's confusing to the user
because they aren't even trying to specify a libc locale name at all.

The solution, as I see it, is:

* Force the environment variables LC_COLLATE=C and LC_CTYPE=C
unconditionally, and pg_perm_setlocale() them. This requires closing a
few loose ends, but it should be doable[1]/messages/by-id/cd3517c7-ddb8-454e-9dd5-70e3d84ff6a2@eisentraut.org. Even the libc provider uses
the "_l()" functions already, and no longer depends on setlocale().

* When datlocprovider<>'c', force datcollate and datctype to NULL.

* If the user specifies LC_CTYPE or LC_COLLATE to CREATE DATABASE, and
the provider is not libc, then ignore LC_COLLATE/LC_CTYPE and emit a
WARNING, rather than trying to set it based on LOCALE and getting an
error.

Regards,
Jeff Davis

[1]: /messages/by-id/cd3517c7-ddb8-454e-9dd5-70e3d84ff6a2@eisentraut.org
/messages/by-id/cd3517c7-ddb8-454e-9dd5-70e3d84ff6a2@eisentraut.org

Attachments:

v1-0001-Improve-CREATE-DATABASE-error-message-for-invalid.patchtext/x-patch; charset=UTF-8; name=v1-0001-Improve-CREATE-DATABASE-error-message-for-invalid.patchDownload

From fea7ab4f0495330fae56f069520de374d75ae0b8 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 5 Jun 2025 16:40:53 -0700
Subject: [PATCH v1] Improve CREATE DATABASE error message for invalid libc
 locale.

---
 src/backend/commands/dbcommands.c | 41 +++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5fbbcdaabb1..c95eb945016 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1065,16 +1065,41 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	/* Check that the chosen locales are valid, and get canonical spellings */
 	if (!check_locale(LC_COLLATE, dbcollate, &canonname))
-		ereport(ERROR,
-				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-				 errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
-				 errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
+	{
+		if (dblocprovider == COLLPROVIDER_BUILTIN)
+			ereport(ERROR,
+					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					 errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+					 errhint("If the locale name is specific to the builtin provider, use BUILTIN_LOCALE.")));
+		else if (dblocprovider == COLLPROVIDER_ICU)
+			ereport(ERROR,
+					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					 errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+					 errhint("If the locale name is specific to the ICU provider, use ICU_LOCALE.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					 errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate)));
+	}
 	dbcollate = canonname;
 	if (!check_locale(LC_CTYPE, dbctype, &canonname))
-		ereport(ERROR,
-				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-				 errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
-				 errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
+	{
+		if (dblocprovider == COLLPROVIDER_BUILTIN)
+			ereport(ERROR,
+					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					 errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+					 errhint("If the locale name is specific to the builtin provider, use BUILTIN_LOCALE.")));
+		else if (dblocprovider == COLLPROVIDER_ICU)
+			ereport(ERROR,
+					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					 errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+					 errhint("If the locale name is specific to the ICU provider, use ICU_LOCALE.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					 errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype)));
+	}
+
 	dbctype = canonname;
 
 	check_encoding_locale_matches(encoding, dbcollate, dbctype);
-- 
2.43.0

Daniel Verite

daniel@manitou-mail.org

7 months ago

In reply to: Jeff Davis (#1)

Re: CREATE DATABASE command for non-libc providers

Jeff Davis wrote:

I have attached a patch 0001 that
fixes a misleading hint, but it's still not great.

+1 for the patch

When using ICU or the builtin provider, it still requires coming up
with some valid locale name for LC_COLLATE and LC_CTYPE

No, since the following invocation does work:

CREATE DATABASE test
template='template0'
locale_provider='builtin'
builtin_locale='C.UTF-8';

Here we let 'locale' or 'lc_collate/lc_ctype' which is the same thing,
defaulting from the template database.

In the discussion you mentioned, the error comes from the OP using
'locale' instead of 'builtin_locale'. At least that's my understanding.
This mistake is not surprising, because when you specify a locale
provider followed by a locale, intuitively you'd expect this locale
to refer to that locale provider. Yet that's not case, mostly for backward
compatibility reasons.

* Force the environment variables LC_COLLATE=C and LC_CTYPE=C
unconditionally, and pg_perm_setlocale() them

Currently that would be a regression for some people, because
when LC_CTYPE=C, the FTS parser produces substandard results with
characters beyond ASCII.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Jeff Davis

pgsql@j-davis.com

7 months ago

In reply to: Daniel Verite (#2)

Re: CREATE DATABASE command for non-libc providers

On Fri, 2025-06-06 at 22:03 +0200, Daniel Verite wrote:

+1 for the patch

Thank you, committed.

Here we let 'locale' or 'lc_collate/lc_ctype' which is the same
thing,
defaulting from the template database.

Right, in the normal case it's OK, but if anything goes wrong, it gets
fairly confusing.

* Force the environment variables LC_COLLATE=C and LC_CTYPE=C
unconditionally, and pg_perm_setlocale() them

Currently that would be a regression for some people, because
when LC_CTYPE=C, the FTS parser produces substandard results with
characters beyond ASCII.

In the other thread, I posted a patch:

/messages/by-id/a1396f17f462ee6561820f755caaf2d12eb9fd15.camel@j-davis.com

for the callers that rely on datctype (regardless of datlocprovider),
they access the locale_t through a global, and use the "_l" variants.

There should be no behavior change, and we still need to set LC_CTYPE,
so you are right that it's not a solution yet. I think it moves us in
the right direction, though.

If nothing else, we can easily identify the places that have behavior
dependent on datctype, and I could have offered a more clear reply to
the user.

Regards,
Jeff Davis

Jeff Davis

pgsql@j-davis.com

7 months ago

In reply to: Jeff Davis (#3)

Re: CREATE DATABASE command for non-libc providers

On Fri, 2025-06-06 at 15:47 -0700, Jeff Davis wrote:

* Force the environment variables LC_COLLATE=C and LC_CTYPE=C
unconditionally, and pg_perm_setlocale() them

Currently that would be a regression for some people, because
when LC_CTYPE=C, the FTS parser produces substandard results with
characters beyond ASCII.

In the other thread, I posted a patch:

/messages/by-id/a1396f17f462ee6561820f755caaf2d12eb9fd15.camel@j-davis.com

for the callers that rely on datctype (regardless of datlocprovider),
they access the locale_t through a global, and use the "_l" variants.

There should be no behavior change, and we still need to set
LC_CTYPE,
so you are right that it's not a solution yet. I think it moves us in
the right direction, though.

I'm not sure of the history here, but it looks like the reason full
text search doesn't use collation is because neither tsvector nor
tsquery are collatable types. Is that something that can ever be
corrected, or are we just stuck with the current behavior forever?

Even if it's not a collatable type, it should use the database
collation rather than going straight to libc. Again, is that something
that can ever be fixed or are we just stuck with libc semantics for
full text search permanently, even if you initialize the cluster with a
different provider?

Regards,
Jeff Davis

Daniel Verite

daniel@manitou-mail.org

7 months ago

In reply to: Jeff Davis (#4)

Re: CREATE DATABASE command for non-libc providers

Jeff Davis wrote:

Even if it's not a collatable type, it should use the database
collation rather than going straight to libc. Again, is that something
that can ever be fixed or are we just stuck with libc semantics for
full text search permanently, even if you initialize the cluster with a
different provider?

ISTM that what backend/tsearch/wparser_def.c needs is comparable
to what backend/regex/regc_pg_locale.c already does with the
PG_Locale_Strategy, and the pg_wc_isxxxx functions.

Looking at git history, the current invocations of is[w]digit(),
is[w]alpha()...
in the FTS parser have been modernized a bit by ed87e1980706 (2017)
but essentially this code dates back from the original integration of
FTS in core by 140d4ebcb46e (2007). These calls are made through
the p_is##type macro-expanded functions:

/*
* In C locale with a multibyte encoding, any non-ASCII symbol is considered
* an alpha character, but not a member of other char classes.
*/
p_iswhat(alnum, 1)
p_iswhat(alpha, 1)
p_iswhat(digit, 0)
p_iswhat(lower, 0)
p_iswhat(print, 0)
p_iswhat(punct, 0)
p_iswhat(space, 0)
p_iswhat(upper, 0)
p_iswhat(xdigit, 0)

That's why in a database with the builtin or ICU provider and lc_ctype=C,
the FTS parser is not Unicode-aware. I may miss something, but I don't see a
technical reason why this code could not be taught to call the equivalent
functions of the current collation provider, following the same principles
as the regex code.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Jeff Davis

pgsql@j-davis.com

7 months ago

In reply to: Daniel Verite (#5)

Re: CREATE DATABASE command for non-libc providers

On Tue, 2025-06-10 at 23:44 +0200, Daniel Verite wrote:

I may miss something, but I don't see a
technical reason why this code could not be taught to call the
equivalent
functions of the current collation provider, following the same
principles
as the regex code.

The main challenge is backwards compatibility. Users of FTS would need
to recreate all of their tsvectors and indexes dependent on them. It's
even possible that some users only have tsvectors and don't store the
original data in the database, which would further complicate matters.

We could try to create a GUC to control this behavior, but behavior-
changing GUCs don't have a great history, and it would probably last
quite some time before we could really turn off libc for good.

There would be similar challenges for downcase_identifier() and maybe
pg_strcasecmp().

Regards,
Jeff Davis

Daniel Verite

daniel@manitou-mail.org

7 months ago

In reply to: Jeff Davis (#6)

Re: CREATE DATABASE command for non-libc providers

Jeff Davis wrote:

The main challenge is backwards compatibility. Users of FTS would need
to recreate all of their tsvectors and indexes dependent on them. It's
even possible that some users only have tsvectors and don't store the
original data in the database, which would further complicate matters.

Why would it be that bad?
FTS indexes don't get corrupted that way. You may get different
lexems before and after the upgrade for some documents, and then
what?

The FTS parser had seen user-visible changes in the past, and
regenerating tsvectors because of that were merely a suggestion.

commit 61d66c44f18c73094a50a2ef97d26cc03e171dc0
Author: Teodor Sigaev <teodor@sigaev.ru>
Date: Tue Mar 29 17:59:58 2016 +0300

Fix support of digits in email/hostnames.

When tsearch was implemented I did several mistakes in hostname/email
definition rules:
1) allow underscore in hostname what ted by RFC
2) forget to allow leading digits separated by hyphen (like 123-x.com)
in hostname
3) do no allow underscore/hyphen after leading digits in localpart of
email

Artur's patch resolves two last issues, but by the way allows hosts name
like
123_x.com together with 123-x.com. RFC forbids underscore usage in
hostname
but pg allows that since initial tsearch version in core, although only
for non-digits. Patch syncs support digits and nondigits in both hostname
and
email.

Forbidding underscore in hostname may break existsing usage of tsearch
and,
anyhow, it should be done by separate patch.

Author: Artur Zakirov
BUG: #13964

In the release notes:

Fix the default text search parser to allow leading digits in email
and host tokens (Artur Zakirov)

In most cases this will result in few changes in the parsing of
text. But if you have data where such addresses occur frequently, it
may be worth rebuilding dependent tsvector columns and indexes so
that addresses of this form will be found properly by text searches.

commit 2c265adea3129c917296b46a82786d67988ece2c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Apr 28 02:04:16 2010 +0000

Modify the built-in text search parser to handle URLs more nearly
according
to RFC 3986. In particular, these characters now terminate the path part
of a URL: '"', '<', '>', '\', '^', '`', '{', '|', '}'. The previous
behavior
was inconsistent and depended on whether a "?" was present in the path.
Per gripe from Donald Fraser and spec research by Kevin Grittner.

This is a pre-existing bug, but not back-patching since the risks of
breaking existing applications seem to outweigh the benefits.

https://www.postgresql.org/docs/release/9.0.0/

E.24.3.5.1. Full Text Search

Use more standards-compliant rules for parsing URL tokens (Tom Lane)

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Jeff Davis

pgsql@j-davis.com

7 months ago

In reply to: Daniel Verite (#7)

Re: CREATE DATABASE command for non-libc providers

On Fri, 2025-06-13 at 18:41 +0200, Daniel Verite wrote:

The main challenge is backwards compatibility.

Why would it be that bad?
FTS indexes don't get corrupted that way. You may get different
lexems before and after the upgrade for some documents, and then
what?

It would produce different results than if you started from scratch in
v19. It's hard for me to say whether that would be acceptable or not,
but I could see how that could be confusing to users if they notice.
Perhaps release notes are enough?

The FTS parser had seen user-visible changes in the past, and
regenerating tsvectors because of that were merely a suggestion.

Interesting, thank you for looking into the history here. It would
certainly be simpler to just make FTS fully collation-aware.

Regards,
Jeff Davis