Remaining dependency on setlocale()
After some previous work here:
/messages/by-id/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com
we are less dependent on setlocale(), but it's still not completely
gone.
setlocale() counts as thread-unsafe, so it would be nice to eliminate
it completely.
The obvious answer is uselocale(), which sets the locale only for the
calling thread, and takes precedence over whatever is set with
setlocale().
But there are a couple problems:
1. I don't think it's supported on Windows.
2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().
Thoughts?
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
But there are a couple problems:
1. I don't think it's supported on Windows.
Can't help with that, but surely Windows has some thread-safe way.
2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().
What I can tell you about that is that check_locale's expectation
that setlocale does any useful canonicalization is mostly wishful
thinking [1]/messages/by-id/14856.1348497531@sss.pgh.pa.us. On a lot of platforms you just get the input string
back again. If that's the only thing keeping us on setlocale,
I think we could drop it. (Perhaps we should do some canonicalization
of our own instead?)
regards, tom lane
On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
But there are a couple problems:
1. I don't think it's supported on Windows.
Can't help with that, but surely Windows has some thread-safe way.
It does. It's not exactly the same, instead there is a thing you can
call that puts setlocale() itself into a thread-local mode, but last
time I checked that mode was missing on MinGW so that's a bit of an
obstacle.
How far can we get by using more _l() functions? For example, [1]/messages/by-id/CA+hUKGJ=ca39Cg=y=S89EaCYvvCF8NrZRO=uog-cnz0VzC6Kfg@mail.gmail.com
shows a use of strftime() that I think can be converted to
strftime_l() so that it doesn't depend on setlocale(). Since POSIX
doesn't specify every obvious _l function, we might need to provide
any missing wrappers that save/restore thread-locally with
uselocale(). Windows doesn't have uselocale(), but it generally
doesn't need such wrappers because it does have most of the obvious
_l() functions.
2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().What I can tell you about that is that check_locale's expectation
that setlocale does any useful canonicalization is mostly wishful
thinking [1]. On a lot of platforms you just get the input string
back again. If that's the only thing keeping us on setlocale,
I think we could drop it. (Perhaps we should do some canonicalization
of our own instead?)
+1
I know it does something on Windows (we know the EDB installer gives
it strings like "Language,Country" and it converts them to
"Language_Country.Encoding", see various threads about it all going
wrong), but I'm not sure it does anything we actually want to
encourage. I'm hoping we can gradually screw it down so that we only
have sane BCP 47 in the system on that OS, and I don't see why we
wouldn't just use them verbatim.
[1]: /messages/by-id/CA+hUKGJ=ca39Cg=y=S89EaCYvvCF8NrZRO=uog-cnz0VzC6Kfg@mail.gmail.com
On 8/7/24 03:07, Thomas Munro wrote:
How far can we get by using more _l() functions? For example, [1]
shows a use of strftime() that I think can be converted to
strftime_l() so that it doesn't depend on setlocale(). Since POSIX
doesn't specify every obvious _l function, we might need to provide
any missing wrappers that save/restore thread-locally with
uselocale(). Windows doesn't have uselocale(), but it generally
doesn't need such wrappers because it does have most of the obvious
_l() functions.
Most of the strtoX functions have an _l variant, but one to watch is
atoi, which is defined with a hardcoded call to strtol, at least with glibc:
8<----------
/* Convert a string to an int. */
int
atoi (const char *nptr)
{
return (int) strtol (nptr, (char **) NULL, 10);
}
8<----------
I guess in many/most places we use atoi we don't care, but maybe it
matters for some?
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 7, 2024 at 9:42 AM Joe Conway <mail@joeconway.com> wrote:
I guess in many/most places we use atoi we don't care, but maybe it
matters for some?
I think we should move in the direction of replacing atoi() calls with
strtol() and actually checking for errors. In many places where use
atoi(), it's unlikely that the string would be anything but an
integer, so error checks are arguably unnecessary. A backup label file
isn't likely to say "START TIMELINE: potaytoes". On the other hand, if
it did say that, I'd prefer to get an error about potaytoes than have
it be treated as if it said "START TIMELINE: 0". And I've definitely
found missing error-checks over the years. For example, on pg14,
"pg_basebackup -Ft -Zmaximum -Dx" works as if you specified "-Z0"
because atoi("maximum") == 0. If we make a practice of checking
integer conversions for errors everywhere, we might avoid some such
silliness.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
How far can we get by using more _l() functions?
There are a ton of calls to, for example, isspace(), used mostly for
parsing.
I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.
So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.
Regards,
Jeff Davis
On 8/7/24 13:16, Jeff Davis wrote:
On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
How far can we get by using more _l() functions?
There are a ton of calls to, for example, isspace(), used mostly for
parsing.I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.
FWIW I see all of these in glibc:
isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
isxdigit_l
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
FWIW I see all of these in glibc:
isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
isxdigit_l
On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
There are a ton of calls to, for example, isspace(), used mostly for
parsing.I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.
We do know of a few isspace() calls that are already questionable[1]/messages/by-id/CA+HWA9awUW0+RV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig@mail.gmail.com
(should be scanner_isspace(), or something like that). It's not only
weird that SELECT ROW('libertà!') is displayed with or without double
quote depending (in theory) on your locale, it's also undefined
behaviour because we feed individual bytes of a multi-byte sequence to
isspace(), so OSes disagree, and in practice we know that macOS and
Windows think that the byte 0xa inside 'à' is a space while glibc and
FreeBSD don't. Looking at the languages with many sequences
containing 0xa0, I guess you'd probably need to be processing CJK text
and cross-platform for the difference to become obvious (that was the
case for the problem report I analysed):
for i in range(1, 0xffff):
if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
print("%04x: %s" % (i, chr(i)))
[1]: /messages/by-id/CA+HWA9awUW0+RV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig@mail.gmail.com
On Thu, Aug 8, 2024 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
FWIW I see all of these in glibc:
isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
isxdigit_lOn my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.
Those (except isascii_l) are from POSIX 2008[1]https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html. They were absorbed
from "Extended API Set Part 4"[2]https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf, along with locale_t (that's why
there is a header <xlocale.h> on a couple of systems even though after
absorption they are supposed to be in <locale.h>). We already
decided that all computers have that stuff (commit 8d9a9f03), but the
reality is a little messier than that... NetBSD hasn't implemented
uselocale() yet[3]/messages/by-id/CWZBBRR6YA8D.8EHMDRGLCKCD@neon.tech, though it has a good set of _l functions. As
discussed in [3]/messages/by-id/CWZBBRR6YA8D.8EHMDRGLCKCD@neon.tech, ECPG code is therefore currently broken in
multithreaded clients because it's falling back to a setlocale() path,
and I think Windows+MinGW must be too (it lacks
HAVE__CONFIGTHREADLOCALE), but those both have a good set of _l
functions. In that thread I tried to figure out how to use _l
functions to fix that problem, but ...
The issue there is that we have our own snprintf.c, that implicitly
requires LC_NUMERIC to be "C" (it is documented as always printing
floats a certain way ignoring locale and that's what the callers there
want in frontend and backend code, but in reality it punts to system
snprintf for floats, assuming that LC_NUMERIC is "C", which we
configure early in backend startup, but frontend code has to do it for
itself!). So we could use snprintf_l or strtod_l instead, but POSIX
hasn't got those yet. Or we could use own own Ryu code (fairly
specific), but integrating Ryu into our snprintf.c (and correctly
implementing all the %... stuff?) sounds like quite a hard,
devil-in-the-details kind of an undertaking to me. Or maybe it's
easy, I dunno. As for the _l functions, you could probably get away
with "every computer has either uselocale() or snprintf_() (or
strtod_()?)" and have two code paths in our snprintf.c. But then we'd
also need a place to track a locale_t for a long-lived newlocale("C"),
which was too messy in my latest attempt...
[1]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html
[2]: https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf
[3]: /messages/by-id/CWZBBRR6YA8D.8EHMDRGLCKCD@neon.tech
On Wed, 2024-08-07 at 13:28 -0400, Joe Conway wrote:
FWIW I see all of these in glibc:
isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
isxdigit_l
My point was just that there are a lot of those call sites (especially
for isspace()) in various parsers. It feels like a lot of code churn to
change all of them, when a lot of them seem to be intended for ascii
anyway.
And where do we get the locale_t structure from? We can create our own
at database connection time, and supply it to each of those call sites,
but I'm not sure that's a huge advantage over just using uselocale().
Regards,
Jeff Davis
On 8/8/24 12:45 AM, Jeff Davis wrote:
My point was just that there are a lot of those call sites (especially
for isspace()) in various parsers. It feels like a lot of code churn to
change all of them, when a lot of them seem to be intended for ascii
anyway.And where do we get the locale_t structure from? We can create our own
at database connection time, and supply it to each of those call sites,
but I'm not sure that's a huge advantage over just using uselocale().
I am leaning towards that we should write our own pure ascii functions
for this. Since we do not support any non-ascii compatible encodings
anyway I do not see the point in having locale support in most of these
call-sites.
Andewas
On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
I am leaning towards that we should write our own pure ascii
functions
for this.
That makes sense for a lot of call sites, but it could cause breakage
if we aren't careful.
Since we do not support any non-ascii compatible encodings
anyway I do not see the point in having locale support in most of
these
call-sites.
An ascii-compatible encoding just means that the code points in the
ascii range are represented as ascii. I'm not clear on whether code
points in the ascii range can return different results for things like
isspace(), but it sounds plausible -- toupper() can return different
results for 'i' in tr_TR.
Also, what about the values outside 128-255, which are still valid
input to isspace()?
Regards,
Jeff Davis
On Tue Aug 6, 2024 at 5:00 PM CDT, Jeff Davis wrote:
After some previous work here:
/messages/by-id/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com
we are less dependent on setlocale(), but it's still not completely
gone.setlocale() counts as thread-unsafe, so it would be nice to eliminate
it completely.The obvious answer is uselocale(), which sets the locale only for the
calling thread, and takes precedence over whatever is set with
setlocale().But there are a couple problems:
1. I don't think it's supported on Windows.
2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().Thoughts?
Hey Jeff,
See this thread[0]/messages/by-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech for some work that I had previously done. Feel free
to take it over, or we could collaborate.
[0]: /messages/by-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech
--
Tristan Partin
Neon (https://neon.tech)
Hi,
On Fri, 2024-08-09 at 15:16 -0500, Tristan Partin wrote:
Hey Jeff,
See this thread[0] for some work that I had previously done. Feel
free
to take it over, or we could collaborate.
Sounds good, sorry I missed that.
Can you please rebase and we can discuss in that thread?
Regards,
Jeff Davis
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
But there are a couple problems:
1. I don't think it's supported on Windows.
Can't help with that, but surely Windows has some thread-safe way.
It does. It's not exactly the same, instead there is a thing you can
call that puts setlocale() itself into a thread-local mode, but last
time I checked that mode was missing on MinGW so that's a bit of an
obstacle.
Actually the MinGW situation might be better than that these days. I
know of three environments where we currently have to keep code
working on MinGW: build farm animal fairywren (msys2 compiler
toochain), CI's optional "Windows - Server 2019, MinGW64 - Meson"
task, and CI's "CompilerWarnings" task, in the "mingw_cross_warning"
step (which actually runs on Linux, and uses configure rather than
meson). All three environments show that they have
_configthreadlocale. So could we could simply require it on Windows?
Then it might be possible to write a replacement implementation of
uselocale() that does a two-step dance with _configthreadlocale() and
setlocale(), restoring both afterwards if they changed. That's what
ECPG open-codes already.
The NetBSD situation is more vexing. I was trying to find out if
someone is working on it and unfortunately it looks like there is a
principled stand against adding it:
https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html
They're right that we really just want to use "C" in some places, and
their LC_C_LOCALE is a very useful system-provided value to be able to
pass into _l functions. It's a shame it's non-standard, because
without it you have to allocate a locale_t for "C" and keep it
somewhere to feed to _l functions...
I've posted a new attempt at ripping those ECPG setlocales() out on
the other thread that had the earlier version and discussion:
/messages/by-id/CA+hUKG+Yv+ps=nS2T8SS1UDU=iySHSr4sGHYiYGkPTpZx6Ooww@mail.gmail.com
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
How far can we get by using more _l() functions?
There are a ton of calls to, for example, isspace(), used mostly for
parsing.I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.
Here are two more cases that I don't think I've seen discussed.
1. The nl_langinfo() call in pg_get_encoding_from_locale(), can
probably be changed to nl_langinfo_l() (it is everywhere we currently
care about except Windows, which has a different already-thread-safe
alternative; AIX seems to lack the _l version, but someone writing a
patch to re-add support for that OS could supply the configure goo for
a uselocale() safe/restore implementation). One problem is that it
has callers that pass it NULL meaning the backend default, but we'd
perhaps use LC_C_GLOBAL for now and have to think about where we get
the database default locale_t in the future.
2. localeconv() is *doubly* non-thread-safe: it depends on the
current locale, and it also returns an object whose storage might be
clobbered by any other call to localeconv(), setlocale, or even,
according to POSIX, uselocale() (!!!). I think that effectively
closes off that escape hatch. On some OSes (macOS, BSDs) you find
localeconv_l() and then I think they give you a more workable
lifetime: as long as the locale_t lives, which makes perfect sense. I
am surprised that no one has invented localeconv_r() where you supply
the output storage, and you could wrap that in uselocale()
save/restore to deal with the other problem, or localeconv_r_l() or
something. I can't understand why this is so bad. The glibc
documentation calls it "a masterpiece of poor design". Ahh, so it
seems like we need to delete our use of localeconf() completely,
because we should be able to get all the information we need from
nl_langinfo_l() instead:
https://www.gnu.org/software/libc/manual/html_node/Locale-Information.html
On Mon, Aug 12, 2024 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
1. The nl_langinfo() call in pg_get_encoding_from_locale(), can
probably be changed to nl_langinfo_l() (it is everywhere we currently
care about except Windows, which has a different already-thread-safe
alternative ...
... though if we wanted to replace all use of localeconv and struct
lconv with nl_langinfo_l() calls, it's not totally obvious how to do
that on Windows. Its closest thing is GetLocaleInfoEx(), but that has
complications: it takes wchar_t locale names, which we don't even have
and can't access when we only have a locale_t, and it must look them
up in some data structure every time, and it copies data out to the
caller as wchar_t so now you have two conversion problems and a
storage problem. If I understand correctly, the whole point of
nl_langinfo_l(item, loc) is that it is supposed to be fast, it's
really just an array lookup, and item is just an index, and the result
is supposed to be stable as long as loc hasn't been freed (and the
thread hasn't exited). So you can use it without putting your own
caching in front of it. One idea I came up with which I haven't tried
and it might turn out to be terrible, is that we could change our
definition of locale_t on Windows. Currently it's a typedef to
Windows' _locale_t, and we use it with a bunch of _XXX functions that
we access by macro to remove the underscore. Instead, we could make
locale_t a pointer to a struct of our own design in WIN32 builds,
holding the native _locale_t and also an array full of all the values
that nl_langinfo_l() can return. We'd provide the standard enums,
indexes into that array, in a fake POSIX-oid header <langinfo.h>.
Then nl_langinfo_l(item, loc) could be implemented as
loc->private_langinfo[item], and strcoll_l(.., loc) could be a static
inline function that does _strcol_l(...,
loc->private_windows_locale_t). These structs would be allocated and
freed with standard-looking newlocale() and freelocale(), so we could
finally stop using #ifdef WIN32-wrapped _create_locale() directly.
Then everything would look more POSIX-y, nl_langinfo_l() could be used
directly wherever we need fast access to that info, and we could, I
think, banish the awkward localeconv, right? I don't know if this all
makes total sense and haven't tried it, just spitballing here...
On Mon, Aug 12, 2024 at 4:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
... though if we wanted to replace all use of localeconv and struct
lconv with nl_langinfo_l() calls,
Gah. I realised while trying the above that you can't really replace
localeconv() with nl_langinfo_l() as the GNU documentation recommends,
because some of the lconv fields we're using are missing from
langinfo.h in POSIX (only GNU added them all, that was a good idea).
:-(
Next idea:
Windows: its localeconv() returns pointer to thread-local storage,
good, but it still needs setlocale(). So the options are: make our
own lconv-populator function for Windows, using GetLocaleInfoEx(), or
do that _configthreadlocale() dance (possibly excluding some MinGW
configurations from working)
Systems that have localeconv_l(): use that
POSIX: use uselocale() and also put a big global lock around
localeconv() call + accessing result (optionally skipping that on an
OS-by-OS basis after confirming that its implementation doesn't really
need it)
The reason the uselocale() + localeconv() seems to require a Big Lock
(by default at least) is that the uselocale() deals only with the
"current locale" aspect, not the output buffer aspect. Clearly the
standard allows for it to be thread-local storage (that's why since
2008 it says that after thread-exit you can't access the result, and I
guess that's how it works on real systems (?)), but it also seems to
allow for a single static buffer (that's why it says that it's not
re-entrant, and any call to localeconv() might clobber it). That
might be OK in practice because we tend to cache that stuff, eg when
assigning GUC lc_monetary (that cache would presumably become
thread-local in the first phase of the multithreading plan), so the
locking shouldn't really hurt.
The reason we'd have to have three ways, and not just two, is again
that NetBSD declined to implement uselocale().
I'll try this in a bit unless someone else has better ideas or plans
for this part... sorry for the drip-feeding.
On Tue, Aug 13, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
I'll try this in a bit unless someone else has better ideas or plans
for this part... sorry for the drip-feeding.
And done, see commitfest entry #5170.
On Sat, 2024-08-10 at 09:42 +1200, Thomas Munro wrote:
The NetBSD situation is more vexing. I was trying to find out if
someone is working on it and unfortunately it looks like there is a
principled stand against adding it:https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html
The objection seems to be very general: that uselocale() modifies the
thread state and affects calls a long distance from uselocale(). I
don't disagree with the general sentiment. But in effect, that just
prevents people migrating away from setlocale(), to which the same
argument applies, and is additionally thread-unsafe.
The only alternative is to essentially ban the use of non-_l variants,
which is fine I suppose, but causes a fair amount of code churn.
They're right that we really just want to use "C" in some places, and
their LC_C_LOCALE is a very useful system-provided value to be able
to
pass into _l functions. It's a shame it's non-standard, because
without it you have to allocate a locale_t for "C" and keep it
somewhere to feed to _l functions...
If we're going to do that, why not just have ascii-only variants of our
own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
LC_C_LOCALE).
Regards,
Jeff Davis
On Wed, Aug 14, 2024 at 1:05 PM Jeff Davis <pgsql@j-davis.com> wrote:
The only alternative is to essentially ban the use of non-_l variants,
which is fine I suppose, but causes a fair amount of code churn.
Let's zoom out a bit and consider some ways we could set up the
process, threads and individual calls:
1. The process global locale is always "C". If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in replacement
_l() functions for systems that lack them. You need to use _l()
functions for all non-"C" locales. The current database default needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions. The "C"
locale can be accessed implicitly with non-l() functions, or you could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) for
"C". Or a name like PG_C_LOCALE, which, in backend code could be just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.
XXX Note that nailing LC_ALL to "C" in the backend would extend our
existing policy for LC_NUMERIC to all categories. That's why we can
use strtod() in the backend and expect the radix character to be '.'.
It's interesting to contemplate the strtod() calls in CF#5166: they
are code copied-and-pasted from backend and frontend; in the backend
we can use strtod() currently but in the frontend code I showed a
change to strtod_l(..., PG_C_LOCALE), in order to be able to delete
some ugly deeply nested uselocale()/setlocale() stuff of the exact
sort that NetBSD hackers (and I) hate. It's obviously a bit of a code
smell that it's copied-and-pasted in the first place, and should
really share code. Supposing some of that stuff finishes up in
src/common, then I think you'd want a strtod_l(..., PG_C_LOCALE) that
could be allowed to take advantage of the knowledge that the global
locale is "C" in the backend. Just thoughts...
2. The process global locale is always "C". Each backend process (in
future: thread) calls uselocale() to set the thread-local locale to
the database default, so it can keep using the non-_l() functions as a
way to access the database default, and otherwise uses _l() functions
if it wants something else (as we do already). The "C" locale can be
accessed with foo_l(..., LC_GLOBAL_LOCALE) or PG_C_LOCALE etc.
XXX This option is blocked by NetBSD's rejection of uselocale(). I
guess if you really wanted to work around NetBSD's policy you could
make your own wrapper for all affected functions, translating foo() to
foo_l(..., pg_thread_current_locale), so you could write uselocale(),
which is pretty much what every other libc does... But eughhh
3. The process global locale is inherited from the system or can be
set by the user however they want for the benefit of extensions, but
we never change it after startup or refer to it. Then we do the same
as 1 or 2, except if we ever want "C" we'll need a locale_t for that,
again perhaps using the PC_C_LOCALE mechanism. Non-_l() functions are
effectively useless except in cases where you really want to use the
system's settings inherited from startup, eg for messages, so they'd
mostly be banned.
What else?
They're right that we really just want to use "C" in some places, and
their LC_C_LOCALE is a very useful system-provided value to be able
to
pass into _l functions. It's a shame it's non-standard, because
without it you have to allocate a locale_t for "C" and keep it
somewhere to feed to _l functions...If we're going to do that, why not just have ascii-only variants of our
own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
LC_C_LOCALE).
Yeah, I agree there are some easy things we should do that way. In
fact we've already established that scanner_isspace() needs to be used
in lots more places for that, even aside from thread-safety.
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
1. The process global locale is always "C". If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in
replacement
_l() functions for systems that lack them. You need to use _l()
functions for all non-"C" locales. The current database default
needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions. The "C"
locale can be accessed implicitly with non-l() functions, or you
could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
for
"C". Or a name like PG_C_LOCALE, which, in backend code could be
just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.
+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.
We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.
Regards,
Jeff Davis
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().What I can tell you about that is that check_locale's expectation
that setlocale does any useful canonicalization is mostly wishful
thinking [1]. On a lot of platforms you just get the input string
back again. If that's the only thing keeping us on setlocale,
I think we could drop it. (Perhaps we should do some canonicalization
of our own instead?)+1
I know it does something on Windows (we know the EDB installer gives
it strings like "Language,Country" and it converts them to
"Language_Country.Encoding", see various threads about it all going
wrong), but I'm not sure it does anything we actually want to
encourage. I'm hoping we can gradually screw it down so that we only
have sane BCP 47 in the system on that OS, and I don't see why we
wouldn't just use them verbatim.
Some more thoughts on check_locale() and canonicalisation:
I doubt the canonicalisation does anything useful on any Unix system,
as they're basically just file names. In the case of glibc, the
encoding part is munged before opening the file so it tolerates .utf8
or .UTF-8 or .u---T----f------8 on input, but it still returns
whatever you gave it so the return value isn't cleaning the input or
anything.
"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places. We could change that to newlocale("") + query instead, but
there is a portability pipeline problem getting the name out of it:
1. POSIX only just added getlocalename_l() in 2024[1]https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html[2]https://www.austingroupbugs.net/view.php?id=1220.
2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc).
3. The <xlocale.h> systems (macOS/*BSD) have non-standard
querylocale(mask, loc).
4. AFAIK there is no way to do it on pure POSIX 2008 systems.
5. For Windows, there is a completely different thing to get the
user's default locale, see CF#3772.
The systems in category 4 would in practice be Solaris and (if it
comes back) AIX. Given that, we probably just can't go that way soon.
So I think the solution could perhaps be something like: in some early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.
[1]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html
[2]: https://www.austingroupbugs.net/view.php?id=1220
On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
So I think the solution could perhaps be something like: in some
early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.
+1.
Regards,
Jeff Davis
On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
So I think the solution could perhaps be something like: in some
early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.+1.
We currently nail down these categories:
/* We keep these set to "C" always. See pg_locale.c for explanation. */
init_locale("LC_MONETARY", LC_MONETARY, "C");
init_locale("LC_NUMERIC", LC_NUMERIC, "C");
init_locale("LC_TIME", LC_TIME, "C");
CF #5170 has patches to make it so that we stop changing them even
transiently, using locale_t interfaces to feed our caches of stuff
needed to work with those categories, so they really stay truly nailed
down.
It sounds like someone needs to investigate doing the same thing for
these two, from CheckMyDatabase():
if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
ereport(FATAL,
(errmsg("database locale is incompatible with
operating system"),
errdetail("The database was initialized with
LC_COLLATE \"%s\", "
" which is not recognized by setlocale().", collate),
errhint("Recreate the database with another locale or
install the missing locale.")));
if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
ereport(FATAL,
(errmsg("database locale is incompatible with
operating system"),
errdetail("The database was initialized with LC_CTYPE \"%s\", "
" which is not recognized by setlocale().", ctype),
errhint("Recreate the database with another locale or
install the missing locale.")));
How should that work? Maybe we could imagine something like
MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
set appropriately. Or should it be a pg_locale_t instead (if your
database default provider is ICU, then you don't even need a locale_t,
right?).
Then I think there is one quite gnarly category, from
assign_locale_messages() (a GUC assignment function):
(void) pg_perm_setlocale(LC_MESSAGES, newval);
I have never really studied gettext(), but I know it was just
standardised in POSIX 2024, and the standardised interface has _l()
variants of all functions. Current implementations don't have them
yet. Clearly we absolutely couldn't call pg_perm_setlocale() after
early startup -- but if gettext() is relying on the current locale to
affect far away code, then maybe this is one place where we'd just
have to use uselocale(). Perhaps we could plan some transitional
strategy where NetBSD users lose the ability to change the GUC without
restarting the server and it has to be the same for all sessions, or
something like that, until they produce either gettext_l() or
uselocale(), but I haven't thought hard about this part at all yet...
On 15.08.24 00:43, Thomas Munro wrote:
"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places. We could change that to newlocale("") + query instead, but
Where do we need that in the server?
It should just be initdb doing that and then initializing the server
with concrete values based on that.
I guess technically some of these GUC settings default to the
environment? But I think we could consider getting rid of that.
On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
On 15.08.24 00:43, Thomas Munro wrote:
"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places. We could change that to newlocale("") + query instead, butWhere do we need that in the server?
Hmm. Yeah, right, the only way I've found so far to even reach that
code and that captures that result is:
create database db2 locale = '';
Thats puts 'en_NZ.UTF-8' or whatever in pg_database. In contrast,
create collation will accept '' but just store it verbatim, and the
GUCs for changing time, monetary, numeric accept it too and keep it
verbatim. We could simply ban '' in all user commands. I doubt
they're documented as acceptable values, once you get past initdb and
have a running system. Looking into that...
It should just be initdb doing that and then initializing the server
with concrete values based on that.
Right.
I guess technically some of these GUC settings default to the
environment? But I think we could consider getting rid of that.
Yeah.
On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
It should just be initdb doing that and then initializing the server
with concrete values based on that.Right.
I guess technically some of these GUC settings default to the
environment? But I think we could consider getting rid of that.Yeah.
Seems to make a lot of sense. I tried that out over in CF #5170.
(In case it's not clear why I'm splitting discussion between threads:
I was thinking of this thread as the one where we're discussing what
needs to be done, with other threads being spun off to become CF entry
with concrete patches. I realised re-reading some discussion that
that might not be obvious...)
On 8/9/24 8:24 PM, Jeff Davis wrote:
On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
I am leaning towards that we should write our own pure ascii
functions
for this.That makes sense for a lot of call sites, but it could cause breakage
if we aren't careful.Since we do not support any non-ascii compatible encodings
anyway I do not see the point in having locale support in most of
these
call-sites.An ascii-compatible encoding just means that the code points in the
ascii range are represented as ascii. I'm not clear on whether code
points in the ascii range can return different results for things like
isspace(), but it sounds plausible -- toupper() can return different
results for 'i' in tr_TR.Also, what about the values outside 128-255, which are still valid
input to isspace()?
My idea was that in a lot of those cases we only try to parse e.g. 0-9
as digits and always only . as the decimal separator so we should make
just make that obvious by either using locale C or writing our own ascii
only functions. These strings are meant to be read by machines, not
humans, primarily.
Andreas
On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
1. The process global locale is always "C". If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in
replacement
_l() functions for systems that lack them. You need to use _l()
functions for all non-"C" locales. The current database default
needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions. The "C"
locale can be accessed implicitly with non-l() functions, or you
could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
for
"C". Or a name like PG_C_LOCALE, which, in backend code could be
just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.
Did we reach a conclusion here? Any thoughts on moving in this
direction, and whether 18 is the right time to do it?
Regards,
Jeff Davis
On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
1. The process global locale is always "C". If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in
replacement
_l() functions for systems that lack them. You need to use _l()
functions for all non-"C" locales. The current database default
needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions. The "C"
locale can be accessed implicitly with non-l() functions, or you
could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
for
"C". Or a name like PG_C_LOCALE, which, in backend code could be
just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.Did we reach a conclusion here? Any thoughts on moving in this
direction, and whether 18 is the right time to do it?
I think this is the best way, and I haven't seen anyone supporting any
other idea. (I'm working on those setlocale()-removal patches I
mentioned, more very soon...)
On 13.12.24 10:44, Thomas Munro wrote:
On Fri, Dec 13, 2024 at 8:22â¯AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
1. The process global locale is always "C". If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in
replacement
_l() functions for systems that lack them. You need to use _l()
functions for all non-"C" locales. The current database default
needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions. The "C"
locale can be accessed implicitly with non-l() functions, or you
could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
for
"C". Or a name like PG_C_LOCALE, which, in backend code could be
just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.Did we reach a conclusion here? Any thoughts on moving in this
direction, and whether 18 is the right time to do it?I think this is the best way, and I haven't seen anyone supporting any
other idea. (I'm working on those setlocale()-removal patches I
mentioned, more very soon...)
I also think this is the right direction, and we'll get closer with the
remaining patches that Thomas has lined up.
I think at this point, we could already remove all locale settings
related to LC_COLLATE. Nothing uses that anymore.
I think we will need to keep the global LC_CTYPE setting set to
something useful, for example so that system error messages come out in
the right encoding.
But I'm concerned about the the Perl_setlocale() dance in plperl.c.
Perl apparently does a setlocale(LC_ALL, "") during startup, and that
code is a workaround to reset everything back afterwards. We need to be
careful not to break that.
(Perl has fixed that in 5.19, but the fix requires that you set another
environment variable before launching Perl, which you can't do in a
threaded system, so we'd probably need another fix eventually. See
<https://github.com/Perl/perl5/issues/8274>.)
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
I think we will need to keep the global LC_CTYPE setting set to
something useful, for example so that system error messages come out
in
the right encoding.
Do we need to rely on the global LC_CTYPE setting? We already use
bind_textdomain_codeset().
But I'm concerned about the the Perl_setlocale() dance in plperl.c.
Perl apparently does a setlocale(LC_ALL, "") during startup, and that
code is a workaround to reset everything back afterwards. We need to
be
careful not to break that.(Perl has fixed that in 5.19, but the fix requires that you set
another
environment variable before launching Perl, which you can't do in a
threaded system, so we'd probably need another fix eventually. See
<https://github.com/Perl/perl5/issues/8274>.)
I don't fully understand that issue, but I would think the direction we
are going (keeping the global LC_CTYPE more consistent and relying on
it less) would make the problem better.
Regards,
Jeff Davis
On 17.12.24 19:10, Jeff Davis wrote:
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
I think we will need to keep the global LC_CTYPE setting set to
something useful, for example so that system error messages come out
in
the right encoding.Do we need to rely on the global LC_CTYPE setting? We already use
bind_textdomain_codeset().
I don't think that would cover messages from the C library (strerror,
dlerror, etc.).
But I'm concerned about the the Perl_setlocale() dance in plperl.c.
Perl apparently does a setlocale(LC_ALL, "") during startup, and that
code is a workaround to reset everything back afterwards. We need to
be
careful not to break that.(Perl has fixed that in 5.19, but the fix requires that you set
another
environment variable before launching Perl, which you can't do in a
threaded system, so we'd probably need another fix eventually. See
<https://github.com/Perl/perl5/issues/8274>.)I don't fully understand that issue, but I would think the direction we
are going (keeping the global LC_CTYPE more consistent and relying on
it less) would make the problem better.
Yes, I think it's the right direction, but we need to figure this issue
out eventually.
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
+1 to this approach. It makes things more consistent across
platforms
and avoids surprising dependencies on the global setting.I think this is the best way, and I haven't seen anyone supporting
any
other idea. (I'm working on those setlocale()-removal patches I
mentioned, more very soon...)I also think this is the right direction, and we'll get closer with
the
remaining patches that Thomas has lined up.I think at this point, we could already remove all locale settings
related to LC_COLLATE. Nothing uses that anymore.I think we will need to keep the global LC_CTYPE setting set to
something useful, for example so that system error messages come out
in
the right encoding.But I'm concerned about the the Perl_setlocale() dance in plperl.c.
Perl apparently does a setlocale(LC_ALL, "") during startup, and that
code is a workaround to reset everything back afterwards. We need to
be
careful not to break that.(Perl has fixed that in 5.19, but the fix requires that you set
another
environment variable before launching Perl, which you can't do in a
threaded system, so we'd probably need another fix eventually. See
<https://github.com/Perl/perl5/issues/8274>.)
To continue this thread, I did a symbol search in the meson build
directory like (patterns.txt attached):
for f in `find . -name *.o`; do
if ( nm --format=just-symbols $f | \
grep -xE -f /tmp/patterns.txt > /dev/null ); then
echo $f; fi; done
and it output:
./contrib/fuzzystrmatch/fuzzystrmatch.so.p/dmetaphone.c.o
./contrib/fuzzystrmatch/fuzzystrmatch.so.p/fuzzystrmatch.c.o
./contrib/isn/isn.so.p/isn.c.o
./contrib/spi/refint.so.p/refint.c.o
./contrib/ltree/ltree.so.p/crc32.c.o
./src/backend/postgres_lib.a.p/commands_copyfromparse.c.o
./src/backend/postgres_lib.a.p/utils_adt_pg_locale_libc.c.o
./src/backend/postgres_lib.a.p/tsearch_wparser_def.c.o
./src/backend/postgres_lib.a.p/parser_scansup.c.o
./src/backend/postgres_lib.a.p/utils_adt_inet_net_pton.c.o
./src/backend/postgres_lib.a.p/tsearch_ts_locale.c.o
./src/bin/psql/psql.p/meson-generated_.._tab-complete.c.o
./src/interfaces/ecpg/preproc/ecpg.p/meson-generated_.._preproc.c.o
./src/interfaces/ecpg/compatlib/libecpg_compat.so.3.18.p/informix.c.o
./src/interfaces/ecpg/compatlib/libecpg_compat.a.p/informix.c.o
./src/port/libpgport_srv.a.p/pgstrcasecmp.c.o
./src/port/libpgport_shlib.a.p/pgstrcasecmp.c.o
./src/port/libpgport.a.p/pgstrcasecmp.c.o
Not a short list, but not a long list, either, so seems tractable. Note
that this misses things like isdigit() which is inlined.
A few observations while spot-checking these files:
---------------------
pgstrcasecmp.c - has code like:
else if (IS_HIGHBIT_SET(ch) && islower(ch))
ch = toupper(ch);
and comments like "Note however that the whole thing is a bit bogus for
multibyte character sets."
Most of the callers are directly comparing with ascii literals, so I'm
not sure what the point is. There are probably some more interesting
callers hidden in there.
----------------------
pg_locale_libc.c -
char2wchar and wchar2char use mbstowcs and wcstombs when the input
locale is NULL. The main culprit seems to be full text search, which
has a bunch of /* TODO */ comments. Another caller is
get_iso_localename().
There are also a couple false positives where mbstowcs_l/wcstombs_l are
emulated with uselocale() and mbstowcs/wcstombs. In that case, it's not
actually sensitive to the global setting.
-----------------------
copyfromparse.c - the input is ASCII so it can use pg_ascii_tolower()
instead of tolower()
-----------------------
Regards,
Jeff Davis
Attachments:
On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
To continue this thread, I did a symbol search in the meson build
directory like (patterns.txt attached):
Attached a rough patch series which does what everyone seemed to agree
on:
* Change some trivial ASCII cases to use pg_ascii_* variants
* Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
* Introduce a new global_lc_ctype for callers that still need to use
operations that depend on datctype
There should be no behavior changes in this series.
Benefits:
* a tiny step toward multithreading, because connections to different
databases no longer require different setlocale() settings
* easier to identify dependence on datctype, because callers will
need to refer to global_lc_ctype.
* harder to accidentally depend on datctype or datcollate
Ideally, when the database locale provider is not libc, the user
shouldn't need to even think about a valid LC_CTYPE locale at all. But
that requires more work, and potentially risk of breakage.
Regards,
Jeff Davis
`
Attachments:
On 07.06.25 00:23, Jeff Davis wrote:
On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
To continue this thread, I did a symbol search in the meson build
directory like (patterns.txt attached):Attached a rough patch series which does what everyone seemed to agree
on:* Change some trivial ASCII cases to use pg_ascii_* variants
* Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
* Introduce a new global_lc_ctype for callers that still need to use
operations that depend on datctype
v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch
v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch
v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch
v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patch
These look good to me.
v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch
This looks ok (but might depend on how patch 0006 turns out).
v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch
I think these need further individual analysis and explanation why these
should use the global lc_ctype setting. For example, you could argue
that the SQL-callable soundex(text) function should use the collation
object of its input value, not the global locale. But furthermore,
soundex_code() could actually just use pg_ascii_toupper() instead. And
in ts_locale.c, the isalnum_l() call should use mylocale that already
exists in that function. The problem to solve it getting a good value
into mylocale. Using the global setting confuses the issue a bit, I think.
v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch
Do we have any data what platforms we'd need these checks for?
Also, if you look into wparser_def.c what p_isxdigit is used for, it's
used for parsing XML (presumably HTML) files, so we just need ASCII-only
behavior and no locale dependency.
v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
As I mentioned earlier in the thread, I don't think we can do this for
LC_CTYPE, because otherwise system error messages would not come out in
the right encoding. For the LC_COLLATE settings, I think we could just
do the setting in main(), where the other non-database-specific locale
categories are set.
On Tue, 2025-06-10 at 17:32 +0200, Peter Eisentraut wrote:
As I mentioned earlier in the thread, I don't think we can do this
for
LC_CTYPE, because otherwise system error messages would not come out
in
the right encoding.
Is there any way around that? If all we need is the right encoding, do
we need a proper locale?
Regards,
Jeff Davis
On Tue, 2025-06-10 at 17:32 +0200, Peter Eisentraut wrote:
v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch
v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch
v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch
v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patchThese look good to me.
Committed. (That means they're in 18, which was not my intention, but
others seemed to think it was harmless enough, so I didn't revert. I
will wait for the branch before I commit any more of these.)
v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch
This looks ok (but might depend on how patch 0006 turns out).
I changed this to a global_libc_locale that includes both LC_COLLATE
and LC_CTYPE (from datcollate and datctype), in case an extension is
relying on strcoll for some reason.
v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch
I think these need further individual analysis and explanation why
these
should use the global lc_ctype setting.
This patch series, at least so far, is designed to have zero behavior
changes. Anything with a potential for a behavior change should be a
separate commit, so that if we need to revert it, we can revert the
behavior change without reintroducing a setlocale() dependency.
For example, you could argue
that the SQL-callable soundex(text) function should use the collation
object of its input value, not the global locale.
That would be a behavior change.
But furthermore,
soundex_code() could actually just use pg_ascii_toupper() instead.
Is that a behavior change?
And
in ts_locale.c, the isalnum_l() call should use mylocale that already
exists in that function. The problem to solve it getting a good
value
into mylocale. Using the global setting confuses the issue a bit, I
think.
I reworked it to be less confusing by changing wchar2char/char2wchar to
take a locale_t instead of pg_locale_t. Hopefully it's an improvement.
In get_iso_localename(), there's a comment saying that it doesn't
matter which locale is used (because it's ASCII), but to use the "_l"
variants, we need to pick some locale. At that point it's not clear to
me that global_libc_locale will be set yet, so I used LC_C_LOCALE.
I'm not sure whether we can rely on LC_C_LOCALE being available, but it
passed in CI, and if it's not available somewhere it might be a good
idea to create it on those platforms anyway.
v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch
Do we have any data what platforms we'd need these checks for?
https://cirrus-ci.com/build/5167600088383488
Looks like windows doesn't have iswxdigit_l or isxdigit_l.
Also, if you look into wparser_def.c what p_isxdigit is used for,
it's
used for parsing XML (presumably HTML) files, so we just need ASCII-
only
behavior and no locale dependency.
iswxdigit() does seem to be dependent on locale, so this could be a
subtle behavior change.
v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
As I mentioned earlier in the thread, I don't think we can do this
for
LC_CTYPE, because otherwise system error messages would not come out
in
the right encoding.
Changed it so that it only sets LC_COLLATE to C, and leaves LC_CTYPE
set to datctype.
Unfortunately, as long as LC_CTYPE is set to a real locale, there's a
danger of accidentally depending on that setting. Can the encoding be
controlled with LC_MESSAGES instead of LC_CTYPE?
Do you have an example of how things can go wrong?
For the LC_COLLATE settings, I think we could just
do the setting in main(), where the other non-database-specific
locale
categories are set.
Done.
Regards,
Jeff Davis
Attachments:
On Wed, 2025-06-11 at 12:15 -0700, Jeff Davis wrote:
I changed this to a global_libc_locale that includes both LC_COLLATE
and LC_CTYPE (from datcollate and datctype), in case an extension is
relying on strcoll for some reason.
..
This patch series, at least so far, is designed to have zero behavior
changes. Anything with a potential for a behavior change should be a
separate commit, so that if we need to revert it, we can revert the
behavior change without reintroducing a setlocale() dependency.
...
I reworked it to be less confusing by changing wchar2char/char2wchar
to
take a locale_t instead of pg_locale_t. Hopefully it's an
improvement.
...
Changed it so that it only sets LC_COLLATE to C, and leaves LC_CTYPE
set to datctype.
Attached rebased v3.
Regards,
Jeff Davis
Attachments:
On Wed, 2025-06-11 at 12:15 -0700, Jeff Davis wrote:
v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
As I mentioned earlier in the thread, I don't think we can do this
for
LC_CTYPE, because otherwise system error messages would not come
out
in
the right encoding.Changed it so that it only sets LC_COLLATE to C, and leaves LC_CTYPE
set to datctype.Unfortunately, as long as LC_CTYPE is set to a real locale, there's a
danger of accidentally depending on that setting. Can the encoding be
controlled with LC_MESSAGES instead of LC_CTYPE?Do you have an example of how things can go wrong?
I looked into this a bit, and if I understand correctly, the only
problem is with strerror() and strerror_r(), which depend on
LC_MESSAGES for the language but LC_CTYPE to find the right encoding.
I attached some example C code to illustrate how strerror() is affected
by both LC_MESSAGES and LC_CTYPE. For example:
$ ./strerror de_DE.UTF-8 de_DE.UTF-8
LC_CTYPE set to: de_DE.UTF-8
LC_MESSAGES set to: de_DE.UTF-8
Error message (from strerror(EILSEQ)): Ungültiges oder
unvollständiges Multi-Byte- oder Wide-Zeichen
$ ./strerror C de_DE.UTF-8
LC_CTYPE set to: C
LC_MESSAGES set to: de_DE.UTF-8
Error message (from strerror(EILSEQ)): Ung?ltiges oder
unvollst?ndiges Multi-Byte- oder Wide-Zeichen
On unix-based systems, we can use newlocale() to initialize a global
variable with both LC_CTYPE and LC_MESSAGES set. The LC_MESSAGES
portion would need to be updated every time the GUC changes, which is
not great.
Windows would be a different story, though: strerror() doesn't seem to
have a variant that accepts a _locale_t object, and even if it did, I
don't see a way to create a _locale_t object with LC_MESSAGES and
LC_CTYPE set to different values. One idea is to use
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE), and then use
setlocale(), which could enable us to use setlocale() similar to how we
use uselocale() on other systems. That would be awkward, though.
Thoughts? That seems like a lot of work just for the case of
strerror()/strerror_r().
Regards,
Jeff Davis
[1]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170
Attachments:
On Tue, 2025-07-01 at 08:06 -0700, Jeff Davis wrote:
Attached rebased v3.
And here's v4.
I changed the global variable to only hold the LC_CTYPE (not
LC_COLLATE), because windows doesn't support a _locale_t that
represents multiple categories with different locales.
This patch series is designed to not have any changes in behavior.
There was some feedback that I could go further, but I'll leave those
suggestions for future patches, in case one causes an unexpected
behavior change and needs to be reverted. I intend to start committing
these soon.
v4-0008 uses LC_C_LOCALE, and I'm not sure if that's portable, but if
the buildfarm complains then I'll fix it or revert it.
Regards,
Jeff Davis
Attachments:
On Mon, 2025-07-07 at 17:56 -0700, Jeff Davis wrote:
I looked into this a bit, and if I understand correctly, the only
problem is with strerror() and strerror_r(), which depend on
LC_MESSAGES for the language but LC_CTYPE to find the right encoding.
...
Windows would be a different story, though: strerror() doesn't seem
to
have a variant that accepts a _locale_t object, and even if it did, I
don't see a way to create a _locale_t object with LC_MESSAGES and
LC_CTYPE set to different values.
I think I have an answer to the second part here:
"For information about the format of the locale argument, see Locale
names, Languages, and Country/Region strings."
and when I follow that link, I see:
"You can specify multiple category types, separated by semicolons.
Category types that aren't specified use the current locale setting.
For example, this code snippet sets the current locale for all
categories to de-DE, and then sets the categories LC_MONETARY to en-GB
and LC_TIME to es-ES:
_wsetlocale(LC_ALL, L"de-DE");
_wsetlocale(LC_ALL, L"LC_MONETARY=en-GB;LC_TIME=es-ES");"
So we just need to construct a string of the right form, and we can
have a _locale_t object representing the global locale for all
categories. I'm not sure exactly how we escape the individual locale
names, but it might be enough to just reject ';' in the locale name (at
least for windows).
The first problem -- how to affect the encoding of strings returned by
strerror() on windows -- may be solvable as well. It looks like
LC_MESSAGES is not supported at all on windows, so the only thing to be
concerned about is the encoding, which is affected by LC_CTYPE. But
windows doesn't offer uselocale() or strerror_l(). The only way seems
to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then
setlocale(LC_CTYPE, datctype) right before strerror(), and switch it
back to "C" right afterward. Comments welcome.
Regards,
Jeff Davis
On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com> wrote:
The first problem -- how to affect the encoding of strings returned by
strerror() on windows -- may be solvable as well. It looks like
LC_MESSAGES is not supported at all on windows, so the only thing to be
concerned about is the encoding, which is affected by LC_CTYPE. But
windows doesn't offer uselocale() or strerror_l(). The only way seems
to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then
setlocale(LC_CTYPE, datctype) right before strerror(), and switch it
back to "C" right afterward. Comments welcome.
FWIW there is an example of that in src/port/pg_localeconv_r.c.
On Tue, Jul 8, 2025 at 1:14 PM Jeff Davis <pgsql@j-davis.com> wrote:
v4-0008 uses LC_C_LOCALE, and I'm not sure if that's portable, but if
the buildfarm complains then I'll fix it or revert it.
(Catching up with this thread...)
LC_C_LOCALE is definitely not portable: I've only seen it on macOS and
NetBSD. It would be a good thing to propose to POSIX, since no other
approach can be retrofitted quite so cromulently...
I tried to make a portable PG_C_LOCALE mechanism like that, but it was
reverted for reasons needing more investigation... see
8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
On Thu, 2025-07-10 at 11:53 +1200, Thomas Munro wrote:
On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com>
wrote:The first problem -- how to affect the encoding of strings returned
by
strerror() on windows -- may be solvable as well. It looks like
LC_MESSAGES is not supported at all on windows, so the only thing
to be
concerned about is the encoding, which is affected by LC_CTYPE. But
windows doesn't offer uselocale() or strerror_l(). The only way
seems
to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and
then
setlocale(LC_CTYPE, datctype) right before strerror(), and switch
it
back to "C" right afterward. Comments welcome.FWIW there is an example of that in src/port/pg_localeconv_r.c.
OK, so it seems we have a path forward here:
1. Have a global_libc_locale that represents all of the categories, and
keep it up to date with GUC changes. On windows, it requires keeping
the textual locale names handy (e.g. copies of datcollate and
datctype), and building the special locale string and doing
_create_locale(LC_ALL, "LC_ABC=somelocale;LC_XYZ=otherlocale").
2. When there's no _l() variant of a function, like strerror_r(), wrap
with uselocale(). On windows, this means using the trick above with
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE).
I don't have a great windows development environment, and it appears CI
and the buildfarm don't offer great coverage either. Can I ask for a
volunteer to do the windows side of this work?
Regards,
Jeff Davis
On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
I tried to make a portable PG_C_LOCALE mechanism like that, but it
was
reverted for reasons needing more investigation... see
8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
The revert seems to be related to pgport_shlib. At least for my current
work, I'm focused on removing setlocale() dependencies in the backend,
and a PG_C_LOCALE should work fine there.
Regards,
Jeff Davis
On Fri, Jul 11, 2025 at 6:33 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
I tried to make a portable PG_C_LOCALE mechanism like that, but it
was
reverted for reasons needing more investigation... see
8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
3c8e463b0d885e0d976f6a13a1fb78187b25c86f).The revert seems to be related to pgport_shlib. At least for my current
work, I'm focused on removing setlocale() dependencies in the backend,
and a PG_C_LOCALE should work fine there.
OK, I'll figure out what happened with that and try to post a new
version over on that other thread soon.
(FWIW I learned a couple of encouraging things about that topic:
glibc's newlocale(LC_ALL, NULL, 0) seems to give you a static
singleton anyway, no allocation happens, so it can't fail in practice
and it's cheap, and FreeBSD also supports LC_C_LOCALE like NetBSD and
macOS, it just doesn't have a name, you pass NULL instead (which is
the value that LC_C_LOCALE has on those other systems). I actually
rather like the macro, because how else are you supposed to test
whether this system can accept NULL there?)
On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
I don't have a great windows development environment, and it appears CI
and the buildfarm don't offer great coverage either. Can I ask for a
volunteer to do the windows side of this work?
Me neither but I'm willing to help with that, and have done lots of
closely related things through trial-by-CI...
On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
I don't have a great windows development environment, and it
appears CI
and the buildfarm don't offer great coverage either. Can I ask for
a
volunteer to do the windows side of this work?Me neither but I'm willing to help with that, and have done lots of
closely related things through trial-by-CI...
Attached a patch to separate the message translation (both gettext and
strerror translations) from setlocale(). That's a step towards thread
safety, and also a step toward setting LC_CTYPE=C permanently (more
work still required there).
The patch feels a bit over-engineered, but I'd like to know what you
think. It would be great if you could test/debug the windows NLS-
enabled paths.
I'm also not sure what to do about the NetBSD path. NetBSD has no
uselocale(), so I have to fall bad to temporary setlocale(), which is
not thread safe. And I'm getting a mysterious error in test_aio for
NetBSD, which I haven't investigated yet.
Regards,
Jeff Davis
Attachments:
On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
The patch feels a bit over-engineered, but I'd like to know what you
think. It would be great if you could test/debug the windows NLS-
enabled paths.
Let me explain how it ended up looking over-engineered, and perhaps
someone has a simpler solution.
For gettext, we already configure the encoding with
bind_textdomain_codeset(). All it needs is LC_MESSAGES set properly,
which can be done with uselocale(), as a semi-permanent setting until
the next GUC change, just like setlocale() today. There are a couple
minor problems for platforms without uselocale(). For windows, we could
just permanently do:
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE)
and then use _wsetlocale. For NetBSD, I don't have a solution, but
perhaps we can just reject new lc_messages settings after startup, or
just defer the problem until threading actually becomes a pressing
issue.
The main problem is with strerror_r(). To get the right LC_MESSAGES
setting, we need the separate path for windows (which has neither
uselocale() nor strerror_l()). Because we need to keep track of that
path anyway, I used it for gettext as well to have a cleaner separation
for the entire message translation locale. That means we can avoid
permanent locale settings, and reduce the chances that we accidentally
depend on the global locale.
Regards,
Jeff Davis
On Thu, 2025-07-24 at 11:10 -0700, Jeff Davis wrote:
The main problem is with strerror_r()...
Postgres messages, like "division by zero" are translated just fine
without LC_CTYPE; gettext() only needs LC_MESSAGES and the server
encoding. So these are fine.
We use strerror_r() to translate the system errno into a readable
message, like "No such file or directory", i.e. the %m replacements.
That needs LC_CTYPE set (just for the encoding, not the
language/region) as well as LC_MESSAGES (for the language/region).
When using a locale provider other than libc, it's unfortunate to
require LC_CTYPE to be set for just this one single purpose. The locale
itself, e.g. the "en_US" part, is not used at all; only the encoding
part of the setting is relevant. And there is no value other than "C"
that works on all platforms. It's fairly confusing to explain why the
LC_CTYPE setting is required for the builtin or ICU providers at all.
Also, while it's far from the biggest challenge when it comes to
multithreading, it does cause thread-safety headaches on platforms
without uselocale().
Perhaps we could get the ASCII message and run it through gettext()?
That would be extra work for translators, but perhaps not a lot, given
that it's a small and static set of messages in practice. That would
also have the benefit that either NLS is enabled or not -- right now,
since the translation happens in two different ways you can end up with
partially-translated messages. It would also result in consistent
translations across platforms.
Regards,
Jeff Davis
On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com>
wrote:I don't have a great windows development environment, and it
appears CI
and the buildfarm don't offer great coverage either. Can I ask
for
a
volunteer to do the windows side of this work?Me neither but I'm willing to help with that, and have done lots of
closely related things through trial-by-CI...
Attached a new patch series, v6.
Rather than creating new global locale_t objects, this series (along
with a separate patch for NLS[1]/messages/by-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com) removes the dependency on the global
LC_CTYPE entirely. It's a bunch of small patches that replace direct
calls to tolower()/toupper() with calls into the provider.
An assumption of these patches is that, in the UTF-8 encoding, the
logic in pg_tolower()/pg_toupper() is equivalent to
pg_ascii_tolower()/pg_ascii_toupper().
Generally these preserve existing behavior, but there are a couple
differences:
* If using the builtin C locale (not C.UTF-8) along with a datctype
that's a non-C locale with single-byte encoding, it could affect the
results of downcase_identifier(), ltree, and fuzzystrmatch on
characters > 127. For ICU, I went to a bit of extra effort to preserve
the existing behavior here, because it's more likely to be used for
single-byte encodings.
* When using ICU or builtin C.UTF-8, along with a datctype of
"tr_TR.UTF-8", then it will affect ltree's and fuzzystrmatch's
treatment of i/I.
If these are a concern we can fix them with some hacks, but those
behaviors seem fairly obscure to me.
Regards,
Jeff Davis
[1]: /messages/by-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
/messages/by-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
Attachments:
On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote:
Attached a new patch series, v6.
I'm eager to start committing this series so that we have plenty of
time to sort out any problems. I welcome feedback before or after
commit, and I can revert if necessary.
The goal here is to do a permanent:
setlocale(LC_CTYPE, "C")
in the postmaster, and instead use _l() variants where necessary.
Forcing the global LC_CTYPE to C will avoid platform-specific nuances
spread throughout the code, and prevent new code from accidentally
depending on platform-specific libc behavior. Instead, libc ctype
behavior will only happen through a pg_locale_t object.
It also takes us a step closer to thread safety.
LC_COLLATE was already permenently set to "C" (5e6e42e4), and most of
LC_CTYPE behavior already uses a pg_locale_t object. This series is
about removing the last few places that rely on raw calls to
tolower()/toupper() (sometimes through pg_tolower()). Where there isn't
a pg_locale_t immediately available it uses the database default locale
(which might or might not be libc).
There's another thread for what to do about strerror_r[1], which
depends on LC_CTYPE for the encoding:
/messages/by-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
pg_localeconv_r() does depend on the LC_CTYPE for the encoding, but it
already sets it from lc_monetary and lc_numeric, without using datctype
or the global setting. Then PGLC_localeconv() converts to the database
encoding, if necessary. So it's an exception to the rule that all ctype
behavior goes through a pg_locale_t, but it's not a problem. (Aside: we
could consider this approach as a narrower fix for strerror_r(), as
well.)
There may be a loose end around plperl, as well, but not sure if this
will make it any worse.
Some other LC_* settings still rely on setlocale(), which can be
considered separately unless there's some interaction that I missed.
Note that the datcollate and datctype fields are already mostly
irrelevant for non-libc providers. We could set those to NULL, but for
now I don't intend to do that.
Regards,
Jeff Davis
Jeff Davis wrote:
The goal here is to do a permanent:
setlocale(LC_CTYPE, "C")
in the postmaster, and instead use _l() variants where necessary.
What about code in extensions? AFAIU a user can control the
locale in effect by setting the LC_CTYPE argument of
CREATE DATABASE, which ends up in the environment
of backends serving that database.
If it's forced to "C", how can an extension use locale-aware
libc functions?
In theory it's the same problem with LC_COLLATE, except
that functions like tolower()/toupper() are much more likely
to be used in extensions than strcoll().
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
What about code in extensions? AFAIU a user can control the
locale in effect by setting the LC_CTYPE argument of
CREATE DATABASE, which ends up in the environment
of backends serving that database.
If it's forced to "C", how can an extension use locale-aware
libc functions?
Extensions often need to be updated for a new major version.
The extension should call pg_database_locale(), and pass that to a
function exposed in pg_locale.h. A recent commit exposed pg_iswalpha(),
etc., so there's a reasonable set of functions that should be suitable
for most purposes.
If it's not available in pg_locale.h, or the extension really needs to
use a different LC_CTYPE for some reason, it can use an _l() variant of
the function.
Regards,
Jeff Davis
On 30.10.25 18:17, Jeff Davis wrote:
On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote:
Attached a new patch series, v6.
I'm eager to start committing this series so that we have plenty of
time to sort out any problems. I welcome feedback before or after
commit, and I can revert if necessary.
What is one supposed to do with this statement? You post a series of 9
patches and the next day you say you are eager to commit it? Do you not
want to give others the time to properly review this? The patches say
they are "v6", but AFAICT the previous patches "v5" and "v4" in this
thread are substantially different from these.
The goal here is to do a permanent:
setlocale(LC_CTYPE, "C")
in the postmaster, and instead use _l() variants where necessary.
Forcing the global LC_CTYPE to C will avoid platform-specific nuances
spread throughout the code, and prevent new code from accidentally
depending on platform-specific libc behavior. Instead, libc ctype
behavior will only happen through a pg_locale_t object.It also takes us a step closer to thread safety.
At first glance, these patches seem reasonable steps into that direction.
But I'm not sure that we actually want to make that switch. It would be
good if our code is independent of the global locale settings, but that
doesn't mean that there couldn't be code in extensions, other libraries,
or other corners of the operating system that relies on this. In
general, and I haven't looked this up in the applicable standards, it
seems like a good idea to accurately declare what encoding you operate in.
Jeff Davis wrote:
On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
What about code in extensions? AFAIU a user can control the
locale in effect by setting the LC_CTYPE argument of
CREATE DATABASE, which ends up in the environment
of backends serving that database.
If it's forced to "C", how can an extension use locale-aware
libc functions?Extensions often need to be updated for a new major version.
I think forcing the C locale is not comparable to API changes,
and the consequences are not even necessarily fixable for extensions.
For instance, consider the following function, when run in a database
with en_US.utf8 as locale.
CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
use locale; return ($_[0] lt $_[1])?1:0;
$$ LANGUAGE plperlu;
select lt_test('a', 'B');
With PG 18 it returns true
With 19devel it returns false.
This is since commit 5e6e42e4 doing that:
+ * Collation is handled by pg_locale.c, and the behavior is dependent
on
+ * the provider. strcoll(), etc., should not be called directly.
+ */
+ init_locale("LC_COLLATE", LC_COLLATE, "C");
+
+ /*
Obviously libperl is not going to be updated to call Postgres
string comparisons functions instead of strcoll().
The same is probably true for other languages available as
extensions that expose POSIX locale-aware functions.
Extending this logic to LC_CTYPE will extend the breakage.
While I agree with the goal of not depending on setlocale()
in the core code for anything that should be locale-provider
dependent, making this goal leak into extensions seems
unnecessarily user-hostile. What it's saying to users is,
before v19 you could choose your locale, and starting
with v19 you'll have "C" whether you want it or not.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
On Fri, 2025-10-31 at 15:01 +0100, Daniel Verite wrote:
Extensions often need to be updated for a new major version.
I think forcing the C locale is not comparable to API changes,
and the consequences are not even necessarily fixable for extensions.
Are we in agreement that it's fine for C extensions?
For instance, consider the following function, when run in a database
with en_US.utf8 as locale.CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
use locale; return ($_[0] lt $_[1])?1:0;
$$ LANGUAGE plperlu;select lt_test('a', 'B');
Are you aware of PL code that does things like that? If the database
locale is ICU, that would be at least a little bit confusing.
Regards,
Jeff Davis
On Fri, 2025-10-31 at 10:40 +0100, Peter Eisentraut wrote:
But I'm not sure that we actually want to make that switch. It would
be
good if our code is independent of the global locale settings, but
that
doesn't mean that there couldn't be code in extensions, other
libraries,
or other corners of the operating system that relies on this.
This question has been brewing for a while. How should we make this
decision?
In
general, and I haven't looked this up in the applicable standards, it
seems like a good idea to accurately declare what encoding you
operate in.
One frustration (for me, at least) is that there is no way to set the
encoding without specifying the locale. LC_CTYPE=C.UTF-8 is close, but
the libc version is not available on all platforms and has some quirks.
That makes any changes to the initdb default logic difficult to sort
out. Some combinations which seem simple -- like ICU/UTF8 -- need to
handle the case when LC_CTYPE is not compatible with UTF-8, even though
the LC_CTYPE has no effect in that case.
Regards,
Jeff Davis
Jeff Davis wrote:
Extensions often need to be updated for a new major version.
I think forcing the C locale is not comparable to API changes,
and the consequences are not even necessarily fixable for extensions.Are we in agreement that it's fine for C extensions?
No, I think we should put the database's lc_ctype
into LC_CTYPE and the database's lc_collate into
LC_COLLATE, independently of anything else,
like it was done until commit 5e6e42e.
I believe that's the purpose of these database
properties, whether the provider is libc or ICU or builtin.
Forcing "C" is a disruptive change, that IMO does
not seem compensated by substantial advantages
that would justify the disruption.
CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
use locale; return ($_[0] lt $_[1])?1:0;
$$ LANGUAGE plperlu;select lt_test('a', 'B');
Are you aware of PL code that does things like that? If the database
locale is ICU, that would be at least a little bit confusing.
plperl users writing "use locale" should understand that
it's the libc locale, like when this code is run outside Postgres.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote:
No, I think we should put the database's lc_ctype
into LC_CTYPE and the database's lc_collate into
LC_COLLATE, independently of anything else,
like it was done until commit 5e6e42e.
I believe that's the purpose of these database
properties, whether the provider is libc or ICU or builtin.
Is there a clean way to document this behavior? I have tried to improve
the documentation in this area before, but it's not easy because the
behavior is so nuanced.
The CREATE DATABASE command needs LOCALE (or LC_CTYPE/LC_COLLATE). But
it's easy to get LOCALE and ICU_LOCALE or BUILTIN_LOCALE confused. And
similarly for initdb. We could clean those up a lot if
LC_CTYPE/LC_COLLATE didn't even need to be set for non-libc providers.
Users can end up in a situation where they need to define
LC_CTYPE/LC_COLLATE, even though it has almost (but not entirely) no
effect:
/messages/by-id/f794e177b0b1ed8917e75258726ae315cf8fbbef.camel@j-davis.com
Reverting commit 5e6e42e may be the right thing, but I'd like to hear
what others have to say on this point first. In particualr, I'd like to
know whether such a revert is based on principle, a practical problem,
or just an abundance of caution.
Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
providers, but leave it as in v17 for libc providers. If someone
explicitly wants libc behavior (by specifying something like "use
locale" in plperl), they probably want to be using the libc provider
for the database.
Regards,
Jeff Davis
On Mon, 2025-11-03 at 11:59 -0800, Jeff Davis wrote:
Reverting commit 5e6e42e may be the right thing, but I'd like to hear
what others have to say on this point first. In particualr, I'd like
to
know whether such a revert is based on principle, a practical
problem,
or just an abundance of caution.Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
providers, but leave it as in v17 for libc providers. If someone
explicitly wants libc behavior (by specifying something like "use
locale" in plperl), they probably want to be using the libc provider
for the database.
Actually, there's yet another option: lc_ctype and lc_collate could be
GUCs again. If they don't affect any backend behavior, then why not?
The user would have complete control.
(Probably PGC_POSTMASTER, because one of the goals is to not rely on
setlocale() in the backends.)
Of course, we need to be sure that they are compatible with the
database encoding. We have code to do that, but not sure how reliable
it is across platforms.
Regards,
Jeff Davis
On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote:
No, I think we should put the database's lc_ctype
into LC_CTYPE and the database's lc_collate into
LC_COLLATE, independently of anything else,
like it was done until commit 5e6e42e.
I believe that's the purpose of these database
properties, whether the provider is libc or ICU or builtin.
As phrased, that appears to be a promise that we will never support
thread-per-connection. setlocale() is not thread-safe, and uselocale()
is not available on NetBSD.
Regards,
Jeff Davis
On 03.11.25 20:14, Daniel Verite wrote:
No, I think we should put the database's lc_ctype
into LC_CTYPE and the database's lc_collate into
LC_COLLATE, independently of anything else,
like it was done until commit 5e6e42e.
I believe that's the purpose of these database
properties, whether the provider is libc or ICU or builtin.Forcing "C" is a disruptive change, that IMO does
not seem compensated by substantial advantages
that would justify the disruption.
From my perspective, the difference between LC_COLLATE and LC_CTYPE is
that LC_COLLATE has a quite limited impact area. Either your code uses
strcoll() (or strxfrm()) or it does not. And if it does, you can find
all the places and adjust them, and it probably won't be that many
places. The impact area of LC_CTYPE is much larger and more complicated
and possibly interacts with other settings and third-party libraries in
ways that we don't understand yet and might not be able to change.
That's why I'm more hesitant about it. But I don't see any reason to
keep LC_COLLATE set going forward.
On 29.10.25 01:19, Jeff Davis wrote:
On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
On Fri, Jul 11, 2025 at 6:22â¯AM Jeff Davis <pgsql@j-davis.com>
wrote:I don't have a great windows development environment, and it
appears CI
and the buildfarm don't offer great coverage either. Can I ask
for
a
volunteer to do the windows side of this work?Me neither but I'm willing to help with that, and have done lots of
closely related things through trial-by-CI...Attached a new patch series, v6.
Rather than creating new global locale_t objects, this series (along
with a separate patch for NLS[1]) removes the dependency on the global
LC_CTYPE entirely. It's a bunch of small patches that replace direct
calls to tolower()/toupper() with calls into the provider.An assumption of these patches is that, in the UTF-8 encoding, the
logic in pg_tolower()/pg_toupper() is equivalent to
pg_ascii_tolower()/pg_ascii_toupper().
I'm getting a bit confused by all these different variant function
names. Like we have now
tolower
TOLOWER
char_tolower
pg_tolower
pg_strlower
pg_ascii_tolower
downcase_identifier
and maybe more, and upper versions.
This patch set makes changes like
- else if (IS_HIGHBIT_SET(ch2) && isupper(ch2))
- ch2 = tolower(ch2);
+ else if (IS_HIGHBIT_SET(ch2))
+ ch2 = TOLOWER(ch2);
So there is apparently some semantic difference between tolower() and
TOLOWER(), which is represented by the fact that the function name is
all upper case? Actually, it's a macro and could mean different things
in different contexts.
And there is very little documentation accompanying all these different
functions. For example, struct collate_methods and struct ctype_methods
contain barely any documentation at all.
Many of these issues are pre-existing, but I just figured it has reached
a point where we need to do something about it.
On Wed, 2025-11-12 at 19:41 +0100, Peter Eisentraut wrote:
The impact area of LC_CTYPE is much larger and more complicated
and possibly interacts with other settings and third-party libraries
in
ways that we don't understand yet and might not be able to change.
That's why I'm more hesitant about it.
What do you think about making lc_ctype and/or lc_collate into GUCs
(like lc_messages), assuming we remove all known effects in the backend
first?
If we make the setting PGC_POSTMASTER, then it eliminates potential
problems with threading, because setlocale() would be called only once
before allowing connections. For platforms that support uselocale(), we
could allow it to be set more freely, for those who need it set to
different values in different backends.
It would also be easier to document, which would be nice. There could
be some confusion if various settings are inconsistent with each other,
but that's true currently. And we'd still enforce a ctype that's
consistent with the encoding, at least.
Regards,
Jeff Davis
On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
I'm getting a bit confused by all these different variant function
names.
One way of looking at it is that the functions in this patch series
mostly affect how identifiers are treated, whereas earlier collation-
related work affects how text data is treated. Ideally, they should be
similar, but for historical reasons they're not.
There are a lot of subtle behaviors for identifiers, which individually
make some sense, but over time have just become edge cases and sources
of inconsistency:
downcase_identifier() is a server function to casefold unquoted
identifiers during parsing (used by other callers, too). For non-ascii
characters in a single-byte encoding, it uses tolower(); otherwise the
lowercasing is ascii-only. Note: if an application is reliant on the
casefolding of non-ascii identifiers, such as SELECT * FROM É finding
the table named "é", that application would not work in UTF-8 even with
a dump/restore.
pg_strcasecmp() and pg_tolower() are used from the server and the
client to do case-insensitive comparison of option names. They're
supposed to use the same casing semantics as downcase_identifier(), but
they don't differentiate between single-byte and multi-byte encodings;
they just call tolower() on any non-ascii byte. That difference
probably doesn't matter for UTF8, because tolower() on a single byte in
a multibyte sequence should be a no-op, but perhaps it can matter in
non-UTF-8 multibyte encodings.
It's hard to avoid some confusion unless we're able to simplify some of
these behaviors. Let me know if you think we can tolerate some
simplifications in these edge cases without breaking anything too
badly.
Many of these issues are pre-existing, but I just figured it has
reached
a point where we need to do something about it.
Starting from first principles, individual character operations should
be mostly for parsing (e.g. tsearch) or pattern matching. Case folding
and caseless matching should be done with string operations. And
obviously all of this should be multibyte aware and work consistently
in different encodings (to the extent possible given the
representational constraints).
Our APIs in pg_locale.c do a good job of offering that, and do not
depend on the global LC_CTYPE. (There are a few things I'd like to add
or clean up, but it offers most of what we need.)
The problem, of course, is migrating the callers to use pg_locale.c
APIs without breaking things. This patch series is intended to make
everything locale-sensitive in the backend go through pg_locale_t
without any behavior changes. The benefit is that it would at least
remove the global LC_CTYPE dependency, but it ends up with hacky
compatibility methods like char_tolower(), which piles on to the
already-confusing set of tolower-like functions.
In an earlier approach:
/messages/by-id/5f95b81af1e81b28b8a9ac5929f199b2f4091fdf.camel@j-davis.com
I added a strfold_ident() method. That's easier to understand for
downcase_identifier(), but didn't solve the problems for other
callsites that depend on tolower(), and so I'd need to add more methods
for those places, and started to look unpleasant.
And earlier in this thread, I had tried the approach of using a global
variable to hold a locale representing datctype. That felt a bit weird,
though, because it mostly only matters when datlocprovider='c', and in
that case, there's already a locale_t initialized along with the
default collation. So why not find a way to go through the default
collation?
I still favor the approach used in the current patch series to remove
the dependency on the global LC_CTYPE, but I'm open to suggestion.
Whatever we do will probably require some additional hacking later
anyway.
I tried to improve the comments in pgstrcasecmp.c, and I rebased.
Regards,
Jeff Davis
Attachments:
On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
Many of these issues are pre-existing, but I just figured it has
reached
a point where we need to do something about it.
I tried to simplify things in this patch series, assuming that we have
some tolerance for small behavior changes.
0001: No behavior change here, same patch as before. Uncontroversial
simplification, so I plan to commit this soon.
0002: change fuzzystrmatch to use ASCII semantics. As far as I can
tell, this only affects the results of soundex(). Before the patch, in
en_US.iso885915, soundex('réd') was 'RÉ30', after the patch it's
'Ré30'. I'm not sure whether the current behavior is intentional or
not. Other functions (daitch_mokotoff, levenshtein, and metaphone) are
unaffected as far as I can tell.
0003+0005: change ltree to use case folding instead of tolower(). I
believe this is a bug fix, because the current code is inconsistent
between ltree_strncasecmp() and ltree_crc32_sz().
0006-0007: Remove char_tolower() API. This also removes the
optimization for single-byte encodings with the libc provider and a
non-C locale, but simplifies the code (the optimization is retained for
the C locale). It's possible to make the lazy-folding optimization work
for all locales without the char_tolower() API by doing something
simlar to what 0004 does for ltree. But to make this work efficiently
for Generic_Text_IC_like() would be a bit more complex: we'd need to
adjust MatchText() to be able to fold the arguments lazily, and perhaps
introduce some kind of casemapping iterator. That's already a pretty
complex function, so I'm hesitant to do that work unless the
optimization is important.
These patches don't get us quite to the point of eliminating the
LC_CTYPE dependency (there's still downcase_identifier() and
pg_strcasecmp() to worry about, and some assorted isxyz() calls to
examine), but they simplify things enough that the path forward will be
easier.
Regards,
Jeff Davis
Attachments:
On Thu, 2025-11-20 at 16:58 -0800, Jeff Davis wrote:
On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
Many of these issues are pre-existing, but I just figured it has
reached
a point where we need to do something about it.I tried to simplify things in this patch series, assuming that we
have
some tolerance for small behavior changes.0001: No behavior change here, same patch as before. Uncontroversial
simplification, so I plan to commit this soon.
Committed.
New series attached, which I tried to put in an order that would be
reasonable for commit.
0001-0004: Pure refactoring patches. I intend to commit a couple of
these soon.
0005: No behavioral change, and not much change at all. Computes the
"max_chr" for regexes (a performance optimization for low codepoints)
more consistently and simply based on the encoding.
0006: fixes longstanding ltree bug due to inconsistency between the
database locale and the global LC_CTYPE setting when using a non-libc
provider. The end result is also cleaner: use the database locale
consistently, like tsearch. I don't intend to backport this, unless
someone thinks it should be, but it should come with a release note to
reindex ltree indexes if using a non-libc provider.
0007: remove the char_tolower() API completely. We'd lose a pattern
matching optimization for single-byte encodings with libc and a non-C
locale, but it's a significant simplification. We could go even further
and change this to use casefolding rather than lower(), but that seems
like a separate change.
0008: Multibyte-aware extraction of pattern prefixes. The previous code
gave up on any byte that it didn't understand, which made prefixes
unnecessarily short. This patch is also cleaner.
0009: Changes fuzzystrmatch to use pg_ascii_toupper(). Most functions
in the extension are unaffected, but soundex() can be affected, and I'm
not sure what exactly it's supposed to do with non-ASCII.
0010: For downcase_identifier(), use a new provider-specific
pg_strfold_ident() method. The ICU version of this method is a work-in-
progress, because right now it depends on libc. I suppose it should
decode to UTF-32, then go through u_tolower(), then re-encode -- but
can the re-encoding fail? In any case, it would be a behavior change
for identifier casefolding with ICU and a single-byte encoding, which
is probably OK but the risk is non-zero.
0011: POC patch to introduce lc_collate GUC. It would only affect
extensions, PLs, libraries, or other non-core code that happens to call
strcoll() or strxfrm(). This would address Daniel's complaint, but it's
more flexible. And by being a GUC, it's clear that we shouldn't depend
on it for any stored data. We can do something similar for LC_CTYPE
after we eliminate dependencies in core code.
Regards,
Jeff Davis
Attachments:
Hi Jeff,
I have reviewed 0001-0004 and got a few comments:
On Nov 25, 2025, at 07:57, Jeff Davis <pgsql@j-davis.com> wrote:
0001-0004: Pure refactoring patches. I intend to commit a couple of
these soon.
1 - 0001
```
+/*
+ * Fold a character to upper case, following C/POSIX locale rules.
+ */
+static inline unsigned char
+pg_ascii_toupper(unsigned char ch)
```
I was curious why “inline” is needed, then I figured out when I tried to build. Without “inline”, compile will raise warnings of “unused function”. So I guess it’s better to explain why “inline” is used in the function comment, otherwise other readers may get the same confusion.
2 - 0002
```
+ * three output codepoints. See Unicode 5.18.2, "Change in Length".
```
With “change in length”, I confirmed “Unicode 5.18.2” means the Unicode Standard Section 5.18.2 “Complications for Case Mapping”. Why don’t we just give the URL in the comment. https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-5/#G29675
3 - 0004
```
@@ -1282,10 +1327,18 @@ size_t
pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- if (locale->ctype->strfold)
- return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
- else
- return locale->ctype->strlower(dst, dstsize, src, srclen, locale);
+ if (locale->ctype == NULL)
+ {
+ int i;
+
+ srclen = (srclen >= 0) ? srclen : strlen(src);
+ for (i = 0; i < srclen && i < dstsize; i++)
+ dst[i] = pg_ascii_tolower(src[i]);
+ if (i < dstsize)
+ dst[i] = '\0';
+ return srclen;
+ }
+ return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
}
```
I don’t get this change. In old code, depending on locale->ctype->strfold, it calls strfold or strlower. But in this patch, it only calls strfold. Why? If that’s intentional, maybe better to add a comment to explain that.
4 - 0004
In pg_strfold, the ctype==NULL fallback code is exactly the same as pg_strlower. I guess you intentionally to not call pg_strlower here for performance consideration, is that true?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, 2025-11-25 at 09:44 +0800, Chao Li wrote:
I was curious why “inline” is needed, then I figured out when I tried
to build. Without “inline”, compile will raise warnings of “unused
function”. So I guess it’s better to explain why “inline” is used in
the function comment, otherwise other readers may get the same
confusion.
That's a typical pattern: to make it "inline", move it to a .h file and
declare it as "static inline". For common patterns like that, I don't
think we should explain them in comments, because it would mean we
would start adding comments in zillions of places.
With “change in length”, I confirmed “Unicode 5.18.2” means the
Unicode Standard Section 5.18.2 “Complications for Case Mapping”. Why
don’t we just give the URL in the comment.
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-5/#G29675
Done, thank you. (Though since we haven't actually moved to 17 yet, I
linked to 16 instead.)
I don’t get this change. In old code, depending on locale->ctype-
strfold, it calls strfold or strlower. But in this patch, it only
calls strfold. Why? If that’s intentional, maybe better to add a
comment to explain that.
I thought it would be slightly cleaner to just define the strfold
method in the libc provider as the same as strlower. I agree it's worth
a comment, so I added some in pg_locale_libc.c.
In pg_strfold, the ctype==NULL fallback code is exactly the same as
pg_strlower. I guess you intentionally to not call pg_strlower here
for performance consideration, is that true?
I made some static functions to clean that up, and added some comments.
Thank you.
New series attached with only these changes and a rebase.
Regards,
Jeff Davis
Attachments:
On Nov 26, 2025, at 01:20, Jeff Davis <pgsql@j-davis.com> wrote:
New series attached with only these changes and a rebase.
Regards,
Jeff Davis<v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>
I continued reviewing 0005-0008.
0005 - no comment. The change looks correct to me. Before the patch, pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR should be the more common cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common cases, thus there should be not a behavior change.
5 - 0006
```
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
int
ltree_strncasecmp(const char *a, const char *b, size_t s)
{
- char *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
- char *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+ static pg_locale_t locale = NULL;
+ size_t al_sz = s + 1;
+ char *al = palloc(al_sz);
+ size_t bl_sz = s + 1;
+ char *bl = palloc(bl_sz);
+ size_t needed;
int res;
+ if (!locale)
+ locale = pg_database_locale();
+
+ needed = pg_strfold(al, al_sz, a, s, locale);
+ if (needed + 1 > al_sz)
+ {
+ /* grow buffer if needed and retry */
+ al_sz = needed + 1;
+ al = repalloc(al, al_sz);
+ needed = pg_strfold(al, al_sz, a, s, locale);
+ Assert(needed + 1 <= al_sz);
+ }
+
+ needed = pg_strfold(bl, bl_sz, b, s, locale);
+ if (needed + 1 > bl_sz)
+ {
+ /* grow buffer if needed and retry */
+ bl_sz = needed + 1;
+ bl = repalloc(bl, bl_sz);
+ needed = pg_strfold(bl, bl_sz, b, s, locale);
+ Assert(needed + 1 <= bl_sz);
+ }
+
res = strncmp(al, bl, s);
pfree(al);
```
I do think the new implementation has some problem.
* The retry logic implies that a single-byte char may become multiple bytes after folding, otherwise retry is not needed because you have allocated s+1 bytes for dest buffers. From this perspective, we should use two needed variables: neededA and neededB, if neededA != neededB, then the two strings are different; if neededA == neededB, then we should be perform strncmp, but here we should pass neededA (or neededB as they are identical) to strncmp(al, bl, neededA).
* Based on the logic you implemented in 0004, first pg_strfold() has copied as many chars as possible to dest buffer, so when retry, ideally we can should resume instead of start over. However, if single-byte->multi-byte folding happens, we have no information to decide from where to resume. From this perspective, in 0004, do we really need to take the try-the-best strategy for strlower_c()? If there are some other use cases that require data to be placed in dest buffer even if dest buffer doesn’t have enough space, then my patch [1]/messages/by-id/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com of changing strlower_libc_sb() should be considered.
6 - 0007
```
/*
- * For efficiency reasons, in the single byte case we don't call lower()
- * on the pattern and text, but instead call SB_lower_char on each
- * character. In the multi-byte case we don't have much choice :-(. Also,
- * ICU does not support single-character case folding, so we go the long
- * way.
+ * For efficiency reasons, in the C locale we don't call lower() on the
+ * pattern and text, but instead call SB_lower_char on each character.
*/
```
SB_lower_char should be changed to C_IMatchText.
7 - 0007
```
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
```
I think the comment should be updated accordingly, like “for ILIKE in the C locale”.
8 - 0008
```
+ /* for text types, use pg_wchar; for BYTEA, use char */
if (typeid != BYTEAOID)
{
- patt = TextDatumGetCString(patt_const->constvalue);
- pattlen = strlen(patt);
+ text *val = DatumGetTextPP(patt_const->constvalue);
+ pg_wchar *wpatt;
+ pg_wchar *wmatch;
+ char *match;
+
+ patt = VARDATA_ANY(val);
+ pattlen = VARSIZE_ANY_EXHDR(val);
+ wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+ wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+ pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+ match = palloc(pattlen + 1);
```
* match is allocated with pattlen+1 bytes, is it long enough to hold pattlen multiple-byte chars?
* match is not freed, but looks like it should be:
*prefix_const = string_to_const(match, typeid);
-> string_to_datum(str, datatype);
-> CStringGetTextDatum(str);
-> cstring_to_text(s)
-> cstring_to_text_with_len(s, strlen(s));
-> *result = (text *) palloc(len + VARHDRSZ);
So, match has been copied, it should be freed.
9 - 0008
```
- }
+ /* Backslash escapes the next character */
+ if (patt[pos] == '\\')
+ {
+ pos++;
+ if (pos >= pattlen)
+ break;
+ }
- match[match_pos] = '\0';
+ match[match_pos++] = pos;
+ }
```
Should “pos” be “part[pos]” assigning to match[match_pos++]?
I will review the rest 3 commits tomorrow.
[1]: /messages/by-id/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Nov 26, 2025, at 09:50, Chao Li <li.evan.chao@gmail.com> wrote:
I will review the rest 3 commits tomorrow.
10 - 0009
```
{
if (isalpha((unsigned char) c))
{
- c = toupper((unsigned char) c);
+ c = pg_ascii_toupper((unsigned char) c);
```
Just curious. As isaplha() and toupper() come from the same header file ctype.h, if we replace toupper with pg_ascii_toupper, does isapha also need to be handled?
11 - 0010
```
- for (i = 0; i < len; i++)
- {
- unsigned char ch = (unsigned char) ident[i];
+ dstsize = len + 1;
+ result = palloc(dstsize);
- if (ch >= 'A' && ch <= 'Z')
- ch += 'a' - 'A';
- else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
- ch = tolower(ch);
- result[i] = (char) ch;
- }
- result[i] = '\0';
+ needed = pg_strfold_ident(result, dstsize, ident, len);
+ Assert(needed + 1 == dstsize);
+ Assert(needed == len);
```
I think assert both dstsize and len are redundant, because dstsize=len+1, and no place to change their values.
12 - 0010
```
+/*
+ * Fold an identifier using the database default locale.
+ *
+ * For historical reasons, does not use ordinary locale behavior. Should only
+ * be used for identifier folding. XXX: can we make this equivalent to
+ * pg_strfold(..., default_locale)?
+ */
+size_t
+pg_strfold_ident(char *dest, size_t destsize, const char *src, ssize_t srclen)
+{
+ if (default_locale == NULL || default_locale->ctype == NULL)
+ {
+ int i;
+
+ for (i = 0; i < srclen && i < destsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dest[i] = (char) ch;
+ }
+
+ if (i < destsize)
+ dest[i] = '\0';
+
+ return srclen;
+ }
+ return default_locale->ctype->strfold_ident(dest, destsize, src, srclen,
+ default_locale);
+}
```
Given default_local can be NULL only at some specific moment, can we do something like
Local = default_local;
If (local == NULL || local->ctype == NULL)
Local = libc or other fallback;
Return default_locale->ctype->strfold_ident(dest, destsize, src, srclen, local);
This way avoids the duplicate code.
13 - 0011
```
+{ name => 'lc_collate', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
+ short_desc => 'Sets the locale for text ordering in extensions.',
```
I just feel the GUC name is very misleading. Without carefully reading the doc, users may very easy to consider lc_collate the system’s locale. If it only affects extensions, then let’s name it accordingly, for example, “extension_lc_collate”, or “legacy_lc_collate”.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, 2025-11-26 at 09:50 +0800, Chao Li wrote:
* The retry logic implies that a single-byte char may become multiple
bytes after folding, otherwise retry is not needed because you have
allocated s+1 bytes for dest buffers. From this perspective, we
should use two needed variables: neededA and neededB, if neededA !=
neededB, then the two strings are different; if neededA == neededB,
then we should be perform strncmp, but here we should pass neededA
(or neededB as they are identical) to strncmp(al, bl, neededA).
Thank you.
It's actually worse than that: having a single 's' argument is just
completely wrong. Consider:
a: U&'x\0394\0394\0394'
b: U&'\0394\0394\0394'
There is no value for byte length 's' for which both 'a' and 'b' are
properly-encoded strings. So, the current code passes invalid byte
sequences to LOWER(), which is another pre-existing bug.
ltree_strncasecmp() is only used for checking equality of the first s
bytes of the query, so let's make it a safer API that just checks
prefix equality. Attached.
* Based on the logic you implemented in 0004, first pg_strfold() has
copied as many chars as possible to dest buffer, so when retry,
ideally we can should resume instead of start over. However, if
single-byte->multi-byte folding happens, we have no information to
decide from where to resume.
Right.
That suggests that we might want some kind of lazy or iterator-based
API for string folding. We'd need to find the right way to do that with
ICU. If we find that it's a performance problem somewhere, we can look
into that. Do you think we need that now?
From this perspective, in 0004, do we really need to take the try-
the-best strategy for strlower_c()? If there are some other use cases
that require data to be placed in dest buffer even if dest buffer
doesn’t have enough space, then my patch [1] of changing
strlower_libc_sb() should be considered.
I will look into that.
SB_lower_char should be changed to C_IMatchText.
Updated comment.
I think the comment should be updated accordingly, like “for ILIKE in
the C locale”.
Done, thank you.
* match is allocated with pattlen+1 bytes, is it long enough to hold
pattlen multiple-byte chars?* match is not freed, but looks like it should be:
...
Should “pos” be “part[pos]” assigning to match[match_pos++]?
All fixed, thank you! (I apologize for posting a patch in that state to
begin with...)
I also reorganized slightly to separate out the pg_iswcased() API into
its own patch, and moved the like_support.c changes from the ctype_is_c
patch (already committed: 1476028225) into the pattern prefixes patch.
Regards,
Jeff Davis
Attachments:
On Thu, 2025-11-27 at 09:08 +0800, Chao Li wrote:
On Nov 26, 2025, at 09:50, Chao Li <li.evan.chao@gmail.com> wrote:
I will review the rest 3 commits tomorrow.
10 - 0009
Just curious. As isaplha() and toupper() come from the same header
file ctype.h, if we replace toupper with pg_ascii_toupper, does
isapha also need to be handled?
OK.
What do you think about the change overall? Is fuzzystrmatch inherently
ASCII-based? Does it cause behavior changes aside from soundex()? Does
the behavior change in soundex() matter?
11 - 0010
I think assert both dstsize and len are redundant, because
dstsize=len+1, and no place to change their values.
OK.
What do you think of the change overall?
If (local == NULL || local->ctype == NULL)
Local = libc or other fallback;
Return default_locale->ctype->strfold_ident(dest, destsize, src,
srclen, local);This way avoids the duplicate code.
OK. The fallback would still be ASCII though, right?
I just feel the GUC name is very misleading. Without carefully
reading the doc, users may very easy to consider lc_collate the
system’s locale. If it only affects extensions, then let’s name it
accordingly, for example, “extension_lc_collate”, or
“legacy_lc_collate”.
It is the system locale, it's just that we won't be using the system
locale for most purposes, so it has very little effect: PLs,
extensions, and libraries used by extensions that happen to rely on the
system locale. That is a bit confusing, which is why I previously just
set LC_COLLATE=C. This patch addresses Daniel's concern that people
might still want lc_collate set to something other than C. I'm not sure
we want this patch, it's just a POC.
I didn't attach a new series here yet, but will after some of the
earlier patches get committed.
Regards,
Jeff Davis
On 29.11.25 21:50, Jeff Davis wrote:
All fixed, thank you! (I apologize for posting a patch in that state to
begin with...)I also reorganized slightly to separate out the pg_iswcased() API into
its own patch, and moved the like_support.c changes from the ctype_is_c
patch (already committed: 1476028225) into the pattern prefixes patch.
I reviewed the v11 patches. But I wasn't able to apply them locally
(couldn't find a starting commit where they applied cleanly), so I
haven't tested them.
Patches 0001 through 0006 seem generally ok, with some small comments:
v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
The function comment reads "Check if b has a prefix of a." -- Is that
the same as "Check if a is a prefix of b."? The latter might be clearer.
v11-0004-Remove-char_tolower-API.patch
The updated comment reads
+ * For efficiency reasons, in the C locale we don't call lower()
on the
+ * pattern and text, but instead call SB_lower_char on each
character.
but the patch removes SB_lower_char().
v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
Might have a small typo in the commit message:
; and preserve and char-at-a-time logic for bytea.
For the remaining patches I have some more substantial questions.
v11-0007-fuzzystrmatch-use-pg_ascii_toupper.patch
dmetaphone.c has a comment
case '\xc7': /* C with cedilla */
so the premise that "fuzzystrmatch is designed for ASCII" does not
appear to be correct. Needs more analysis.
(But apparently it's not multibyte aware at all, so I don't know what to
do about that.)
v11-0008-downcase_identifier-use-method-table-from-locale.patch
I'm confused here about the name of the function pg_strfold_ident(). In
general, case "folding" results in an opaque string that is really only
useful for comparing against other case-folded strings. But for
identifiers we are actually interested lower-casing. I think this
should be corrected in the API naming.
v11-0009-Control-LC_COLLATE-with-GUC.patch
I know there were some complaints about compatibility with extensions,
but I don't think anything concrete was presented. I would like to see
more evidence that we need this.
Also, recall that we used to have a lc_collate GUC, and in the end
people got confused that it didn't actually show a meaningful value when
you used ICU. So we removed that. It seems adding this back in would
create a similar kind of confusion. So to avoid that, maybe this should
be called fallback_lc_collate or something like that.
If we were to proceed with this patch, it should have some documentation
and tests.
On Fri, 2025-12-05 at 16:01 +0100, Peter Eisentraut wrote:
v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
The function comment reads "Check if b has a prefix of a." -- Is that
the same as "Check if a is a prefix of b."? The latter might be
clearer.
Yes, fixed.
Note: I separated this into two patches. 0003 fixes the multibyte
mishandling issue, and 0004 consistently performs case folding. 0003 is
backpatchable, I believe.
but the patch removes SB_lower_char().
Fixed and committed.
v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
Might have a small typo in the commit message:
; and preserve and char-at-a-time logic for bytea.
Fixed.
I also changed it into two functions: like_fixed_prefix(), which is
almost unchanged from the original; and like_fixed_prefix_ci(), which
is multibyte and locale-aware. It was too confusing to have single-byte
and multi-byte logic in the same function, and they didn't share much
code anyway.
case '\xc7': /* C with cedilla */
so the premise that "fuzzystrmatch is designed for ASCII" does not
appear to be correct. Needs more analysis.(But apparently it's not multibyte aware at all, so I don't know what
to
do about that.)
I didn't notice that, thank you. Agreed, we need a bit more discussion
around this case as well as soundex().
v11-0008-downcase_identifier-use-method-table-from-locale.patch
I'm confused here about the name of the function pg_strfold_ident().
In
general, case "folding" results in an opaque string that is really
only
useful for comparing against other case-folded strings. But for
identifiers we are actually interested lower-casing. I think this
should be corrected in the API naming.
Agreed and fixed.
Also, I added 0006, which saves a locale_t object for ICU in this one
case where it's required. Surely that's not what we want in the long
term, but we don't have the infrastructure for decoding pg_wchar into
code points yet, and 0006 avoids the dependency on the global LC_CTYPE
setting.
v11-0009-Control-LC_COLLATE-with-GUC.patch
I know there were some complaints about compatibility with
extensions,
but I don't think anything concrete was presented. I would like to
see
more evidence that we need this.Also, recall that we used to have a lc_collate GUC, and in the end
people got confused that it didn't actually show a meaningful value
when
you used ICU. So we removed that. It seems adding this back in
would
create a similar kind of confusion. So to avoid that, maybe this
should
be called fallback_lc_collate or something like that.
Yes, this is a POC patch and needs more discussion.
What are your thoughts about a similar lc_ctype GUC, though? That has
slightly different trade-offs.
I believe v12 0001-0005 are about ready for commit, and 0003 should be
backported.
Regards,
Jeff Davis