Windows default locale vs initdb

Started by Thomas Munroover 4 years ago32 messages
#1Thomas Munro
thomas.munro@gmail.com

Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2]https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename, which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

[1]: https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2]: https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename

#2Pavel Stehule
pavel.stehule@gmail.com
In reply to: Thomas Munro (#1)
Re: Windows default locale vs initdb

po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com>
napsal:

Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing)
issue. They needed to create a new database with Czech collation on Azure
SAS. There was not any entry in pg_collation for Czech language. The reply
from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0'
ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.

Regards

Pavel

Show quoted text

[1]
https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2]
https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Pavel Stehule (#2)
Re: Windows default locale vs initdb

On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com>
wrote:

po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com>
napsal:

Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing)
issue. They needed to create a new database with Czech collation on Azure
SAS. There was not any entry in pg_collation for Czech language. The reply
from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0'
ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.

My understanding from Microsoft staff at conferences is that Azure's
PostgreSQL SAS runs on linux, not WIndows.

cheers

andrew

#4Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrew Dunstan (#3)
Re: Windows default locale vs initdb

po 19. 4. 2021 v 12:52 odesílatel Andrew Dunstan <andrew@dunslane.net>
napsal:

On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com>
wrote:

po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com>
napsal:

Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows
locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing)
issue. They needed to create a new database with Czech collation on Azure
SAS. There was not any entry in pg_collation for Czech language. The reply
from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0'
ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.

My understanding from Microsoft staff at conferences is that Azure's
PostgreSQL SAS runs on linux, not WIndows.

I had different informations, but still there was something wrong because
no czech locales was in pg_collation

Show quoted text

cheers

andrew

#5Dave Page
dpage@pgadmin.org
In reply to: Andrew Dunstan (#3)
Re: Windows default locale vs initdb

On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net> wrote:

My understanding from Microsoft staff at conferences is that Azure's
PostgreSQL SAS runs on linux, not WIndows.

This is from a regular Azure Database for PostgreSQL single server:

postgres=> select version();
version
------------------------------------------------------------
PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row)

And this is from the new Flexible Server preview:

postgres=> select version();
version

-----------------------------------------------------------------------------------------------------------------
PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)

So I guess it's a case of "it depends".

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#5)
Re: Windows default locale vs initdb

On 4/19/21 10:26 AM, Dave Page wrote:

On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net
<mailto:andrew@dunslane.net>> wrote:

My understanding from Microsoft staff at conferences is that
Azure's PostgreSQL SAS runs on  linux, not WIndows.

This is from a regular Azure Database for PostgreSQL single server:

postgres=> select version();
                          version                           
------------------------------------------------------------
 PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row) 

And this is from the new Flexible Server preview:

postgres=> select version();
                                                     version          
                                          
-----------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)

So I guess it's a case of "it depends".

Good to know. A year or two back at more than one conference I tried to enlist some of these folks in helping us with Windows PostgreSQL and their reply was that they knew nothing about it because they were on Linux :-) I guess things change over time.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#7Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Thomas Munro (#1)
Re: Windows default locale vs initdb

On 19.04.21 07:42, Thomas Munro wrote:

It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.

pg_newlocale_from_collation() doesn't support collcollate != collctype
on Windows anyway, so that wouldn't be an issue.

#8Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#1)
Re: Windows default locale vs initdb

On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokm�l)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

I suppose that was the only form available at the time the code was
written, so there was no choice.

Right.

The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2]

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.

#9Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Noah Misch (#8)
1 attachment(s)
Re: Windows default locale vs initdb

On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:

The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2]

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.

The value returned by GetUserDefaultLocaleName() is a system configured

parameter, independent of what you set with setlocale(). It might be
reasonable for initdb but not for a backend in most cases.

You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is
no longer recommended, because using LCIDs is no longer recommended [1]https://docs.microsoft.com/en-us/globalization/locale/locale-names.
Although, this would work for legacy locales. Please find attached a POC
patch showing this approach.

[1]: https://docs.microsoft.com/en-us/globalization/locale/locale-names

Regards,

Juan José Santamaría Flecha

Attachments:

0001-POC-Make-Windows-locale-POSIX-looking.patchapplication/octet-stream; name=0001-POC-Make-Windows-locale-POSIX-looking.patchDownload
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 403adb0..23bf991 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2040,6 +2040,28 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* set the locale with setlocale, to see if it accepts it. */
 	res = setlocale(category, locale);
+#if defined(WIN32) && _WIN32_WINNT >= 0x0600
+	{
+		wchar_t		localename[LOCALE_NAME_MAX_LENGTH];
+		char		localebuf[LOCALE_NAME_MAX_LENGTH];
+		LCID		lcid;
+		char	   *hyphen;
+
+		if (MultiByteToWideChar(CP_ACP, 0, res, -1, localename,
+								LOCALE_NAME_MAX_LENGTH) &&
+			GetLocaleInfoEx(localename, LOCALE_RETURN_NUMBER | LOCALE_ILANGUAGE,
+							(LPWSTR)&lcid, sizeof(lcid)) &&
+			LCIDToLocaleName(lcid, localename, LOCALE_NAME_MAX_LENGTH, 0) &&
+			WideCharToMultiByte(CP_ACP, 0, localename, -1, localebuf,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL))
+		{
+			hyphen = strchr(localebuf, '-');
+			if (hyphen)
+				*hyphen = '_';
+			res = pg_strdup(localebuf);
+		}
+	}
+#endif
 
 	/* save canonical name if requested. */
 	if (res && canonname)
#10Thomas Munro
thomas.munro@gmail.com
In reply to: Juan José Santamaría Flecha (#9)
2 attachment(s)
Re: Windows default locale vs initdb

On Wed, Dec 15, 2021 at 11:32 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:

The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2]

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.

The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale(). It might be reasonable for initdb but not for a backend in most cases.

Agreed. Only for initdb, and only if you didn't specify a locale name
on the command line.

You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs is no longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing this approach.

Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName(). Here's a patch.

One thing you did in your patch that I disagree with, I think, was to
convert a BCP 47 name to a POSIX name early, that is, s/-/_/. I think
we should use the locale name exactly as Windows (really, under the
covers, ICU) spells it. There is only one place in the tree today
that really wants a POSIX locale name, and that's LC_MESSAGES,
accessed by GNU gettext, not Windows. We already had code to cope
with that.

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1]/messages/by-id/CAC+AXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg@mail.gmail.com. I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually. Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question. How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.

[1]: /messages/by-id/CAC+AXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg@mail.gmail.com

Attachments:

0001-Default-to-BCP-47-locale-in-initdb-on-Windows.patchtext/x-patch; charset=UTF-8; name=0001-Default-to-BCP-47-locale-in-initdb-on-Windows.patchDownload
From d6d677fd185242590f0f716cf69d09e735122ff7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH 1/2] Default to BCP 47 locale in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because they are unstable and not recommended for use in
databases.  Since setlocale() returns such names, instead use
GetUserDefaultLocaleName() if the user didn't provide an explicit
locale.

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 10 ++++++++--
 src/bin/initdb/initdb.c   | 28 +++++++++++++++++++++++++++-
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..22e33f0f57 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,14 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on English words,
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 89b888eaa5..57c5ecf3cf 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,6 +59,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2022,7 +2026,27 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name, error %d",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name, error %d",
+					 GetLastError());
+		locale = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale = pg_strdup("");
+#endif
+	}
+	else
+		locale = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
 	res = setlocale(category, locale);
@@ -2054,6 +2078,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale);
 }
 
 /*
-- 
2.30.2

0002-Remove-support-for-old-Windows-locale-names.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-support-for-old-Windows-locale-names.patchDownload
From a440c6675d92682782f8a10667eec30d35e6fd9c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 08:53:08 +1200
Subject: [PATCH 2/2] Remove support for old Windows locale names.

We now use BCP 47 locale names by default and also advise those for
explicit use.  Remove support for munging the old unstable and
unsystematic English word-style locale names.

XXX When could we do this?
XXX How are you supposed to pg_upgrade a system with datcollate =
"English_Canada" to "en-CA"?
---
 configure                         |   6 -
 configure.ac                      |   1 -
 doc/src/sgml/charset.sgml         |   2 +-
 src/backend/utils/adt/pg_locale.c | 235 ++----------------------------
 src/include/port/win32_port.h     |   9 --
 src/port/win32setlocale.c         | 193 ------------------------
 src/tools/msvc/Mkvcbuild.pm       |   2 +-
 7 files changed, 18 insertions(+), 430 deletions(-)
 delete mode 100644 src/port/win32setlocale.c

diff --git a/configure b/configure
index a4f4d321fb..5b7704352d 100755
--- a/configure
+++ b/configure
@@ -17174,12 +17174,6 @@ esac
  ;;
 esac
 
-  case " $LIBOBJS " in
-  *" win32setlocale.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS win32setlocale.$ac_objext"
- ;;
-esac
-
   case " $LIBOBJS " in
   *" win32stat.$ac_objext "* ) ;;
   *) LIBOBJS="$LIBOBJS win32stat.$ac_objext"
diff --git a/configure.ac b/configure.ac
index 5bd29a4d2f..358580e6d6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1995,7 +1995,6 @@ if test "$PORTNAME" = "win32"; then
   AC_LIBOBJ(win32error)
   AC_LIBOBJ(win32ntdll)
   AC_LIBOBJ(win32security)
-  AC_LIBOBJ(win32setlocale)
   AC_LIBOBJ(win32stat)
   AC_DEFINE([HAVE_SYMLINK], 1,
             [Define to 1 if you have the `symlink' function.])
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 22e33f0f57..e02d7e0020 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -90,7 +90,7 @@ initdb --locale=sv_SE
     For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
     Windows also supports more verbose locale names based on English words,
     such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
-    but these are not recommended.
+    but these should not be used in PostgreSQL.
    </para>
 
    <para>
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 607a4b7340..044f62de2b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -118,7 +118,7 @@ static HTAB *collation_cache = NULL;
 
 
 #if defined(WIN32) && defined(LC_MESSAGES)
-static char *IsoLocaleName(const char *);	/* MSVC specific */
+static char *PosixLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
@@ -204,10 +204,7 @@ pg_perm_setlocale(int category, const char *locale)
 		case LC_MESSAGES:
 			envvar = "LC_MESSAGES";
 #ifdef WIN32
-			result = IsoLocaleName(locale);
-			if (result == NULL)
-				result = (char *) locale;
-			elog(DEBUG3, "IsoLocaleName() executed; locale: \"%s\"", result);
+			locale = PosixLocaleName(locale);
 #endif							/* WIN32 */
 			break;
 #endif							/* LC_MESSAGES */
@@ -905,218 +902,35 @@ cache_locale_time(void)
 
 #if defined(WIN32) && defined(LC_MESSAGES)
 /*
- * Convert a Windows setlocale() argument to a Unix-style one.
+ * Convert a Windows BCP 47 locale name to a POSIX one.
  *
  * Regardless of platform, we install message catalogs under a Unix-style
  * LL[_CC][.ENCODING][@VARIANT] naming convention.  Only LC_MESSAGES settings
  * following that style will elicit localized interface strings.
  *
- * Before Visual Studio 2012 (msvcr110.dll), Windows setlocale() accepted "C"
- * (but not "c") and strings of the form <Language>[_<Country>][.<CodePage>],
- * case-insensitive.  setlocale() returns the fully-qualified form; for
- * example, setlocale("thaI") returns "Thai_Thailand.874".  Internally,
- * setlocale() and _create_locale() select a "locale identifier"[1] and store
- * it in an undocumented _locale_t field.  From that LCID, we can retrieve the
- * ISO 639 language and the ISO 3166 country.  Character encoding does not
- * matter, because the server and client encodings govern that.
- *
- * Windows Vista introduced the "locale name" concept[2], closely following
- * RFC 4646.  Locale identifiers are now deprecated.  Starting with Visual
- * Studio 2012, setlocale() accepts locale names in addition to the strings it
- * accepted historically.  It does not standardize them; setlocale("Th-tH")
- * returns "Th-tH".  setlocale(category, "") still returns a traditional
- * string.  Furthermore, msvcr110.dll changed the undocumented _locale_t
- * content to carry locale names instead of locale identifiers.
- *
- * Visual Studio 2015 should still be able to do the same as Visual Studio
- * 2012, but the declaration of locale_name is missing in _locale_t, causing
- * this code compilation to fail, hence this falls back instead on to
- * enumerating all system locales by using EnumSystemLocalesEx to find the
- * required locale name.  If the input argument is in Unix-style then we can
- * get ISO Locale name directly by using GetLocaleInfoEx() with LCType as
- * LOCALE_SNAME.
- *
- * MinGW headers declare _create_locale(), but msvcrt.dll lacks that symbol in
- * releases before Windows 8. IsoLocaleName() always fails in a MinGW-built
- * postgres.exe, so only Unix-style values of the lc_messages GUC can elicit
- * localized messages. In particular, every lc_messages setting that initdb
- * can select automatically will yield only C-locale messages. XXX This could
- * be fixed by running the fully-qualified locale name through a lookup table.
- *
- * This function returns a pointer to a static buffer bearing the converted
- * name or NULL if conversion fails.
- *
- * [1] https://docs.microsoft.com/en-us/windows/win32/intl/locale-identifiers
- * [2] https://docs.microsoft.com/en-us/windows/win32/intl/locale-names
- */
-
-/*
- * Callback function for EnumSystemLocalesEx() in get_iso_localename().
- *
- * This function enumerates all system locales, searching for one that matches
- * an input with the format: <Language>[_<Country>], e.g.
- * English[_United States]
- *
- * The input is a three wchar_t array as an LPARAM. The first element is the
- * locale_name we want to match, the second element is an allocated buffer
- * where the Unix-style locale is copied if a match is found, and the third
- * element is the search status, 1 if a match was found, 0 otherwise.
+ * Historically, verbose, but unsystematic and unstable names like
+ * "Thai_Thailand.874" were supported, but now only BCP 47 input is expected.
+ * That means we just need to be able to convert "en-US" to "en_US".
  */
-static BOOL CALLBACK
-search_locale_enum(LPWSTR pStr, DWORD dwFlags, LPARAM lparam)
-{
-	wchar_t		test_locale[LOCALE_NAME_MAX_LENGTH];
-	wchar_t   **argv;
-
-	(void) (dwFlags);
-
-	argv = (wchar_t **) lparam;
-	*argv[2] = (wchar_t) 0;
-
-	memset(test_locale, 0, sizeof(test_locale));
-
-	/* Get the name of the <Language> in English */
-	if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHLANGUAGENAME,
-						test_locale, LOCALE_NAME_MAX_LENGTH))
-	{
-		/*
-		 * If the enumerated locale does not have a hyphen ("en") OR the
-		 * lc_message input does not have an underscore ("English"), we only
-		 * need to compare the <Language> tags.
-		 */
-		if (wcsrchr(pStr, '-') == NULL || wcsrchr(argv[0], '_') == NULL)
-		{
-			if (_wcsicmp(argv[0], test_locale) == 0)
-			{
-				wcscpy(argv[1], pStr);
-				*argv[2] = (wchar_t) 1;
-				return FALSE;
-			}
-		}
-
-		/*
-		 * We have to compare a full <Language>_<Country> tag, so we append
-		 * the underscore and name of the country/region in English, e.g.
-		 * "English_United States".
-		 */
-		else
-		{
-			size_t		len;
-
-			wcscat(test_locale, L"_");
-			len = wcslen(test_locale);
-			if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHCOUNTRYNAME,
-								test_locale + len,
-								LOCALE_NAME_MAX_LENGTH - len))
-			{
-				if (_wcsicmp(argv[0], test_locale) == 0)
-				{
-					wcscpy(argv[1], pStr);
-					*argv[2] = (wchar_t) 1;
-					return FALSE;
-				}
-			}
-		}
-	}
-
-	return TRUE;
-}
-
-/*
- * This function converts a Windows locale name to an ISO formatted version
- * for Visual Studio 2015 or greater.
- *
- * Returns NULL, if no valid conversion was found.
- */
-static char *
-get_iso_localename(const char *winlocname)
-{
-	wchar_t		wc_locale_name[LOCALE_NAME_MAX_LENGTH];
-	wchar_t		buffer[LOCALE_NAME_MAX_LENGTH];
-	static char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
-	char	   *period;
-	int			len;
-	int			ret_val;
-
-	/*
-	 * Valid locales have the following syntax:
-	 * <Language>[_<Country>[.<CodePage>]]
-	 *
-	 * GetLocaleInfoEx can only take locale name without code-page and for the
-	 * purpose of this API the code-page doesn't matter.
-	 */
-	period = strchr(winlocname, '.');
-	if (period != NULL)
-		len = period - winlocname;
-	else
-		len = pg_mbstrlen(winlocname);
-
-	memset(wc_locale_name, 0, sizeof(wc_locale_name));
-	memset(buffer, 0, sizeof(buffer));
-	MultiByteToWideChar(CP_ACP, 0, winlocname, len, wc_locale_name,
-						LOCALE_NAME_MAX_LENGTH);
-
-	/*
-	 * If the lc_messages is already a Unix-style string, we have a direct
-	 * match with LOCALE_SNAME, e.g. en-US, en_US.
-	 */
-	ret_val = GetLocaleInfoEx(wc_locale_name, LOCALE_SNAME, (LPWSTR) &buffer,
-							  LOCALE_NAME_MAX_LENGTH);
-	if (!ret_val)
-	{
-		/*
-		 * Search for a locale in the system that matches language and country
-		 * name.
-		 */
-		wchar_t    *argv[3];
-
-		argv[0] = wc_locale_name;
-		argv[1] = buffer;
-		argv[2] = (wchar_t *) &ret_val;
-		EnumSystemLocalesEx(search_locale_enum, LOCALE_WINDOWS, (LPARAM) argv,
-							NULL);
-	}
-
-	if (ret_val)
-	{
-		size_t		rc;
-		char	   *hyphen;
-
-		/* Locale names use only ASCII, any conversion locale suffices. */
-		rc = wchar2char(iso_lc_messages, buffer, sizeof(iso_lc_messages), NULL);
-		if (rc == -1 || rc == sizeof(iso_lc_messages))
-			return NULL;
-
-		/*
-		 * Simply replace the hyphen with an underscore.  See comments in
-		 * IsoLocaleName.
-		 */
-		hyphen = strchr(iso_lc_messages, '-');
-		if (hyphen)
-			*hyphen = '_';
-		return iso_lc_messages;
-	}
-
-	return NULL;
-}
-
 static char *
-IsoLocaleName(const char *winlocname)
+PosixLocaleName(const char *winlocname)
 {
-#if defined(_MSC_VER)
-	static char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
+	char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
+	char *hyphen;
 
 	if (pg_strcasecmp("c", winlocname) == 0 ||
 		pg_strcasecmp("posix", winlocname) == 0)
 	{
 		strcpy(iso_lc_messages, "C");
-		return iso_lc_messages;
 	}
 	else
-		return get_iso_localename(winlocname);
-
-#endif							/* defined(_MSC_VER) */
-	return NULL;				/* Not supported on this version of msvc/mingw */
+	{
+		strlcpy(iso_lc_messages, winlocname, sizeof(iso_lc_messages));
+		hypen = strchr(iso_lc_messages, '-');
+		if (hyphen)
+			*hyphen = '_';
+	}
+	return pg_strcpy(iso_lc_messages);
 }
 #endif							/* WIN32 && LC_MESSAGES */
 
@@ -1680,33 +1494,16 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 			ereport(ERROR,
 					(errmsg("could not load locale \"%s\"", collcollate)));
 #elif defined(WIN32)
-		/*
-		 * If we are targeting Windows Vista and above, we can ask for a name
-		 * given a collation name (earlier versions required a location code
-		 * that we don't have).
-		 */
 		NLSVERSIONINFOEX version = {sizeof(NLSVERSIONINFOEX)};
 		WCHAR		wide_collcollate[LOCALE_NAME_MAX_LENGTH];
 
 		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
 							LOCALE_NAME_MAX_LENGTH);
 		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
-		{
-			/*
-			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
-			 * locale name like "English_United States.1252".  Until those
-			 * values can be prevented from entering the system, or 100%
-			 * reliably converted to the more useful tag format, tolerate the
-			 * resulting error and report that we have no version data.
-			 */
-			if (GetLastError() == ERROR_INVALID_PARAMETER)
-				return NULL;
-
 			ereport(ERROR,
 					(errmsg("could not get collation version for locale \"%s\": error code %lu",
 							collcollate,
 							GetLastError())));
-		}
 		collversion = psprintf("%ld.%ld,%ld.%ld",
 							   (version.dwNLSVersion >> 8) & 0xFFFF,
 							   version.dwNLSVersion & 0xFF,
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 5121c0c626..13ed144a8f 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -437,15 +437,6 @@ extern int	_pgstat64(const char *name, struct stat *buf);
 #undef setlocale
 #endif
 
-/*
- * Define our own wrapper macro around setlocale() to work around bugs in
- * Windows' native setlocale() function.
- */
-extern char *pgwin32_setlocale(int category, const char *locale);
-
-#define setlocale(a,b) pgwin32_setlocale(a,b)
-
-
 /* In backend/port/win32/signal.c */
 extern PGDLLIMPORT volatile int pg_signal_queue;
 extern PGDLLIMPORT int pg_signal_mask;
diff --git a/src/port/win32setlocale.c b/src/port/win32setlocale.c
deleted file mode 100644
index aadd09a4e9..0000000000
--- a/src/port/win32setlocale.c
+++ /dev/null
@@ -1,193 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * win32setlocale.c
- *		Wrapper to work around bugs in Windows setlocale() implementation
- *
- * Copyright (c) 2011-2022, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *	  src/port/win32setlocale.c
- *
- *
- * The setlocale() function in Windows is broken in two ways. First, it
- * has a problem with locale names that have a dot in the country name. For
- * example:
- *
- * "Chinese (Traditional)_Hong Kong S.A.R..950"
- *
- * For some reason, setlocale() doesn't accept that as argument, even though
- * setlocale(LC_ALL, NULL) returns exactly that. Fortunately, it accepts
- * various alternative names for such countries, so to work around the broken
- * setlocale() function, we map the troublemaking locale names to accepted
- * aliases, before calling setlocale().
- *
- * The second problem is that the locale name for "Norwegian (Bokm&aring;l)"
- * contains a non-ASCII character. That's problematic, because it's not clear
- * what encoding the locale name itself is supposed to be in, when you
- * haven't yet set a locale. Also, it causes problems when the cluster
- * contains databases with different encodings, as the locale name is stored
- * in the pg_database system catalog. To work around that, when setlocale()
- * returns that locale name, map it to a pure-ASCII alias for the same
- * locale.
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#undef setlocale
-
-struct locale_map
-{
-	/*
-	 * String in locale name to replace. Can be a single string (end is NULL),
-	 * or separate start and end strings. If two strings are given, the locale
-	 * name must contain both of them, and everything between them is
-	 * replaced. This is used for a poor-man's regexp search, allowing
-	 * replacement of "start.*end".
-	 */
-	const char *locale_name_start;
-	const char *locale_name_end;
-
-	const char *replacement;	/* string to replace the match with */
-};
-
-/*
- * Mappings applied before calling setlocale(), to the argument.
- */
-static const struct locale_map locale_map_argument[] = {
-	/*
-	 * "HKG" is listed here:
-	 * http://msdn.microsoft.com/en-us/library/cdax410z%28v=vs.71%29.aspx
-	 * (Country/Region Strings).
-	 *
-	 * "ARE" is the ISO-3166 three-letter code for U.A.E. It is not on the
-	 * above list, but seems to work anyway.
-	 */
-	{"Hong Kong S.A.R.", NULL, "HKG"},
-	{"U.A.E.", NULL, "ARE"},
-
-	/*
-	 * The ISO-3166 country code for Macau S.A.R. is MAC, but Windows doesn't
-	 * seem to recognize that. And Macau isn't listed in the table of accepted
-	 * abbreviations linked above. Fortunately, "ZHM" seems to be accepted as
-	 * an alias for "Chinese (Traditional)_Macau S.A.R..950". I'm not sure
-	 * where "ZHM" comes from, must be some legacy naming scheme. But hey, it
-	 * works.
-	 *
-	 * Note that unlike HKG and ARE, ZHM is an alias for the *whole* locale
-	 * name, not just the country part.
-	 *
-	 * Some versions of Windows spell it "Macau", others "Macao".
-	 */
-	{"Chinese (Traditional)_Macau S.A.R..950", NULL, "ZHM"},
-	{"Chinese_Macau S.A.R..950", NULL, "ZHM"},
-	{"Chinese (Traditional)_Macao S.A.R..950", NULL, "ZHM"},
-	{"Chinese_Macao S.A.R..950", NULL, "ZHM"},
-	{NULL, NULL, NULL}
-};
-
-/*
- * Mappings applied after calling setlocale(), to its return value.
- */
-static const struct locale_map locale_map_result[] = {
-	/*
-	 * "Norwegian (Bokm&aring;l)" locale name contains the a-ring character.
-	 * Map it to a pure-ASCII alias.
-	 *
-	 * It's not clear what encoding setlocale() uses when it returns the
-	 * locale name, so to play it safe, we search for "Norwegian (Bok*l)".
-	 *
-	 * Just to make life even more complicated, some versions of Windows spell
-	 * the locale name without parentheses.  Translate that too.
-	 */
-	{"Norwegian (Bokm", "l)_Norway", "Norwegian_Norway"},
-	{"Norwegian Bokm", "l_Norway", "Norwegian_Norway"},
-	{NULL, NULL, NULL}
-};
-
-#define MAX_LOCALE_NAME_LEN		100
-
-static const char *
-map_locale(const struct locale_map *map, const char *locale)
-{
-	static char aliasbuf[MAX_LOCALE_NAME_LEN];
-	int			i;
-
-	/* Check if the locale name matches any of the problematic ones. */
-	for (i = 0; map[i].locale_name_start != NULL; i++)
-	{
-		const char *needle_start = map[i].locale_name_start;
-		const char *needle_end = map[i].locale_name_end;
-		const char *replacement = map[i].replacement;
-		char	   *match;
-		char	   *match_start = NULL;
-		char	   *match_end = NULL;
-
-		match = strstr(locale, needle_start);
-		if (match)
-		{
-			/*
-			 * Found a match for the first part. If this was a two-part
-			 * replacement, find the second part.
-			 */
-			match_start = match;
-			if (needle_end)
-			{
-				match = strstr(match_start + strlen(needle_start), needle_end);
-				if (match)
-					match_end = match + strlen(needle_end);
-				else
-					match_start = NULL;
-			}
-			else
-				match_end = match_start + strlen(needle_start);
-		}
-
-		if (match_start)
-		{
-			/* Found a match. Replace the matched string. */
-			int			matchpos = match_start - locale;
-			int			replacementlen = strlen(replacement);
-			char	   *rest = match_end;
-			int			restlen = strlen(rest);
-
-			/* check that the result fits in the static buffer */
-			if (matchpos + replacementlen + restlen + 1 > MAX_LOCALE_NAME_LEN)
-				return NULL;
-
-			memcpy(&aliasbuf[0], &locale[0], matchpos);
-			memcpy(&aliasbuf[matchpos], replacement, replacementlen);
-			/* includes null terminator */
-			memcpy(&aliasbuf[matchpos + replacementlen], rest, restlen + 1);
-
-			return aliasbuf;
-		}
-	}
-
-	/* no match, just return the original string */
-	return locale;
-}
-
-char *
-pgwin32_setlocale(int category, const char *locale)
-{
-	const char *argument;
-	char	   *result;
-
-	if (locale == NULL)
-		argument = NULL;
-	else
-		argument = map_locale(locale_map_argument, locale);
-
-	/* Call the real setlocale() function */
-	result = setlocale(category, argument);
-
-	/*
-	 * setlocale() is specified to return a "char *" that the caller is
-	 * forbidden to modify, so casting away the "const" is innocuous.
-	 */
-	if (result)
-		result = unconstify(char *, map_locale(locale_map_result, result));
-
-	return result;
-}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index e4feda10fd..1bc35ef926 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -108,7 +108,7 @@ sub mkvcbuild
 	  pqsignal.c mkdtemp.c qsort.c qsort_arg.c bsearch_arg.c quotes.c system.c
 	  strerror.c tar.c
 	  win32env.c win32error.c win32ntdll.c
-	  win32security.c win32setlocale.c win32stat.c);
+	  win32security.c win32stat.c);
 
 	push(@pgportfiles, 'strtof.c') if ($vsVersion < '14.00');
 
-- 
2.30.2

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#10)
2 attachment(s)
Re: Windows default locale vs initdb

On Tue, Jul 19, 2022 at 10:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's a patch.

I added this to the next commitfest, and cfbot promptly told me about
some warnings I needed to fix. That'll teach me to post a patch
tested with "ci-os-only: windows". Looking more closely at some error
messages that report GetLastError() where I'd mixed up %d and %lu, I
see also that I didn't quite follow existing conventions for wording
when reporting Windows error numbers, so I fixed that too.

In the "startcreate" step on CI you can see that it says:

The database cluster will be initialized with locale "en-US".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

As for whether "accordingly" still applies, by the logic of of
win32_langinfo()... Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not
sure what to make of that. The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want. Hmph.

Attachments:

v2-0001-Default-to-BCP-47-locale-in-initdb-on-Windows.patchtext/x-patch; charset=UTF-8; name=v2-0001-Default-to-BCP-47-locale-in-initdb-on-Windows.patchDownload
From 95f2684150e2938f2e555d16bbed4295a6dad279 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v2 1/2] Default to BCP 47 locale in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because they are unstable and not recommended for use in
databases.  Since setlocale() returns such names, on Windows use
GetUserDefaultLocaleName() if the user didn't provide an explicit
locale.

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 10 ++++++++--
 src/bin/initdb/initdb.c   | 31 +++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..b656ca489f 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,14 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags, like ICU.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on English words,
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 89b888eaa5..3af08b7b99 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,6 +59,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2007,6 +2011,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2022,10 +2027,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+					 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2054,6 +2079,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.35.1

v2-0002-Remove-support-for-old-Windows-locale-names.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Remove-support-for-old-Windows-locale-names.patchDownload
From 1e0b75b4c8958397a8e660fa0b8759f1da78a753 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 08:53:08 +1200
Subject: [PATCH v2 2/2] Remove support for old Windows locale names.

We now use BCP 47 locale names by default and also advise those for
explicit use.  Remove support for munging the old unstable and
unsystematic English word-style locale names.

XXX When could we do this?
XXX How are you supposed to pg_upgrade a system with datcollate =
"English_Canada" to "en-CA"?
XXX There are more traces to remove in win32_langinfo()
---
 configure                         |   6 -
 configure.ac                      |   1 -
 doc/src/sgml/charset.sgml         |   2 +-
 src/backend/utils/adt/pg_locale.c | 235 ++----------------------------
 src/include/port/win32_port.h     |   9 --
 src/port/win32setlocale.c         | 193 ------------------------
 src/tools/msvc/Mkvcbuild.pm       |   2 +-
 7 files changed, 18 insertions(+), 430 deletions(-)
 delete mode 100644 src/port/win32setlocale.c

diff --git a/configure b/configure
index a4f4d321fb..5b7704352d 100755
--- a/configure
+++ b/configure
@@ -17174,12 +17174,6 @@ esac
  ;;
 esac
 
-  case " $LIBOBJS " in
-  *" win32setlocale.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS win32setlocale.$ac_objext"
- ;;
-esac
-
   case " $LIBOBJS " in
   *" win32stat.$ac_objext "* ) ;;
   *) LIBOBJS="$LIBOBJS win32stat.$ac_objext"
diff --git a/configure.ac b/configure.ac
index 5bd29a4d2f..358580e6d6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1995,7 +1995,6 @@ if test "$PORTNAME" = "win32"; then
   AC_LIBOBJ(win32error)
   AC_LIBOBJ(win32ntdll)
   AC_LIBOBJ(win32security)
-  AC_LIBOBJ(win32setlocale)
   AC_LIBOBJ(win32stat)
   AC_DEFINE([HAVE_SYMLINK], 1,
             [Define to 1 if you have the `symlink' function.])
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index b656ca489f..3c12f3f344 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -90,7 +90,7 @@ initdb --locale=sv_SE
     For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
     Windows also supports more verbose locale names based on English words,
     such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
-    but these are not recommended.
+    but these should not be used in PostgreSQL.
    </para>
 
    <para>
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 607a4b7340..044f62de2b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -118,7 +118,7 @@ static HTAB *collation_cache = NULL;
 
 
 #if defined(WIN32) && defined(LC_MESSAGES)
-static char *IsoLocaleName(const char *);	/* MSVC specific */
+static char *PosixLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
@@ -204,10 +204,7 @@ pg_perm_setlocale(int category, const char *locale)
 		case LC_MESSAGES:
 			envvar = "LC_MESSAGES";
 #ifdef WIN32
-			result = IsoLocaleName(locale);
-			if (result == NULL)
-				result = (char *) locale;
-			elog(DEBUG3, "IsoLocaleName() executed; locale: \"%s\"", result);
+			locale = PosixLocaleName(locale);
 #endif							/* WIN32 */
 			break;
 #endif							/* LC_MESSAGES */
@@ -905,218 +902,35 @@ cache_locale_time(void)
 
 #if defined(WIN32) && defined(LC_MESSAGES)
 /*
- * Convert a Windows setlocale() argument to a Unix-style one.
+ * Convert a Windows BCP 47 locale name to a POSIX one.
  *
  * Regardless of platform, we install message catalogs under a Unix-style
  * LL[_CC][.ENCODING][@VARIANT] naming convention.  Only LC_MESSAGES settings
  * following that style will elicit localized interface strings.
  *
- * Before Visual Studio 2012 (msvcr110.dll), Windows setlocale() accepted "C"
- * (but not "c") and strings of the form <Language>[_<Country>][.<CodePage>],
- * case-insensitive.  setlocale() returns the fully-qualified form; for
- * example, setlocale("thaI") returns "Thai_Thailand.874".  Internally,
- * setlocale() and _create_locale() select a "locale identifier"[1] and store
- * it in an undocumented _locale_t field.  From that LCID, we can retrieve the
- * ISO 639 language and the ISO 3166 country.  Character encoding does not
- * matter, because the server and client encodings govern that.
- *
- * Windows Vista introduced the "locale name" concept[2], closely following
- * RFC 4646.  Locale identifiers are now deprecated.  Starting with Visual
- * Studio 2012, setlocale() accepts locale names in addition to the strings it
- * accepted historically.  It does not standardize them; setlocale("Th-tH")
- * returns "Th-tH".  setlocale(category, "") still returns a traditional
- * string.  Furthermore, msvcr110.dll changed the undocumented _locale_t
- * content to carry locale names instead of locale identifiers.
- *
- * Visual Studio 2015 should still be able to do the same as Visual Studio
- * 2012, but the declaration of locale_name is missing in _locale_t, causing
- * this code compilation to fail, hence this falls back instead on to
- * enumerating all system locales by using EnumSystemLocalesEx to find the
- * required locale name.  If the input argument is in Unix-style then we can
- * get ISO Locale name directly by using GetLocaleInfoEx() with LCType as
- * LOCALE_SNAME.
- *
- * MinGW headers declare _create_locale(), but msvcrt.dll lacks that symbol in
- * releases before Windows 8. IsoLocaleName() always fails in a MinGW-built
- * postgres.exe, so only Unix-style values of the lc_messages GUC can elicit
- * localized messages. In particular, every lc_messages setting that initdb
- * can select automatically will yield only C-locale messages. XXX This could
- * be fixed by running the fully-qualified locale name through a lookup table.
- *
- * This function returns a pointer to a static buffer bearing the converted
- * name or NULL if conversion fails.
- *
- * [1] https://docs.microsoft.com/en-us/windows/win32/intl/locale-identifiers
- * [2] https://docs.microsoft.com/en-us/windows/win32/intl/locale-names
- */
-
-/*
- * Callback function for EnumSystemLocalesEx() in get_iso_localename().
- *
- * This function enumerates all system locales, searching for one that matches
- * an input with the format: <Language>[_<Country>], e.g.
- * English[_United States]
- *
- * The input is a three wchar_t array as an LPARAM. The first element is the
- * locale_name we want to match, the second element is an allocated buffer
- * where the Unix-style locale is copied if a match is found, and the third
- * element is the search status, 1 if a match was found, 0 otherwise.
+ * Historically, verbose, but unsystematic and unstable names like
+ * "Thai_Thailand.874" were supported, but now only BCP 47 input is expected.
+ * That means we just need to be able to convert "en-US" to "en_US".
  */
-static BOOL CALLBACK
-search_locale_enum(LPWSTR pStr, DWORD dwFlags, LPARAM lparam)
-{
-	wchar_t		test_locale[LOCALE_NAME_MAX_LENGTH];
-	wchar_t   **argv;
-
-	(void) (dwFlags);
-
-	argv = (wchar_t **) lparam;
-	*argv[2] = (wchar_t) 0;
-
-	memset(test_locale, 0, sizeof(test_locale));
-
-	/* Get the name of the <Language> in English */
-	if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHLANGUAGENAME,
-						test_locale, LOCALE_NAME_MAX_LENGTH))
-	{
-		/*
-		 * If the enumerated locale does not have a hyphen ("en") OR the
-		 * lc_message input does not have an underscore ("English"), we only
-		 * need to compare the <Language> tags.
-		 */
-		if (wcsrchr(pStr, '-') == NULL || wcsrchr(argv[0], '_') == NULL)
-		{
-			if (_wcsicmp(argv[0], test_locale) == 0)
-			{
-				wcscpy(argv[1], pStr);
-				*argv[2] = (wchar_t) 1;
-				return FALSE;
-			}
-		}
-
-		/*
-		 * We have to compare a full <Language>_<Country> tag, so we append
-		 * the underscore and name of the country/region in English, e.g.
-		 * "English_United States".
-		 */
-		else
-		{
-			size_t		len;
-
-			wcscat(test_locale, L"_");
-			len = wcslen(test_locale);
-			if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHCOUNTRYNAME,
-								test_locale + len,
-								LOCALE_NAME_MAX_LENGTH - len))
-			{
-				if (_wcsicmp(argv[0], test_locale) == 0)
-				{
-					wcscpy(argv[1], pStr);
-					*argv[2] = (wchar_t) 1;
-					return FALSE;
-				}
-			}
-		}
-	}
-
-	return TRUE;
-}
-
-/*
- * This function converts a Windows locale name to an ISO formatted version
- * for Visual Studio 2015 or greater.
- *
- * Returns NULL, if no valid conversion was found.
- */
-static char *
-get_iso_localename(const char *winlocname)
-{
-	wchar_t		wc_locale_name[LOCALE_NAME_MAX_LENGTH];
-	wchar_t		buffer[LOCALE_NAME_MAX_LENGTH];
-	static char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
-	char	   *period;
-	int			len;
-	int			ret_val;
-
-	/*
-	 * Valid locales have the following syntax:
-	 * <Language>[_<Country>[.<CodePage>]]
-	 *
-	 * GetLocaleInfoEx can only take locale name without code-page and for the
-	 * purpose of this API the code-page doesn't matter.
-	 */
-	period = strchr(winlocname, '.');
-	if (period != NULL)
-		len = period - winlocname;
-	else
-		len = pg_mbstrlen(winlocname);
-
-	memset(wc_locale_name, 0, sizeof(wc_locale_name));
-	memset(buffer, 0, sizeof(buffer));
-	MultiByteToWideChar(CP_ACP, 0, winlocname, len, wc_locale_name,
-						LOCALE_NAME_MAX_LENGTH);
-
-	/*
-	 * If the lc_messages is already a Unix-style string, we have a direct
-	 * match with LOCALE_SNAME, e.g. en-US, en_US.
-	 */
-	ret_val = GetLocaleInfoEx(wc_locale_name, LOCALE_SNAME, (LPWSTR) &buffer,
-							  LOCALE_NAME_MAX_LENGTH);
-	if (!ret_val)
-	{
-		/*
-		 * Search for a locale in the system that matches language and country
-		 * name.
-		 */
-		wchar_t    *argv[3];
-
-		argv[0] = wc_locale_name;
-		argv[1] = buffer;
-		argv[2] = (wchar_t *) &ret_val;
-		EnumSystemLocalesEx(search_locale_enum, LOCALE_WINDOWS, (LPARAM) argv,
-							NULL);
-	}
-
-	if (ret_val)
-	{
-		size_t		rc;
-		char	   *hyphen;
-
-		/* Locale names use only ASCII, any conversion locale suffices. */
-		rc = wchar2char(iso_lc_messages, buffer, sizeof(iso_lc_messages), NULL);
-		if (rc == -1 || rc == sizeof(iso_lc_messages))
-			return NULL;
-
-		/*
-		 * Simply replace the hyphen with an underscore.  See comments in
-		 * IsoLocaleName.
-		 */
-		hyphen = strchr(iso_lc_messages, '-');
-		if (hyphen)
-			*hyphen = '_';
-		return iso_lc_messages;
-	}
-
-	return NULL;
-}
-
 static char *
-IsoLocaleName(const char *winlocname)
+PosixLocaleName(const char *winlocname)
 {
-#if defined(_MSC_VER)
-	static char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
+	char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
+	char *hyphen;
 
 	if (pg_strcasecmp("c", winlocname) == 0 ||
 		pg_strcasecmp("posix", winlocname) == 0)
 	{
 		strcpy(iso_lc_messages, "C");
-		return iso_lc_messages;
 	}
 	else
-		return get_iso_localename(winlocname);
-
-#endif							/* defined(_MSC_VER) */
-	return NULL;				/* Not supported on this version of msvc/mingw */
+	{
+		strlcpy(iso_lc_messages, winlocname, sizeof(iso_lc_messages));
+		hypen = strchr(iso_lc_messages, '-');
+		if (hyphen)
+			*hyphen = '_';
+	}
+	return pg_strcpy(iso_lc_messages);
 }
 #endif							/* WIN32 && LC_MESSAGES */
 
@@ -1680,33 +1494,16 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 			ereport(ERROR,
 					(errmsg("could not load locale \"%s\"", collcollate)));
 #elif defined(WIN32)
-		/*
-		 * If we are targeting Windows Vista and above, we can ask for a name
-		 * given a collation name (earlier versions required a location code
-		 * that we don't have).
-		 */
 		NLSVERSIONINFOEX version = {sizeof(NLSVERSIONINFOEX)};
 		WCHAR		wide_collcollate[LOCALE_NAME_MAX_LENGTH];
 
 		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
 							LOCALE_NAME_MAX_LENGTH);
 		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
-		{
-			/*
-			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
-			 * locale name like "English_United States.1252".  Until those
-			 * values can be prevented from entering the system, or 100%
-			 * reliably converted to the more useful tag format, tolerate the
-			 * resulting error and report that we have no version data.
-			 */
-			if (GetLastError() == ERROR_INVALID_PARAMETER)
-				return NULL;
-
 			ereport(ERROR,
 					(errmsg("could not get collation version for locale \"%s\": error code %lu",
 							collcollate,
 							GetLastError())));
-		}
 		collversion = psprintf("%ld.%ld,%ld.%ld",
 							   (version.dwNLSVersion >> 8) & 0xFFFF,
 							   version.dwNLSVersion & 0xFF,
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 5121c0c626..13ed144a8f 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -437,15 +437,6 @@ extern int	_pgstat64(const char *name, struct stat *buf);
 #undef setlocale
 #endif
 
-/*
- * Define our own wrapper macro around setlocale() to work around bugs in
- * Windows' native setlocale() function.
- */
-extern char *pgwin32_setlocale(int category, const char *locale);
-
-#define setlocale(a,b) pgwin32_setlocale(a,b)
-
-
 /* In backend/port/win32/signal.c */
 extern PGDLLIMPORT volatile int pg_signal_queue;
 extern PGDLLIMPORT int pg_signal_mask;
diff --git a/src/port/win32setlocale.c b/src/port/win32setlocale.c
deleted file mode 100644
index aadd09a4e9..0000000000
--- a/src/port/win32setlocale.c
+++ /dev/null
@@ -1,193 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * win32setlocale.c
- *		Wrapper to work around bugs in Windows setlocale() implementation
- *
- * Copyright (c) 2011-2022, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *	  src/port/win32setlocale.c
- *
- *
- * The setlocale() function in Windows is broken in two ways. First, it
- * has a problem with locale names that have a dot in the country name. For
- * example:
- *
- * "Chinese (Traditional)_Hong Kong S.A.R..950"
- *
- * For some reason, setlocale() doesn't accept that as argument, even though
- * setlocale(LC_ALL, NULL) returns exactly that. Fortunately, it accepts
- * various alternative names for such countries, so to work around the broken
- * setlocale() function, we map the troublemaking locale names to accepted
- * aliases, before calling setlocale().
- *
- * The second problem is that the locale name for "Norwegian (Bokm&aring;l)"
- * contains a non-ASCII character. That's problematic, because it's not clear
- * what encoding the locale name itself is supposed to be in, when you
- * haven't yet set a locale. Also, it causes problems when the cluster
- * contains databases with different encodings, as the locale name is stored
- * in the pg_database system catalog. To work around that, when setlocale()
- * returns that locale name, map it to a pure-ASCII alias for the same
- * locale.
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#undef setlocale
-
-struct locale_map
-{
-	/*
-	 * String in locale name to replace. Can be a single string (end is NULL),
-	 * or separate start and end strings. If two strings are given, the locale
-	 * name must contain both of them, and everything between them is
-	 * replaced. This is used for a poor-man's regexp search, allowing
-	 * replacement of "start.*end".
-	 */
-	const char *locale_name_start;
-	const char *locale_name_end;
-
-	const char *replacement;	/* string to replace the match with */
-};
-
-/*
- * Mappings applied before calling setlocale(), to the argument.
- */
-static const struct locale_map locale_map_argument[] = {
-	/*
-	 * "HKG" is listed here:
-	 * http://msdn.microsoft.com/en-us/library/cdax410z%28v=vs.71%29.aspx
-	 * (Country/Region Strings).
-	 *
-	 * "ARE" is the ISO-3166 three-letter code for U.A.E. It is not on the
-	 * above list, but seems to work anyway.
-	 */
-	{"Hong Kong S.A.R.", NULL, "HKG"},
-	{"U.A.E.", NULL, "ARE"},
-
-	/*
-	 * The ISO-3166 country code for Macau S.A.R. is MAC, but Windows doesn't
-	 * seem to recognize that. And Macau isn't listed in the table of accepted
-	 * abbreviations linked above. Fortunately, "ZHM" seems to be accepted as
-	 * an alias for "Chinese (Traditional)_Macau S.A.R..950". I'm not sure
-	 * where "ZHM" comes from, must be some legacy naming scheme. But hey, it
-	 * works.
-	 *
-	 * Note that unlike HKG and ARE, ZHM is an alias for the *whole* locale
-	 * name, not just the country part.
-	 *
-	 * Some versions of Windows spell it "Macau", others "Macao".
-	 */
-	{"Chinese (Traditional)_Macau S.A.R..950", NULL, "ZHM"},
-	{"Chinese_Macau S.A.R..950", NULL, "ZHM"},
-	{"Chinese (Traditional)_Macao S.A.R..950", NULL, "ZHM"},
-	{"Chinese_Macao S.A.R..950", NULL, "ZHM"},
-	{NULL, NULL, NULL}
-};
-
-/*
- * Mappings applied after calling setlocale(), to its return value.
- */
-static const struct locale_map locale_map_result[] = {
-	/*
-	 * "Norwegian (Bokm&aring;l)" locale name contains the a-ring character.
-	 * Map it to a pure-ASCII alias.
-	 *
-	 * It's not clear what encoding setlocale() uses when it returns the
-	 * locale name, so to play it safe, we search for "Norwegian (Bok*l)".
-	 *
-	 * Just to make life even more complicated, some versions of Windows spell
-	 * the locale name without parentheses.  Translate that too.
-	 */
-	{"Norwegian (Bokm", "l)_Norway", "Norwegian_Norway"},
-	{"Norwegian Bokm", "l_Norway", "Norwegian_Norway"},
-	{NULL, NULL, NULL}
-};
-
-#define MAX_LOCALE_NAME_LEN		100
-
-static const char *
-map_locale(const struct locale_map *map, const char *locale)
-{
-	static char aliasbuf[MAX_LOCALE_NAME_LEN];
-	int			i;
-
-	/* Check if the locale name matches any of the problematic ones. */
-	for (i = 0; map[i].locale_name_start != NULL; i++)
-	{
-		const char *needle_start = map[i].locale_name_start;
-		const char *needle_end = map[i].locale_name_end;
-		const char *replacement = map[i].replacement;
-		char	   *match;
-		char	   *match_start = NULL;
-		char	   *match_end = NULL;
-
-		match = strstr(locale, needle_start);
-		if (match)
-		{
-			/*
-			 * Found a match for the first part. If this was a two-part
-			 * replacement, find the second part.
-			 */
-			match_start = match;
-			if (needle_end)
-			{
-				match = strstr(match_start + strlen(needle_start), needle_end);
-				if (match)
-					match_end = match + strlen(needle_end);
-				else
-					match_start = NULL;
-			}
-			else
-				match_end = match_start + strlen(needle_start);
-		}
-
-		if (match_start)
-		{
-			/* Found a match. Replace the matched string. */
-			int			matchpos = match_start - locale;
-			int			replacementlen = strlen(replacement);
-			char	   *rest = match_end;
-			int			restlen = strlen(rest);
-
-			/* check that the result fits in the static buffer */
-			if (matchpos + replacementlen + restlen + 1 > MAX_LOCALE_NAME_LEN)
-				return NULL;
-
-			memcpy(&aliasbuf[0], &locale[0], matchpos);
-			memcpy(&aliasbuf[matchpos], replacement, replacementlen);
-			/* includes null terminator */
-			memcpy(&aliasbuf[matchpos + replacementlen], rest, restlen + 1);
-
-			return aliasbuf;
-		}
-	}
-
-	/* no match, just return the original string */
-	return locale;
-}
-
-char *
-pgwin32_setlocale(int category, const char *locale)
-{
-	const char *argument;
-	char	   *result;
-
-	if (locale == NULL)
-		argument = NULL;
-	else
-		argument = map_locale(locale_map_argument, locale);
-
-	/* Call the real setlocale() function */
-	result = setlocale(category, argument);
-
-	/*
-	 * setlocale() is specified to return a "char *" that the caller is
-	 * forbidden to modify, so casting away the "const" is innocuous.
-	 */
-	if (result)
-		result = unconstify(char *, map_locale(locale_map_result, result));
-
-	return result;
-}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index e4feda10fd..1bc35ef926 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -108,7 +108,7 @@ sub mkvcbuild
 	  pqsignal.c mkdtemp.c qsort.c qsort_arg.c bsearch_arg.c quotes.c system.c
 	  strerror.c tar.c
 	  win32env.c win32error.c win32ntdll.c
-	  win32security.c win32setlocale.c win32stat.c);
+	  win32security.c win32stat.c);
 
 	push(@pgportfiles, 'strtof.c') if ($vsVersion < '14.00');
 
-- 
2.35.1

#12Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Thomas Munro (#10)
Re: Windows default locale vs initdb

On Tue, Jul 19, 2022 at 12:59 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName(). Here's a patch.

This LGTM.

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1]. I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

That thread has been split [1]/messages/by-id/0050ec23-34d9-2765-9015-98c04f0e18ac@postgrespro.ru, but that is how the current version behaves.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually. Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question. How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.

Is there a safe way to do that in pg_upgrade or would we be forcing users
to pg_dump into the new cluster?

[1]: /messages/by-id/0050ec23-34d9-2765-9015-98c04f0e18ac@postgrespro.ru
/messages/by-id/0050ec23-34d9-2765-9015-98c04f0e18ac@postgrespro.ru

Regards,

Juan José Santamaría Flecha

#13Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Thomas Munro (#11)
Re: Windows default locale vs initdb

On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:

As for whether "accordingly" still applies, by the logic of of
win32_langinfo()... Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not
sure what to make of that. The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want. Hmph.

Still, WIN1252 is not the wrong answer for what we are asking. Even if you
enable UTF-8 support [1]https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170, the system will use the current default Windows
ANSI code page (ACP) for the locale and UTF-8 for the code page.

[1]: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170

Regards,

Juan José Santamaría Flecha

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Juan José Santamaría Flecha (#13)
3 attachment(s)
Re: Windows default locale vs initdb

On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:

As for whether "accordingly" still applies, by the logic of of
win32_langinfo()... Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not
sure what to make of that. The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want. Hmph.

Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.

I'm still confused about what that means. Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached). It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]https://cirrus-ci.com/task/6423238052937728:

SELECT 'i'::citext = 'İ'::citext AS t;
 t
 ---
- t
+ f
 (1 row)

About the pg_upgrade problem, maybe it's OK ... existing old format
names should continue to work, but we can still remove the weird code
that does locale name tweaking, right? pg_upgraded databases should
contain fixed names (ie that were fixed by old initdb so should
continue to work), and new clusters will get BCP 47 names.

I don't really know, I was just playing with rough ideas by sending
patches to CI here...

[1]: https://cirrus-ci.com/task/6423238052937728

Attachments:

v3-0001-Default-to-BCP-47-locale-in-initdb-on-Windows.patchtext/x-patch; charset=UTF-8; name=v3-0001-Default-to-BCP-47-locale-in-initdb-on-Windows.patchDownload
From b007eb45e575956d5035f4152f72177abddc2762 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v3 1/3] Default to BCP 47 locale in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because they are unstable and not recommended for use in
databases.  Since setlocale() returns such names, on Windows use
GetUserDefaultLocaleName() if the user didn't provide an explicit
locale.

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 10 ++++++++--
 src/bin/initdb/initdb.c   | 31 +++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..b656ca489f 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,14 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags, like ICU.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on English words,
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 89b888eaa5..3af08b7b99 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,6 +59,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2007,6 +2011,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2022,10 +2027,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+					 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2054,6 +2079,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.30.2

v3-0002-Default-to-UTF-8-in-initdb-on-Windows.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Default-to-UTF-8-in-initdb-on-Windows.patchDownload
From 430fda564b3a36f346a672668e5181f43132368e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 20 Jul 2022 22:38:50 +1200
Subject: [PATCH v3 2/3] Default to UTF-8 in initdb on Windows.

---
 src/backend/utils/adt/pg_locale.c | 13 ++++++++++++-
 src/bin/initdb/initdb.c           |  4 +++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 607a4b7340..6242dc094d 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1687,8 +1687,19 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		 */
 		NLSVERSIONINFOEX version = {sizeof(NLSVERSIONINFOEX)};
 		WCHAR		wide_collcollate[LOCALE_NAME_MAX_LENGTH];
+		char		copy_collcollate[LOCALE_NAME_MAX_LENGTH];
 
-		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
+		/* Trim off encoding, if there is one */
+		strlcpy(copy_collcollate, collcollate, sizeof(copy_collcollate));
+		for (char *p = copy_collcollate; *p; ++p)
+		{
+			if (*p == '.')
+			{
+				*p = 0;
+				break;
+			}
+		}
+		MultiByteToWideChar(CP_ACP, 0, copy_collcollate, -1, wide_collcollate,
 							LOCALE_NAME_MAX_LENGTH);
 		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
 		{
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3af08b7b99..1170b80d75 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2040,7 +2040,9 @@ check_locale_name(int category, const char *locale, char **canonname)
 								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
 			pg_fatal("failed to convert locale name: error code %lu",
 					 GetLastError());
-		locale_copy = pg_strdup(name);
+		/* default to UTF-8, no matter what the system default codepage is */
+		locale_copy = pg_malloc(strlen(name) + sizeof(".UTF-8") + 1);
+		sprintf(locale_copy, "%s.UTF-8", name);
 #else
 		/* use environment to find the default */
 		locale_copy = pg_strdup("");
-- 
2.30.2

v3-0003-Remove-support-for-old-Windows-locale-names.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Remove-support-for-old-Windows-locale-names.patchDownload
From 5281778fbcd20155fd4613732c00e140f7745842 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 08:53:08 +1200
Subject: [PATCH v3 3/3] Remove support for old Windows locale names.

We now use BCP 47 locale names by default and also advise those for
explicit use.  Remove support for munging the old unstable and
unsystematic English word-style locale names.

If you explicitly provide a traditional value like "English_United
States.1521", it will still work, but we won't do any kind of cleanup on
the name.  This should be enough to work with pg_upgrade'd systems from
before we started using BCP 47.

XXX Is that true?
---
 configure                         |   6 -
 configure.ac                      |   1 -
 doc/src/sgml/charset.sgml         |   2 +-
 src/backend/utils/adt/pg_locale.c | 231 +++---------------------------
 src/include/port/win32_port.h     |   9 --
 src/port/win32setlocale.c         | 193 -------------------------
 src/tools/msvc/Mkvcbuild.pm       |   2 +-
 7 files changed, 19 insertions(+), 425 deletions(-)
 delete mode 100644 src/port/win32setlocale.c

diff --git a/configure b/configure
index 59fa82b8d7..d0da0c0a40 100755
--- a/configure
+++ b/configure
@@ -17180,12 +17180,6 @@ esac
  ;;
 esac
 
-  case " $LIBOBJS " in
-  *" win32setlocale.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS win32setlocale.$ac_objext"
- ;;
-esac
-
   case " $LIBOBJS " in
   *" win32stat.$ac_objext "* ) ;;
   *) LIBOBJS="$LIBOBJS win32stat.$ac_objext"
diff --git a/configure.ac b/configure.ac
index 612dabf698..c0906429be 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1996,7 +1996,6 @@ if test "$PORTNAME" = "win32"; then
   AC_LIBOBJ(win32error)
   AC_LIBOBJ(win32ntdll)
   AC_LIBOBJ(win32security)
-  AC_LIBOBJ(win32setlocale)
   AC_LIBOBJ(win32stat)
   AC_DEFINE([HAVE_SYMLINK], 1,
             [Define to 1 if you have the `symlink' function.])
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index b656ca489f..3c12f3f344 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -90,7 +90,7 @@ initdb --locale=sv_SE
     For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
     Windows also supports more verbose locale names based on English words,
     such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
-    but these are not recommended.
+    but these should not be used in PostgreSQL.
    </para>
 
    <para>
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6242dc094d..f7f51a7055 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -118,7 +118,7 @@ static HTAB *collation_cache = NULL;
 
 
 #if defined(WIN32) && defined(LC_MESSAGES)
-static char *IsoLocaleName(const char *);	/* MSVC specific */
+static char *PosixLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
@@ -204,10 +204,7 @@ pg_perm_setlocale(int category, const char *locale)
 		case LC_MESSAGES:
 			envvar = "LC_MESSAGES";
 #ifdef WIN32
-			result = IsoLocaleName(locale);
-			if (result == NULL)
-				result = (char *) locale;
-			elog(DEBUG3, "IsoLocaleName() executed; locale: \"%s\"", result);
+			locale = PosixLocaleName(locale);
 #endif							/* WIN32 */
 			break;
 #endif							/* LC_MESSAGES */
@@ -905,218 +902,35 @@ cache_locale_time(void)
 
 #if defined(WIN32) && defined(LC_MESSAGES)
 /*
- * Convert a Windows setlocale() argument to a Unix-style one.
+ * Convert a Windows BCP 47 locale name to a POSIX one.
  *
  * Regardless of platform, we install message catalogs under a Unix-style
  * LL[_CC][.ENCODING][@VARIANT] naming convention.  Only LC_MESSAGES settings
  * following that style will elicit localized interface strings.
  *
- * Before Visual Studio 2012 (msvcr110.dll), Windows setlocale() accepted "C"
- * (but not "c") and strings of the form <Language>[_<Country>][.<CodePage>],
- * case-insensitive.  setlocale() returns the fully-qualified form; for
- * example, setlocale("thaI") returns "Thai_Thailand.874".  Internally,
- * setlocale() and _create_locale() select a "locale identifier"[1] and store
- * it in an undocumented _locale_t field.  From that LCID, we can retrieve the
- * ISO 639 language and the ISO 3166 country.  Character encoding does not
- * matter, because the server and client encodings govern that.
- *
- * Windows Vista introduced the "locale name" concept[2], closely following
- * RFC 4646.  Locale identifiers are now deprecated.  Starting with Visual
- * Studio 2012, setlocale() accepts locale names in addition to the strings it
- * accepted historically.  It does not standardize them; setlocale("Th-tH")
- * returns "Th-tH".  setlocale(category, "") still returns a traditional
- * string.  Furthermore, msvcr110.dll changed the undocumented _locale_t
- * content to carry locale names instead of locale identifiers.
- *
- * Visual Studio 2015 should still be able to do the same as Visual Studio
- * 2012, but the declaration of locale_name is missing in _locale_t, causing
- * this code compilation to fail, hence this falls back instead on to
- * enumerating all system locales by using EnumSystemLocalesEx to find the
- * required locale name.  If the input argument is in Unix-style then we can
- * get ISO Locale name directly by using GetLocaleInfoEx() with LCType as
- * LOCALE_SNAME.
- *
- * MinGW headers declare _create_locale(), but msvcrt.dll lacks that symbol in
- * releases before Windows 8. IsoLocaleName() always fails in a MinGW-built
- * postgres.exe, so only Unix-style values of the lc_messages GUC can elicit
- * localized messages. In particular, every lc_messages setting that initdb
- * can select automatically will yield only C-locale messages. XXX This could
- * be fixed by running the fully-qualified locale name through a lookup table.
- *
- * This function returns a pointer to a static buffer bearing the converted
- * name or NULL if conversion fails.
- *
- * [1] https://docs.microsoft.com/en-us/windows/win32/intl/locale-identifiers
- * [2] https://docs.microsoft.com/en-us/windows/win32/intl/locale-names
- */
-
-/*
- * Callback function for EnumSystemLocalesEx() in get_iso_localename().
- *
- * This function enumerates all system locales, searching for one that matches
- * an input with the format: <Language>[_<Country>], e.g.
- * English[_United States]
- *
- * The input is a three wchar_t array as an LPARAM. The first element is the
- * locale_name we want to match, the second element is an allocated buffer
- * where the Unix-style locale is copied if a match is found, and the third
- * element is the search status, 1 if a match was found, 0 otherwise.
- */
-static BOOL CALLBACK
-search_locale_enum(LPWSTR pStr, DWORD dwFlags, LPARAM lparam)
-{
-	wchar_t		test_locale[LOCALE_NAME_MAX_LENGTH];
-	wchar_t   **argv;
-
-	(void) (dwFlags);
-
-	argv = (wchar_t **) lparam;
-	*argv[2] = (wchar_t) 0;
-
-	memset(test_locale, 0, sizeof(test_locale));
-
-	/* Get the name of the <Language> in English */
-	if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHLANGUAGENAME,
-						test_locale, LOCALE_NAME_MAX_LENGTH))
-	{
-		/*
-		 * If the enumerated locale does not have a hyphen ("en") OR the
-		 * lc_message input does not have an underscore ("English"), we only
-		 * need to compare the <Language> tags.
-		 */
-		if (wcsrchr(pStr, '-') == NULL || wcsrchr(argv[0], '_') == NULL)
-		{
-			if (_wcsicmp(argv[0], test_locale) == 0)
-			{
-				wcscpy(argv[1], pStr);
-				*argv[2] = (wchar_t) 1;
-				return FALSE;
-			}
-		}
-
-		/*
-		 * We have to compare a full <Language>_<Country> tag, so we append
-		 * the underscore and name of the country/region in English, e.g.
-		 * "English_United States".
-		 */
-		else
-		{
-			size_t		len;
-
-			wcscat(test_locale, L"_");
-			len = wcslen(test_locale);
-			if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHCOUNTRYNAME,
-								test_locale + len,
-								LOCALE_NAME_MAX_LENGTH - len))
-			{
-				if (_wcsicmp(argv[0], test_locale) == 0)
-				{
-					wcscpy(argv[1], pStr);
-					*argv[2] = (wchar_t) 1;
-					return FALSE;
-				}
-			}
-		}
-	}
-
-	return TRUE;
-}
-
-/*
- * This function converts a Windows locale name to an ISO formatted version
- * for Visual Studio 2015 or greater.
- *
- * Returns NULL, if no valid conversion was found.
+ * Historically, verbose, but unsystematic and unstable names like
+ * "Thai_Thailand.874" were supported, but now only BCP 47 input is expected.
+ * That means we just need to be able to convert "en-US" to "en_US".
  */
 static char *
-get_iso_localename(const char *winlocname)
+PosixLocaleName(const char *winlocname)
 {
-	wchar_t		wc_locale_name[LOCALE_NAME_MAX_LENGTH];
-	wchar_t		buffer[LOCALE_NAME_MAX_LENGTH];
-	static char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
-	char	   *period;
-	int			len;
-	int			ret_val;
-
-	/*
-	 * Valid locales have the following syntax:
-	 * <Language>[_<Country>[.<CodePage>]]
-	 *
-	 * GetLocaleInfoEx can only take locale name without code-page and for the
-	 * purpose of this API the code-page doesn't matter.
-	 */
-	period = strchr(winlocname, '.');
-	if (period != NULL)
-		len = period - winlocname;
-	else
-		len = pg_mbstrlen(winlocname);
-
-	memset(wc_locale_name, 0, sizeof(wc_locale_name));
-	memset(buffer, 0, sizeof(buffer));
-	MultiByteToWideChar(CP_ACP, 0, winlocname, len, wc_locale_name,
-						LOCALE_NAME_MAX_LENGTH);
-
-	/*
-	 * If the lc_messages is already a Unix-style string, we have a direct
-	 * match with LOCALE_SNAME, e.g. en-US, en_US.
-	 */
-	ret_val = GetLocaleInfoEx(wc_locale_name, LOCALE_SNAME, (LPWSTR) &buffer,
-							  LOCALE_NAME_MAX_LENGTH);
-	if (!ret_val)
-	{
-		/*
-		 * Search for a locale in the system that matches language and country
-		 * name.
-		 */
-		wchar_t    *argv[3];
-
-		argv[0] = wc_locale_name;
-		argv[1] = buffer;
-		argv[2] = (wchar_t *) &ret_val;
-		EnumSystemLocalesEx(search_locale_enum, LOCALE_WINDOWS, (LPARAM) argv,
-							NULL);
-	}
-
-	if (ret_val)
-	{
-		size_t		rc;
-		char	   *hyphen;
-
-		/* Locale names use only ASCII, any conversion locale suffices. */
-		rc = wchar2char(iso_lc_messages, buffer, sizeof(iso_lc_messages), NULL);
-		if (rc == -1 || rc == sizeof(iso_lc_messages))
-			return NULL;
-
-		/*
-		 * Simply replace the hyphen with an underscore.  See comments in
-		 * IsoLocaleName.
-		 */
-		hyphen = strchr(iso_lc_messages, '-');
-		if (hyphen)
-			*hyphen = '_';
-		return iso_lc_messages;
-	}
-
-	return NULL;
-}
-
-static char *
-IsoLocaleName(const char *winlocname)
-{
-#if defined(_MSC_VER)
-	static char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
+	char iso_lc_messages[LOCALE_NAME_MAX_LENGTH];
+	char *hyphen;
 
 	if (pg_strcasecmp("c", winlocname) == 0 ||
 		pg_strcasecmp("posix", winlocname) == 0)
 	{
 		strcpy(iso_lc_messages, "C");
-		return iso_lc_messages;
 	}
 	else
-		return get_iso_localename(winlocname);
-
-#endif							/* defined(_MSC_VER) */
-	return NULL;				/* Not supported on this version of msvc/mingw */
+	{
+		strlcpy(iso_lc_messages, winlocname, sizeof(iso_lc_messages));
+		hypen = strchr(iso_lc_messages, '-');
+		if (hyphen)
+			*hyphen = '_';
+	}
+	return pg_strcpy(iso_lc_messages);
 }
 #endif							/* WIN32 && LC_MESSAGES */
 
@@ -1680,11 +1494,6 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 			ereport(ERROR,
 					(errmsg("could not load locale \"%s\"", collcollate)));
 #elif defined(WIN32)
-		/*
-		 * If we are targeting Windows Vista and above, we can ask for a name
-		 * given a collation name (earlier versions required a location code
-		 * that we don't have).
-		 */
 		NLSVERSIONINFOEX version = {sizeof(NLSVERSIONINFOEX)};
 		WCHAR		wide_collcollate[LOCALE_NAME_MAX_LENGTH];
 		char		copy_collcollate[LOCALE_NAME_MAX_LENGTH];
@@ -1703,13 +1512,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 							LOCALE_NAME_MAX_LENGTH);
 		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
 		{
-			/*
-			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
-			 * locale name like "English_United States.1252".  Until those
-			 * values can be prevented from entering the system, or 100%
-			 * reliably converted to the more useful tag format, tolerate the
-			 * resulting error and report that we have no version data.
-			 */
+			/* Old style locale names fail here, so ignore. */
 			if (GetLastError() == ERROR_INVALID_PARAMETER)
 				return NULL;
 
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 4de5bf3bf6..a569433811 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -445,15 +445,6 @@ extern int	_pgstat64(const char *name, struct stat *buf);
 #undef setlocale
 #endif
 
-/*
- * Define our own wrapper macro around setlocale() to work around bugs in
- * Windows' native setlocale() function.
- */
-extern char *pgwin32_setlocale(int category, const char *locale);
-
-#define setlocale(a,b) pgwin32_setlocale(a,b)
-
-
 /* In backend/port/win32/signal.c */
 extern PGDLLIMPORT volatile int pg_signal_queue;
 extern PGDLLIMPORT int pg_signal_mask;
diff --git a/src/port/win32setlocale.c b/src/port/win32setlocale.c
deleted file mode 100644
index aadd09a4e9..0000000000
--- a/src/port/win32setlocale.c
+++ /dev/null
@@ -1,193 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * win32setlocale.c
- *		Wrapper to work around bugs in Windows setlocale() implementation
- *
- * Copyright (c) 2011-2022, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *	  src/port/win32setlocale.c
- *
- *
- * The setlocale() function in Windows is broken in two ways. First, it
- * has a problem with locale names that have a dot in the country name. For
- * example:
- *
- * "Chinese (Traditional)_Hong Kong S.A.R..950"
- *
- * For some reason, setlocale() doesn't accept that as argument, even though
- * setlocale(LC_ALL, NULL) returns exactly that. Fortunately, it accepts
- * various alternative names for such countries, so to work around the broken
- * setlocale() function, we map the troublemaking locale names to accepted
- * aliases, before calling setlocale().
- *
- * The second problem is that the locale name for "Norwegian (Bokm&aring;l)"
- * contains a non-ASCII character. That's problematic, because it's not clear
- * what encoding the locale name itself is supposed to be in, when you
- * haven't yet set a locale. Also, it causes problems when the cluster
- * contains databases with different encodings, as the locale name is stored
- * in the pg_database system catalog. To work around that, when setlocale()
- * returns that locale name, map it to a pure-ASCII alias for the same
- * locale.
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#undef setlocale
-
-struct locale_map
-{
-	/*
-	 * String in locale name to replace. Can be a single string (end is NULL),
-	 * or separate start and end strings. If two strings are given, the locale
-	 * name must contain both of them, and everything between them is
-	 * replaced. This is used for a poor-man's regexp search, allowing
-	 * replacement of "start.*end".
-	 */
-	const char *locale_name_start;
-	const char *locale_name_end;
-
-	const char *replacement;	/* string to replace the match with */
-};
-
-/*
- * Mappings applied before calling setlocale(), to the argument.
- */
-static const struct locale_map locale_map_argument[] = {
-	/*
-	 * "HKG" is listed here:
-	 * http://msdn.microsoft.com/en-us/library/cdax410z%28v=vs.71%29.aspx
-	 * (Country/Region Strings).
-	 *
-	 * "ARE" is the ISO-3166 three-letter code for U.A.E. It is not on the
-	 * above list, but seems to work anyway.
-	 */
-	{"Hong Kong S.A.R.", NULL, "HKG"},
-	{"U.A.E.", NULL, "ARE"},
-
-	/*
-	 * The ISO-3166 country code for Macau S.A.R. is MAC, but Windows doesn't
-	 * seem to recognize that. And Macau isn't listed in the table of accepted
-	 * abbreviations linked above. Fortunately, "ZHM" seems to be accepted as
-	 * an alias for "Chinese (Traditional)_Macau S.A.R..950". I'm not sure
-	 * where "ZHM" comes from, must be some legacy naming scheme. But hey, it
-	 * works.
-	 *
-	 * Note that unlike HKG and ARE, ZHM is an alias for the *whole* locale
-	 * name, not just the country part.
-	 *
-	 * Some versions of Windows spell it "Macau", others "Macao".
-	 */
-	{"Chinese (Traditional)_Macau S.A.R..950", NULL, "ZHM"},
-	{"Chinese_Macau S.A.R..950", NULL, "ZHM"},
-	{"Chinese (Traditional)_Macao S.A.R..950", NULL, "ZHM"},
-	{"Chinese_Macao S.A.R..950", NULL, "ZHM"},
-	{NULL, NULL, NULL}
-};
-
-/*
- * Mappings applied after calling setlocale(), to its return value.
- */
-static const struct locale_map locale_map_result[] = {
-	/*
-	 * "Norwegian (Bokm&aring;l)" locale name contains the a-ring character.
-	 * Map it to a pure-ASCII alias.
-	 *
-	 * It's not clear what encoding setlocale() uses when it returns the
-	 * locale name, so to play it safe, we search for "Norwegian (Bok*l)".
-	 *
-	 * Just to make life even more complicated, some versions of Windows spell
-	 * the locale name without parentheses.  Translate that too.
-	 */
-	{"Norwegian (Bokm", "l)_Norway", "Norwegian_Norway"},
-	{"Norwegian Bokm", "l_Norway", "Norwegian_Norway"},
-	{NULL, NULL, NULL}
-};
-
-#define MAX_LOCALE_NAME_LEN		100
-
-static const char *
-map_locale(const struct locale_map *map, const char *locale)
-{
-	static char aliasbuf[MAX_LOCALE_NAME_LEN];
-	int			i;
-
-	/* Check if the locale name matches any of the problematic ones. */
-	for (i = 0; map[i].locale_name_start != NULL; i++)
-	{
-		const char *needle_start = map[i].locale_name_start;
-		const char *needle_end = map[i].locale_name_end;
-		const char *replacement = map[i].replacement;
-		char	   *match;
-		char	   *match_start = NULL;
-		char	   *match_end = NULL;
-
-		match = strstr(locale, needle_start);
-		if (match)
-		{
-			/*
-			 * Found a match for the first part. If this was a two-part
-			 * replacement, find the second part.
-			 */
-			match_start = match;
-			if (needle_end)
-			{
-				match = strstr(match_start + strlen(needle_start), needle_end);
-				if (match)
-					match_end = match + strlen(needle_end);
-				else
-					match_start = NULL;
-			}
-			else
-				match_end = match_start + strlen(needle_start);
-		}
-
-		if (match_start)
-		{
-			/* Found a match. Replace the matched string. */
-			int			matchpos = match_start - locale;
-			int			replacementlen = strlen(replacement);
-			char	   *rest = match_end;
-			int			restlen = strlen(rest);
-
-			/* check that the result fits in the static buffer */
-			if (matchpos + replacementlen + restlen + 1 > MAX_LOCALE_NAME_LEN)
-				return NULL;
-
-			memcpy(&aliasbuf[0], &locale[0], matchpos);
-			memcpy(&aliasbuf[matchpos], replacement, replacementlen);
-			/* includes null terminator */
-			memcpy(&aliasbuf[matchpos + replacementlen], rest, restlen + 1);
-
-			return aliasbuf;
-		}
-	}
-
-	/* no match, just return the original string */
-	return locale;
-}
-
-char *
-pgwin32_setlocale(int category, const char *locale)
-{
-	const char *argument;
-	char	   *result;
-
-	if (locale == NULL)
-		argument = NULL;
-	else
-		argument = map_locale(locale_map_argument, locale);
-
-	/* Call the real setlocale() function */
-	result = setlocale(category, argument);
-
-	/*
-	 * setlocale() is specified to return a "char *" that the caller is
-	 * forbidden to modify, so casting away the "const" is innocuous.
-	 */
-	if (result)
-		result = unconstify(char *, map_locale(locale_map_result, result));
-
-	return result;
-}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index cc7a908d10..fff378d2b3 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -109,7 +109,7 @@ sub mkvcbuild
 	  pqsignal.c mkdtemp.c qsort.c qsort_arg.c bsearch_arg.c quotes.c system.c
 	  strerror.c tar.c
 	  win32env.c win32error.c win32ntdll.c
-	  win32security.c win32setlocale.c win32stat.c);
+	  win32security.c win32stat.c);
 
 	push(@pgportfiles, 'strtof.c') if ($vsVersion < '14.00');
 
-- 
2.30.2

#15Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Thomas Munro (#14)
Re: Windows default locale vs initdb

On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

Still, WIN1252 is not the wrong answer for what we are asking. Even if

you enable UTF-8 support [1], the system will use the current default
Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.

I'm still confused about what that means. Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached). It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Let me try to explain this using the "Beta: Use Unicode UTF-8 for

worldwide language support" option [1]https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do.

- Currently in a system with the language settings of "English_United
States" and that option disabled, when executing initdb you get:

The database cluster will be initialized with locale "English_United
States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

And as a test for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR: character with byte sequence 0xc5 0x9f in encoding "UTF8" has no
equivalent in encoding "WIN1252"

We get this error even if the database encoding is UTF8, and is caused by
the tr_tr locales being encoded in WIN1254. We can discuss this in another
thread, and I can propose a patch.

- If we enable the UTF-8 support option, then the same test goes as:

The database cluster will be initialized with locale "English_United
States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

And for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
to_char
---------
şubat
(1 row)

In this case the Windows locales are actually UTF8 encoded.

TL;DR; What I want to show through this example is that Windows ACP is not
modified by setlocale(), it can only be done through the Windows registry
and only in recent releases.

But then the Turkish i test in contrib/citext/sql/citext_utf8.sql
failed[1]:

SELECT 'i'::citext = 'İ'::citext AS t;
t
---
- t
+ f
(1 row)

This is current state of affairs:

- Windows:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower |
latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ
| İ

- Linux:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower |
latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ
| i

Latin_capital_dotted doesn't have the same lower value.

[1]: https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do
https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do

Regards,

Juan José Santamaría Flecha

#16Thomas Munro
thomas.munro@gmail.com
In reply to: Juan José Santamaría Flecha (#15)
Re: Windows default locale vs initdb

On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be done through the Windows registry and only in recent releases.

Thanks, that was helpful, and so was that SO link.

So it sounds like I should forget about the v3-0002 patch, but the
v3-0001 and v3-0003 patches might have a future. And it sounds like
we might need to investigate maybe defending ourselves against the ACP
being different than what we expect (ie not matching the database
encoding)? Did I understand correctly that you're looking into that?

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#16)
Re: Windows default locale vs initdb

On Fri, Jul 29, 2022 at 3:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be done through the Windows registry and only in recent releases.

Thanks, that was helpful, and so was that SO link.

So it sounds like I should forget about the v3-0002 patch, but the
v3-0001 and v3-0003 patches might have a future. And it sounds like
we might need to investigate maybe defending ourselves against the ACP
being different than what we expect (ie not matching the database
encoding)? Did I understand correctly that you're looking into that?

I'm going to withdraw this entry. The sooner we get something like
0001 into a release, the sooner the world will be rid of PostgreSQL
clusters initialised with the bad old locale names that the manual
very clearly tells you not to use for databases.... but I don't
understand this ACP/registry vs database encoding stuff and how it
relates to the use of BCP47 locale names, which puts me off changing
anything until we do.

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#17)
1 attachment(s)
Re: Windows default locale vs initdb

Another country has changed its name, and a Windows OS update has
again broken every PostgreSQL cluster in that whole country[1]/messages/by-id/18196-b10f93dfbde3d7db@postgresql.org (or at
least those that had accepted initdb's default choice of locale,
probably most). Let's get to the bottom of this, because otherwise it
is simply going to keep happening, causing administrative pain for a
lot of people.

Here is a rebase of the basic patch I proposed last time, and a
re-statement of what we know:

1. initdb chooses a default locale using a technique that gives you
an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"),
non-ASCII ("Norwegian (Bokmål)") string that we are warned we should
not store anywhere. We store it, and then later it is not recognised.
Instead we should select an IETF BCP 47 locale name, based on stable
ISO country and language codes, like "en-US", "tr-TR" etc. Here is
the patch to teach initdb to use that, unchanged from v3 except that I
tweaked the docs a bit.

2. In Windows 10+ it is now also possible to put ".UTF-8" on the end
of locale names. I couldn't figure out whether we should do that, and
what effect it has on ctypes -- apparently not the effect I expected
(see upthread). Was our UTF-8 support on Windows already broken, and
this new ".UTF-8" thing is just a new way to reach that brokenness?
Is it OK to continue to choose the "legacy" single byte encodings by
default on that OS, and consider that a separate topic for separate
research?

3. It is not clear to me how we should deal with pg_upgrade.
Eventually we want all of the old-school names to fade away, and
pg_upgrade would need to be part of that. Perhaps there is some API
that can be used to translate to the new canonical forms without us
having to maintain translation tables and other messiness in our tree.

4. Eventually we should probably ban non-ASCII characters from
entering the relevant catalogues (they are shared, so their encoding
is undefined except that they must be a superset of ASCII), and delete
all the old win32setlocale.c kludges, after we reach a point where
everyone should be using exclusively BCP 47.

[1]: /messages/by-id/18196-b10f93dfbde3d7db@postgresql.org

Attachments:

v4-0001-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patchapplication/x-patch; name=v4-0001-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patchDownload
From d015005cca08bc1c7ae487392ed7b5a4cfa58748 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v4] Default to IETF BCP 47 locale names in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and not recommended for use in
databases and (2) they may contain non-ASCII characters, which we can't
put in our shared catalogs.  Since setlocale() returns such names, on
Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++++++++++--
 src/bin/initdb/initdb.c   | 31 +++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 74783d148f..9a2cd5c2d5 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags, like ICU.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on full names
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended because they are not stable across operating
+    system updates due to changes in geographical names, and may contain
+    non-ASCII characters which are not supported in PostgreSQL's shared
+    catalogs.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..021e847240 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -64,6 +64,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2132,6 +2136,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2147,10 +2152,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+					 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2183,6 +2208,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.39.2

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#18)
Re: Windows default locale vs initdb

I clicked "Trigger" to get a Mingw test run of this, and it failed[1]https://cirrus-ci.com/task/5301814774464512.
I see why: our function win32_langinfo() believes that it shouldn't
call GetLocaleInfoEx() on non-MSVC compilers, so we see 'initdb:
error: could not find suitable encoding for locale "en-US"'. I think
it has fallback code that parses the ".1252" or whatever on the end of
the name, but "en-US" hasn't got one. I don't know the first thing
about Mingw but it looks like a declaration for that function arrived
6 years ago[2]https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931, and deleting the "#if defined(_MSC_VER)" fixes the
problem and the tests pass[3]https://cirrus-ci.com/task/6558569718349824. As far as I know, we don't support any
Mingw but the very latest: it's not a target with real users who have
version requirements, it's just a developer [in]convenience, so if it
passes on CI and whatever MSYS version "fairywren" runs in the build
farm right now, that should be enough.

I could just do that in this patch, but I suppose that also means that
someone needs to go through pg_locale.c and other places that test
_MSC_VER not because they actually care about the compiler but because
they want to detect some crusty old Mingw version, and see what else
can be deleted as a result, possibly including a lot of fallback code.
It feels like a separate cleanup for a separate patch.

[1]: https://cirrus-ci.com/task/5301814774464512
[2]: https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931
[3]: https://cirrus-ci.com/task/6558569718349824

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#19)
Re: Windows default locale vs initdb

Here is a thought that occurs to me, as I follow along with Jeff
Davis's evolving proposals for built-in collations and ctypes: What
would stop us from dropping support for the libc (sic) provider on
Windows? That may sound radical and likely to cause extra work for
people on upgrade, but how does that compare to the pain of keeping
this barely maintained code in the tree? Suppose the idea in this
thread goes ahead and we get people to transition to the modern locale
names: there is non-zero transitional/upgrade pain there too. How
delicious it would be to just nuke the whole thing from orbit, and
keep only cross-platform code that is maintained with enthusiasm by
active hackers.

That's probably a little extreme, but it's the direction my thoughts
start to go in when confronting the realisation that it's up to us
[Unix hackers making drive-by changes], no one is coming to help us
[from the Windows user community].

I've even heard others talk about dropping Windows completely, due to
the maintenance imbalance. This would be somewhat more fine grained.
(One could use a similar argument to drop non-NTFS filesystems and
turn on POSIX-mode file links, to end that other locus of struggle.)

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#20)
2 attachment(s)
Re: Windows default locale vs initdb

Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1]/messages/by-id/CA+hUKGJZskvCh=Qm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA@mail.gmail.com was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]https://github.com/postgres/postgres/commit/cff4e5a3

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]/messages/by-id/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ@mail.gmail.com

[1]: /messages/by-id/CA+hUKGJZskvCh=Qm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA@mail.gmail.com
[2]: https://github.com/postgres/postgres/commit/cff4e5a3
[3]: /messages/by-id/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ@mail.gmail.com

Attachments:

v5-0001-MinGW-has-GetLocaleInfoEx.patchtext/x-patch; charset=US-ASCII; name=v5-0001-MinGW-has-GetLocaleInfoEx.patchDownload
From fb33b7eb5482bae31b70bb54dbe77325b543a89c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 20 Nov 2023 14:24:35 +1300
Subject: [PATCH v5 1/2] MinGW has GetLocaleInfoEx().

To use BCP 47 locale names like "en-US" without a suffix ".encoding", we
need to be able to call GetLocaleInfoEx() to look up the encoding.  That
was previously gated for MSVC only, but MinGW has had the function for
many years.  Remove that gating, because otherwise our MinGW build farm
animals would fail when a later commit switches to using the new names by
default.

There are probably other places where _MSC_VER is being used as a proxy
for detecting MinGW with an out-of-date idea about missing functions.

Discussion: https://postgr.es/m/CA%2BhUKGLsV3vTjPp7bOZBr3JTKp3Brkr9V0Qfmc7UvpWcmAQL4A%40mail.gmail.com
---
 src/port/chklocale.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/port/chklocale.c b/src/port/chklocale.c
index 8cb81c8640e..a15b0d5349b 100644
--- a/src/port/chklocale.c
+++ b/src/port/chklocale.c
@@ -204,7 +204,6 @@ win32_langinfo(const char *ctype)
 	char	   *r = NULL;
 	char	   *codepage;
 
-#if defined(_MSC_VER)
 	uint32		cp;
 	WCHAR		wctype[LOCALE_NAME_MAX_LENGTH];
 
@@ -229,7 +228,6 @@ win32_langinfo(const char *ctype)
 		}
 	}
 	else
-#endif
 	{
 		/*
 		 * Locale format on Win32 is <Language>_<Country>.<CodePage>.  For
-- 
2.45.2

v5-0002-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patchtext/x-patch; charset=UTF-8; name=v5-0002-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patchDownload
From dc726a61aace86bda62687e3aa1411753ba3f1a4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v5 2/2] Default to IETF BCP 47 locale names in initdb on
 Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and explicitly not recommended for
use in databases and (2) they may contain non-ASCII characters, which we
can't put in our shared catalogs.  Since setlocale() returns such names,
on Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Reviewed-by:
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++++++++++--
 src/bin/initdb/initdb.c   | 31 +++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..adb21eb0799 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags, like ICU.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on full names
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended because they are not stable across operating
+    system updates due to changes in geographical names, and may contain
+    non-ASCII characters which are not supported in PostgreSQL's shared
+    catalogs.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a0150..393232b6cec 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -64,6 +64,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2132,6 +2136,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2147,10 +2152,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+					 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2183,6 +2208,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.45.2

#22Ertan Küçükoglu
ertan.kucukoglu@gmail.com
In reply to: Thomas Munro (#21)
Re: Windows default locale vs initdb

Hi,

I am a complete noob about PostgreSQL development.
I don't know about the PostgreSQL CI system.
I will be needing some help as to how to do the tests.
I have access to different Windows OSes (v10, Server 2022 mainly).
These systems can be set to English or Turkish locales if needed.
I can also add new Windows versions if needed.
I do not know how to use patch files. I am also not sure what tests I
should do.
Do I need to set up a Windows build system for PostgreSQL CI?
Will I download some files (EXE, etc) ready for testing? Copy them over an
existing installation for testing?

Thanks for your help.

Regards,
Ertan

Thomas Munro <thomas.munro@gmail.com>, 22 Tem 2024 Pzt, 05:52 tarihinde
şunu yazdı:

Show quoted text

Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1]
/messages/by-id/CA+hUKGJZskvCh=Qm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA@mail.gmail.com
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3]
/messages/by-id/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ@mail.gmail.com

#23Zaid Shabbir
zaidshabbir@gmail.com
In reply to: Thomas Munro (#21)
Re: Windows default locale vs initdb

Hello Thomas,

Can you please list down some of the use cases for the patch ? Other than
Turkish, does this patch have an impact on other locales too ?

Regards,
Zaid

On Mon, Jul 22, 2024 at 7:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Show quoted text

Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1]
/messages/by-id/CA+hUKGJZskvCh=Qm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA@mail.gmail.com
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3]
/messages/by-id/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ@mail.gmail.com

#24Thomas Munro
thomas.munro@gmail.com
In reply to: Zaid Shabbir (#23)
Re: Windows default locale vs initdb

On Mon, Jul 22, 2024 at 8:38 PM Zaid Shabbir <zaidshabbir@gmail.com> wrote:

Can you please list down some of the use cases for the patch ? Other than Turkish, does this patch have an impact on other locales too ?

Hi Zaid,

Yes, initdb.exe would use BCP47 codes by default for all languages.
Who knows which country will change its name next?

From a quick search of other recent cases: Czech Republic -> Czechia,
Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc.
The Windows manual says:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

It's pretty bad for our users when it happens and the Windows locale
name changes: a database cluster that suddenly can't start, and even
after you've figured out why and adjusted the references in
postgresql.conf, you still can't connect. There is also the problem
that some of the old full names have non-ASCII characters (Türkiye,
São Tomé and Príncipe, Curaçao, Côte d'Ivoire, Åland) which is bad at
least in theory because we use the string in times and places when it
it is not clear what the encoding the name itself has.

I don't use Windows myself, I've just been watching this train wreck
replaying in a loop for long enough. Clearly it's going to take some
time to wean the user community off the unstable names, and it struck
me that the default is probably the main source of them in new
clusters, hence this patch.

#25Thomas Munro
thomas.munro@gmail.com
In reply to: Ertan Küçükoglu (#22)
Re: Windows default locale vs initdb

On Mon, Jul 22, 2024 at 8:04 PM Ertan Küçükoglu
<ertan.kucukoglu@gmail.com> wrote:

I am a complete noob about PostgreSQL development.
I don't know about the PostgreSQL CI system.
I will be needing some help as to how to do the tests.
I have access to different Windows OSes (v10, Server 2022 mainly).
These systems can be set to English or Turkish locales if needed.
I can also add new Windows versions if needed.
I do not know how to use patch files. I am also not sure what tests I should do.
Do I need to set up a Windows build system for PostgreSQL CI?
Will I download some files (EXE, etc) ready for testing? Copy them over an existing installation for testing?

Sorry, I didn't mean to put you on the spot :-) Yeah you'd need to
install a compiler, various libraries and tools to be able to build
form source with a patch. Unfortunately I'm not the best person to
explain how to do that on Windows as I don't use it. Honestly it
might be a bit too much new stuff to figure out at once just to test
this small patch. What I'd be hoping for is confirmation that there
are no weird unintended consequences or problems I'm not seeing since
I'm writing blind patches based on documentation only, but it's
probably too much to ask to figure out the whole development
environment and then go on an open ended expedition looking for
unknown problems.

#26Ertan Küçükoglu
ertan.kucukoglu@gmail.com
In reply to: Thomas Munro (#25)
Re: Windows default locale vs initdb

Thomas Munro <thomas.munro@gmail.com>, 22 Tem 2024 Pzt, 14:00 tarihinde
şunu yazdı:

Sorry, I didn't mean to put you on the spot :-) Yeah you'd need to
install a compiler, various libraries and tools to be able to build
form source with a patch. Unfortunately I'm not the best person to
explain how to do that on Windows as I don't use it. Honestly it
might be a bit too much new stuff to figure out at once just to test
this small patch. What I'd be hoping for is confirmation that there
are no weird unintended consequences or problems I'm not seeing since
I'm writing blind patches based on documentation only, but it's
probably too much to ask to figure out the whole development
environment and then go on an open ended expedition looking for
unknown problems.

I already installed Visual Studio 2022 with C++ support as suggested in
https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the
src\tools\msvc directory."
It seems I will need help setting up the build environment.

#27Andrew Dunstan
andrew@dunslane.net
In reply to: Thomas Munro (#21)
Re: Windows default locale vs initdb

On 2024-07-21 Su 10:51 PM, Thomas Munro wrote:

Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1] /messages/by-id/CA+hUKGJZskvCh=Qm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA@mail.gmail.com
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] /messages/by-id/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ@mail.gmail.com

I have an environment I can use for testing. But what exactly am I
testing? :-) Install a few "problem" language/region settings, switch
the system and ensure initdb runs ok?

Other than Turkish, which locales should I install?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#28Ertan Küçükoglu
ertan.kucukoglu@gmail.com
In reply to: Andrew Dunstan (#27)
Re: Windows default locale vs initdb

Andrew Dunstan <andrew@dunslane.net>, 22 Tem 2024 Pzt, 16:44 tarihinde şunu
yazdı:

I have an environment I can use for testing. But what exactly am I
testing? :-) Install a few "problem" language/region settings, switch
the system and ensure initdb runs ok?

Other than Turkish, which locales should I install?

Thomas earlier listed a few:
"From a quick search of other recent cases: Czech Republic -> Czechia,
Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc."

I am not sure if all needs testing though.

Thanks & Regards,
Ertan

#29Thomas Munro
thomas.munro@gmail.com
In reply to: Andrew Dunstan (#27)
Re: Windows default locale vs initdb

On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew@dunslane.net> wrote:

I have an environment I can use for testing. But what exactly am I
testing? :-) Install a few "problem" language/region settings, switch
the system and ensure initdb runs ok?

I just want to know about any weird unexpected consequences of using
BCP47 locale names, before we change the default in v18. The only
concrete thing I found so far was that MinGW didn't like it, but I
provided a fix for that. It'd still be possible to initialise a new
cluster with the old style names if you really want to, but you'd have
to pass it in explicitly; I was wondering if that could be necessary
in some pg_upgrade scenario but I guess not, it just clobbers
template0's pg_database row with values from the source database, and
recreates everything else so I think it should be fine (?). I am a
little uneasy about the new names not having .encoding but there
doesn't seem to be an issue with that (such locales exist on Unix
too), and the OS still knows which encoding they use in that case.

#30Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#29)
5 attachment(s)
Re: Windows default locale vs initdb

On Tue, Jul 23, 2024 at 11:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew@dunslane.net> wrote:

I have an environment I can use for testing. But what exactly am I
testing? :-) Install a few "problem" language/region settings, switch
the system and ensure initdb runs ok?

I thought a bit more about what to do with the messy .UTF-8 situation
on Windows, and I think I might see a way forward that harmonises the
code and behaviour with Unix, and deletes a lot of special case code.
But it's only theories + CI so far.

0001, 0002: As before, teach initdb.exe to choose eg "en-US" by default.

0003: Force people to choose locales that match the database
encoding, as we do on Unix. That is, forbid contradictory
combinations like --locale="English_United States.1252"
--encoding=UTF8, which are currently allowed (and the world is full of
such database clusters because that is how the EDB installer GUI makes
them). The only allowed combinations for American English should now
be: --locale="en-US" --encoding="WIN1252", and --locale="en-US.UTF-8"
--encoding="UTF8". You can still use the old names if you like, by
explicitly writing --locale="English_United States.1252", but the
encoding then has to be WIN1252. It's crazy to mix them up, let's ban
that.

Obviously there is a pg_upgrade case to worry about there. We'd have
to "fix" the now illegal combinations, and I don't know exactly how
yet.

0004: Rip out the code that does extra wchar_t conversations for
collations. If I've understood correctly, we don't need them: if you
have a .UTF-8 locale then your encoding is UTF-8 and should be able to
use strcoll_l() directly. Right?

0005: Something similar was being done for strftime(). And we might
as well use strftime_l() instead while we're here (part of general
movement to use _l functions and stop splattering setlocale() all over
the place, for the multithreaded future).

These patches pass on CI. Do they give the expected results when used
on a real Windows system?

There are a few more places where we do wchar_t conversions that could
probably be stripped out too, if my assumptions are correct, and we
could dig further if the basic idea can be validated and people think
this is going in a good direction.

Attachments:

v6-0001-MinGW-has-GetLocaleInfoEx.patchtext/x-patch; charset=US-ASCII; name=v6-0001-MinGW-has-GetLocaleInfoEx.patchDownload
From 886815244ab43092562ae3118cd5588a2fad5bb2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 20 Nov 2023 14:24:35 +1300
Subject: [PATCH v6 1/5] MinGW has GetLocaleInfoEx().

To use BCP 47 locale names like "en-US" without a suffix ".encoding", we
need to be able to call GetLocaleInfoEx() to look up the encoding.  That
was previously gated for MSVC only, but MinGW has had the function for
many years.  Remove that gating, because otherwise our MinGW build farm
animals would fail when a later commit switches to using the new names by
default.

There are probably other places where _MSC_VER is being used as a proxy
for detecting MinGW with an out-of-date idea about missing functions.

Discussion: https://postgr.es/m/CA%2BhUKGLsV3vTjPp7bOZBr3JTKp3Brkr9V0Qfmc7UvpWcmAQL4A%40mail.gmail.com
---
 src/port/chklocale.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/port/chklocale.c b/src/port/chklocale.c
index 8cb81c8640..a15b0d5349 100644
--- a/src/port/chklocale.c
+++ b/src/port/chklocale.c
@@ -204,7 +204,6 @@ win32_langinfo(const char *ctype)
 	char	   *r = NULL;
 	char	   *codepage;
 
-#if defined(_MSC_VER)
 	uint32		cp;
 	WCHAR		wctype[LOCALE_NAME_MAX_LENGTH];
 
@@ -229,7 +228,6 @@ win32_langinfo(const char *ctype)
 		}
 	}
 	else
-#endif
 	{
 		/*
 		 * Locale format on Win32 is <Language>_<Country>.<CodePage>.  For
-- 
2.39.2

v6-0002-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patchtext/x-patch; charset=UTF-8; name=v6-0002-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patchDownload
From 357751c04cdd3dc7dea1ee9409356d818af70d5d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v6 2/5] Default to IETF BCP 47 locale names in initdb on
 Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and explicitly not recommended for
use in databases and (2) they may contain non-ASCII characters, which we
can't put in our shared catalogs.  Since setlocale() returns such names,
on Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Reviewed-by:
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++++++++++--
 src/bin/initdb/initdb.c   | 31 +++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85..adb21eb079 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags, like ICU.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on full names
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended because they are not stable across operating
+    system updates due to changes in geographical names, and may contain
+    non-ASCII characters which are not supported in PostgreSQL's shared
+    catalogs.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a015..393232b6ce 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -64,6 +64,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2132,6 +2136,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2147,10 +2152,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+					 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2183,6 +2208,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.39.2

v6-0003-Don-t-allow-UTF-8-with-non-UTF-8-locales-on-Windo.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Don-t-allow-UTF-8-with-non-UTF-8-locales-on-Windo.patchDownload
From a4b0b0324900d12d487370a08b4ddba20552e230 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Aug 2024 10:23:05 +1200
Subject: [PATCH v6 3/5] Don't allow UTF-8 with non-UTF-8 locales on Windows.

Historically, we allowed contradictions such as:

initdb.exe --locale="French_France.1252" --encoding="UTF-8"

That's because Windows didn't support UTF-8 directly, and PostgreSQL
had to perform UTF-8 char -> wchar_t conversions at various places on
that OS, and still does.  Therefore it was never actually passing
UTF-8 text to operating system facilities.

In preparation for removing those code paths, harmonizing the code
and behavior with Unix builds, and adapting to modern Windows
interfaces, ban such contradictions.  Locale names should ideally be
specified as BCP 47 tags.  If UTF-8 is desired, the name should have
".UTF-8" on the end, but otherwise the traditional encoding of that
language is implied.  Now only the following are valid:

initdb.exe --locale="fr-FR" --encoding="WIN1252"
initdb.exe --locale="fr-FR.UTF-8" --encoding="UTF-8"
initdb.exe --locale="French_France.1252" --encoding="WIN1252"

(The last form is not recommended, but still accepted.)

XXX This will cause problems for clusters upgraded with pg_upgrade from
a system using locales with the wrong encoding.  We'll need a way to
translate to the correct modern locale names.

Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 src/backend/commands/dbcommands.c | 12 +-----------
 src/bin/initdb/initdb.c           | 11 -----------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 7026352bc9..566085fecc 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1555,11 +1555,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
  * 2. locale encoding = -1, which means that we couldn't determine the
  * locale's encoding and have to trust the user to get it right.
  *
- * 3. selected encoding is UTF8 and platform is win32. This is because
- * UTF8 is a pseudo codepage that is supported in all locales since it's
- * converted to UTF16 before being used.
- *
- * 4. selected encoding is SQL_ASCII, but only if you're a superuser. This
+ * 3. selected encoding is SQL_ASCII, but only if you're a superuser. This
  * is risky but we have historically allowed it --- notably, the
  * regression tests require it.
  *
@@ -1574,9 +1570,6 @@ check_encoding_locale_matches(int encoding, const char *collate, const char *cty
 	if (!(ctype_encoding == encoding ||
 		  ctype_encoding == PG_SQL_ASCII ||
 		  ctype_encoding == -1 ||
-#ifdef WIN32
-		  encoding == PG_UTF8 ||
-#endif
 		  (encoding == PG_SQL_ASCII && superuser())))
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1589,9 +1582,6 @@ check_encoding_locale_matches(int encoding, const char *collate, const char *cty
 	if (!(collate_encoding == encoding ||
 		  collate_encoding == PG_SQL_ASCII ||
 		  collate_encoding == -1 ||
-#ifdef WIN32
-		  encoding == PG_UTF8 ||
-#endif
 		  (encoding == PG_SQL_ASCII && superuser())))
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 393232b6ce..77bf815919 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2228,9 +2228,6 @@ check_locale_encoding(const char *locale, int user_enc)
 	if (!(locale_enc == user_enc ||
 		  locale_enc == PG_SQL_ASCII ||
 		  locale_enc == -1 ||
-#ifdef WIN32
-		  user_enc == PG_UTF8 ||
-#endif
 		  user_enc == PG_SQL_ASCII))
 	{
 		pg_log_error("encoding mismatch");
@@ -2695,13 +2692,6 @@ setup_locale_encoding(void)
 			 * Windows, UTF-8 works with any locale, so we can fall back to
 			 * UTF-8.
 			 */
-#ifdef WIN32
-			encodingid = PG_UTF8;
-			printf(_("Encoding \"%s\" implied by locale is not allowed as a server-side encoding.\n"
-					 "The default database encoding will be set to \"%s\" instead.\n"),
-				   pg_encoding_to_char(ctype_enc),
-				   pg_encoding_to_char(encodingid));
-#else
 			pg_log_error("locale \"%s\" requires unsupported encoding \"%s\"",
 						 lc_ctype, pg_encoding_to_char(ctype_enc));
 			pg_log_error_detail("Encoding \"%s\" is not allowed as a server-side encoding.",
@@ -2709,7 +2699,6 @@ setup_locale_encoding(void)
 			pg_log_error_hint("Rerun %s with a different locale selection.",
 							  progname);
 			exit(1);
-#endif
 		}
 		else
 		{
-- 
2.39.2

v6-0004-Collate-UTF-8-without-wchar_t-conversion-in-Windo.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Collate-UTF-8-without-wchar_t-conversion-in-Windo.patchDownload
From 5e8689b50db21fe5adfcee15f54524eefe64c492 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Aug 2024 10:36:15 +1200
Subject: [PATCH v6 4/5] Collate UTF-8 without wchar_t conversion in Windows.

Traditionally, Windows didn't support UTF-8 encoding in system
interfaces, and we had to convert to UTF-16 and use wcscoll_l().
Windows 10+ has UTF-8 support, and an earlier commit banned the use of
locales with encoding that doesn't match the database, so we can now
harmonize with the Unix code paths and just call strcoll_l().

Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 src/backend/utils/adt/pg_locale.c | 90 +------------------------------
 1 file changed, 1 insertion(+), 89 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index cd3661e727..4d3c3e4e75 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1804,78 +1804,6 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
-/*
- * pg_strncoll_libc_win32_utf8
- *
- * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
- * invoke wcscoll_l().
- */
-#ifdef WIN32
-static int
-pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
-							size_t len2, pg_locale_t locale)
-{
-	char		sbuf[TEXTBUFLEN];
-	char	   *buf = sbuf;
-	char	   *a1p,
-			   *a2p;
-	int			a1len = len1 * 2 + 2;
-	int			a2len = len2 * 2 + 2;
-	int			r;
-	int			result;
-
-	Assert(locale->provider == COLLPROVIDER_LIBC);
-	Assert(GetDatabaseEncoding() == PG_UTF8);
-#ifndef WIN32
-	Assert(false);
-#endif
-
-	if (a1len + a2len > TEXTBUFLEN)
-		buf = palloc(a1len + a2len);
-
-	a1p = buf;
-	a2p = buf + a1len;
-
-	/* API does not work for zero-length input */
-	if (len1 == 0)
-		r = 0;
-	else
-	{
-		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-								(LPWSTR) a1p, a1len / 2);
-		if (!r)
-			ereport(ERROR,
-					(errmsg("could not convert string to UTF-16: error code %lu",
-							GetLastError())));
-	}
-	((LPWSTR) a1p)[r] = 0;
-
-	if (len2 == 0)
-		r = 0;
-	else
-	{
-		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-								(LPWSTR) a2p, a2len / 2);
-		if (!r)
-			ereport(ERROR,
-					(errmsg("could not convert string to UTF-16: error code %lu",
-							GetLastError())));
-	}
-	((LPWSTR) a2p)[r] = 0;
-
-	errno = 0;
-	result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
-	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw headers */
-		ereport(ERROR,
-				(errmsg("could not compare Unicode strings: %m")));
-
-	if (buf != sbuf)
-		pfree(buf);
-
-	return result;
-}
-#endif							/* WIN32 */
-
 /*
  * pg_strcoll_libc
  *
@@ -1891,17 +1819,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	int			result;
 
 	Assert(locale->provider == COLLPROVIDER_LIBC);
-#ifdef WIN32
-	if (GetDatabaseEncoding() == PG_UTF8)
-	{
-		size_t		len1 = strlen(arg1);
-		size_t		len2 = strlen(arg2);
-
-		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
-	}
-	else
-#endif							/* WIN32 */
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+	result = strcoll_l(arg1, arg2, locale->info.lt);
 
 	return result;
 }
@@ -1925,12 +1843,6 @@ pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
 
 	Assert(locale->provider == COLLPROVIDER_LIBC);
 
-#ifdef WIN32
-	/* check for this case before doing the work for nul-termination */
-	if (GetDatabaseEncoding() == PG_UTF8)
-		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
-#endif							/* WIN32 */
-
 	if (bufsize1 + bufsize2 > TEXTBUFLEN)
 		buf = palloc(bufsize1 + bufsize2);
 
-- 
2.39.2

v6-0005-Format-times-without-wchar_t-conversion-in-Window.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Format-times-without-wchar_t-conversion-in-Window.patchDownload
From 05a451df747f219192fbf79d833ac50285048dbf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Aug 2024 11:33:22 +1200
Subject: [PATCH v6 5/5] Format times without wchar_t conversion in Windows.

Previously we allowed the locale to be set to something that used an
encoding that didn't match the database.  We have disallowed that now,
so we can use strftime() directly.  And if we're going to touch that
code, we might as well use strftime_l() instead and skip some ugly
save/restore of global state.

strftime_l() is from POSIX 2008.  All supported systems have it, thought
Windows has a leading underscore.

For the CI MinGW cross-build warning check to pass, add -lucrt because
otherwise strftime_l() is not available.

Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 .cirrus.tasks.yml                 |   1 +
 src/backend/utils/adt/pg_locale.c | 147 ++++--------------------------
 src/include/port/win32_port.h     |   1 +
 3 files changed, 21 insertions(+), 128 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 1ce6c443a8..3bf81ed4af 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -753,6 +753,7 @@ task:
         --host=x86_64-w64-mingw32 \
         --enable-cassert \
         --without-icu \
+        LDFLAGS="-lucrt" \
         CC="ccache x86_64-w64-mingw32-gcc" \
         CXX="ccache x86_64-w64-mingw32-g++"
       make -s -j${BUILD_JOBS} clean
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 4d3c3e4e75..5e64470b58 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -174,6 +174,8 @@ static void icu_set_collation_attributes(UCollator *collator, const char *loc,
 										 UErrorCode *status);
 #endif
 
+static void report_newlocale_failure(const char *localename);
+
 /*
  * POSIX doesn't define _l-variants of these functions, but several systems
  * have them.  We provide our own replacements here.
@@ -732,65 +734,6 @@ PGLC_localeconv(void)
 	return &CurrentLocaleConv;
 }
 
-#ifdef WIN32
-/*
- * On Windows, strftime() returns its output in encoding CP_ACP (the default
- * operating system codepage for the computer), which is likely different
- * from SERVER_ENCODING.  This is especially important in Japanese versions
- * of Windows which will use SJIS encoding, which we don't support as a
- * server encoding.
- *
- * So, instead of using strftime(), use wcsftime() to return the value in
- * wide characters (internally UTF16) and then convert to UTF8, which we
- * know how to handle directly.
- *
- * Note that this only affects the calls to strftime() in this file, which are
- * used to get the locale-aware strings. Other parts of the backend use
- * pg_strftime(), which isn't locale-aware and does not need to be replaced.
- */
-static size_t
-strftime_win32(char *dst, size_t dstlen,
-			   const char *format, const struct tm *tm)
-{
-	size_t		len;
-	wchar_t		wformat[8];		/* formats used below need 3 chars */
-	wchar_t		wbuf[MAX_L10N_DATA];
-
-	/*
-	 * Get a wchar_t version of the format string.  We only actually use
-	 * plain-ASCII formats in this file, so we can say that they're UTF8.
-	 */
-	len = MultiByteToWideChar(CP_UTF8, 0, format, -1,
-							  wformat, lengthof(wformat));
-	if (len == 0)
-		elog(ERROR, "could not convert format string from UTF-8: error code %lu",
-			 GetLastError());
-
-	len = wcsftime(wbuf, MAX_L10N_DATA, wformat, tm);
-	if (len == 0)
-	{
-		/*
-		 * wcsftime failed, possibly because the result would not fit in
-		 * MAX_L10N_DATA.  Return 0 with the contents of dst unspecified.
-		 */
-		return 0;
-	}
-
-	len = WideCharToMultiByte(CP_UTF8, 0, wbuf, len, dst, dstlen - 1,
-							  NULL, NULL);
-	if (len == 0)
-		elog(ERROR, "could not convert string to UTF-8: error code %lu",
-			 GetLastError());
-
-	dst[len] = '\0';
-
-	return len;
-}
-
-/* redefine strftime() */
-#define strftime(a,b,c,d) strftime_win32(a,b,c,d)
-#endif							/* WIN32 */
-
 /*
  * Subroutine for cache_locale_time().
  * Convert the given string from encoding "encoding" to the database
@@ -829,10 +772,7 @@ cache_locale_time(void)
 	bool		strftimefail = false;
 	int			encoding;
 	int			i;
-	char	   *save_lc_time;
-#ifdef WIN32
-	char	   *save_lc_ctype;
-#endif
+	locale_t	locale;
 
 	/* did we do this already? */
 	if (CurrentLCTimeValid)
@@ -840,50 +780,24 @@ cache_locale_time(void)
 
 	elog(DEBUG3, "cache_locale_time() executed; locale: \"%s\"", locale_time);
 
-	/*
-	 * As in PGLC_localeconv(), it's critical that we not throw error while
-	 * libc's locale settings have nondefault values.  Hence, we just call
-	 * strftime() within the critical section, and then convert and save its
-	 * results afterwards.
-	 */
-
-	/* Save prevailing value of time locale */
-	save_lc_time = setlocale(LC_TIME, NULL);
-	if (!save_lc_time)
-		elog(ERROR, "setlocale(NULL) failed");
-	save_lc_time = pstrdup(save_lc_time);
-
 #ifdef WIN32
-
-	/*
-	 * On Windows, it appears that wcsftime() internally uses LC_CTYPE, so we
-	 * must set it here.  This code looks the same as what PGLC_localeconv()
-	 * does, but the underlying reason is different: this does NOT determine
-	 * the encoding we'll get back from strftime_win32().
-	 */
-
-	/* Save prevailing value of ctype locale */
-	save_lc_ctype = setlocale(LC_CTYPE, NULL);
-	if (!save_lc_ctype)
-		elog(ERROR, "setlocale(NULL) failed");
-	save_lc_ctype = pstrdup(save_lc_ctype);
-
-	/* use lc_time to set the ctype */
-	setlocale(LC_CTYPE, locale_time);
+	locale = _create_locale(LC_ALL, locale_time);
+#else
+	locale = newlocale(LC_ALL, locale_time, NULL);
 #endif
+	if (!locale)
+		report_newlocale_failure(locale_time);
 
-	setlocale(LC_TIME, locale_time);
-
-	/* We use times close to current time as data for strftime(). */
+	/* We use times close to current time as data for strftime_l(). */
 	timenow = time(NULL);
 	timeinfo = localtime(&timenow);
 
-	/* Store the strftime results in MAX_L10N_DATA-sized portions of buf[] */
+	/* Store the strftime_l results in MAX_L10N_DATA-sized portions of buf[] */
 	bufptr = buf;
 
 	/*
 	 * MAX_L10N_DATA is sufficient buffer space for every known locale, and
-	 * POSIX defines no strftime() errors.  (Buffer space exhaustion is not an
+	 * POSIX defines no strftime_l() errors.  (Buffer space exhaustion is not an
 	 * error.)  An implementation might report errors (e.g. ENOMEM) by
 	 * returning 0 (or, less plausibly, a negative value) and setting errno.
 	 * Report errno just in case the implementation did that, but clear it in
@@ -895,10 +809,10 @@ cache_locale_time(void)
 	for (i = 0; i < 7; i++)
 	{
 		timeinfo->tm_wday = i;
-		if (strftime(bufptr, MAX_L10N_DATA, "%a", timeinfo) <= 0)
+		if (strftime_l(bufptr, MAX_L10N_DATA, "%a", timeinfo, locale) <= 0)
 			strftimefail = true;
 		bufptr += MAX_L10N_DATA;
-		if (strftime(bufptr, MAX_L10N_DATA, "%A", timeinfo) <= 0)
+		if (strftime_l(bufptr, MAX_L10N_DATA, "%A", timeinfo, locale) <= 0)
 			strftimefail = true;
 		bufptr += MAX_L10N_DATA;
 	}
@@ -908,39 +822,26 @@ cache_locale_time(void)
 	{
 		timeinfo->tm_mon = i;
 		timeinfo->tm_mday = 1;	/* make sure we don't have invalid date */
-		if (strftime(bufptr, MAX_L10N_DATA, "%b", timeinfo) <= 0)
+		if (strftime_l(bufptr, MAX_L10N_DATA, "%b", timeinfo, locale) <= 0)
 			strftimefail = true;
 		bufptr += MAX_L10N_DATA;
-		if (strftime(bufptr, MAX_L10N_DATA, "%B", timeinfo) <= 0)
+		if (strftime_l(bufptr, MAX_L10N_DATA, "%B", timeinfo, locale) <= 0)
 			strftimefail = true;
 		bufptr += MAX_L10N_DATA;
 	}
 
-	/*
-	 * Restore the prevailing locale settings; as in PGLC_localeconv(),
-	 * failure to do so is fatal.
-	 */
 #ifdef WIN32
-	if (!setlocale(LC_CTYPE, save_lc_ctype))
-		elog(FATAL, "failed to restore LC_CTYPE to \"%s\"", save_lc_ctype);
+	_free_locale(locale);
+#else
+	freelocale(locale);
 #endif
-	if (!setlocale(LC_TIME, save_lc_time))
-		elog(FATAL, "failed to restore LC_TIME to \"%s\"", save_lc_time);
 
 	/*
 	 * At this point we've done our best to clean up, and can throw errors, or
 	 * call functions that might throw errors, with a clean conscience.
 	 */
 	if (strftimefail)
-		elog(ERROR, "strftime() failed: %m");
-
-	/* Release the pstrdup'd locale names */
-	pfree(save_lc_time);
-#ifdef WIN32
-	pfree(save_lc_ctype);
-#endif
-
-#ifndef WIN32
+		elog(ERROR, "strftime_() failed: %m");
 
 	/*
 	 * As in PGLC_localeconv(), we must convert strftime()'s output from the
@@ -951,16 +852,6 @@ cache_locale_time(void)
 	if (encoding < 0)
 		encoding = PG_SQL_ASCII;
 
-#else
-
-	/*
-	 * On Windows, strftime_win32() always returns UTF8 data, so convert from
-	 * that if necessary.
-	 */
-	encoding = PG_UTF8;
-
-#endif							/* WIN32 */
-
 	bufptr = buf;
 
 	/* localized days */
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 7ffe5891c6..87157a1095 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -450,6 +450,7 @@ extern int	_pglstat64(const char *name, struct stat *buf);
 #define isspace_l _isspace_l
 #define iswspace_l _iswspace_l
 #define strcoll_l _strcoll_l
+#define strftime_l _strftime_l
 #define strxfrm_l _strxfrm_l
 #define wcscoll_l _wcscoll_l
 
-- 
2.39.2

#31Ertan Küçükoglu
ertan.kucukoglu@gmail.com
In reply to: Ertan Küçükoglu (#26)
Re: Windows default locale vs initdb

I already installed Visual Studio 2022 with C++ support as suggested in
https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the
src\tools\msvc directory."
It seems I will need help setting up the build environment.

I am willing to be a tester for Windows given I could get help setting
up the build environment.
It also feels documentation needs some update as I failed to find necessary
files.

Thanks & Regards,
Ertan

#32Andrew Dunstan
andrew@dunslane.net
In reply to: Ertan Küçükoglu (#31)
Re: Windows default locale vs initdb

On 2024-08-08 Th 4:08 AM, Ertan Küçükoglu wrote:

I already installed Visual Studio 2022 with C++ support as
suggested in
https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in
the src\tools\msvc directory."
It seems I will need help setting up the build environment.

I am willing to be a tester for Windows given I could get help setting
up the build environment.
It also feels documentation needs some update as I failed to find
necessary files.

If you're trying to build the master branch those documents no longer
apply. You will need to build using meson, as documented here:
<https://www.postgresql.org/docs/17/install-meson.html&gt;

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com