Chinese initdb on Windows

Started by Heikki Linnakangasalmost 15 years ago4 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
2 attachment(s)

On windows, if you have OS locale set to "Chinese (Simplified, PRC)",
initdb fails:

X:\>C:\pgsql-install\bin\initdb.exe -D data2
The files belonging to this database system will be owned by user "Heikki".
This user must also own the server process.

The database cluster will be initialized with locale Chinese
(Simplified)_People
's Republic of China.936.
initdb: locale Chinese (Simplified)_People's Republic of China.936
requires unsu
pported encoding GBK
Encoding GBK is not allowed as a server-side encoding.
Rerun initdb with a different locale selection.

The easy workaround for that is to specify --encoding=UTF-8, as UTF-8
can be used with any locale on Windows. How about doing that
automatically in initdb? Now that we have the smarts in psql to detect
current encoding from the environment and set client_encoding
accordingly, it Just Works. Attached is a patch for that.

Once you get past that, however, there's another issue:

...
creating directory data2 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in data2/base/1 ... ok
initializing pg_authid ... FATAL: database locale is incompatible with operatin
g system
DETAIL: The database was initialized with LC_COLLATE "Chinese (Simplified)_Peoples Republic of China.936", which is not recognized by setlocale().
HINT: Recreate the database with another locale or install the missing locale.
child process exited with exit code 1

The problem is probably the apostrophe in the locale name, although it
seems to be missing from the above error message. setlocale() has a
known problem with locale names that have dots in the country name, and
looks like it has similar issues with apostrophes.

Fortunately, there are aliases for those problematic locales on Windows,
that don't have dots or apostrophes in the names. We did some testing in
EnterpriseDB of various locales on various versions of Windows, and came
up with the following mappings:

"*_Hong Kong S.A.R.*" -> "*_HKG.*"
"*_U.A.E.*" -> "*_ARE.*"
"*_People's Republic of China.*" -> "*_China.*"
"China_Macau S.A.R..950" -> "ZHM"

The first three mappings map the full country name to an abbreviation
that is also accepted by Windows' setlocale(). See
http://msdn.microsoft.com/en-us/library/cdax410z%28v=vs.71%29.aspx. ARE
is not on that list, but seems to work.

Macau is trickier. ZHM is not an abbreviation of the country, but of the
whole locale, so we can't replace just the country part. So this will
not work for "Finnish_Macau S.A.R..950", like the other mappings do.
Nevertheless, it works for the common case.

Any objections to the 2nd attached patch, which adds the mapping of
those locale names on Windows?

I'm thinking it's not too late to do this in 9.1.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

initdb-fallback-to-utf8-on-windows.patchtext/x-diff; name=initdb-fallback-to-utf8-on-windows.patchDownload
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 98e864d..6999a42 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2909,7 +2909,19 @@ main(int argc, char *argv[])
 		}
 		else if (!pg_valid_server_encoding_id(ctype_enc))
 		{
-			/* We recognized it, but it's not a legal server encoding */
+			/*
+			 * We recognized it, but it's not a legal server encoding.
+			 * On Windows, UTF-8 works with any locale, so we can fall back
+			 * to UTF-8.
+			 */
+#ifdef WIN32
+			printf(_("Encoding %s implied by locale is not allowed as a server-side encoding.\n"
+					 "The default database encoding has been set to %s instead.\n"),
+				   pg_encoding_to_char(ctype_enc),
+				   pg_encoding_to_char(PG_UTF8));
+			ctype_enc = PG_UTF8;
+			encodingid = encodingid_to_string(ctype_enc);
+#else
 			fprintf(stderr,
 					_("%s: locale %s requires unsupported encoding %s\n"),
 					progname, lc_ctype, pg_encoding_to_char(ctype_enc));
@@ -2918,6 +2930,7 @@ main(int argc, char *argv[])
 					"Rerun %s with a different locale selection.\n"),
 					pg_encoding_to_char(ctype_enc), progname);
 			exit(1);
+#endif
 		}
 		else
 		{
initdb-map-broken-windows-locales.patchtext/x-diff; name=initdb-map-broken-windows-locales.patchDownload
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 6999a42..754f65b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,6 +185,8 @@ static int	locale_date_order(const char *locale);
 static bool check_locale_name(const char *locale);
 static bool check_locale_encoding(const char *locale, int encoding);
 static void setlocales(void);
+static void strreplace(char *str, char *needle, char *replacement);
+static char *localemap(char *locale);
 static void usage(const char *progname);
 
 #ifdef WIN32
@@ -2253,6 +2255,79 @@ check_locale_encoding(const char *locale, int user_enc)
 	return true;
 }
 
+/*
+ * Replace 'needle' with 'replacement' in 'str' . Note that the replacement
+ * is done in-place, so 'replacement' must be shorter than 'needle'.
+ */
+static void
+strreplace(char *str, char *needle, char *replacement)
+{
+	char *s;
+
+	s = strstr(str, needle);
+	if (s != NULL)
+	{
+		int replacementlen = strlen(replacement);
+		char *rest = s + strlen(needle);
+
+		memcpy(s, replacement, replacementlen);
+		memmove(s + replacementlen, rest, strlen(rest) + 1);
+	}
+}
+
+/*
+ * Windows has a problem with locale names that have a dot or apostrophe in
+ * the country name. For example:
+ *
+ * "Chinese (Traditional)_Hong Kong S.A.R..950"
+ *
+ * For some reason, setlocale() doesn't accept that. Fortunately, Windows'
+ * setlocale() accepts various alternative names for such countries, so we
+ * map the full country names to accepted aliases.
+ *
+ * The returned string is always malloc'd - if no mapping is done it is
+ * just a malloc'd copy of the original.
+ */
+static char *
+localemap(char *locale)
+{
+	locale = xstrdup(locale);
+
+#ifdef WIN32
+	/*
+	 * Map the full country name to an abbreviation that setlocale() accepts
+	 * "China" and "HKG" are listed here:
+	 * http://msdn.microsoft.com/en-us/library/cdax410z%28v=vs.71%29.aspx
+	 * (Country/Region Strings).
+	 *
+	 * "ARE" is the ISO-3166 three-letter code for U.A.E. It is not on the
+	 * above list, but seems to work anyway.
+	 */
+	strreplace(locale, "People's Republic of China", "China");
+	strreplace(locale, "Hong Kong S.A.R.", "HKG");
+	strreplace(locale, "U.A.E.", "ARE");
+
+	/*
+	 * The ISO-3166 country code for Macau S.A.R. is MAC, but Windows doesn't
+	 * seem to recognize that. And Macau isn't listed in the table of
+	 * accepted abbreviations linked above.
+	 *
+	 * Fortunately, "ZHM" seems to be accepted as an alias for
+	 * "Chinese (Traditional)_Macau S.A.R..950", so we use that. Note that
+	 * it's unlike HKG and ARE, ZHM is an alias for the whole locale name,
+	 * not just the country part. I'm not sure where that "ZHM" comes from,
+	 * must be some legacy naming scheme. But hey, it works.
+	 *
+	 * Some versions of Windows spell it "Macau", others "Macao".
+	 */
+	strreplace(locale, "Chinese (Traditional)_Macau S.A.R..950", "ZHM");
+	strreplace(locale, "Chinese_Macau S.A.R..950", "ZHM");
+	strreplace(locale, "Chinese (Traditional)_Macao S.A.R..950", "ZHM");
+	strreplace(locale, "Chinese_Macao S.A.R..950", "ZHM");
+#endif
+
+	return locale;
+}
 
 /*
  * set up the locale variables
@@ -2285,25 +2360,25 @@ setlocales(void)
 	 */
 
 	if (strlen(lc_ctype) == 0 || !check_locale_name(lc_ctype))
-		lc_ctype = xstrdup(setlocale(LC_CTYPE, NULL));
+		lc_ctype = localemap(setlocale(LC_CTYPE, NULL));
 	if (strlen(lc_collate) == 0 || !check_locale_name(lc_collate))
-		lc_collate = xstrdup(setlocale(LC_COLLATE, NULL));
+		lc_collate = localemap(setlocale(LC_COLLATE, NULL));
 	if (strlen(lc_numeric) == 0 || !check_locale_name(lc_numeric))
-		lc_numeric = xstrdup(setlocale(LC_NUMERIC, NULL));
+		lc_numeric = localemap(setlocale(LC_NUMERIC, NULL));
 	if (strlen(lc_time) == 0 || !check_locale_name(lc_time))
-		lc_time = xstrdup(setlocale(LC_TIME, NULL));
+		lc_time = localemap(setlocale(LC_TIME, NULL));
 	if (strlen(lc_monetary) == 0 || !check_locale_name(lc_monetary))
-		lc_monetary = xstrdup(setlocale(LC_MONETARY, NULL));
+		lc_monetary = localemap(setlocale(LC_MONETARY, NULL));
 	if (strlen(lc_messages) == 0 || !check_locale_name(lc_messages))
 #if defined(LC_MESSAGES) && !defined(WIN32)
 	{
 		/* when available get the current locale setting */
-		lc_messages = xstrdup(setlocale(LC_MESSAGES, NULL));
+		lc_messages = localemap(setlocale(LC_MESSAGES, NULL));
 	}
 #else
 	{
 		/* when not available, get the CTYPE setting */
-		lc_messages = xstrdup(setlocale(LC_CTYPE, NULL));
+		lc_messages = localemap(setlocale(LC_CTYPE, NULL));
 	}
 #endif
 
#2Dave Page
dpage@pgadmin.org
In reply to: Heikki Linnakangas (#1)
Re: Chinese initdb on Windows

On Mon, Mar 21, 2011 at 7:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On windows, if you have OS locale set to "Chinese (Simplified, PRC)", initdb
fails:

X:\>C:\pgsql-install\bin\initdb.exe -D data2
The files belonging to this database system will be owned by user "Heikki".
This user must also own the server process.

The database cluster will be initialized with locale Chinese
(Simplified)_People
's Republic of China.936.
initdb: locale Chinese (Simplified)_People's Republic of China.936 requires
unsu
pported encoding GBK
Encoding GBK is not allowed as a server-side encoding.
Rerun initdb with a different locale selection.

The easy workaround for that is to specify --encoding=UTF-8, as UTF-8 can be
used with any locale on Windows. How about doing that automatically in
initdb? Now that we have the smarts in psql to detect current encoding from
the environment and set client_encoding accordingly, it Just Works. Attached
is a patch for that.

Once you get past that, however, there's another issue:

...

creating directory data2 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in data2/base/1 ... ok
initializing pg_authid ... FATAL:  database locale is incompatible with
operatin
g system
DETAIL:  The database was initialized with LC_COLLATE "Chinese
(Simplified)_Peoples Republic of China.936",  which is not recognized by
setlocale().
HINT:  Recreate the database with another locale or install the missing
locale.
child process exited with exit code 1

The problem is probably the apostrophe in the locale name, although it seems
to be missing from the above error message. setlocale() has a known problem
with locale names that have dots in the country name, and looks like it has
similar issues with apostrophes.

Fortunately, there are aliases for those problematic locales on Windows,
that don't have dots or apostrophes in the names. We did some testing in
EnterpriseDB of various locales on various versions of Windows, and came up
with the following mappings:

"*_Hong Kong S.A.R.*" -> "*_HKG.*"
"*_U.A.E.*" -> "*_ARE.*"
"*_People's Republic of China.*" -> "*_China.*"
"China_Macau S.A.R..950" -> "ZHM"

The first three mappings map the full country name to an abbreviation that
is also accepted by Windows' setlocale(). See
http://msdn.microsoft.com/en-us/library/cdax410z%28v=vs.71%29.aspx. ARE is
not on that list, but seems to work.

Macau is trickier. ZHM is not an abbreviation of the country, but of the
whole locale, so we can't replace just the country part. So this will not
work for "Finnish_Macau S.A.R..950", like the other mappings do.
Nevertheless, it works for the common case.

Any objections to the 2nd attached patch, which adds the mapping of those
locale names on Windows?

I'm thinking it's not too late to do this in 9.1.

I've heard complaints a number of times from Chinese users who I
believe this would help.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Chinese initdb on Windows

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Any objections to the 2nd attached patch, which adds the mapping of
those locale names on Windows?

I think the added initdb message isn't following our style guidelines
--- it certainly doesn't match the adjacent existing message.  Other
than that quibble, ok here.

regards, tom lane

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: Chinese initdb on Windows

On 22.03.2011 01:06, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

Any objections to the 2nd attached patch, which adds the mapping of
those locale names on Windows?

I think the added initdb message isn't following our style guidelines
--- it certainly doesn't match the adjacent existing message.  Other
than that quibble, ok here.

What you usually get is something like this:

...
The database cluster will be initialized with locale Lithuanian_Lithuania.1257.
The default database encoding has accordingly been set to WIN1257.
initdb: could not find suitable text search configuration for locale Lithuanian_
Lithuania.1257
The default text search configuration will be set to "simple".

creating directory data2 ... ok
creating subdirectories ... ok
...

And when initdb falls back to UTF-8 with the patch you get:

The database cluster will be initialized with locale Chinese (Simplified)_China.936.
Encoding GBK implied by locale is not allowed as a server-side encoding.
The default database encoding has been set to UTF8 instead.
initdb: could not find suitable text search configuration for locale Chinese (Simplified)_China.936
The default text search configuration will be set to "simple".

creating directory data2 ... ok
creating subdirectories ... ok
...

The new message fits in nicely with the surrounding messages IMHO. Or
are you thinking that it should be more warning like, similar to the
message about missing text search configuration? Something like:

The database cluster will be initialized with locale Chinese

(Simplified)_China.936.

initdb: encoding GBK implied by locale is not allowed as a

server-side encoding.

The default database encoding has been set to UTF8 instead.
initdb: could not find suitable text search configuration for locale

Chinese (Simplified)_China.936

The default text search configuration will be set to "simple".

creating directory data2 ... ok
creating subdirectories ... ok
...

That's fine with me as well..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com