FW: Win32 unicode vs ICU

Started by Magnus Haganderover 20 years ago21 messages
#1Magnus Hagander
mha@sollentuna.net
4 attachment(s)

I just realised this mail didn't go through. Probably because it was too
large for -hackers. So: repost to -patches. Sorry about that. If it's a
duplicate, even more sorry, but I couldn't find it in the archives.

(This may explain that nobody answered me :P)

//Magnus

Show quoted text

-----Original Message-----
From: Magnus Hagander
Sent: Sunday, July 31, 2005 2:09 PM
To: PostgreSQL-development
Cc: pgsql-hackers-win32@postgresql.org
Subject: Win32 unicode vs ICU

Hi!

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

The main downsides of ICU vs the native ones are:
* ICU does not accept win32 locale names. When doing
setlocale("sv_se"), for example, win32 will return this in
later calls as "Swedish_Sweden.1252". To get around this in
the ICU patch, I had to implement a lookup map that converts
it back to sv_se for ICU.

* ICU is yet another build and runtime dependency, and a
large one (comes in at 11Mb for the DLL files alone in the
win32 download)

I guess that the main upside of it is that we'd get
constistent behaviour - in case there are issues with either
ICU or win32 native they'd otherwise differ. And only one new
codepath. But we already live with the platform-inconsistency today...

Another upside is that it handles more encodings in ICU - my
native implementation does *only* UTF8 and relies on existing
functionality to deal with other encodings. It could of
course be extended if necessary, but from what I can tell
UTF8 is the big one.

I have attached both patches. For the native version, only
win32_utf8.patch is required. For the ICU version,
icu_win32.patch is needed and also the files
localemap.c,localemap.pl, iso639 and iso3166 needs to go in
src/backend/port/win32. (the localemap needs to be updated to
do a better-than-linear search, but I wanted to include an example)

Thoughts on the options?

And anohter question - my native patch touches the same
functions as the ICU patch. Can somebody who knows the
internals confirm or deny that these are all the required
locations, or do we need to modify more?

(I have run simple tests in swedish locale and both behave
the same and correct, but I'm unsure of exactly how much
would be affected)

Finally, the win32 patch also changes the normal path to use
strncoll(). The comment above the function states that we'd
like to use strncoll but it's not available. Well, on win32
it is, so it should provide a speedup on win32. It is
currently not included in the ICU patch, but should probably
be included whichever path we'd chose.

//Magnus

Attachments:

win32_utf8.patchapplication/octet-stream; name=win32_utf8.patchDownload
Index: src/backend/utils/adt/oracle_compat.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/oracle_compat.c,v
retrieving revision 1.60
diff -c -r1.60 oracle_compat.c
*** src/backend/utils/adt/oracle_compat.c	7 May 2005 15:18:17 -0000	1.60
--- src/backend/utils/adt/oracle_compat.c	31 Jul 2005 11:11:28 -0000
***************
*** 149,154 ****
--- 149,209 ----
  #endif   /* USE_WIDE_UPPER_LOWER */
  
  
+ #ifdef WIN32
+ /* Win32 UTF8/UTF16 conversion helpers */
+ static wchar_t *win32_utf8texttoutf16(const text *txt)
+ {
+ 	int			nbytes = VARSIZE(txt) - VARHDRSZ;	
+ 	wchar_t    *result;
+ 	int         r;
+ 
+ 	if (nbytes < 0 ||
+ 			nbytes > (int) (INT_MAX / sizeof(wchar_t)) -1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ 
+ 	/* Output workspace cannot have more codes than input bytes */
+ 	result = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
+ 
+ 	/* Do the conversion */
+ 	r = MultiByteToWideChar(CP_UTF8, 0, VARDATA(txt),  nbytes,
+ 				result, nbytes * sizeof(wchar_t));
+ 	if (!r)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+ 				 errmsg("invalid multibyte character for locale"),
+ 				 errhint("The server's LC_CTYPE locale is probably incompatible with the database encoding.")));
+ 
+ 	result[r] = 0;	
+ 	return result;
+ }
+ 
+ static text *win32_utf16toutf8text(const wchar_t *txt)
+ {
+ 	text		*result;
+ 	int			 nbytes;
+ 	int			 r;
+ 	
+ 	nbytes = WideCharToMultiByte(CP_UTF8, 0, txt, -1, NULL, 0, NULL, NULL);
+ 	if (nbytes == 0)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+ 				 errmsg("invalid utf16 character for locale: %lu",GetLastError())));
+ 
+ 	result = palloc(nbytes+VARHDRSZ);
+ 
+ 	r = WideCharToMultiByte(CP_UTF8, 0, txt, -1, VARDATA(result), nbytes, NULL, NULL);
+ 	if (r == 0)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+ 				 errmsg("invalid utf16 character for locale: %lu",GetLastError())));
+ 
+ 	VARATT_SIZEP(result) = nbytes + VARHDRSZ - 1; /* -1 = ignore null */
+ 	return result;
+ }
+ #endif /* WIN32 */
+ 
  /********************************************************************
   *
   * lower
***************
*** 179,184 ****
--- 234,251 ----
  		wchar_t    *workspace;
  		int			i;
  
+ #ifdef WIN32
+ 		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
+ 		if (GetDatabaseEncoding() == PG_UTF8)
+ 		{
+ 			workspace = win32_utf8texttoutf16(string);
+ 			_wcslwr(workspace);
+ 			result = win32_utf16toutf8text(workspace);
+ 			pfree(workspace);
+ 			PG_RETURN_TEXT_P(result);
+ 		}
+ #endif
+ 		
  		workspace = texttowcs(string);
  
  		for (i = 0; workspace[i] != 0; i++)
***************
*** 245,250 ****
--- 312,329 ----
  		wchar_t    *workspace;
  		int			i;
  
+ #ifdef WIN32
+ 		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
+ 		if (GetDatabaseEncoding() == PG_UTF8)
+ 		{
+ 			workspace = win32_utf8texttoutf16(string);
+ 			_wcsupr(workspace);
+ 			result = win32_utf16toutf8text(workspace);
+ 			pfree(workspace);
+ 			PG_RETURN_TEXT_P(result);
+ 		}
+ #endif
+ 
  		workspace = texttowcs(string);
  
  		for (i = 0; workspace[i] != 0; i++)
***************
*** 315,320 ****
--- 394,405 ----
  		int			wasalnum = 0;
  		int			i;
  
+ #ifdef WIN32
+ 		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
+ 		if (GetDatabaseEncoding() == PG_UTF8)
+ 			workspace = win32_utf8texttoutf16(string);
+ 		else
+ #endif
  		workspace = texttowcs(string);
  
  		for (i = 0; workspace[i] != 0; i++)
***************
*** 326,331 ****
--- 411,422 ----
  			wasalnum = iswalnum(workspace[i]);
  		}
  
+ #ifdef WIN32
+ 		/* Map back to UTF-8 */
+ 		if (GetDatabaseEncoding() == PG_UTF8)
+ 			result = win32_utf16toutf8text(workspace);
+ 		else
+ #endif
  		result = wcstotext(workspace, i);
  
  		pfree(workspace);
Index: src/backend/utils/adt/varlena.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/varlena.c,v
retrieving revision 1.130
diff -c -r1.130 varlena.c
*** src/backend/utils/adt/varlena.c	29 Jul 2005 03:17:55 -0000	1.130
--- src/backend/utils/adt/varlena.c	31 Jul 2005 11:15:49 -0000
***************
*** 848,853 ****
--- 848,901 ----
  		char		a2buf[STACKBUFLEN];
  		char	   *a1p,
  				   *a2p;
+ #ifdef WIN32
+ 		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
+ 		if (GetDatabaseEncoding() == PG_UTF8)
+ 		{
+ 			int a1len = STACKBUFLEN;
+ 			int a2len = STACKBUFLEN;
+ 
+ 			if (len1 >= STACKBUFLEN/2)
+ 			{
+ 				a1len = len1 * 2 + 2;
+ 				a1p = palloc(a1len);
+ 			}
+ 			else
+ 				a1p = a1buf;
+ 			
+ 			if (len2 >= STACKBUFLEN/2)
+ 			{
+ 				a2len = len2 * 2 + 2;
+ 				a2p = palloc(a2len);
+ 			}
+ 			else
+ 				a2p = a2buf;
+ 
+ 			if (!MultiByteToWideChar(CP_UTF8, 0, arg1, -1, (LPWSTR)a1p, a1len/2))
+ 				ereport(ERROR,
+ 						(errmsg("failed to convert string to UTF16: %lu",
+ 							GetLastError())));
+ 			if (!MultiByteToWideChar(CP_UTF8, 0, arg2, -1, (LPWSTR)a2p, a2len/2))
+ 				ereport(ERROR,
+ 						(errmsg("failed to convert string to UTF16: %lu",
+ 							GetLastError())));
+ 			
+ 			errno=0;
+ 			result = wcscoll((LPWSTR)a1p, (LPWSTR)a2p);
+ 			if (result == 2147483647) /* _NLSCMPERROR; missing from mingw headers */
+ 				ereport(ERROR,
+ 						(errmsg("failed to compare unicode strings: %i", errno)));
+ 			if (a1len != STACKBUFLEN)
+ 				pfree(a1p);
+ 			if (a2len != STACKBUFLEN)
+ 				pfree(a2p);
+ 			
+ 			return result;
+ 		}
+ 
+ 		/* Win32 has strncoll(), so use it there to avoid copying */
+ 		return _strncoll(arg1, arg2, (len1>len2)?len2:len1 );
+ #endif
  
  		if (len1 >= STACKBUFLEN)
  			a1p = (char *) palloc(len1 + 1);
icu_win32.patchapplication/octet-stream; name=icu_win32.patchDownload
Index: configure.in
===================================================================
RCS file: /projects/cvsroot/pgsql/configure.in,v
retrieving revision 1.417
diff -c -r1.417 configure.in
*** configure.in	6 Jul 2005 21:04:13 -0000	1.417
--- configure.in	28 Jul 2005 17:01:47 -0000
***************
*** 467,472 ****
--- 467,480 ----
  AC_MSG_RESULT([$with_openssl])
  AC_SUBST(with_openssl)
  
+ #
+ # ICU
+ #
+ AC_MSG_CHECKING([whether to build with ICU support])
+ PGAC_ARG_BOOL(with, icu, no, [  --with-icu              build with ICU support],
+               [AC_DEFINE([USE_ICU], 1, [Define to build with ICU support. (--with-icu)])])
+ AC_MSG_RESULT([$with_icu])
+ AC_SUBST(with_icu)
  
  #
  # Readline
***************
*** 674,679 ****
--- 682,698 ----
    fi
  fi
  
+ if test "$with_icu" = yes ; then
+   if test "$PORTNAME" != "win32"; then
+     AC_CHECK_LIB(icui18n, ucol_open_3_2, [], [AC_MSG_ERROR([library 'icui18n' is required for ICU])])
+     AC_CHECK_LIB(icuuc,  u_tolower_3_2, [], [AC_MSG_ERROR([library 'icuuc' is required for ICU])])
+     AC_CHECK_LIB(icudata, icudt32_dat, [], [AC_MSG_ERROR([library 'icudata' is required for ICU])])
+   else
+     AC_CHECK_LIB(icuin, ucol_open_3_2, [], [AC_MSG_ERROR([library 'icuin' is required for ICU])])
+     AC_CHECK_LIB(icuuc,  u_tolower_3_2, [], [AC_MSG_ERROR([library 'icuuc' is required for ICU])])
+   fi
+ fi
+   
  if test "$with_pam" = yes ; then
    AC_CHECK_LIB(pam,    pam_start, [], [AC_MSG_ERROR([library 'pam' is required for PAM])])
  fi
***************
*** 748,753 ****
--- 767,776 ----
    AC_CHECK_HEADER(openssl/err.h, [], [AC_MSG_ERROR([header file <openssl/err.h> is required for OpenSSL])])
  fi
  
+ if test "$with_icu" = yes ; then
+   AC_CHECK_HEADER(unicode/utypes.h, [], [AC_MSG_ERROR([header file <unicode/utypes.h> is required for ICU])])
+ fi
+  
  if test "$with_pam" = yes ; then
    AC_CHECK_HEADERS(security/pam_appl.h, [],
                     [AC_CHECK_HEADERS(pam/pam_appl.h, [],
Index: src/backend/port/win32/Makefile
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/port/win32/Makefile,v
retrieving revision 1.6
diff -c -r1.6 Makefile
*** src/backend/port/win32/Makefile	29 Aug 2004 00:38:03 -0000	1.6
--- src/backend/port/win32/Makefile	30 Jul 2005 17:52:47 -0000
***************
*** 12,29 ****
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = sema.o shmem.o timer.o socket.o signal.o security.o error.o
  
  all: SUBSYS.o
  
  SUBSYS.o: $(OBJS)
  	$(LD) $(LDREL) $(LDOUT) SUBSYS.o $(OBJS)
  
  depend dep:
  	$(CC) -MM $(CFLAGS) *.c >depend
  
  clean: 
! 	rm -f SUBSYS.o $(OBJS)
  
  ifeq (depend,$(wildcard depend))
  include depend
--- 12,34 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = sema.o shmem.o timer.o socket.o signal.o security.o error.o localemap.o
  
  all: SUBSYS.o
  
  SUBSYS.o: $(OBJS)
  	$(LD) $(LDREL) $(LDOUT) SUBSYS.o $(OBJS)
  
+ localemap.o: localemap.c localemap.h
+ 
+ localemap.h: iso639 iso3166 localemap.pl
+ 	$(PERL) localemap.pl > localemap.h
+ 
  depend dep:
  	$(CC) -MM $(CFLAGS) *.c >depend
  
  clean: 
! 	rm -f SUBSYS.o $(OBJS) localemap.h
  
  ifeq (depend,$(wildcard depend))
  include depend
Index: src/backend/utils/adt/oracle_compat.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/oracle_compat.c,v
retrieving revision 1.60
diff -c -r1.60 oracle_compat.c
*** src/backend/utils/adt/oracle_compat.c	7 May 2005 15:18:17 -0000	1.60
--- src/backend/utils/adt/oracle_compat.c	30 Jul 2005 14:59:29 -0000
***************
*** 32,37 ****
--- 32,43 ----
  #include "utils/pg_locale.h"
  #include "mb/pg_wchar.h"
  
+ #ifdef USE_ICU
+ #include <unicode/utypes.h>   /* Basic ICU data types */
+ #include <unicode/ucnv.h>     /* C   Converter API    */
+ #include <unicode/ustring.h>
+ #endif /* USE_ICU */
+ 
  
  /*
   * If the system provides the needed functions for wide-character manipulation
***************
*** 53,58 ****
--- 59,106 ----
  	   bool doltrim, bool dortrim);
  
  
+ #ifdef USE_ICU
+ static UConverter *conv = NULL;
+ 
+ static text *
+ UChartotext(const UChar *str, int ncodes)
+ {
+ 	text	   *result;
+ 	size_t		nbytes, resultsize;
+ 
+ 	UErrorCode status = U_ZERO_ERROR;
+ 
+ 	/* Overflow paranoia */
+ 	if (ncodes < 0 ||
+ 		ncodes > (int) ((INT_MAX - VARHDRSZ) / sizeof(UChar)) - 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ 
+ 	/* Make workspace certainly large enough for result */
+ 	resultsize = UCNV_GET_MAX_BYTES_FOR_STRING(ncodes, ucnv_getMaxCharSize(conv));
+ 	result = (text *) palloc(resultsize + VARHDRSZ);
+ 
+ 	nbytes = ucnv_fromUChars(conv, VARDATA(result), resultsize,
+ 							 str, ncodes, &status);
+ 
+ 	if (U_FAILURE(status))
+ 	{
+ 		/* Invalid multibyte character encountered ... shouldn't happen */
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+ 				 errmsg("invalid multibyte character for locale")));
+ 	}
+ 
+ 	Assert(nbytes <= (size_t) (ncodes * sizeof(UChar)));
+ 
+ 	VARATT_SIZEP(result) = nbytes + VARHDRSZ;
+ 
+ 	return result;
+ }
+ 
+ 
+ #else
  #ifdef USE_WIDE_UPPER_LOWER
  
  /*
***************
*** 147,152 ****
--- 195,201 ----
  	return result;
  }
  #endif   /* USE_WIDE_UPPER_LOWER */
+ #endif   /* USE_ICU */
  
  
  /********************************************************************
***************
*** 166,171 ****
--- 215,286 ----
  Datum
  lower(PG_FUNCTION_ARGS)
  {
+ #ifdef USE_ICU
+ #define STACKBUFLEN		1024 / sizeof(UChar)
+ 	/* use ICU only when max encoding length > one */
+ 	if (pg_database_encoding_max_length() > 1)
+ 	{
+ 		text	   *string = PG_GETARG_TEXT_P(0);
+ 		text	   *result;
+ 		UChar       sourcebuf[STACKBUFLEN], destbuf[STACKBUFLEN];
+ 		UChar      *source, *dest;
+ 		int			buflen,
+ 					arglen = VARSIZE(string) - VARHDRSZ;
+ 		UErrorCode  status = U_ZERO_ERROR;
+ 
+ 		if (conv == NULL)
+ 		{
+ 			conv = ucnv_open(NULL, &status);
+ 			if (U_FAILURE(status))
+ 			{
+ 				ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: oracle_compat.c, could not get converter for \"%s\"", ucnv_getDefaultName())));
+ 			}
+ 		}
+ 
+ 		if (arglen >= STACKBUFLEN / sizeof(UChar))
+ 		{
+ 			buflen = arglen * sizeof(UChar) + 1;
+  			source = palloc(buflen);
+ 			dest = palloc(buflen);
+ 		}
+ 		else
+ 		{
+ 			buflen = STACKBUFLEN;
+ 			source = sourcebuf;
+ 			dest = destbuf;
+ 		}
+ 		// convert to UTF-16
+ 		ucnv_toUChars(conv, source, buflen, VARDATA(string), arglen, &status);
+ 		if (U_FAILURE(status))
+ 		{
+ 			ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not convert string")));
+ 		}
+ 		
+ 		// run desired function
+ 		buflen = u_strToLower(dest, buflen, source, -1, NULL, &status);
+ 		if (U_FAILURE(status))
+ 		{
+ 			ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not modify case")));
+ 		}
+ 
+ 		// and convert modified utf-16 string back to text
+ 		result = UChartotext(dest, buflen);
+ 
+ 		if (arglen >= STACKBUFLEN / sizeof(UChar))
+ 		{
+ 			pfree(source);
+ 			pfree(dest);
+ 		}
+ 		PG_RETURN_TEXT_P(result);
+ 	}
+ 	else
+ #else
  #ifdef USE_WIDE_UPPER_LOWER
  	/*
  	 *	Use wide char code only when max encoding length > 1 and ctype != C.
***************
*** 192,197 ****
--- 307,313 ----
  	}
  	else
  #endif   /* USE_WIDE_UPPER_LOWER */
+ #endif   /* USE_ICU */
  	{
  		text	   *string = PG_GETARG_TEXT_P_COPY(0);
  		char	   *ptr;
***************
*** 232,237 ****
--- 348,418 ----
  Datum
  upper(PG_FUNCTION_ARGS)
  {
+ #ifdef USE_ICU
+ 	/* use ICU only when max encoding length > one */
+ 	if (pg_database_encoding_max_length() > 1)
+ 	{
+ 		text	   *string = PG_GETARG_TEXT_P(0);
+ 		text	   *result;
+ 		UChar       sourcebuf[STACKBUFLEN], destbuf[STACKBUFLEN];
+ 		UChar      *source, *dest;
+ 		int32_t     buflen, arglen;
+ 
+ 		UErrorCode  status = U_ZERO_ERROR;
+ 
+ 		if (conv == NULL)
+ 		{
+ 			conv = ucnv_open(NULL, &status);
+ 			if (U_FAILURE(status))
+ 			{
+ 				ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not get converter for \"%s\"", ucnv_getDefaultName())));
+ 			}
+ 		}
+ 		arglen = VARSIZE(string) - VARHDRSZ;
+ 		if (arglen * sizeof(UChar) >= STACKBUFLEN)
+ 		{
+ 			buflen = arglen * sizeof(UChar) + 1;
+  			source = palloc(buflen);
+ 			dest = palloc(buflen);
+ 		}
+ 		else
+ 		{
+ 			buflen = STACKBUFLEN;
+ 			source = sourcebuf;
+ 			dest = destbuf;
+ 		}
+ 		// convert to UTF-16
+ 		ucnv_toUChars(conv, source, buflen, VARDATA(string), arglen, &status);
+ 		if (U_FAILURE(status))
+ 		{
+ 			ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not convert string")));
+ 		}
+ 
+ 		// run desired function
+ 		buflen = u_strToUpper(dest, buflen, source, -1, NULL, &status);
+ 		if (U_FAILURE(status))
+ 		{
+ 			ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not modify case")));
+ 		}
+ 
+ 		// and convert modified utf-16 string back to text
+ 		result = UChartotext(dest, buflen);
+ 
+ 		if (arglen * sizeof(UChar) >= STACKBUFLEN)
+ 		{
+ 			pfree(source);
+ 			pfree(dest);
+ 		}
+ 		PG_RETURN_TEXT_P(result);
+ 	}
+ 	else
+ #else
  #ifdef USE_WIDE_UPPER_LOWER
  	/*
  	 *	Use wide char code only when max encoding length > 1 and ctype != C.
***************
*** 258,263 ****
--- 439,445 ----
  	}
  	else
  #endif   /* USE_WIDE_UPPER_LOWER */
+ #endif   /* USE_ICU */
  	{
  		text	   *string = PG_GETARG_TEXT_P_COPY(0);
  		char	   *ptr;
***************
*** 301,306 ****
--- 483,553 ----
  Datum
  initcap(PG_FUNCTION_ARGS)
  {
+ #ifdef USE_ICU
+ 	/* use ICU only when max encoding length > one */
+ 	if (pg_database_encoding_max_length() > 1)
+ 	{
+ 		text	   *string = PG_GETARG_TEXT_P(0);
+ 		text	   *result;
+ 		UChar       sourcebuf[STACKBUFLEN], destbuf[STACKBUFLEN];
+ 		UChar      *source, *dest;
+ 		int32_t     buflen, arglen;
+ 
+ 		UErrorCode  status = U_ZERO_ERROR;
+ 
+ 		if (conv == NULL)
+ 		{
+ 			conv = ucnv_open(NULL, &status);
+ 			if (U_FAILURE(status))
+ 			{
+ 				ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not get converter for \"%s\"", ucnv_getDefaultName())));
+ 			}
+ 		}
+ 		arglen = VARSIZE(string) - VARHDRSZ;
+ 		if (arglen * sizeof(UChar) >= STACKBUFLEN)
+ 		{
+ 			buflen = arglen * sizeof(UChar) + 1;
+  			source = palloc(buflen);
+ 			dest = palloc(buflen);
+ 		}
+ 		else
+ 		{
+ 			buflen = STACKBUFLEN;
+ 			source = sourcebuf;
+ 			dest = destbuf;
+ 		}
+ 		// convert to UTF-16
+ 		ucnv_toUChars(conv, source, buflen, VARDATA(string), arglen, &status);
+ 		if (U_FAILURE(status))
+ 		{
+ 			ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not convert string")));
+ 		}
+ 
+ 		// run desired function
+ 		buflen = u_strToTitle(dest, buflen, source, -1, NULL, NULL, &status);
+ 		if (U_FAILURE(status))
+ 		{
+ 			ereport(ERROR,
+ 						(errcode(status),
+ 						 errmsg("ICU error: Could not modify case")));
+ 		}
+ 
+ 		// and convert modified utf-16 string back to text
+ 		result = UChartotext(dest, buflen);
+ 
+ 		if (arglen * sizeof(UChar) >= STACKBUFLEN)
+ 		{
+ 			pfree(source);
+ 			pfree(dest);
+ 		}
+ 		PG_RETURN_TEXT_P(result);
+ 	}
+ 	else
+ #else
  #ifdef USE_WIDE_UPPER_LOWER
  	/*
  	 *	Use wide char code only when max encoding length > 1 and ctype != C.
***************
*** 334,339 ****
--- 581,587 ----
  	}
  	else
  #endif   /* USE_WIDE_UPPER_LOWER */
+ #endif   /* USE_ICU */
  	{
  		text	   *string = PG_GETARG_TEXT_P_COPY(0);
  		int			wasalnum = 0;
Index: src/backend/utils/adt/varlena.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/varlena.c,v
retrieving revision 1.130
diff -c -r1.130 varlena.c
*** src/backend/utils/adt/varlena.c	29 Jul 2005 03:17:55 -0000	1.130
--- src/backend/utils/adt/varlena.c	30 Jul 2005 17:46:34 -0000
***************
*** 30,35 ****
--- 30,41 ----
  #include "utils/pg_locale.h"
  #include "regex/regex.h"
  
+ #ifdef USE_ICU
+ #include <unicode/utypes.h>   /* Basic ICU data types */
+ #include <unicode/ucnv.h>     /* C   Converter API    */
+ #include <unicode/ucol.h>
+ #include <unicode/uloc.h>
+ #endif /* USE_ICU */
  
  typedef struct varlena unknown;
  
***************
*** 844,874 ****
  
  	if (!lc_collate_is_c())
  	{
! 		char		a1buf[STACKBUFLEN];
! 		char		a2buf[STACKBUFLEN];
! 		char	   *a1p,
! 				   *a2p;
  
! 		if (len1 >= STACKBUFLEN)
! 			a1p = (char *) palloc(len1 + 1);
! 		else
! 			a1p = a1buf;
! 		if (len2 >= STACKBUFLEN)
! 			a2p = (char *) palloc(len2 + 1);
  		else
! 			a2p = a2buf;
  
! 		memcpy(a1p, arg1, len1);
! 		a1p[len1] = '\0';
! 		memcpy(a2p, arg2, len2);
! 		a2p[len2] = '\0';
! 
! 		result = strcoll(a1p, a2p);
! 
! 		if (len1 >= STACKBUFLEN)
! 			pfree(a1p);
! 		if (len2 >= STACKBUFLEN)
! 			pfree(a2p);
  	}
  	else
  	{
--- 850,986 ----
  
  	if (!lc_collate_is_c())
  	{
! #ifdef USE_ICU
! #define USTACKBUFLEN		STACKBUFLEN / sizeof(UChar)
! 		if (pg_database_encoding_max_length() > 1)
! 		{
! 			UChar		a1buf[USTACKBUFLEN],
! 		         		a2buf[USTACKBUFLEN];
! 			int			a1len = USTACKBUFLEN,
! 			  			a2len = USTACKBUFLEN;	
! 			UChar	   *a1p,
! 			           *a2p;
! 
! 			static UCollator * collator = NULL;
! 			static UConverter * conv = NULL;
! 			UErrorCode  status = U_ZERO_ERROR;
  
! 			if (conv == NULL)
! 			{
! 				conv = ucnv_open(NULL, &status);
! 				if (U_FAILURE(status) || conv == NULL)
! 				{
! 					ereport(ERROR,
! 							(errcode(status),
! 							 errmsg("ICU error: varlena.c, could not get converter for \"%s\"", ucnv_getDefaultName())));
! 				}
! 			}
! 
! 			/* We keep a static collator "forever", since it is hard
! 			 * coded into the database cluster at initdb time
! 			 * anyway. Create it first time we get here. */
! 			if (collator == NULL)
! 			{
! 				/* Expect LC_COLLATE to be set to something that ICU
! 				 * will understand. This is quite probable, since ICU
! 				 * does a lot of heuristics with this argument. I'd
! 				 * rather set this in xlog.c, but it seems ICU forgets
! 				 * it??? */
! #ifndef WIN32
! 				uloc_setDefault(setlocale(LC_COLLATE, NULL), &status);
! #else
! 				/* Win32 locale names are completely different from what ICU expects, so 
! 				   we need to do some conversion */
! 				uloc_setDefault(pgwin32_localemap(setlocale(LC_COLLATE, NULL)), &status);
! #endif
! 				if(U_FAILURE(status))
! 				{
! 					ereport(WARNING,
! 							(errcode(status),
! 							 errmsg("ICU Error: varlena.c, could not set default lc_collate")));
! 				}
! 				collator = ucol_open(NULL, &status);
! 				if (U_FAILURE(status))
! 				{
! 					ereport(WARNING,
! 							(errcode(status),
! 							 errmsg("ICU Error: varlena.c, could not open collator")));
! 				}
! 			}
! 			
! 			if (len1 >= USTACKBUFLEN / sizeof(UChar))
! 			{
! 				a1len = len1 * sizeof(UChar) + 2;
! 				a1p = (UChar *) palloc(a1len);	
! 			}
! 			else
! 				a1p = a1buf;
! 
! 			if (len2 >= USTACKBUFLEN / sizeof(UChar))
! 			{
! 				a2len = len2 * sizeof(UChar) + 2;
! 				a2p = (UChar *) palloc(a2len);
! 			}
! 			else
! 				a2p = a2buf;
! 
! 			ucnv_toUChars(conv, a1p, a1len, arg1, len1, &status);
! 			if(U_FAILURE(status))
! 			{
! 				ereport(WARNING,
! 						(errcode(status),
! 						 errmsg("ICU Error: varlena.c, could not convert to UChars")));
! 			}
! 			ucnv_toUChars(conv, a2p, a2len, arg2, len2, &status);
! 			if(U_FAILURE(status))
! 			{
! 				ereport(WARNING,
! 						(errcode(status),
! 						 errmsg("ICU Error: varlena.c, could not convert to UChars")));
! 			}
! 
! 			result = ucol_strcoll(collator, a1p, -1, a2p, -1);
! 			if(U_FAILURE(status))
! 			{
! 				ereport(WARNING,
! 						(errcode(status),
! 						 errmsg("ICU Error: varlena.c, could not collate")));
! 			}
! 
! 			if (len1 * sizeof(UChar) >= USTACKBUFLEN)
! 				pfree(a1p);
! 			if (len2 * sizeof(UChar) >= USTACKBUFLEN)
! 				pfree(a2p);
! 		}
  		else
! #endif /* USE_ICU */
! 		{
! 			char		a1buf[STACKBUFLEN];
! 			char		a2buf[STACKBUFLEN];
! 			char	   *a1p,
! 					   *a2p;
  
! 			if (len1 >= STACKBUFLEN)
! 				a1p = (char *) palloc(len1 + 1);
! 			else
! 				a1p = a1buf;
! 			if (len2 >= STACKBUFLEN)
! 				a2p = (char *) palloc(len2 + 1);
! 			else
! 				a2p = a2buf;
! 			
! 			memcpy(a1p, arg1, len1);
! 			a1p[len1] = '\0';
! 			memcpy(a2p, arg2, len2);
! 			a2p[len2] = '\0';
! 			
! 			result = strcoll(a1p, a2p);
! 			
! 			if (len1 >= STACKBUFLEN)
! 				pfree(a1p);
! 			if (len2 >= STACKBUFLEN)
! 				pfree(a2p);
! 		}
  	}
  	else
  	{
Index: src/backend/utils/mb/encnames.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/encnames.c,v
retrieving revision 1.25
diff -c -r1.25 encnames.c
*** src/backend/utils/mb/encnames.c	14 Mar 2005 18:31:20 -0000	1.25
--- src/backend/utils/mb/encnames.c	28 Jul 2005 17:16:52 -0000
***************
*** 375,380 ****
--- 375,496 ----
  	}
  };
  
+ #ifdef USE_ICU
+ /*
+  * Try to map most internal character encodings to the proper and
+  * preferred IANA string. Use this in mbutils.c to feed ICU info about
+  * the database's character encoding.
+  *
+  * Palle Girgensohn, 2005
+  */
+ 
+ pg_enc2name pg_enc2iananame_tbl[] =
+ {
+ 	{
+ 		"US-ASCII", PG_SQL_ASCII
+ 	},
+ 	{
+ 		"EUC-JP", PG_EUC_JP
+ 	},
+ 	{
+ 		"GB2312", PG_EUC_CN
+ 	},
+ 	{
+ 		"EUC-KR", PG_EUC_KR
+ 	},
+ 	{
+ 		"ISO-2022-CN", PG_EUC_TW
+ 	},
+ 	{
+ 		"KS_C_5601-1987", PG_JOHAB  /* either KS_C_5601-1987 or ISO-2022-KR ??? */
+ 	},
+ 	{
+ 		"UTF-8", PG_UTF8
+ 	},
+ 	{
+ 		"MULE_INTERNAL", PG_MULE_INTERNAL  /* is not for real */
+ 	},
+ 	{
+ 		"ISO-8859-1", PG_LATIN1
+ 	},
+ 	{
+ 		"ISO-8859-2", PG_LATIN2
+ 	},
+ 	{
+ 		"ISO-8859-3", PG_LATIN3
+ 	},
+ 	{
+ 		"ISO-8859-4", PG_LATIN4
+ 	},
+ 	{
+ 		"ISO-8859-9", PG_LATIN5
+ 	},
+ 	{
+ 		"ISO-8859-10", PG_LATIN6
+ 	},
+ 	{
+ 		"ISO-8859-13", PG_LATIN7
+ 	},
+ 	{
+ 		"ISO-8859-14", PG_LATIN8
+ 	},
+ 	{
+ 		"ISO-8859-15", PG_LATIN9
+ 	},
+ 	{
+ 		"ISO-8859-16", PG_LATIN10
+ 	},
+ 	{
+ 		"windows-1256", PG_WIN1256
+ 	},
+ 	{
+ 		"windows-1258", PG_WIN1258
+ 	},
+ 	{
+ 		"windows-874", PG_WIN874
+ 	},
+ 	{
+ 		"KOI8-R", PG_KOI8R
+ 	},
+ 	{
+ 		"windows-1251", PG_WIN1251
+ 	},
+ 	{
+ 		"IBM866", PG_WIN866
+ 	},
+ 	{
+ 		"ISO-8859-5", PG_ISO_8859_5
+ 	},
+ 	{
+ 		"ISO-8859-6", PG_ISO_8859_6
+ 	},
+ 	{
+ 		"ISO-8859-7", PG_ISO_8859_7
+ 	},
+ 	{
+ 		"ISO-8859-8", PG_ISO_8859_8
+ 	},
+ 	{
+ 		"windows-1250", PG_WIN1250
+ 	},
+ 	{
+ 		"Shift_JIS", PG_SJIS
+ 	},
+ 	{
+ 		"Big5", PG_BIG5
+ 	},
+ 	{
+ 		"GBK", PG_GBK
+ 	},
+ 	{
+ 		"cp949", PG_UHC
+ 	},
+ 	{
+ 		"GB18030", PG_GB18030
+ 	}
+ };
+ #endif /* USE_ICU */
+ 
  /* ----------
   * Encoding checks, for error returns -1 else encoding id
   * ----------
Index: src/backend/utils/mb/mbutils.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/mbutils.c,v
retrieving revision 1.50
diff -c -r1.50 mbutils.c
*** src/backend/utils/mb/mbutils.c	10 Jul 2005 21:13:59 -0000	1.50
--- src/backend/utils/mb/mbutils.c	28 Jul 2005 16:36:54 -0000
***************
*** 15,20 ****
--- 15,23 ----
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  #include "catalog/namespace.h"
+ #ifdef USE_ICU
+ #include <unicode/ucnv.h>
+ #endif /* USE_ICU */
  
  /*
   * We handle for actual FE and BE encoding setting encoding-identificator
***************
*** 576,581 ****
--- 579,587 ----
  
  	DatabaseEncoding = &pg_enc2name_tbl[encoding];
  	Assert(DatabaseEncoding->encoding == encoding);
+ #ifdef USE_ICU
+ 	ucnv_setDefaultName((&pg_enc2iananame_tbl[encoding])->name);
+ #endif
  }
  
  void
Index: src/include/pg_config.h.in
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/pg_config.h.in,v
retrieving revision 1.86
diff -c -r1.86 pg_config.h.in
*** src/include/pg_config.h.in	1 Jul 2005 18:17:31 -0000	1.86
--- src/include/pg_config.h.in	28 Jul 2005 16:36:54 -0000
***************
*** 642,647 ****
--- 642,650 ----
  /* Define to build with (Open)SSL support. (--with-openssl) */
  #undef USE_SSL
  
+ /* Define to build with ICU support. (--with-icu) */
+ #undef USE_ICU
+ 
  /* Define to select SysV-style semaphores. */
  #undef USE_SYSV_SEMAPHORES
  
Index: src/include/mb/pg_wchar.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/mb/pg_wchar.h,v
retrieving revision 1.59
diff -c -r1.59 pg_wchar.h
*** src/include/mb/pg_wchar.h	15 Jun 2005 00:15:08 -0000	1.59
--- src/include/mb/pg_wchar.h	28 Jul 2005 16:36:54 -0000
***************
*** 235,240 ****
--- 235,241 ----
  } pg_enc2name;
  
  extern pg_enc2name pg_enc2name_tbl[];
+ extern pg_enc2name pg_enc2iananame_tbl[];
  
  extern pg_encname *pg_char_to_encname_struct(const char *name);
  
Index: src/include/port/win32.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/port/win32.h,v
retrieving revision 1.46
diff -c -r1.46 win32.h
*** src/include/port/win32.h	16 Jun 2005 17:53:54 -0000	1.46
--- src/include/port/win32.h	30 Jul 2005 17:39:36 -0000
***************
*** 256,258 ****
--- 256,261 ----
  
  /* in backend/port/win32/error.c */
  extern void _dosmaperr(unsigned long);
+ 
+ /* in backend/port/win32/localemap.c */
+ extern char *pgwin32_localemap(char *winlocale);
localemap.plapplication/octet-stream; name=localemap.plDownload
localemap.capplication/octet-stream; name=localemap.cDownload
#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#1)
Re: [PATCHES] FW: Win32 unicode vs ICU

Folks, we need to address the questions asked in this email and get it
into CVS soon.

---------------------------------------------------------------------------

Magnus Hagander wrote:

I just realised this mail didn't go through. Probably because it was too
large for -hackers. So: repost to -patches. Sorry about that. If it's a
duplicate, even more sorry, but I couldn't find it in the archives.

(This may explain that nobody answered me :P)

//Magnus

-----Original Message-----
From: Magnus Hagander
Sent: Sunday, July 31, 2005 2:09 PM
To: PostgreSQL-development
Cc: pgsql-hackers-win32@postgresql.org
Subject: Win32 unicode vs ICU

Hi!

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

The main downsides of ICU vs the native ones are:
* ICU does not accept win32 locale names. When doing
setlocale("sv_se"), for example, win32 will return this in
later calls as "Swedish_Sweden.1252". To get around this in
the ICU patch, I had to implement a lookup map that converts
it back to sv_se for ICU.

* ICU is yet another build and runtime dependency, and a
large one (comes in at 11Mb for the DLL files alone in the
win32 download)

I guess that the main upside of it is that we'd get
constistent behaviour - in case there are issues with either
ICU or win32 native they'd otherwise differ. And only one new
codepath. But we already live with the platform-inconsistency today...

Another upside is that it handles more encodings in ICU - my
native implementation does *only* UTF8 and relies on existing
functionality to deal with other encodings. It could of
course be extended if necessary, but from what I can tell
UTF8 is the big one.

I have attached both patches. For the native version, only
win32_utf8.patch is required. For the ICU version,
icu_win32.patch is needed and also the files
localemap.c,localemap.pl, iso639 and iso3166 needs to go in
src/backend/port/win32. (the localemap needs to be updated to
do a better-than-linear search, but I wanted to include an example)

Thoughts on the options?

And anohter question - my native patch touches the same
functions as the ICU patch. Can somebody who knows the
internals confirm or deny that these are all the required
locations, or do we need to modify more?

(I have run simple tests in swedish locale and both behave
the same and correct, but I'm unsure of exactly how much
would be affected)

Finally, the win32 patch also changes the normal path to use
strncoll(). The comment above the function states that we'd
like to use strncoll but it's not available. Well, on win32
it is, so it should provide a speedup on win32. It is
currently not included in the ICU patch, but should probably
be included whichever path we'd chose.

//Magnus

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#1)
Re: Win32 unicode vs ICU

[ moving to -hackers for wider discussion ]

"Magnus Hagander" <mha@sollentuna.net> wrote in
http://archives.postgresql.org/pgsql-patches/2005-08/msg00039.php

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

We need to figure out what we're going to do about this. Given where
we are in the release cycle, I am pretty strongly tempted to just apply
the smaller patch (just map utf8/utf16 using Windows native functions)
for PG 8.1.

I think that ICU would be interesting as the base for a much larger
patch that gets us away from depending on libc's locale support at all
(in particular, getting rid of the "one locale per database" problem).
But it seems like a heck of a big dependency to incur for any lesser goal.

I feel it makes sense to apply the smaller patch in any case, so that
there's a Win32 solution not requiring ICU (ie, I can't see an argument
for doing (2) rather than (3)).

Comments?

Also,

And anohter question - my native patch touches the same
functions as the ICU patch. Can somebody who knows the
internals confirm or deny that these are all the required
locations, or do we need to modify more?

There is a strxfrm() call in src/backend/utils/adt/selfuncs.c,
which probably needs to be looked at too.

regards, tom lane

#4Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#3)
Re: Win32 unicode vs ICU

On Sat, Aug 20, 2005 at 12:17:47PM -0400, Tom Lane wrote:

I think that ICU would be interesting as the base for a much larger
patch that gets us away from depending on libc's locale support at all
(in particular, getting rid of the "one locale per database" problem).
But it seems like a heck of a big dependency to incur for any lesser goal.

There is a locale project from the Gnome guys, with an eye towards a
wider audience. The announcement, which states the goals of the
project, is here:

http://mail.gnome.org/archives/locale-list/2005-August/msg00000.html

The project website is at http://live.gnome.org/LocaleProject

The big problem with this is that the license is likely to be LGPL, so
there's probably not much code we could use. OTOH, it's possible that
we could borrow some ideas from them. In particular, they are based
mostly on the Common Locale Data Repository,
http://www.unicode.org/cldr/

However, this thread on their list, which is about the license they will
choose, hints that rewriting the whole CLDR handling from scratch would
be very painful:

http://mail.gnome.org/archives/locale-list/2005-August/msg00004.html

This is precisely the reason they are using LGPL: they do not want to
have to rewrite it all, which they would were they to choose a license
like BSD. (Personally I think this is folly -- someone else will have
to rewrite it again with a BSD license sometime, and then the value of
their work would be decreased.)

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"A wizard is never late, Frodo Baggins, nor is he early.
He arrives precisely when he means to." (Gandalf, en LoTR FoTR)

#5Palle Girgensohn
girgen@pingpong.net
In reply to: Tom Lane (#3)
Re: Win32 unicode vs ICU

--On lördag, augusti 20, 2005 12.17.47 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

[ moving to -hackers for wider discussion ]

"Magnus Hagander" <mha@sollentuna.net> wrote in
http://archives.postgresql.org/pgsql-patches/2005-08/msg00039.php

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

We need to figure out what we're going to do about this. Given where
we are in the release cycle, I am pretty strongly tempted to just apply
the smaller patch (just map utf8/utf16 using Windows native functions)
for PG 8.1.

I think that ICU would be interesting as the base for a much larger
patch that gets us away from depending on libc's locale support at all
(in particular, getting rid of the "one locale per database" problem).
But it seems like a heck of a big dependency to incur for any lesser goal.

I feel it makes sense to apply the smaller patch in any case, so that
there's a Win32 solution not requiring ICU (ie, I can't see an argument
for doing (2) rather than (3)).

Comments?

I don't mind either way, but while Win32 will work with Magnus' patch,
FreeBSD won't; it needs the ICU patch to work. OTH, I maintain the FreeBSD
port where I already have the patch as an ("experiemental") option. Not
every FreeBSD user uses the ports system, though.

So, it is a question whether FreeBSD's unicode support is important or not,
I guess? Win32 will work both ways.

/Palle

#6Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Palle Girgensohn (#5)
Re: Win32 unicode vs ICU

Palle Girgensohn wrote:

I feel it makes sense to apply the smaller patch in any case, so that
there's a Win32 solution not requiring ICU (ie, I can't see an argument
for doing (2) rather than (3)).

Comments?

I don't mind either way, but while Win32 will work with Magnus' patch,
FreeBSD won't; it needs the ICU patch to work. OTH, I maintain the FreeBSD
port where I already have the patch as an ("experiemental") option. Not
every FreeBSD user uses the ports system, though.

So, it is a question whether FreeBSD's unicode support is important or not,
I guess? Win32 will work both ways.

How is FreeBSD's Unicode support broken? I was not aware of that.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7Palle Girgensohn
girgen@pingpong.net
In reply to: Bruce Momjian (#6)
Re: Win32 unicode vs ICU

--On måndag, augusti 22, 2005 09.19.58 -0400 Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

Palle Girgensohn wrote:

I feel it makes sense to apply the smaller patch in any case, so that
there's a Win32 solution not requiring ICU (ie, I can't see an argument
for doing (2) rather than (3)).

Comments?

I don't mind either way, but while Win32 will work with Magnus' patch,
FreeBSD won't; it needs the ICU patch to work. OTH, I maintain the
FreeBSD port where I already have the patch as an ("experiemental")
option. Not every FreeBSD user uses the ports system, though.

So, it is a question whether FreeBSD's unicode support is important or
not, I guess? Win32 will work both ways.

How is FreeBSD's Unicode support broken? I was not aware of that.

FreeBSD has no unicode collation support. Hence the need for ICU.

/Palle

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Palle Girgensohn (#7)
Re: Win32 unicode vs ICU

Palle Girgensohn <girgen@pingpong.net> writes:

<pgman@candle.pha.pa.us> wrote:

How is FreeBSD's Unicode support broken? I was not aware of that.

FreeBSD has no unicode collation support. Hence the need for ICU.

Well, this obviously doesn't bother anyone who uses FreeBSD, so it need
not bother us either. I do not feel a need to take on ICU in order to
implement features that are not present anywhere else on the platform.

regards, tom lane

#9Palle Girgensohn
girgen@pingpong.net
In reply to: Tom Lane (#8)
Re: Win32 unicode vs ICU

--On måndag, augusti 22, 2005 10.12.11 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

Palle Girgensohn <girgen@pingpong.net> writes:

<pgman@candle.pha.pa.us> wrote:

How is FreeBSD's Unicode support broken? I was not aware of that.

FreeBSD has no unicode collation support. Hence the need for ICU.

Well, this obviously doesn't bother anyone who uses FreeBSD, so it need
not bother us either. I do not feel a need to take on ICU in order to
implement features that are not present anywhere else on the platform.

It bothered me enough to patch postgresql. :) And I use it with Java,
which has working unicode support, soo... Oh well, I can live with that -
I'll maintain my patch locally for the time beeing, if that's what's
required.

/Palle

#10Magnus Hagander
mha@sollentuna.net
In reply to: Palle Girgensohn (#9)
Re: Win32 unicode vs ICU

.. back home again after a couple of days ..

<snip>

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

<snip>

I feel it makes sense to apply the smaller patch in any case,
so that there's a Win32 solution not requiring ICU (ie, I
can't see an argument for doing (2) rather than (3)).

Comments?

Sounds reasonable to me - couldn't really find a reasonable argument for
(2), but it was an option :-)

Though as it seems to be needed on FreeBSD as well, we should definitly
look at (3) as a long-term option. Considering Palle has been running it
in production for quite a while now IIRC, the ICU part should be fairly
stable. but see below...

And anohter question - my native patch touches the same

functions as

the ICU patch. Can somebody who knows the internals confirm or deny
that these are all the required locations, or do we need to modify
more?

There is a strxfrm() call in
src/backend/utils/adt/selfuncs.c, which probably needs to be
looked at too.

Ok. Will look into that. Do you have a hint as to how to test that?
Considering I've been unable to show incorrect function without donig
it, and apparantly so has Palle (with the ICU patch) running it in
production for a long time, I clearly don't know how to provoke a
failure with the current code.

Which brings up another point - there are clearly no regression tests
for this (considering we missed the unicode stuff early in the 8.0
cycle). Is there a reasonable way to add something along this line that
could be used, or will that be too complex? (I guess we can never test
completely as it depends on the OS locale handling, but some?) I'm
thinking it might be too complex as we'd need a new initdb with a
different encoding, but perhaps it's worth it anyway?

//Magnus

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#10)
Re: Win32 unicode vs ICU

"Magnus Hagander" <mha@sollentuna.net> writes:

There is a strxfrm() call in
src/backend/utils/adt/selfuncs.c, which probably needs to be
looked at too.

Ok. Will look into that. Do you have a hint as to how to test that?

Any problems would manifest as a bogus interpolation between histogram
elements for a scalar-inequality selectivity estimate in a text column.
For instance, if you insert all 676 2-letter combinations AA, AB, AC,
..., ZY, ZZ into a text column, ANALYZE, and then try cases like
"EXPLAIN SELECT * FROM tab WHERE col < 'QW'", ideally the row estimate
should be pretty nearly dead on. Being pure-ASCII this test would
probably still work in a broken Unicode context, but if you did a
similar experiment with 26 non-ASCII characters it would be likely to
come out with silly results. You could increase the obviousness of the
bad result by reducing the statistics target, since the silliness will
be bounded by the histogram bin size.

(Just looking at it again, the code in convert_string_to_scalar is
pretty bogus for multibyte encodings in any case. Possibly we need to
rethink the whole approach.)

Which brings up another point - there are clearly no regression tests
for this (considering we missed the unicode stuff early in the 8.0
cycle).

src/test/locale? src/test/mb? I've never used either, but they're
there ...

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
Re: Win32 unicode vs ICU

I wrote:

(Just looking at it again, the code in convert_string_to_scalar is
pretty bogus for multibyte encodings in any case. Possibly we need to
rethink the whole approach.)

After studying this some more, I think the code is really so bogus for
any non-ASCII situation that it's probably not worth worrying about
too much. It's effectively assuming that the output of strxfrm() is
still in an ASCII-superset encoding ... but I don't see anything in
strxfrm's API that guarantees any such thing.

As long as strxfrm() doesn't fail completely for Windows Unicode,
I'd recommend just leaving this alone. As previously noted, the
worst that can happen is an estimation error that's bounded by the
histogram bin size anyhow.

regards, tom lane

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#10)
Re: Win32 unicode vs ICU

I looked over the proposed patch a bit and found some problems --- in
particular, if I read M$'s documentation about MultiByteToWideChar
correctly, they chose an API that fails for zero-length input, and
so you gotta program around that. Also, varstr_cmp() cannot assume
it gets null-terminated input.

I cannot test the attached revised patch; please check it out.

regards, tom lane

#14Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#1)
Re: FW: Win32 unicode vs ICU

This has been saved for the 8.2 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Magnus Hagander wrote:

I just realised this mail didn't go through. Probably because it was too
large for -hackers. So: repost to -patches. Sorry about that. If it's a
duplicate, even more sorry, but I couldn't find it in the archives.

(This may explain that nobody answered me :P)

//Magnus

-----Original Message-----
From: Magnus Hagander
Sent: Sunday, July 31, 2005 2:09 PM
To: PostgreSQL-development
Cc: pgsql-hackers-win32@postgresql.org
Subject: Win32 unicode vs ICU

Hi!

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

The main downsides of ICU vs the native ones are:
* ICU does not accept win32 locale names. When doing
setlocale("sv_se"), for example, win32 will return this in
later calls as "Swedish_Sweden.1252". To get around this in
the ICU patch, I had to implement a lookup map that converts
it back to sv_se for ICU.

* ICU is yet another build and runtime dependency, and a
large one (comes in at 11Mb for the DLL files alone in the
win32 download)

I guess that the main upside of it is that we'd get
constistent behaviour - in case there are issues with either
ICU or win32 native they'd otherwise differ. And only one new
codepath. But we already live with the platform-inconsistency today...

Another upside is that it handles more encodings in ICU - my
native implementation does *only* UTF8 and relies on existing
functionality to deal with other encodings. It could of
course be extended if necessary, but from what I can tell
UTF8 is the big one.

I have attached both patches. For the native version, only
win32_utf8.patch is required. For the ICU version,
icu_win32.patch is needed and also the files
localemap.c,localemap.pl, iso639 and iso3166 needs to go in
src/backend/port/win32. (the localemap needs to be updated to
do a better-than-linear search, but I wanted to include an example)

Thoughts on the options?

And anohter question - my native patch touches the same
functions as the ICU patch. Can somebody who knows the
internals confirm or deny that these are all the required
locations, or do we need to modify more?

(I have run simple tests in swedish locale and both behave
the same and correct, but I'm unsure of exactly how much
would be affected)

Finally, the win32 patch also changes the normal path to use
strncoll(). The comment above the function states that we'd
like to use strncoll but it's not available. Well, on win32
it is, so it should provide a speedup on win32. It is
currently not included in the ICU patch, but should probably
be included whichever path we'd chose.

//Magnus

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#15Kevin McArthur
postgresql-list@stormtide.ca
In reply to: Magnus Hagander (#10)
FreeBSD ICU was Win32 unicode vs ICU

I was reviewing this thread about its lack of collation support in freebsd.

As some of you may or may not know the PHP project is also currently working
heavily on unicode support. (For PHP6)

I had the chance to ask Andrei Zmievski of the php project about their
support for unicode. The key items are as follows.

<StormTide> with the new unicode support, is there any support for unicode
collation
<andrei> StormTide, there will be
<StormTide> is it imported by the platform or custom done for php
<StormTide> (cuz freebsd seeems to have issues with its collation support)
<andrei> StormTide, not OS-dependent
<andrei> StormTide, uses CLDR

Should the postgresql project also be looking at CLDR for cross-platform
unicode support?

http://www.unicode.org/cldr/

Kevin McArthur
Digifonica Canada

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Magnus Hagander" <mha@sollentuna.net>
Cc: <pgsql-hackers@postgresql.org>; "Palle Girgensohn" <girgen@pingpong.net>
Sent: Tuesday, August 23, 2005 9:03 AM
Subject: Re: [HACKERS] Win32 unicode vs ICU

Show quoted text

I wrote:

(Just looking at it again, the code in convert_string_to_scalar is
pretty bogus for multibyte encodings in any case. Possibly we need to
rethink the whole approach.)

After studying this some more, I think the code is really so bogus for
any non-ASCII situation that it's probably not worth worrying about
too much. It's effectively assuming that the output of strxfrm() is
still in an ASCII-superset encoding ... but I don't see anything in
strxfrm's API that guarantees any such thing.

As long as strxfrm() doesn't fail completely for Windows Unicode,
I'd recommend just leaving this alone. As previously noted, the
worst that can happen is an estimation error that's bounded by the
histogram bin size anyhow.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

#16John Hansen
john@geeknet.com.au
In reply to: Kevin McArthur (#15)
Re: FreeBSD ICU was Win32 unicode vs ICU

Kevin McArthur Wrote:

Should the postgresql project also be looking at CLDR for
cross-platform unicode support?

Afaict, from the ICU website, ICU too uses CLDR.
Why reinvent the wheel?

... John

#17Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#1)
Re: FW: Win32 unicode vs ICU

Is this patch moving toward competion?

---------------------------------------------------------------------------

Magnus Hagander wrote:

I just realised this mail didn't go through. Probably because it was too
large for -hackers. So: repost to -patches. Sorry about that. If it's a
duplicate, even more sorry, but I couldn't find it in the archives.

(This may explain that nobody answered me :P)

//Magnus

-----Original Message-----
From: Magnus Hagander
Sent: Sunday, July 31, 2005 2:09 PM
To: PostgreSQL-development
Cc: pgsql-hackers-win32@postgresql.org
Subject: Win32 unicode vs ICU

Hi!

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

The main downsides of ICU vs the native ones are:
* ICU does not accept win32 locale names. When doing
setlocale("sv_se"), for example, win32 will return this in
later calls as "Swedish_Sweden.1252". To get around this in
the ICU patch, I had to implement a lookup map that converts
it back to sv_se for ICU.

* ICU is yet another build and runtime dependency, and a
large one (comes in at 11Mb for the DLL files alone in the
win32 download)

I guess that the main upside of it is that we'd get
constistent behaviour - in case there are issues with either
ICU or win32 native they'd otherwise differ. And only one new
codepath. But we already live with the platform-inconsistency today...

Another upside is that it handles more encodings in ICU - my
native implementation does *only* UTF8 and relies on existing
functionality to deal with other encodings. It could of
course be extended if necessary, but from what I can tell
UTF8 is the big one.

I have attached both patches. For the native version, only
win32_utf8.patch is required. For the ICU version,
icu_win32.patch is needed and also the files
localemap.c,localemap.pl, iso639 and iso3166 needs to go in
src/backend/port/win32. (the localemap needs to be updated to
do a better-than-linear search, but I wanted to include an example)

Thoughts on the options?

And anohter question - my native patch touches the same
functions as the ICU patch. Can somebody who knows the
internals confirm or deny that these are all the required
locations, or do we need to modify more?

(I have run simple tests in swedish locale and both behave
the same and correct, but I'm unsure of exactly how much
would be affected)

Finally, the win32 patch also changes the normal path to use
strncoll(). The comment above the function states that we'd
like to use strncoll but it's not available. Well, on win32
it is, so it should provide a speedup on win32. It is
currently not included in the ICU patch, but should probably
be included whichever path we'd chose.

//Magnus

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#18Roman Neuhauser
neuhauser@sigpipe.cz
In reply to: John Hansen (#16)
Re: FreeBSD ICU was Win32 unicode vs ICU

# john@geeknet.com.au / 2005-08-25 04:53:47 +1000:

Kevin McArthur Wrote:

Should the postgresql project also be looking at CLDR for
cross-platform unicode support?

Afaict, from the ICU website, ICU too uses CLDR.
Why reinvent the wheel?

Just a thought: PHP is under a BSD-like (Apache-style) license.
Does that mean they're going to rewrite ICU under their license?

--
How many Vietnam vets does it take to screw in a light bulb?
You don't know, man. You don't KNOW.
Cause you weren't THERE. http://bash.org/?255991

#19Martijn van Oosterhout
kleptog@svana.org
In reply to: Roman Neuhauser (#18)
Re: FreeBSD ICU was Win32 unicode vs ICU

On Sun, Mar 26, 2006 at 02:13:28PM +0200, Roman Neuhauser wrote:

Afaict, from the ICU website, ICU too uses CLDR.
Why reinvent the wheel?

Just a thought: PHP is under a BSD-like (Apache-style) license.
Does that mean they're going to rewrite ICU under their license?

Near as I can tell, ICU is under a BSD like licence.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html

Perhaps they are compatable.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#20Roman Neuhauser
neuhauser@sigpipe.cz
In reply to: Martijn van Oosterhout (#19)
Re: FreeBSD ICU was Win32 unicode vs ICU

# kleptog@svana.org / 2006-03-26 15:53:24 +0200:

On Sun, Mar 26, 2006 at 02:13:28PM +0200, Roman Neuhauser wrote:

Afaict, from the ICU website, ICU too uses CLDR.
Why reinvent the wheel?

Just a thought: PHP is under a BSD-like (Apache-style) license.
Does that mean they're going to rewrite ICU under their license?

Near as I can tell, ICU is under a BSD like licence.

I got confused by the Gnome LocaleProject reference earlier in the
thread, sorry for the noise.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html

Perhaps they are compatable.

They should be, as the ICU license is basically the MIT license
tightened to have the same set of requirements as the new BSD
license.

--
How many Vietnam vets does it take to screw in a light bulb?
You don't know, man. You don't KNOW.
Cause you weren't THERE. http://bash.org/?255991

#21Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#1)
Re: FW: Win32 unicode vs ICU

Added to TODO.detail.

---------------------------------------------------------------------------

Magnus Hagander wrote:

I just realised this mail didn't go through. Probably because it was too
large for -hackers. So: repost to -patches. Sorry about that. If it's a
duplicate, even more sorry, but I couldn't find it in the archives.

(This may explain that nobody answered me :P)

//Magnus

-----Original Message-----
From: Magnus Hagander
Sent: Sunday, July 31, 2005 2:09 PM
To: PostgreSQL-development
Cc: pgsql-hackers-win32@postgresql.org
Subject: Win32 unicode vs ICU

Hi!

I've been working with Palles ICU patch to make it work on
win32, and I believe I have it done. While doing it I noticed
that ICU basically converts to UTF16 and back - I previously
thought it worked on UTF8 strings. Based on this I also tried
out an implementation for the win32-unicode problem that does
*not* require ICU. It uses the win32 native functions to map
to utf16 and back, and then to process the text there. And I
got through with much less code than the ICU version, while
doing the same thing.

I am unsure of how to proceed. As I see it there are three paths:
1) Use native win32 functionality only on win32
2) Use ICU functionality only on win32
3) Allow both ICU and native functionality, compile time
switch --with-icu (same as unix with the ICU patch)

The main downsides of ICU vs the native ones are:
* ICU does not accept win32 locale names. When doing
setlocale("sv_se"), for example, win32 will return this in
later calls as "Swedish_Sweden.1252". To get around this in
the ICU patch, I had to implement a lookup map that converts
it back to sv_se for ICU.

* ICU is yet another build and runtime dependency, and a
large one (comes in at 11Mb for the DLL files alone in the
win32 download)

I guess that the main upside of it is that we'd get
constistent behaviour - in case there are issues with either
ICU or win32 native they'd otherwise differ. And only one new
codepath. But we already live with the platform-inconsistency today...

Another upside is that it handles more encodings in ICU - my
native implementation does *only* UTF8 and relies on existing
functionality to deal with other encodings. It could of
course be extended if necessary, but from what I can tell
UTF8 is the big one.

I have attached both patches. For the native version, only
win32_utf8.patch is required. For the ICU version,
icu_win32.patch is needed and also the files
localemap.c,localemap.pl, iso639 and iso3166 needs to go in
src/backend/port/win32. (the localemap needs to be updated to
do a better-than-linear search, but I wanted to include an example)

Thoughts on the options?

And anohter question - my native patch touches the same
functions as the ICU patch. Can somebody who knows the
internals confirm or deny that these are all the required
locations, or do we need to modify more?

(I have run simple tests in swedish locale and both behave
the same and correct, but I'm unsure of exactly how much
would be affected)

Finally, the win32 patch also changes the normal path to use
strncoll(). The comment above the function states that we'd
like to use strncoll but it's not available. Well, on win32
it is, so it should provide a speedup on win32. It is
currently not included in the ICU patch, but should probably
be included whichever path we'd chose.

//Magnus

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +