patch: utf8_to_unicode (trivial)

Started by Joseph Adamsover 15 years ago14 messages
#1Joseph Adams
joeyadams3.14159@gmail.com
1 attachment(s)

In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 ,
but no corresponding utf8_to_unicode . However, there is a static
function called utf2ucs that does what utf8_to_unicode would do.

I'd like this function to be available because the JSON code needs to
convert UTF-8 to and from Unicode codepoints, and I'm currently using
a separate UTF-8 to codepoint function for that.

This patch renames utf2ucs to utf8_to_unicode and makes it public. It
also fixes the version of utf2ucs in src/bin/psql/mbprint.c so that
it's equivalent to the one in wchar.c .

This is a patch against CVS HEAD for application. It compiles and
tests successfully.

Comments? Thanks,

Joey Adams

Attachments:

utf8-to-unicode.patchapplication/octet-stream; name=utf8-to-unicode.patchDownload
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index e164866..1a2a24d 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -453,6 +453,34 @@ unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
 	return utf8string;
 }
 
+/*
+ * Convert a UTF-8 character to a Unicode code point.
+ * This is a one-character version of pg_utf2wchar_with_len.
+ *
+ * c must point to a large enough string for the character's length.
+ */
+pg_wchar
+utf8_to_unicode(const unsigned char *c)
+{
+	if ((*c & 0x80) == 0)
+		return (pg_wchar) c[0];
+	else if ((*c & 0xe0) == 0xc0)
+		return (pg_wchar) (((c[0] & 0x1f) << 6) |
+						   (c[1] & 0x3f));
+	else if ((*c & 0xf0) == 0xe0)
+		return (pg_wchar) (((c[0] & 0x0f) << 12) |
+						   ((c[1] & 0x3f) << 6) |
+						   (c[2] & 0x3f));
+	else if ((*c & 0xf8) == 0xf0)
+		return (pg_wchar) (((c[0] & 0x07) << 18) |
+						   ((c[1] & 0x3f) << 12) |
+						   ((c[2] & 0x3f) << 6) |
+						   (c[3] & 0x3f));
+	else
+		/* that is an invalid code on purpose */
+		return 0xffffffff;
+}
+
 
 /*
  * Return the byte length of a UTF8 character pointed to by s
@@ -462,8 +490,8 @@ unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
  * We return "1" for any leading byte that is either flat-out illegal or
  * indicates a length larger than we support.
  *
- * pg_utf2wchar_with_len(), utf2ucs(), pg_utf8_islegal(), and perhaps
- * other places would need to be fixed to change this.
+ * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(),
+ * and perhaps other places would need to be fixed to change this.
  */
 int
 pg_utf_mblen(const unsigned char *s)
@@ -632,36 +660,10 @@ ucs_wcwidth(pg_wchar ucs)
 		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
 }
 
-static pg_wchar
-utf2ucs(const unsigned char *c)
-{
-	/*
-	 * one char version of pg_utf2wchar_with_len. no control here, c must
-	 * point to a large enough string
-	 */
-	if ((*c & 0x80) == 0)
-		return (pg_wchar) c[0];
-	else if ((*c & 0xe0) == 0xc0)
-		return (pg_wchar) (((c[0] & 0x1f) << 6) |
-						   (c[1] & 0x3f));
-	else if ((*c & 0xf0) == 0xe0)
-		return (pg_wchar) (((c[0] & 0x0f) << 12) |
-						   ((c[1] & 0x3f) << 6) |
-						   (c[2] & 0x3f));
-	else if ((*c & 0xf8) == 0xf0)
-		return (pg_wchar) (((c[0] & 0x07) << 18) |
-						   ((c[1] & 0x3f) << 12) |
-						   ((c[2] & 0x3f) << 6) |
-						   (c[3] & 0x3f));
-	else
-		/* that is an invalid code on purpose */
-		return 0xffffffff;
-}
-
 static int
 pg_utf_dsplen(const unsigned char *s)
 {
-	return ucs_wcwidth(utf2ucs(s));
+	return ucs_wcwidth(utf8_to_unicode(s));
 }
 
 /*
diff --git a/src/bin/psql/mbprint.c b/src/bin/psql/mbprint.c
index b1263ea..4c8b1d8 100644
--- a/src/bin/psql/mbprint.c
+++ b/src/bin/psql/mbprint.c
@@ -63,7 +63,7 @@ utf2ucs(const unsigned char *c)
 						   ((c[1] & 0x3f) << 6) |
 						   (c[2] & 0x3f));
 	}
-	else if ((*c & 0xf0) == 0xf0)
+	else if ((*c & 0xf8) == 0xf0)
 	{
 		return (pg_wchar) (((c[0] & 0x07) << 18) |
 						   ((c[1] & 0x3f) << 12) |
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 817f9aa..ea753a6 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -412,6 +412,7 @@ extern int	pg_valid_client_encoding(const char *name);
 extern int	pg_valid_server_encoding(const char *name);
 
 extern unsigned char *unicode_to_utf8(pg_wchar c, unsigned char *utf8string);
+extern pg_wchar utf8_to_unicode(const unsigned char *utf8char);
 extern int	pg_utf_mblen(const unsigned char *);
 extern unsigned char *pg_do_encoding_conversion(unsigned char *src, int len,
 						  int src_encoding,
#2Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Joseph Adams (#1)
1 attachment(s)
Re: patch: utf8_to_unicode (trivial)

On Tue, Jul 27, 2010 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jul 24, 2010 at 10:34 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 ,
but no corresponding utf8_to_unicode .  However, there is a static
function called utf2ucs that does what utf8_to_unicode would do.

I'd like this function to be available because the JSON code needs to
convert UTF-8 to and from Unicode codepoints, and I'm currently using
a separate UTF-8 to codepoint function for that.

This patch renames utf2ucs to utf8_to_unicode and makes it public.  It
also fixes the version of utf2ucs in  src/bin/psql/mbprint.c so that
it's equivalent to the one in wchar.c .

This is a patch against CVS HEAD for application.  It compiles and
tests successfully.

Comments?  Thanks,

I feel obliged to respond this since I'm supposed to be covering your
GSoC project while Magnus is on vacation, but I actually know very
little about this topic.  What's undeniable, however, is that the
coding in the two versions of utf8ucs() in the tree right now don't
match.  src/backend/utils/mb/wchar.c has:

       else if ((*c & 0xf8) == 0xf0)

while src/bin/psql/mbprint.c, which is otherwise identical, has:

       else if ((*c & 0xf0) == 0xf0)

I'm inclined to believe that your patch is right to think that the
former version is correct, because it used to match the latter version
until Tom Lane changed it in 2007, and I suspect he simply failed to
update both copies.  But I'd like someone who actually understands
what this code is doing to confirm that.

http://archives.postgresql.org/pgsql-committers/2007-01/msg00293.php

I suspect we need to not only fix this, but back-patch it at least to
8.2, which is the first release where there are two copies of this
function.  I am not sure whether earlier releases need to be changed,
or not.  But again, someone who understands the issues better than I
do needs to weigh in here.

In terms of making this function non-static, I'm inclined to think
that a better approach would be to move it to src/port.  That gets rid
of the need to have two copies in the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
. It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

I'm not sure whether I like the old patch better or the new one.

Joey Adams

Attachments:

utf8-to-unicode-port.patchapplication/octet-stream; name=utf8-to-unicode-port.patchDownload
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index 4b98c8b..d1c75db 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -462,8 +462,8 @@ unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
  * We return "1" for any leading byte that is either flat-out illegal or
  * indicates a length larger than we support.
  *
- * pg_utf2wchar_with_len(), utf2ucs(), pg_utf8_islegal(), and perhaps
- * other places would need to be fixed to change this.
+ * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(),
+ * and perhaps other places would need to be fixed to change this.
  */
 int
 pg_utf_mblen(const unsigned char *s)
@@ -632,36 +632,10 @@ ucs_wcwidth(pg_wchar ucs)
 		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
 }
 
-static pg_wchar
-utf2ucs(const unsigned char *c)
-{
-	/*
-	 * one char version of pg_utf2wchar_with_len. no control here, c must
-	 * point to a large enough string
-	 */
-	if ((*c & 0x80) == 0)
-		return (pg_wchar) c[0];
-	else if ((*c & 0xe0) == 0xc0)
-		return (pg_wchar) (((c[0] & 0x1f) << 6) |
-						   (c[1] & 0x3f));
-	else if ((*c & 0xf0) == 0xe0)
-		return (pg_wchar) (((c[0] & 0x0f) << 12) |
-						   ((c[1] & 0x3f) << 6) |
-						   (c[2] & 0x3f));
-	else if ((*c & 0xf8) == 0xf0)
-		return (pg_wchar) (((c[0] & 0x07) << 18) |
-						   ((c[1] & 0x3f) << 12) |
-						   ((c[2] & 0x3f) << 6) |
-						   (c[3] & 0x3f));
-	else
-		/* that is an invalid code on purpose */
-		return 0xffffffff;
-}
-
 static int
 pg_utf_dsplen(const unsigned char *s)
 {
-	return ucs_wcwidth(utf2ucs(s));
+	return ucs_wcwidth(utf8_to_unicode(s));
 }
 
 /*
diff --git a/src/bin/psql/mbprint.c b/src/bin/psql/mbprint.c
index ad66ed8..96f76ff 100644
--- a/src/bin/psql/mbprint.c
+++ b/src/bin/psql/mbprint.c
@@ -43,41 +43,6 @@ pg_get_utf8_id(void)
 #define PG_UTF8		pg_get_utf8_id()
 
 
-static pg_wchar
-utf2ucs(const unsigned char *c)
-{
-	/*
-	 * one char version of pg_utf2wchar_with_len. no control here, c must
-	 * point to a large enough string
-	 */
-	if ((*c & 0x80) == 0)
-		return (pg_wchar) c[0];
-	else if ((*c & 0xe0) == 0xc0)
-	{
-		return (pg_wchar) (((c[0] & 0x1f) << 6) |
-						   (c[1] & 0x3f));
-	}
-	else if ((*c & 0xf0) == 0xe0)
-	{
-		return (pg_wchar) (((c[0] & 0x0f) << 12) |
-						   ((c[1] & 0x3f) << 6) |
-						   (c[2] & 0x3f));
-	}
-	else if ((*c & 0xf0) == 0xf0)
-	{
-		return (pg_wchar) (((c[0] & 0x07) << 18) |
-						   ((c[1] & 0x3f) << 12) |
-						   ((c[2] & 0x3f) << 6) |
-						   (c[3] & 0x3f));
-	}
-	else
-	{
-		/* that is an invalid code on purpose */
-		return 0xffffffff;
-	}
-}
-
-
 /*
  * Unicode 3.1 compliant validation : for each category, it checks the
  * combination of each byte to make sure it maps to a valid range. It also
@@ -354,7 +319,7 @@ pg_wcsformat(unsigned char *pwcs, size_t len, int encoding,
 		else if (w < 0)			/* Non-ascii control char */
 		{
 			if (encoding == PG_UTF8)
-				sprintf((char *) ptr, "\\u%04X", utf2ucs(pwcs));
+				sprintf((char *) ptr, "\\u%04X", utf8_to_unicode(pwcs));
 			else
 			{
 				/*
diff --git a/src/include/port.h b/src/include/port.h
index 291a3e7..9a3614f 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -439,4 +439,7 @@ extern void qsort_arg(void *base, size_t nel, size_t elsize,
 /* port/chklocale.c */
 extern int	pg_get_encoding_from_locale(const char *ctype);
 
+/* port/utf8.c */
+extern unsigned int utf8_to_unicode(const unsigned char *utf8char);
+
 #endif   /* PG_PORT_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 9ef9491..2622ca4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -31,7 +31,7 @@ override CPPFLAGS := -I$(top_builddir)/src/port -DFRONTEND $(CPPFLAGS)
 LIBS += $(PTHREAD_LIBS)
 
 OBJS = $(LIBOBJS) chklocale.o dirmod.o exec.o noblock.o path.o \
-	pgsleep.o pgstrcasecmp.o qsort.o qsort_arg.o sprompt.o thread.o
+	pgsleep.o pgstrcasecmp.o qsort.o qsort_arg.o sprompt.o thread.o utf8.o
 ifneq (,$(filter $(PORTNAME),cygwin win32))
 OBJS += pipe.o
 endif
diff --git a/src/port/utf8.c b/src/port/utf8.c
new file mode 100644
index 0000000..33d38ef
--- /dev/null
+++ b/src/port/utf8.c
@@ -0,0 +1,45 @@
+/*-------------------------------------------------------------------------
+ *
+ * utf8.c
+ *	  The utf8_to_unicode function
+ *
+ * This lives in src/port instead of src/backend/utils/mb so that it
+ * doesn't have to be duplicated in src/bin/psql/mprint.c .
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Convert a UTF-8 character to a Unicode code point.
+ * This is a one-character version of pg_utf2wchar_with_len.
+ *
+ * c must point to a large enough string for the character's length.
+ */
+unsigned int
+utf8_to_unicode(const unsigned char *c)
+{
+	if ((*c & 0x80) == 0)
+		return (unsigned int) c[0];
+	else if ((*c & 0xe0) == 0xc0)
+		return (unsigned int) (((c[0] & 0x1f) << 6) |
+							   (c[1] & 0x3f));
+	else if ((*c & 0xf0) == 0xe0)
+		return (unsigned int) (((c[0] & 0x0f) << 12) |
+							   ((c[1] & 0x3f) << 6) |
+							   (c[2] & 0x3f));
+	else if ((*c & 0xf8) == 0xf0)
+		return (unsigned int) (((c[0] & 0x07) << 18) |
+							   ((c[1] & 0x3f) << 12) |
+							   ((c[2] & 0x3f) << 6) |
+							   (c[3] & 0x3f));
+	else
+		/* that is an invalid code on purpose */
+		return 0xffffffff;
+}
#3Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#2)
Re: patch: utf8_to_unicode (trivial)

On Fri, Aug 13, 2010 at 3:12 AM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
.  It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

Well, right now, in addition to having two copies of utf2ucs(), we
have two declarations of pg_wchar, one in src/bin/psql/mbprint.c and
the other in src/include/mb/pg_wchar.h; so both existing copies of the
function are able to use that typedef. It seems like we might want to
move the typedef to the same place as the declaration of the renamed
utf2ucs(), but I'm not quite sure where that should be. The only
header in src/port is pthread-win32.h, and we're sure not going to put
it there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#4Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#3)
Re: patch: utf8_to_unicode (trivial)

Excerpts from Robert Haas's message of vie ago 13 12:00:32 -0400 2010:

On Fri, Aug 13, 2010 at 3:12 AM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
.  It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

Well, right now, in addition to having two copies of utf2ucs(), we
have two declarations of pg_wchar, one in src/bin/psql/mbprint.c and
the other in src/include/mb/pg_wchar.h; so both existing copies of the
function are able to use that typedef. It seems like we might want to
move the typedef to the same place as the declaration of the renamed
utf2ucs(), but I'm not quite sure where that should be. The only
header in src/port is pthread-win32.h, and we're sure not going to put
it there.

src/include/port.h?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#4)
Re: patch: utf8_to_unicode (trivial)

On Fri, Aug 13, 2010 at 12:11 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

src/include/port.h?

Oh, hey, look at that. Any thought on what to about the fact that our
two existing copies of utf2ucs() don't match? (one tests against 0xf8
where the other against 0xf0)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#6Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#5)
Re: patch: utf8_to_unicode (trivial)

Excerpts from Robert Haas's message of vie ago 13 12:50:13 -0400 2010:

On Fri, Aug 13, 2010 at 12:11 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

src/include/port.h?

Oh, hey, look at that. Any thought on what to about the fact that our
two existing copies of utf2ucs() don't match? (one tests against 0xf8
where the other against 0xf0)

I'm not sure why it's masking 0xf8 instead of 0xf0. It seems like c &
0xf8 == 0xf8 signals start of a 5-byte sequence which is not valid per
RFC 3629, according to wikipedia:
http://en.wikipedia.org/wiki/UTF-8#Description

(Moreover, 0xf5 to 0xf7 signal start of a 4-byte sequence for codepoints
that apparently are not supposed to be valid).

So apparently it's good that the code returns an invalid code in those
cases, i.e. wchar.c is right and mbprint is wrong.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: patch: utf8_to_unicode (trivial)

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Robert Haas's message of vie ago 13 12:50:13 -0400 2010:

Oh, hey, look at that. Any thought on what to about the fact that our
two existing copies of utf2ucs() don't match? (one tests against 0xf8
where the other against 0xf0)

I'm not sure why it's masking 0xf8 instead of 0xf0.

Because it wants to verify that this is in fact a 4-byte UTF8 code.
Compare the code (and comments) for pg_utf_mblen.

AFAICS the version in mbprint.c is flat out wrong, and the only reason
nobody's noticed is that it should never get passed a more-than-4-byte
sequence anyway.

regards, tom lane

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: patch: utf8_to_unicode (trivial)

On Fri, Aug 13, 2010 at 1:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Robert Haas's message of vie ago 13 12:50:13 -0400 2010:

Oh, hey, look at that.  Any thought on what to about the fact that our
two existing copies of utf2ucs() don't match?  (one tests against 0xf8
where the other against 0xf0)

I'm not sure why it's masking 0xf8 instead of 0xf0.

Because it wants to verify that this is in fact a 4-byte UTF8 code.
Compare the code (and comments) for pg_utf_mblen.

AFAICS the version in mbprint.c is flat out wrong, and the only reason
nobody's noticed is that it should never get passed a more-than-4-byte
sequence anyway.

Should we fix it, then, and if so how far should we go back?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: patch: utf8_to_unicode (trivial)

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 13, 2010 at 1:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAICS the version in mbprint.c is flat out wrong, and the only reason
nobody's noticed is that it should never get passed a more-than-4-byte
sequence anyway.

Should we fix it, then, and if so how far should we go back?

Yes, I'd vote for back-patching, at least to all versions in which the
wchar.c version looks like that.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joseph Adams (#2)
Re: patch: utf8_to_unicode (trivial)

Joseph Adams <joeyadams3.14159@gmail.com> writes:

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
. It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

I'm not sure whether I like the old patch better or the new one.

FWIW, I *don't* like this version, specifically because it fails to
utilize the pg_wchar datatype. The function in question is neither big
enough nor mutable enough that it's urgent to not duplicate it between
the backend and psql, so I don't see much value in moving it to src/port.

I think the correct things to do are to apply the original patch (modulo
a stylistic change, namely put the new function where the old one was to
miminize the size of the diff), and to back-patch the bug fix in mbprint.c.

regards, tom lane

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: patch: utf8_to_unicode (trivial)

On Sun, Aug 15, 2010 at 7:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joseph Adams <joeyadams3.14159@gmail.com> writes:

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
.  It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

I'm not sure whether I like the old patch better or the new one.

FWIW, I *don't* like this version, specifically because it fails to
utilize the pg_wchar datatype.  The function in question is neither big
enough nor mutable enough that it's urgent to not duplicate it between
the backend and psql, so I don't see much value in moving it to src/port.

Well, we'd better at least add a comment noting that the two versions
should match. But I think it would be better to unify them. However,
in the back-branches, I'd just fix the incorrect copy.

YMMV.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#11)
Re: patch: utf8_to_unicode (trivial)

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 15, 2010 at 7:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I *don't* like this version, specifically because it fails to
utilize the pg_wchar datatype. �The function in question is neither big
enough nor mutable enough that it's urgent to not duplicate it between
the backend and psql, so I don't see much value in moving it to src/port.

Well, we'd better at least add a comment noting that the two versions
should match. But I think it would be better to unify them. However,
in the back-branches, I'd just fix the incorrect copy.

Yeah, I did the latter part already because I figured it was
uncontroversial. What to do in HEAD is still under debate.

As for "the two versions should match", the only way they'd be likely to
diverge would be if the requirements change on one end or the other.
It's not unreasonable to suppose, for example, that we might want the
backend's version to start throwing an elog instead of just returning
-1 for a bad character. It would be a lot harder to do that if we've
pushed the code into src/port.

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#12)
Re: patch: utf8_to_unicode (trivial)

On Sun, Aug 15, 2010 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 15, 2010 at 7:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I *don't* like this version, specifically because it fails to
utilize the pg_wchar datatype.  The function in question is neither big
enough nor mutable enough that it's urgent to not duplicate it between
the backend and psql, so I don't see much value in moving it to src/port.

Well, we'd better at least add a comment noting that the two versions
should match.  But I think it would be better to unify them.  However,
in the back-branches, I'd just fix the incorrect copy.

Yeah, I did the latter part already because I figured it was
uncontroversial.  What to do in HEAD is still under debate.

As for "the two versions should match", the only way they'd be likely to
diverge would be if the requirements change on one end or the other.
It's not unreasonable to suppose, for example, that we might want the
backend's version to start throwing an elog instead of just returning
-1 for a bad character.  It would be a lot harder to do that if we've
pushed the code into src/port.

Not really. You'd just write a wrapper to call the version in
src/port and then elog if it returned -1. Unless -1 is actually a
valid result, I guess.

Anyway, it's not really important enough to me to have a protracted
argument about it. Let's wait and see if anyone else has an opinion,
and perhaps a consensus will emerge.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: patch: utf8_to_unicode (trivial)

Robert Haas <robertmhaas@gmail.com> writes:

Anyway, it's not really important enough to me to have a protracted
argument about it. Let's wait and see if anyone else has an opinion,
and perhaps a consensus will emerge.

Well, nobody else seems to care, so I went ahead and committed the
shorter form of the patch, ie just rename & export the function.

regards, tom lane