Unicode escapes in literals
I would like to add an escape mechanism to PostgreSQL for entering
arbitrary Unicode characters into string literals. We currently only
have the option of entering the character directly via the keyboard or
cut-and-paste, which is difficult for a number of reasons, such as when
the font doesn't have the character, and entering the UTF8-encoded bytes
using the E'...' strings, which is hardly usable.
SQL has the following escape syntax for it:
U&'special character: \xxxx' [ UESCAPE '\' ]
where xxxx is the hexadecimal Unicode codepoint. So this is pretty much
just another variant on what the E'...' syntax does.
The trick is that since we have user-definable encoding conversion
routines, we can't convert the Unicode codepoint to the server encoding
in the scanner stage. I imagine there are two ways to address this:
1. Only support this syntax when the server encoding is UTF8. This
would probably cover most use cases anyway. We could have limited
support for characters in the ASCII range for all server encodings.
2. Convert this syntax to a function call. But that would then create a
lot of inconsistencies, such as needing functional indexes for matches
against what should really be a literal.
I'd be happy to start with UTF8 support only. Other ideas?
Peter Eisentraut <peter_e@gmx.net> writes:
SQL has the following escape syntax for it:
U&'special character: \xxxx' [ UESCAPE '\' ]
Man that's ugly. Why the ampersand? How do you propose to distinguish
this from a perfectly legitimate use of the & operator?
2. Convert this syntax to a function call. But that would then create a
lot of inconsistencies, such as needing functional indexes for matches
against what should really be a literal.
Uh, why do you think that? The function could surely be stable, even
immutable if you grant that a database's encoding can't change.
regards, tom lane
Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
SQL has the following escape syntax for it:
U&'special character: \xxxx' [ UESCAPE '\' ]Man that's ugly. Why the ampersand?
Yeah, excellent question. It seems completely unnecessary, but it is
surely there in the syntax diagram.
How do you propose to distinguish
this from a perfectly legitimate use of the & operator?
Well, technically, there is going to be some conflict, but the practical
impact should be minimal because:
- There are no spaces allowed between U&' . We typically suggest spaces
around binary operators.
- Naming a column "u" might not be terribly common.
- Binary-and with an undecorated string literal is not very common.
Of course, I have no data for these assertions. An inquiry on -general
might give more insight.
2. Convert this syntax to a function call. But that would then create a
lot of inconsistencies, such as needing functional indexes for matches
against what should really be a literal.Uh, why do you think that? The function could surely be stable, even
immutable if you grant that a database's encoding can't change.
Yeah, true, that would work.
There are some other disadvantages for making a function call. You
couldn't use that kind of literal in any other place where the parser
calls for a string constant: role names, tablespace locations,
passwords, copy delimiters, enum values, function body, file names.
There is also a related feature for Unicode escapes in identifiers, and
it might be good to keep the door open on that.
We could to a dual approach: Convert in the scanner when server encoding
is UTF8, and pass on as function call otherwise. Surely ugly though.
Or pass it on as a separate token type to the analyze phase, but that is
a lot more work.
Others: What use cases do you envision, and what requirements would they
create for this feature?
Peter Eisentraut <peter_e@gmx.net> writes:
There are some other disadvantages for making a function call. You
couldn't use that kind of literal in any other place where the parser
calls for a string constant: role names, tablespace locations,
passwords, copy delimiters, enum values, function body, file names.
Good point. I'm okay with supporting the feature only when database
encoding is UTF8.
regards, tom lane
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
Man that's ugly. Why the ampersand?
Yeah, excellent question. It seems completely unnecessary, but it is
surely there in the syntax diagram.
Probably because many Unicode representations are done with "U+"
followed by 4-6 hexadecimal units, but "+" is problematic for other
reasons (in some vendor's implementation)?
A
--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/
Andrew Sullivan <ajs@commandprompt.com> writes:
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
Yeah, excellent question. It seems completely unnecessary, but it is
surely there in the syntax diagram.
Probably because many Unicode representations are done with "U+"
followed by 4-6 hexadecimal units, but "+" is problematic for other
reasons (in some vendor's implementation)?
They could hardly ignore the conflict with the operator interpretation
for +. The committee has now cut themselves off from ever having a
standard operator named &, but I suppose they didn't think ahead to that.
regards, tom lane
I wrote:
SQL has the following escape syntax for it:
U&'special character: \xxxx' [ UESCAPE '\' ]
Here is an in-progress patch for this. It still needs updates in the
psql scanner and possibly other scanners. But the server-side
functionality works.
Attachments:
uescape.difftext/plain; name=uescape.diff; x-mac-creator=0; x-mac-type=0Download
Index: doc/src/sgml/syntax.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v
retrieving revision 1.123
diff -u -3 -p -c -r1.123 syntax.sgml
*** doc/src/sgml/syntax.sgml 26 Jun 2008 22:24:42 -0000 1.123
--- doc/src/sgml/syntax.sgml 27 Oct 2008 16:54:26 -0000
*************** UPDATE "my_table" SET "a" = 5;
*** 190,195 ****
--- 190,247 ----
</para>
<para>
+ A variant of quoted identifiers allows including escaped Unicode
+ characters identified by their code points. This variant starts
+ with <literal>U&</literal> (upper or lower case U followed by
+ ampersand) immediately before the opening double quote, without
+ any spaces in between, for example <literal>U&"foo"</literal>.
+ (Note that this creates an ambiguity with the
+ operator <literal>&</literal>. Use spaces around the operator to
+ avoid this problem.) Inside the quotes, Unicode characters can be
+ specified in escaped form by writing a backslash followed by the
+ four-digit hexadecimal code point number or alternatively a
+ backslash followed by a plus sign followed by a six-digt
+ hexadecimal code point number. For example, the
+ identifier <literal>"data"</literal> could be written as
+ <programlisting>
+ U&"d\0061t\0061"
+ </programlisting>
+ or equivalently
+ <programlisting>
+ U&"d\+000061t\+000061"
+ </programlisting>
+ The following less trivial example writes the Russian
+ word <quote>slon</quote> (elephant) in Cyrillic letters:
+ <programlisting>
+ U&"\0441\043B\043E\043D"
+ </programlisting>
+ </para>
+
+ <para>
+ If a different escape character than backslash is desired, it can
+ be specified using the <literal>UESCAPE</literal> clause after the
+ string, for example:
+ <programlisting>
+ U&"d!0061t!0061" UESCAPE '!'
+ </programlisting>
+ The escape character can be any single character other than a
+ hexadecimal digit, the plus sign, a single quote, a double quote,
+ or a whitespace character. Note that the escape character is
+ written in single quotes, not double quotes.
+ </para>
+
+ <para>
+ To include the escape character in the identifier literally, write
+ it twice.
+ </para>
+
+ <para>
+ The Unicode escape syntax works only when the server encoding is
+ UTF8. When other server encodings are used, only code points in
+ the ASCII range (up to <literal>\007F</literal>) can be specified.
+ </para>
+
+ <para>
Quoting an identifier also makes it case-sensitive, whereas
unquoted names are always folded to lower case. For example, the
identifiers <literal>FOO</literal>, <literal>foo</literal>, and
*************** UPDATE "my_table" SET "a" = 5;
*** 245,251 ****
write two adjacent single quotes, e.g.
<literal>'Dianne''s horse'</literal>.
Note that this is <emphasis>not</> the same as a double-quote
! character (<literal>"</>).
</para>
<para>
--- 297,303 ----
write two adjacent single quotes, e.g.
<literal>'Dianne''s horse'</literal>.
Note that this is <emphasis>not</> the same as a double-quote
! character (<literal>"</>). <!-- font-lock sanity: " -->
</para>
<para>
*************** SELECT 'foo' 'bar';
*** 269,282 ****
by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is
following the standard.)
</para>
- <para>
<indexterm>
<primary>escape string syntax</primary>
</indexterm>
<indexterm>
<primary>backslash escapes</primary>
</indexterm>
<productname>PostgreSQL</productname> also accepts <quote>escape</>
string constants, which are an extension to the SQL standard.
An escape string constant is specified by writing the letter
--- 321,339 ----
by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is
following the standard.)
</para>
+ </sect3>
+
+ <sect3 id="sql-syntax-strings-escape">
+ <title>String Constants with C-Style Escapes</title>
<indexterm>
<primary>escape string syntax</primary>
</indexterm>
<indexterm>
<primary>backslash escapes</primary>
</indexterm>
+
+ <para>
<productname>PostgreSQL</productname> also accepts <quote>escape</>
string constants, which are an extension to the SQL standard.
An escape string constant is specified by writing the letter
*************** SELECT 'foo' 'bar';
*** 287,293 ****
Within an escape string, a backslash character (<literal>\</>) begins a
C-like <firstterm>backslash escape</> sequence, in which the combination
of backslash and following character(s) represent a special byte
! value:
<table id="sql-backslash-table">
<title>Backslash Escape Sequences</title>
--- 344,351 ----
Within an escape string, a backslash character (<literal>\</>) begins a
C-like <firstterm>backslash escape</> sequence, in which the combination
of backslash and following character(s) represent a special byte
! value, shown in <xref linkend="sql-backslash-table">
! </para>
<table id="sql-backslash-table">
<title>Backslash Escape Sequences</title>
*************** SELECT 'foo' 'bar';
*** 341,354 ****
</tgroup>
</table>
! It is your responsibility that the byte sequences you create are
! valid characters in the server character set encoding. Any other
character following a backslash is taken literally. Thus, to
include a backslash character, write two backslashes (<literal>\\</>).
Also, a single quote can be included in an escape string by writing
<literal>\'</literal>, in addition to the normal way of <literal>''</>.
</para>
<caution>
<para>
If the configuration parameter
--- 399,422 ----
</tgroup>
</table>
! <para>
! Any other
character following a backslash is taken literally. Thus, to
include a backslash character, write two backslashes (<literal>\\</>).
Also, a single quote can be included in an escape string by writing
<literal>\'</literal>, in addition to the normal way of <literal>''</>.
</para>
+ <para>
+ It is your responsibility that the byte sequences you create are
+ valid characters in the server character set encoding. When the
+ server encoding is UTF-8, then the alternative Unicode escape
+ syntax, explained in <xref linkend="sql-syntax-strings-uescape">,
+ should be used instead. (The alternative would be doing the
+ UTF-8 encoding by hand and writing out the bytes, which would be
+ very cumbersome.)
+ </para>
+
<caution>
<para>
If the configuration parameter
*************** SELECT 'foo' 'bar';
*** 379,384 ****
--- 447,509 ----
</para>
</sect3>
+ <sect3 id="sql-syntax-strings-uescape">
+ <title>String Constants with Unicode Escapes</title>
+
+ <para>
+ <productname>PostgreSQL</productname> also supports another type
+ of escape syntax for strings that allows specifying arbitrary
+ Unicode characters by code point. A Unicode escape string
+ constant starts with <literal>U&</literal> (upper or lower case
+ letter U followed by ampersand) immediately before the opening
+ quote, without any spaces in between, for
+ example <literal>U&'foo'</literal>. (Note that this creates an
+ ambiguity with the operator <literal>&</literal>. Use spaces
+ around the operator to avoid this problem.) Inside the quotes,
+ Unicode characters can be specified in escaped form by writing a
+ backslash followed by the four-digit hexadecimal code point
+ number or alternatively a backslash followed by a plus sign
+ followed by a six-digt hexadecimal code point number. For
+ example, the string <literal>'data'</literal> could be written as
+ <programlisting>
+ U&'d\0061t\0061'
+ </programlisting>
+ or equivalently
+ <programlisting>
+ U&'d\+000061t\+000061'
+ </programlisting>
+ The following less trivial example writes the Russian
+ word <quote>slon</quote> (elephant) in Cyrillic letters:
+ <programlisting>
+ U&'\0441\043B\043E\043D'
+ </programlisting>
+ </para>
+
+ <para>
+ If a different escape character than backslash is desired, it can
+ be specified using the <literal>UESCAPE</literal> clause after
+ the string, for example:
+ <programlisting>
+ U&'d!0061t!0061' UESCAPE '!'
+ </programlisting>
+ The escape character can be any single character other than a
+ hexadecimal digit, the plus sign, a single quote, a double quote,
+ or a whitespace character.
+ </para>
+
+ <para>
+ The Unicode escape syntax works only when the server encoding is
+ UTF8. When other server encodings are used, only code points in
+ the ASCII range (up to <literal>\007F</literal>) can be
+ specified.
+ </para>
+
+ <para>
+ To include the escape character in the string literally, write it
+ twice.
+ </para>
+ </sect3>
+
<sect3 id="sql-syntax-dollar-quoting">
<title>Dollar-Quoted String Constants</title>
Index: src/backend/parser/scan.l
===================================================================
RCS file: /cvsroot/pgsql/src/backend/parser/scan.l,v
retrieving revision 1.146
diff -u -3 -p -c -r1.146 scan.l
*** src/backend/parser/scan.l 1 Sep 2008 20:42:45 -0000 1.146
--- src/backend/parser/scan.l 27 Oct 2008 16:54:27 -0000
*************** static int literalalloc; /* current all
*** 76,81 ****
--- 76,82 ----
static void addlit(char *ytext, int yleng);
static void addlitchar(unsigned char ychar);
static char *litbufdup(void);
+ static char *litbuf_udeescape(unsigned char escape);
#define lexer_errposition() scanner_errposition(yylloc)
*************** static unsigned char unescape_single_cha
*** 125,130 ****
--- 126,133 ----
* <xq> standard quoted strings
* <xe> extended quoted strings (support backslash escape sequences)
* <xdolq> $foo$ quoted strings
+ * <xui> quoted identifier with Unicode escapes
+ * <xus> quoted string with Unicode escapes
*/
%x xb
*************** static unsigned char unescape_single_cha
*** 134,139 ****
--- 137,144 ----
%x xe
%x xq
%x xdolq
+ %x xui
+ %x xus
/*
* In order to make the world safe for Windows and Mac clients as well as
*************** xdstop {dquote}
*** 244,249 ****
--- 249,273 ----
xddouble {dquote}{dquote}
xdinside [^"]+
+ /* Unicode escapes */
+ uescape [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+ /* error rule to avoid backup */
+ uescapefail ("-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU])
+
+ /* Quoted identifier with Unicode escapes */
+ xuistart [uU]&{dquote}
+ xuistop1 {dquote}{whitespace}*{uescapefail}?
+ xuistop2 {dquote}{whitespace}*{uescape}
+
+ /* Quoted string with Unicode escapes */
+ xusstart [uU]&{quote}
+ xusstop1 {quote}{whitespace}*{uescapefail}?
+ xusstop2 {quote}{whitespace}*{uescape}
+
+ /* error rule to avoid backup */
+ xufailed [uU]&
+
+
/* C-style comments
*
* The "extended comment" syntax closely resembles allowable operator syntax.
*************** other .
*** 444,449 ****
--- 468,478 ----
BEGIN(xe);
startlit();
}
+ {xusstart} {
+ SET_YYLLOC();
+ BEGIN(xus);
+ startlit();
+ }
<xq,xe>{quotestop} |
<xq,xe>{quotefail} {
yyless(1);
*************** other .
*** 456,465 ****
yylval.str = litbufdup();
return SCONST;
}
! <xq,xe>{xqdouble} {
addlitchar('\'');
}
! <xq>{xqinside} {
addlit(yytext, yyleng);
}
<xe>{xeinside} {
--- 485,506 ----
yylval.str = litbufdup();
return SCONST;
}
! <xus>{xusstop1} {
! /* throw back all but the quote */
! yyless(1);
! BEGIN(INITIAL);
! yylval.str = litbuf_udeescape('\\');
! return SCONST;
! }
! <xus>{xusstop2} {
! BEGIN(INITIAL);
! yylval.str = litbuf_udeescape(yytext[yyleng-2]);
! return SCONST;
! }
! <xq,xe,xus>{xqdouble} {
addlitchar('\'');
}
! <xq,xus>{xqinside} {
addlit(yytext, yyleng);
}
<xe>{xeinside} {
*************** other .
*** 496,509 ****
if (IS_HIGHBIT_SET(c))
saw_high_bit = true;
}
! <xq,xe>{quotecontinue} {
/* ignore */
}
<xe>. {
/* This is only needed for \ just before EOF */
addlitchar(yytext[0]);
}
! <xq,xe><<EOF>> { yyerror("unterminated quoted string"); }
{dolqdelim} {
SET_YYLLOC();
--- 537,550 ----
if (IS_HIGHBIT_SET(c))
saw_high_bit = true;
}
! <xq,xe,xus>{quotecontinue} {
/* ignore */
}
<xe>. {
/* This is only needed for \ just before EOF */
addlitchar(yytext[0]);
}
! <xq,xe,xus><<EOF>> { yyerror("unterminated quoted string"); }
{dolqdelim} {
SET_YYLLOC();
*************** other .
*** 553,558 ****
--- 594,604 ----
BEGIN(xd);
startlit();
}
+ {xuistart} {
+ SET_YYLLOC();
+ BEGIN(xui);
+ startlit();
+ }
<xd>{xdstop} {
char *ident;
*************** other .
*** 565,577 ****
yylval.str = ident;
return IDENT;
}
! <xd>{xddouble} {
addlitchar('"');
}
! <xd>{xdinside} {
addlit(yytext, yyleng);
}
! <xd><<EOF>> { yyerror("unterminated quoted identifier"); }
{typecast} {
SET_YYLLOC();
--- 611,656 ----
yylval.str = ident;
return IDENT;
}
! <xui>{xuistop1} {
! char *ident;
!
! BEGIN(INITIAL);
! if (literallen == 0)
! yyerror("zero-length delimited identifier");
! ident = litbuf_udeescape('\\');
! if (literallen >= NAMEDATALEN)
! truncate_identifier(ident, literallen, true);
! yylval.str = ident;
! /* throw back all but the quote */
! yyless(1);
! return IDENT;
! }
! <xui>{xuistop2} {
! char *ident;
!
! BEGIN(INITIAL);
! if (literallen == 0)
! yyerror("zero-length delimited identifier");
! ident = litbuf_udeescape(yytext[yyleng - 2]);
! if (literallen >= NAMEDATALEN)
! truncate_identifier(ident, literallen, true);
! yylval.str = ident;
! return IDENT;
! }
! <xd,xui>{xddouble} {
addlitchar('"');
}
! <xd,xui>{xdinside} {
addlit(yytext, yyleng);
}
! <xd,xui><<EOF>> { yyerror("unterminated quoted identifier"); }
!
! {xufailed} {
! /* throw back all but the initial u/U */
! yyless(1);
! /* and treat it as {other} */
! return yytext[0];
! }
{typecast} {
SET_YYLLOC();
*************** litbufdup(void)
*** 908,913 ****
--- 987,1082 ----
return new;
}
+ static int
+ hexval(unsigned char c)
+ {
+ if (c >= '0' && c <= '9')
+ return c - '0';
+ if (c >= 'a' && c <= 'f')
+ return c - 'a' + 0xA;
+ if (c >= 'A' && c <= 'F')
+ return c - 'A' + 0xA;
+ elog(ERROR, "invalid hexadecimal digit");
+ return 0; /* not reached */
+ }
+
+ static void
+ check_unicode_value(pg_wchar c, char * loc)
+ {
+ if (GetDatabaseEncoding() == PG_UTF8)
+ return;
+
+ if (c > 0x7F)
+ {
+ yylloc += (char *) loc - literalbuf + 3; /* 3 for U&" */
+ yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
+ }
+ }
+
+ static char *
+ litbuf_udeescape(unsigned char escape)
+ {
+ char *new;
+ char *in, *out;
+
+ if (isxdigit(escape)
+ || escape == '+'
+ || escape == '\''
+ || escape == '"'
+ || scanner_isspace(escape))
+ yyerror("invalid Unicode escape character");
+
+ /*
+ * This relies on the subtle assumption that a UTF-8 expansion
+ * cannot be longer than its escaped representation.
+ */
+ new = palloc(literallen + 1);
+
+ in = literalbuf;
+ out = new;
+ while (*in)
+ {
+ if (in[0] == escape)
+ {
+ if (in[1] == escape)
+ {
+ *out++ = escape;
+ in += 2;
+ }
+ else if (isxdigit(in[1]) && isxdigit(in[2]) && isxdigit(in[3]) && isxdigit(in[4]))
+ {
+ pg_wchar unicode = hexval(in[1]) * 16*16*16 + hexval(in[2]) * 16*16 + hexval(in[3]) * 16 + hexval(in[4]);
+ check_unicode_value(unicode, in);
+ unicode_to_utf8(unicode, (unsigned char *) out);
+ in += 5;
+ out += pg_mblen(out);
+ }
+ else if (in[1] == '+'
+ && isxdigit(in[2]) && isxdigit(in[3])
+ && isxdigit(in[4]) && isxdigit(in[5])
+ && isxdigit(in[6]) && isxdigit(in[7]))
+ {
+ pg_wchar unicode = hexval(in[2]) * 16*16*16*16*16 + hexval(in[3]) * 16*16*16*16 + hexval(in[4]) * 16*16*16
+ + hexval(in[5]) * 16*16 + hexval(in[6]) * 16 + hexval(in[7]);
+ check_unicode_value(unicode, in);
+ unicode_to_utf8(unicode, (unsigned char *) out);
+ in += 8;
+ out += pg_mblen(out);
+ }
+ else
+ {
+ yylloc += in - literalbuf + 3; /* 3 for U&" */
+ yyerror("invalid Unicode escape value");
+ }
+ }
+ else
+ *out++ = *in++;
+ }
+
+ *out = '\0';
+ pg_verifymbstr(new, out - new, false);
+ return new;
+ }
static unsigned char
unescape_single_char(unsigned char c)
Index: src/backend/utils/adt/xml.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/xml.c,v
retrieving revision 1.79
diff -u -3 -p -c -r1.79 xml.c
*** src/backend/utils/adt/xml.c 14 Oct 2008 17:12:33 -0000 1.79
--- src/backend/utils/adt/xml.c 27 Oct 2008 16:54:27 -0000
*************** unicode_to_sqlchar(pg_wchar c)
*** 1497,1524 ****
{
static unsigned char utf8string[5]; /* need trailing zero */
! if (c <= 0x7F)
! {
! utf8string[0] = c;
! }
! else if (c <= 0x7FF)
! {
! utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
! utf8string[1] = 0x80 | (c & 0x3F);
! }
! else if (c <= 0xFFFF)
! {
! utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
! utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
! utf8string[2] = 0x80 | (c & 0x3F);
! }
! else
! {
! utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
! utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
! utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
! utf8string[3] = 0x80 | (c & 0x3F);
! }
return (char *) pg_do_encoding_conversion(utf8string,
pg_mblen((char *) utf8string),
--- 1497,1503 ----
{
static unsigned char utf8string[5]; /* need trailing zero */
! unicode_to_utf8(c, utf8string);
return (char *) pg_do_encoding_conversion(utf8string,
pg_mblen((char *) utf8string),
Index: src/backend/utils/mb/wchar.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/mb/wchar.c,v
retrieving revision 1.66
diff -u -3 -p -c -r1.66 wchar.c
*** src/backend/utils/mb/wchar.c 15 Nov 2007 21:14:40 -0000 1.66
--- src/backend/utils/mb/wchar.c 27 Oct 2008 16:54:27 -0000
*************** pg_utf2wchar_with_len(const unsigned cha
*** 419,424 ****
--- 419,459 ----
return cnt;
}
+
+ /*
+ * Map a Unicode codepoint to UTF-8. utf8string must have 4 bytes of
+ * space allocated.
+ */
+ unsigned char *
+ unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
+ {
+ if (c <= 0x7F)
+ {
+ utf8string[0] = c;
+ }
+ else if (c <= 0x7FF)
+ {
+ utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
+ utf8string[1] = 0x80 | (c & 0x3F);
+ }
+ else if (c <= 0xFFFF)
+ {
+ utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
+ utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
+ utf8string[2] = 0x80 | (c & 0x3F);
+ }
+ else
+ {
+ utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
+ utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
+ utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
+ utf8string[3] = 0x80 | (c & 0x3F);
+ }
+
+ return utf8string;
+ }
+
+
/*
* Return the byte length of a UTF8 character pointed to by s
*
Index: src/include/mb/pg_wchar.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/mb/pg_wchar.h,v
retrieving revision 1.79
diff -u -3 -p -c -r1.79 pg_wchar.h
*** src/include/mb/pg_wchar.h 18 Jun 2008 18:42:54 -0000 1.79
--- src/include/mb/pg_wchar.h 27 Oct 2008 16:54:27 -0000
*************** extern const char *GetDatabaseEncodingNa
*** 380,385 ****
--- 380,386 ----
extern int pg_valid_client_encoding(const char *name);
extern int pg_valid_server_encoding(const char *name);
+ extern unsigned char *unicode_to_utf8(pg_wchar c, unsigned char *utf8string);
extern int pg_utf_mblen(const unsigned char *);
extern unsigned char *pg_do_encoding_conversion(unsigned char *src, int len,
int src_encoding,