XML with invalid chars

Started by Andrew Dunstanover 14 years ago13 messages
#1Andrew Dunstan
andrew@dunslane.net

I came across this today, while helping a customer. The following will
happily create a piece of XML with an embedded ^A:

select xmlelement(name foo, null, E'abc\x01def');

Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
as "" or equivalent in XML version 1.1, and not as a 0x01 byte
(see <http://en.wikipedia.org/wiki/XML#Valid_characters&gt;)

ISTM this is something we should definitely try to fix ASAP, even if we
probably can't backpatch the fix.

(Interestingly, the software than runs my PostgreSQL blog, Serendipity,
appears to have a similar bug, at least in the version Devrim is using,
having cheerfully embedded a ^L in its RSS feed a few days ago, thus
causing planet.postgresql.org to blow up.)

cheers

andrew

#2Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#1)
Re: XML with invalid chars

On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:

I came across this today, while helping a customer. The following will
happily create a piece of XML with an embedded ^A:

select xmlelement(name foo, null, E'abc\x01def');

Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
as "&#x01;" or equivalent in XML version 1.1, and not as a 0x01 byte
(see <http://en.wikipedia.org/wiki/XML#Valid_characters&gt;)

ISTM this is something we should definitely try to fix ASAP, even if we
probably can't backpatch the fix.

+1. Given that such a datum breaks dump+reload, it seems risky to do nothing at
all in the back branches.

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#1)
Re: XML with invalid chars

On mån, 2011-04-25 at 19:25 -0400, Andrew Dunstan wrote:

I came across this today, while helping a customer. The following
will
happily create a piece of XML with an embedded ^A:

select xmlelement(name foo, null, E'abc\x01def');

Now, a ^A is totally forbidden in XML version 1.0, and allowed but
only
as "&#x01;" or equivalent in XML version 1.1, and not as a 0x01 byte
(see <http://en.wikipedia.org/wiki/XML#Valid_characters&gt;)

The best place to fix this might be in escape_xml() in xml.c. Since we
don't support XML 1.1 yet, just reject all invalid characters there
according to XML 1.0.

Relevant bits from the SQL standard:

i)
Let CS be the character set of SQLT. Let XMLVRAW be the result of mapping SQLV to Unicode
using the implementation-defined mapping of character strings of CS to Unicode. If any Unicode
code point in XMLVRAW does not represent a valid XML character, then an exception condition
is raised: SQL/XML mapping error — invalid XML character.

[This is what you'd add.]

ii)
Let XMLV be XMLVRAW, with each instance of “&” (U+0026) replaced by “&amp;”, each
instance of “<” (U+003C) replaced by “&lt;”, each instance of “>” (U+003E) replaced by “&gt;”,
and each instance of Carriage Return (U+000D) replaced by “&#x0d;”.

[This is what it already does.]

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#2)
1 attachment(s)
Re: XML with invalid chars

On 04/26/2011 05:11 PM, Noah Misch wrote:

On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:

I came across this today, while helping a customer. The following will
happily create a piece of XML with an embedded ^A:

select xmlelement(name foo, null, E'abc\x01def');

Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
as "&#x01;" or equivalent in XML version 1.1, and not as a 0x01 byte
(see<http://en.wikipedia.org/wiki/XML#Valid_characters&gt;)

ISTM this is something we should definitely try to fix ASAP, even if we
probably can't backpatch the fix.

+1. Given that such a datum breaks dump+reload, it seems risky to do nothing at
all in the back branches.

Here's a patch along the lines suggested by Peter.

I'm not sure what to do about the back branches and cases where data is
already in databases. This is fairly ugly. Suggestions welcome.

cheers

andrew

Attachments:

xmlchars.patchtext/x-patch; name=xmlchars.patchDownload
*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***************
*** 1844,1850 **** escape_xml(const char *str)
--- 1844,1865 ----
  			case '\r':
  				appendStringInfoString(&buf, "&#x0d;");
  				break;
+ 			case '\n':
+ 			case '\t':
+ 				appendStringInfoCharMacro(&buf, *p);
+ 				break;
  			default:
+ 				/* 
+ 				 * Any control char we haven't already explicitly handled
+ 				 * (i.e. TAB, NL and CR)is an error. 
+ 				 * If we ever support XML 1.1 they will be allowed,
+ 				 * but will have to be escaped.
+ 				 */
+ 				if (*p < '\x20')
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ 							 errmsg("character out of range"),
+ 							 errdetail("XML does not support control characters.")));
  				appendStringInfoCharMacro(&buf, *p);
  				break;
  		}
#5Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#4)
Re: XML with invalid chars

On Wed, Apr 27, 2011 at 03:05:30PM -0400, Andrew Dunstan wrote:

On 04/26/2011 05:11 PM, Noah Misch wrote:

On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:

I came across this today, while helping a customer. The following will
happily create a piece of XML with an embedded ^A:

select xmlelement(name foo, null, E'abc\x01def');

Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
as "&#x01;" or equivalent in XML version 1.1, and not as a 0x01 byte
(see<http://en.wikipedia.org/wiki/XML#Valid_characters&gt;)

ISTM this is something we should definitely try to fix ASAP, even if we
probably can't backpatch the fix.

+1. Given that such a datum breaks dump+reload, it seems risky to do nothing at
all in the back branches.

Here's a patch along the lines suggested by Peter.

I'm not sure what to do about the back branches and cases where data is
already in databases. This is fairly ugly. Suggestions welcome.

We could provide a script in (or linked from) the release notes for testing the
data in all your xml columns.

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something. With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2. Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***************
*** 1844,1850 **** escape_xml(const char *str)
--- 1844,1865 ----
case '\r':
appendStringInfoString(&buf, "&#x0d;");
break;
+ 			case '\n':
+ 			case '\t':
+ 				appendStringInfoCharMacro(&buf, *p);
+ 				break;
default:
+ 				/* 
+ 				 * Any control char we haven't already explicitly handled
+ 				 * (i.e. TAB, NL and CR)is an error. 
+ 				 * If we ever support XML 1.1 they will be allowed,
+ 				 * but will have to be escaped.
+ 				 */
+ 				if (*p < '\x20')

This needs to be an unsigned comparison. On my system, "char" is signed, so
"SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly.

The XML character set forbids more than just control characters; see
http://www.w3.org/TR/xml/#charsets. We also ought to reject, for example,
"SELECT xmlelement(name foo, null, E'\ufffe')".

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through. You can also still inject the
byte into an attribute value via "xmlelement". I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

Show quoted text
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ 							 errmsg("character out of range"),
+ 							 errdetail("XML does not support control characters.")));
appendStringInfoCharMacro(&buf, *p);
break;
}
#6Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#5)
Re: XML with invalid chars

On 04/27/2011 05:30 PM, Noah Misch wrote:

I'm not sure what to do about the back branches and cases where data is
already in databases. This is fairly ugly. Suggestions welcome.

We could provide a script in (or linked from) the release notes for testing the
data in all your xml columns.

Yeah, we'll have to do something like that. What a blasted mess,

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something. With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2. Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

No, I think we need to be strict ourselves.

+ if (*p< '\x20')

This needs to be an unsigned comparison. On my system, "char" is signed, so
"SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly.

Good point. Perhaps we'd be better off using iscntrl(*p).

The XML character set forbids more than just control characters; see
http://www.w3.org/TR/xml/#charsets. We also ought to reject, for example,
"SELECT xmlelement(name foo, null, E'\ufffe')".

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through. You can also still inject the
byte into an attribute value via "xmlelement". I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

Maybe there are, but I'd want lots of convincing that we should do that
at this stage. Maybe for 9.2. I think we can plug the holes fairly
simply for xmlpi and xmlcomment, and catch the attributes by moving this
check up into map_sql_value_to_xml_value().

This is a significant data integrity bug, much along the same lines as
the invalidly encoded data holes we plugged a release or two back. I'm
amazed we haven't hit it till now, but we're sure to see more of it -
XML use with Postgres is growing substantially, I believe.

cheers

andrew

#7Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#6)
Re: XML with invalid chars

On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:

On 04/27/2011 05:30 PM, Noah Misch wrote:

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something. With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2. Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

No, I think we need to be strict ourselves.

Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at
least, do so when linked to a libxml2 that neglects to do so itself?

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through. You can also still inject the
byte into an attribute value via "xmlelement". I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

Maybe there are, but I'd want lots of convincing that we should do that
at this stage. Maybe for 9.2. I think we can plug the holes fairly
simply for xmlpi and xmlcomment, and catch the attributes by moving this
check up into map_sql_value_to_xml_value().

I don't have much convincing to offer -- hunting down the holes seem fine, too.

Thanks,
nm

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#5)
Re: XML with invalid chars

On 04/27/2011 05:30 PM, Noah Misch wrote:

I'm not sure what to do about the back branches and cases where data is
already in databases. This is fairly ugly. Suggestions welcome.

We could provide a script in (or linked from) the release notes for testing the
data in all your xml columns.

Here's a draft. We'd need to come up with slightly modified versions for
older versions of Postgres that don't sport array_agg() and unnest()

cheers

andrew

create function cleanup_xml_table
(schema_name text,table_name text, columns text[])
returns void
language plpgsql as
$func$

declare
cmd text;
cond text;
sep text := '';
alt text := '';
col text;
forbidden text := $$[\x1-\x8\xB\xC\xE-\x1F]$$;
begin
cmd := 'update ' || quote_ident(schema_name) || '.' ||
quote_ident(table_name) || ' set ';
for col in select unnest(columns)
loop
cmd := cmd || sep;
cond := cond || alt;
sep := ', ';
alt := ' or ';
cmd := cmd || quote_ident(col) || '=' ||
'regexp_replace(' || quote_ident(col) , || '::text, ' ||
quote_literal(forbiden) || ', $$$$, $$g$$)::xml';
cond := cond || quote_ident(col) || '::text ~ ' ||
quote_literal(forbidden);
end loop;
cmd := cmd || ' where ' || cond;
execute cmd;
return;
end;

$func$;

select cleanup_xml_table(table_schema,table_name, cols)
from
(select table_schema::text,
table_name::text,
array_agg(column_name::text) as cols
from information_schema.columns
where data_type = 'xml'
and is_updatable = 'yes'
group by table_schema, table_name) xmltabs;

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#7)
1 attachment(s)
Re: XML with invalid chars

On 04/27/2011 11:41 PM, Noah Misch wrote:

On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:

On 04/27/2011 05:30 PM, Noah Misch wrote:

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something. With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2. Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

No, I think we need to be strict ourselves.

Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at
least, do so when linked to a libxml2 that neglects to do so itself?

Yep.

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through. You can also still inject the
byte into an attribute value via "xmlelement". I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

Maybe there are, but I'd want lots of convincing that we should do that
at this stage. Maybe for 9.2. I think we can plug the holes fairly
simply for xmlpi and xmlcomment, and catch the attributes by moving this
check up into map_sql_value_to_xml_value().

I don't have much convincing to offer -- hunting down the holes seem fine, too.

I think I've done that. Here's the patch I have now. It looks like we
can catch pretty much everything by putting checks in four places, which
isn't too bad.

Please review and try to break.

cheers

andrew

Attachments:

xmlchars2.patchtext/x-patch; name=xmlchars2.patchDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index ee82d46..12cfd56 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -142,6 +142,20 @@ static void SPI_sql_row_to_xmlelement(int rownum, StringInfo result,
 #define NAMESPACE_XSI "http://www.w3.org/2001/XMLSchema-instance"
 #define NAMESPACE_SQLXML "http://standards.iso.org/iso/9075/2003/sqlxml"
 
+/* forbidden C0 control chars */
+#define FORBIDDEN_C0  \
+	"\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0E\x0F\x10\x11" \
+	"\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F"
+
+static inline void
+check_forbidden_c0(char * str)
+{
+	if (strpbrk(str,FORBIDDEN_C0) != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+				 errmsg("character out of range"),
+				 errdetail("XML does not support control characters.")));
+}
 
 #ifdef USE_LIBXML
 
@@ -411,6 +425,8 @@ xmlcomment(PG_FUNCTION_ARGS)
 	appendStringInfoText(&buf, arg);
 	appendStringInfo(&buf, "-->");
 
+	check_forbidden_c0(buf.data);
+	
 	PG_RETURN_XML_P(stringinfo_to_xmltype(&buf));
 #else
 	NO_XML_SUPPORT();
@@ -718,6 +734,8 @@ xmlpi(char *target, text *arg, bool arg_is_null, bool *result_is_null)
 	}
 	appendStringInfoString(&buf, "?>");
 
+	check_forbidden_c0(buf.data);
+	
 	result = stringinfo_to_xmltype(&buf);
 	pfree(buf.data);
 	return result;
@@ -1184,6 +1202,9 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
 										   encoding,
 										   PG_UTF8);
 
+	/* check for illegal XML chars */
+	check_forbidden_c0((char *) utf8string);
+
 	/* Start up libxml and its parser (no-ops if already done) */
 	pg_xml_init();
 	xmlInitParser();
@@ -1804,6 +1825,9 @@ map_sql_value_to_xml_value(Datum value, Oid type, bool xml_escape_strings)
 		getTypeOutputInfo(type, &typeOut, &isvarlena);
 		str = OidOutputFunctionCall(typeOut, value);
 
+		/* check for illegal XML chars */
+		check_forbidden_c0(str);
+
 		/* ... exactly as-is for XML, and when escaping is not wanted */
 		if (type == XMLOID || !xml_escape_strings)
 			return str;
#10Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#9)
Re: XML with invalid chars

On Sun, May 08, 2011 at 06:25:27PM -0400, Andrew Dunstan wrote:

On 04/27/2011 11:41 PM, Noah Misch wrote:

On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:

On 04/27/2011 05:30 PM, Noah Misch wrote:

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something. With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2. Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

No, I think we need to be strict ourselves.

Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at
least, do so when linked to a libxml2 that neglects to do so itself?

Yep.

I see you've gone with doing it unconditionally. I'd lean toward testing the
library in pg_xml_init and setting a flag indicating whether we need the extra
pass. However, a later patch can always optimize that.

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through. You can also still inject the
byte into an attribute value via "xmlelement". I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

Maybe there are, but I'd want lots of convincing that we should do that
at this stage. Maybe for 9.2. I think we can plug the holes fairly
simply for xmlpi and xmlcomment, and catch the attributes by moving this
check up into map_sql_value_to_xml_value().

I don't have much convincing to offer -- hunting down the holes seem fine, too.

I think I've done that. Here's the patch I have now. It looks like we
can catch pretty much everything by putting checks in four places, which
isn't too bad.

Please review and try to break.

Here are the test cases I tried:

-- caught successfully
SELECT E'\x01'::xml;
SELECT xmlcomment(E'\x01');
SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
SELECT xmlelement(name foo, NULL, E'\x01');
SELECT xmlforest(E'\x01' AS foo);
SELECT xmlpi(name foo, E'\x01');
SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');

-- not caught
SELECT xmlroot('<root/>', version E'\x01');
SELECT xmlcomment(E'\ufffe');

-- not directly related, but also wrongly accepted
SELECT xmlroot('<root/>', version ' ');
SELECT xmlroot('<root/>', version 'foo');

Offhand, I don't find libxml2's handling of XML declarations particularly
consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
document) accepts '<?xml version="foo"?>' but rejects '<?xml version=" "?>'.
Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
accepts anything, even control characters. The XML 1.0 standard is stricter:
the version must match ^1\.[0-9]+$. We might want to tighten this at the same
time.

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index ee82d46..12cfd56 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -142,6 +142,20 @@ static void SPI_sql_row_to_xmlelement(int rownum, StringInfo result,
#define NAMESPACE_XSI "http://www.w3.org/2001/XMLSchema-instance"
#define NAMESPACE_SQLXML "http://standards.iso.org/iso/9075/2003/sqlxml"
+/* forbidden C0 control chars */
+#define FORBIDDEN_C0  \
+	"\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0E\x0F\x10\x11" \
+	"\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F"
+
+static inline void
+check_forbidden_c0(char * str)
+{
+	if (strpbrk(str,FORBIDDEN_C0) != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+				 errmsg("character out of range"),
+				 errdetail("XML does not support control characters.")));

This would be an errhint, I think. However, the message seems to emphasize
the wrong thing. XML 1.0 defines a lexical production called Char that
includes various Unicode character ranges. Control characters as we know them
happen to not fall in any of those ranges. The characters aren't unsupported
in the sense of being missing features; the language simply forbids them.

libxml2's error message for this case is "PCDATA invalid Char value 1"
(assuming \x01). Mentioning PCDATA seems redundant, since no other context
offers greater freedom. How about:

ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("invalid XML 1.0 Char \\U%08x", char_val)));

nm

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#10)
Re: XML with invalid chars

On 05/09/2011 11:25 PM, Noah Misch wrote:

I see you've gone with doing it unconditionally. I'd lean toward testing the
library in pg_xml_init and setting a flag indicating whether we need the extra
pass. However, a later patch can always optimize that.

I wasn't terribly keen on the idea, but we can look at it again later.

Please review and try to break.

Here are the test cases I tried:

-- caught successfully
SELECT E'\x01'::xml;
SELECT xmlcomment(E'\x01');
SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
SELECT xmlelement(name foo, NULL, E'\x01');
SELECT xmlforest(E'\x01' AS foo);
SELECT xmlpi(name foo, E'\x01');
SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');

-- not caught
SELECT xmlroot('<root/>', version E'\x01');

That's an easy fix.

SELECT xmlcomment(E'\ufffe');

That's a bit harder. Do we want to extend these checks to cover
surrogates and end of plane characters, which are the remaining
forbidden chars? It certainly seems likely to be a somewhat slower test
since I think we'd need to process the input strings a Unicode char at a
time, but we do that in other places and it seems acceptable. What do
people think?

-- not directly related, but also wrongly accepted
SELECT xmlroot('<root/>', version ' ');
SELECT xmlroot('<root/>', version 'foo');

Offhand, I don't find libxml2's handling of XML declarations particularly
consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'.
Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
accepts anything, even control characters. The XML 1.0 standard is stricter:
the version must match ^1\.[0-9]+$. We might want to tighten this at the same
time.

We can add some stuff to check the version strings. Doesn't seem
terribly difficult.

libxml2's error message for this case is "PCDATA invalid Char value 1"
(assuming \x01). Mentioning PCDATA seems redundant, since no other context
offers greater freedom. How about:

ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("invalid XML 1.0 Char \\U%08x", char_val)));

That would also mean processing the string a unicode char at a time. So
maybe that's what we need to do.

Thanks for the input.

cheers

andrew

#12Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#11)
Re: XML with invalid chars

On Wed, May 11, 2011 at 06:17:07PM -0400, Andrew Dunstan wrote:

On 05/09/2011 11:25 PM, Noah Misch wrote:

SELECT xmlcomment(E'\ufffe');

That's a bit harder. Do we want to extend these checks to cover
surrogates and end of plane characters, which are the remaining
forbidden chars? It certainly seems likely to be a somewhat slower test
since I think we'd need to process the input strings a Unicode char at a
time, but we do that in other places and it seems acceptable. What do
people think?

My thinking was that we should only make this flag day for xml-type users if
we're going to fix it all the way.

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#12)
Re: XML with invalid chars

On 05/11/2011 07:00 PM, Noah Misch wrote:

On Wed, May 11, 2011 at 06:17:07PM -0400, Andrew Dunstan wrote:

On 05/09/2011 11:25 PM, Noah Misch wrote:

SELECT xmlcomment(E'\ufffe');

That's a bit harder. Do we want to extend these checks to cover
surrogates and end of plane characters, which are the remaining
forbidden chars? It certainly seems likely to be a somewhat slower test
since I think we'd need to process the input strings a Unicode char at a
time, but we do that in other places and it seems acceptable. What do
people think?

My thinking was that we should only make this flag day for xml-type users if
we're going to fix it all the way.

Fair enough.

cheers

andrew