XML validation of whitespace values

Started by Tim Kaneabout 12 years ago3 messagesgeneral
Jump to latest
#1Tim Kane
tim.kane@gmail.com

HI all,

I’ve had an interesting problem trying to perform an UPDATE based on the
results of processing an xpath against a field of type xml.
Specifically, my query would fail with:

ERROR: could not parse XML document
DETAIL: line 1: Start tag expected, '<' not found

I thought this strange, since the xml type is (I believe) supposed to
guarantee that the value is well formed (wether it’s a document or content).

After some amount of digging to locate the rogue entry, it turns out the
field value is actually a single whitespace character – which apparently is
well formed..

Thus I’m left with the following situation…

clone=# select xml_is_well_formed(' ');
xml_is_well_formed
--------------------
t
(1 row)

clone=# select xpath_exists (‘//test', ' ');
ERROR: could not parse XML document
DETAIL: line 1: Start tag expected, '<' not found

clone=# create temp table xml_test (document xml);
CREATE TABLE
clone=# insert into xml_test values (' ');
INSERT 0 1
clone=# select xml_is_well_formed(document::text) from xml_test;
xml_is_well_formed
--------------------
t
(1 row)

clone=# select xpath_exists ('test', document) from xml_test;
ERROR: could not parse XML document
DETAIL: line 1: Start tag expected, '<' not found

This is confusing, since… if the xml is well formed – it probably shouldn’t
cause the parser to error. And if it isn’t well formed, then the
instantiation of the value as an xml type should fail.

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Tim Kane (#1)
Re: XML validation of whitespace values

On 3/14/14, 11:12 AM, Tim Kane wrote:

clone=# select xml_is_well_formed(' ');
xml_is_well_formed
--------------------
t
(1 row)

clone=# select xpath_exists (‘//test', ' ');
ERROR: could not parse XML document
DETAIL: line 1: Start tag expected, '<' not found

There are several issues at work here:

- contrib/xml2 has a slightly different notion of what is an OK xml
value than the built-in xml type.

- A string consisting of whitespace is well-formed XML content, but not
a well-formed XML document. Compare xmlparse(document ' ') vs
xmlparse(content ' '). contrib/xml2 (apparently) follows the latter
interpretation.

- xpath queries only work properly on XML documents.

It might be better if you wrote your code without using contrib/xml2,
and instead caught any parse exceptions in, say, plpgsql code.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Tim Kane (#1)
Re: XML validation of whitespace values

Tim Kane wrote

clone=# create temp table xml_test (document xml);
CREATE TABLE

If you know you need to use xpath on this content then you should do one of
the following:

SELECT CASE WHEN document IS DOCUMENT THEN xpath(...) ELSE
default_value_for_missing_data END;

CREATE TABLE xml_test (
document xml NOT NULL CHECK (document IS DOCUMENT)
);

For better and worse in the name of "simplicity" both xml and json(b?) types
are usable for both the embedded content and the entire document. Unless
you think you have a good reason for a single column to represent both
content and documents you should probably constrain stored data to be
whichever you are expecting. This is the root of your confusion since, as
Peter said, xpathing only operates on documents - not content fragments.

"The second argument must be a well formed XML document. In particular, it
must have a single root node element."

Unfortunately the decision to treat documents and content as the same types
means this cannot be discovered at parse-time but is data dependent and thus
a run-time error. By putting in the constraint you ensure that any use of
said column in an xpath query will either always succeed or always fail.

Furthermore:

"xml_is_well_formed does the former if the xmloption configuration parameter
is set to DOCUMENT, or the latter if it is set to CONTENT" - the default for
xmloption is CONTENT...

Both quotes above come from:

http://www.postgresql.org/docs/9.3/interactive/functions-xml.html

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/XML-validation-of-whitespace-values-tp5796092p5803594.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general