review: xml_is_well_formed
Hello
I looked on patch
https://commitfest.postgresql.org/action/patch_view?id=334 .This patch
moves function xml_is_well_formed from contrib xm2 to core.
* Is the patch in context diff format?
yes
* Does it apply cleanly to the current CVS HEAD?
yes
* Does it include reasonable tests, necessary doc patches, etc?
yes
* Does the patch actually implement that?
yes
* Do we want that?
yes
* Do we already have it?
yes - simplified version in core
* Does it follow SQL spec, or the community-agreed behavior?
no - I didn't find any resources about conformance with SQL spec, but
it has same behave like original contrib function
* Does it include pg_dump support (if applicable)?
not related
* Are there dangers?
no
*Are there any assertion failures or crashes?
not found
I have a few issues:
* broken regress test (fedora 13 - xmllint: using libxml version 20707)
postgres=# SELECT xml_is_well_formed('<pg:foo
xmlns:pg="http://postgresql.org/stuff";>bar</pg:foo>');
xml_is_well_formed
--------------------
f
(1 row)
this xml is broken - but in regress tests is ok
[pavel@pavel-stehule ~]$ xmllint xxx
xxx:1: parser error : error parsing attribute name
<pg:foo xmlns:pg="http://postgresql.org/stuff";>bar</pg:foo>
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)
it is probably wrong result - is it ok??
* I don't understand to this fragment
PG_TRY();
+ {
+ size_t count;
+ xmlChar *version = NULL;
+ int standalone = -1;
+.
+ res_code = parse_xml_decl(string, &count, &version,
NULL, &standalone);
+ if (res_code != 0)
+ xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
+ "invalid XML
content: invalid XML declaration",
+ res_code);
+.
+ doc = xmlNewDoc(version);
+ doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");
+ doc->standalone = 1;
why? This function can raise exception when declaration is wrong. It
is wrong - I think, this function should returns false instead
exception.
Regards
Pavel Stehule
postgres=# select version();
version
--------------------------------------------------------------------------------------------------------------
--------
PostgreSQL 9.1devel on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.4.4 20100630 (Red Hat 4.4.4-10),
64-bit
(1 row)
Hi Pavel,
Thanks for taking the time to review my patch. Attached is a new version
addressing your concerns.
On 29/07/10 14:21, Pavel Stehule wrote:
I have a few issues:
* broken regress test (fedora 13 - xmllint: using libxml version 20707)postgres=# SELECT xml_is_well_formed('<pg:foo
xmlns:pg="http://postgresql.org/stuff";>bar</pg:foo>');
xml_is_well_formed
--------------------
f
(1 row)this xml is broken - but in regress tests is ok
[pavel@pavel-stehule ~]$ xmllint xxx
xxx:1: parser error : error parsing attribute name
<pg:foo xmlns:pg="http://postgresql.org/stuff";>bar</pg:foo>
This is a problem that was observered recently by Thom Brown - the
commit fest webapp adds the semicolon after the quote. If you look at
the attachment I sent in a email client you'll find there is no
semicolon. The patch attached here will also not have the semicolon.
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)it is probably wrong result - is it ok??
Yes this is OK, pure text is valid XML content.
* I don't understand to this fragment
PG_TRY(); + { + size_t count; + xmlChar *version = NULL; + int standalone = -1; +. + res_code = parse_xml_decl(string,&count,&version, NULL,&standalone); + if (res_code != 0) + xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT, + "invalid XML content: invalid XML declaration", + res_code); +. + doc = xmlNewDoc(version); + doc->encoding = xmlStrdup((const xmlChar *) "UTF-8"); + doc->standalone = 1;why? This function can raise exception when declaration is wrong. It
is wrong - I think, this function should returns false instead
exception.
You're quite right, I should be returning false here and not causing an
exception. I have corrected this in the attached patch.
Regards,
--
Mike Fowler
Registered Linux user: 379787
Attachments:
xml_is_well_formed-4.patchtext/x-diff; name=xml_is_well_formed-4.patchDownload+258-25
Hello
2010/7/30 Mike Fowler <mike@mlfowler.com>:
Hi Pavel,
Thanks for taking the time to review my patch. Attached is a new version
addressing your concerns.On 29/07/10 14:21, Pavel Stehule wrote:
I have a few issues:
* broken regress test (fedora 13 - xmllint: using libxml version 20707)
ok - main regress test is ok now, next I checked a contrib test for xml2
inside contrib/xml2 make installcheck, and there is a problem
SET client_min_messages = warning;
\set ECHO none
+ psql:pgxml.sql:10: ERROR: could not find function
"xml_is_well_formed" in file "/usr/local/pgsql.xwf/lib/pgxml.so"
+ psql:pgxml.sql:15: ERROR: could not find function
"xml_is_well_formed" in file "/usr/local/pgsql.xwf/lib/pgxml.so"
RESET client_min_messages;
select query_to_xml('select 1 as x',true,false,'');
you have to remove declaration from pgxml.sql.in and
uninstall_pgxml.sql, and other related files in contrib/xml2/ regress
test
postgres=# SELECT xml_is_well_formed('<pg:foo
xmlns:pg="http://postgresql.org/stuff";>bar</pg:foo>');
xml_is_well_formed
--------------------
f
(1 row)this xml is broken - but in regress tests is ok
[pavel@pavel-stehule ~]$ xmllint xxx
xxx:1: parser error : error parsing attribute name
<pg:foo xmlns:pg="http://postgresql.org/stuff";>bar</pg:foo>This is a problem that was observered recently by Thom Brown - the commit
fest webapp adds the semicolon after the quote. If you look at the
attachment I sent in a email client you'll find there is no semicolon. The
patch attached here will also not have the semicolon.
ok - attached patch is correct, ... please, can you remove a broken patch?
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)it is probably wrong result - is it ok??
Yes this is OK, pure text is valid XML content.
It interesting for me - is somewhere some documentation about it?
My colleagues speak some else :)
http://zvon.org/comp/r/tut-XML.html#Pages~MinimalQ20XnumberQ20XofQ20Xelements
http://www.w3.org/TR/REC-xml/#sec-prolog-dtd
I am not a specialist on XML - so just don't know
* I don't understand to this fragment
PG_TRY(); + { + size_t count; + xmlChar *version = NULL; + int standalone = -1; +. + res_code = parse_xml_decl(string,&count,&version, NULL,&standalone); + if (res_code != 0) + xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT, + "invalid XML content: invalid XML declaration", + res_code); +. + doc = xmlNewDoc(version); + doc->encoding = xmlStrdup((const xmlChar *) "UTF-8"); + doc->standalone = 1;why? This function can raise exception when declaration is wrong. It
is wrong - I think, this function should returns false instead
exception.You're quite right, I should be returning false here and not causing an
exception. I have corrected this in the attached patch.
ok
Show quoted text
Regards,
--
Mike Fowler
Registered Linux user: 379787
On fre, 2010-07-30 at 12:50 +0100, Mike Fowler wrote:
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)it is probably wrong result - is it ok??
Yes this is OK, pure text is valid XML content.
Are you speaking of XML content fragments that SQL/XML defines?
Well-formedness should probably only allow XML documents.
On Sat, Jul 31, 2010 at 8:10 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
On fre, 2010-07-30 at 12:50 +0100, Mike Fowler wrote:
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)it is probably wrong result - is it ok??
Yes this is OK, pure text is valid XML content.
Are you speaking of XML content fragments that SQL/XML defines?
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
2010/7/31 Robert Haas <robertmhaas@gmail.com>:
On Sat, Jul 31, 2010 at 8:10 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
On fre, 2010-07-30 at 12:50 +0100, Mike Fowler wrote:
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)it is probably wrong result - is it ok??
Yes this is OK, pure text is valid XML content.
Are you speaking of XML content fragments that SQL/XML defines?
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.
I agree with this idea - so I am able to do:
postgres=# select 'xxx'::xml;
xml
-----
xxx
(1 row)
I have not any suggestions now - so I'll change flag to "ready to commit"
Regards
Pavel Stehule
Show quoted text
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
2010/8/2 Pavel Stehule <pavel.stehule@gmail.com>:
2010/7/31 Robert Haas <robertmhaas@gmail.com>:
On Sat, Jul 31, 2010 at 8:10 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
On fre, 2010-07-30 at 12:50 +0100, Mike Fowler wrote:
* xml_is_well_formed returns true for simple text
postgres=# SELECT xml_is_well_formed('ssss');
xml_is_well_formed
--------------------
t
(1 row)it is probably wrong result - is it ok??
Yes this is OK, pure text is valid XML content.
Are you speaking of XML content fragments that SQL/XML defines?
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.I agree with this idea - so I am able to do:
postgres=# select 'xxx'::xml;
xml
-----
xxx
(1 row)I have not any suggestions now - so I'll change flag to "ready to commit"
sorry - contrib module should be a fixed
patch attached
Show quoted text
Regards
Pavel Stehule
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Attachments:
xml2contrib.patchtext/x-patch; charset=US-ASCII; name=xml2contrib.patchDownload+4-10
On 02/08/10 07:46, Pavel Stehule wrote:
I have not any suggestions now - so I'll change flag to "ready to commit"
sorry - contrib module should be a fixed
patch attached
Thanks Pavel, you saved me some time!
Regards,
--
Mike Fowler
Registered Linux user: 379787
On lör, 2010-07-31 at 13:40 -0400, Robert Haas wrote:
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.
Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()
I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.
Hello
2010/8/3 Peter Eisentraut <peter_e@gmx.net>:
On lör, 2010-07-31 at 13:40 -0400, Robert Haas wrote:
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.
yes, it is little bit curious - but it can be just documented. Now, I
don't think, so we need more functions.
Regards
Pavel
On 03/08/10 16:15, Peter Eisentraut wrote:
On lör, 2010-07-31 at 13:40 -0400, Robert Haas wrote:
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.
What about making the function sensitive to the XML OPTION, such that:
test=# SET xmloption TO DOCUMENT;
SET
text=# SELECT xml_is_well_formed('foo');
xml_is_well_formed
--------------------
f
(1 row)
test=# SET xmloption TO CONTENT;
SET
text=# SELECT xml_is_well_formed('foo');
xml_is_well_formed
--------------------
t
(1 row)
with the inverse for DOCUMENTS? To me this makes the most sense as it makes the function behave much more like the other xml functions.
--
Mike Fowler
Registered Linux user: 379787
On Fri, Aug 6, 2010 at 4:28 AM, Mike Fowler <mike@mlfowler.com> wrote:
On 03/08/10 16:15, Peter Eisentraut wrote:
On lör, 2010-07-31 at 13:40 -0400, Robert Haas wrote:
Well-formedness should probably only allow XML documents.
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.What about making the function sensitive to the XML OPTION, such that:
test=# SET xmloption TO DOCUMENT;
SET
text=# SELECT xml_is_well_formed('foo');xml_is_well_formed
--------------------
f
(1 row)
That will make using this function a huge hassle, won't it? Functions
that do different things depending on GUC settings are usually
troublesome. Having three functions would be more sensible if we need
all three behaviors, but I don't see why we do.
Or perhaps it could return a string instead of a boolean: content,
document, or NULL if it's neither.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
On 06/08/10 12:31, Robert Haas wrote:
Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.What about making the function sensitive to the XML OPTION, such that:
test=# SET xmloption TO DOCUMENT;
SET
text=# SELECT xml_is_well_formed('foo');xml_is_well_formed
--------------------
f
(1 row)That will make using this function a huge hassle, won't it? Functions
that do different things depending on GUC settings are usually
troublesome. Having three functions would be more sensible if we need
all three behaviors, but I don't see why we do.Or perhaps it could return a string instead of a boolean: content,
document, or NULL if it's neither.
I like the sound of that. In fact this helps workaround the IS DOCUMENT
and IS CONTENT limitations such that you can you can select only
content, only documents or both is you use IS NOT NULL.
Unless anyone sees a reason that this function needs to remain a boolean
function, I'll rework the patch over the weekend.
Regards,
--
Mike Fowler
Registered Linux user: 379787
On fre, 2010-08-06 at 07:31 -0400, Robert Haas wrote:
What about making the function sensitive to the XML OPTION, such
that:
test=# SET xmloption TO DOCUMENT;
SET
text=# SELECT xml_is_well_formed('foo');xml_is_well_formed
--------------------
f
(1 row)That will make using this function a huge hassle, won't it? Functions
that do different things depending on GUC settings are usually
troublesome. Having three functions would be more sensible if we need
all three behaviors, but I don't see why we do.
Upthread you opined that this function should essentially indicate
whether a cast to type xml would succeed, and observing the xmloption is
an essential part of that. I had assumed the original patch actually
did that.
On fre, 2010-08-06 at 14:43 +0100, Mike Fowler wrote:
Or perhaps it could return a string instead of a boolean: content,
document, or NULL if it's neither.I like the sound of that. In fact this helps workaround the IS
DOCUMENT
and IS CONTENT limitations such that you can you can select only
content, only documents or both is you use IS NOT NULL.Unless anyone sees a reason that this function needs to remain a
boolean function, I'll rework the patch over the weekend.
What is the actual use case for this function? Is the above behavior
actually useful?
One reason to stick with boolean is backward compatibility.
On Fri, Aug 6, 2010 at 4:52 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On fre, 2010-08-06 at 07:31 -0400, Robert Haas wrote:
What about making the function sensitive to the XML OPTION, such
that:
test=# SET xmloption TO DOCUMENT;
SET
text=# SELECT xml_is_well_formed('foo');xml_is_well_formed
--------------------
f
(1 row)That will make using this function a huge hassle, won't it? Functions
that do different things depending on GUC settings are usually
troublesome. Having three functions would be more sensible if we need
all three behaviors, but I don't see why we do.Upthread you opined that this function should essentially indicate
whether a cast to type xml would succeed, and observing the xmloption is
an essential part of that. I had assumed the original patch actually
did that.
Oh.
I didn't realize the casting behavior was already dependent on the GUC. Yuck.
Maybe we should following the GUC after all, then.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
On 06/08/10 21:55, Peter Eisentraut wrote:
On fre, 2010-08-06 at 14:43 +0100, Mike Fowler wrote:
Or perhaps it could return a string instead of a boolean: content,
document, or NULL if it's neither.I like the sound of that. In fact this helps workaround the IS
DOCUMENT
and IS CONTENT limitations such that you can you can select only
content, only documents or both is you use IS NOT NULL.Unless anyone sees a reason that this function needs to remain a
boolean function, I'll rework the patch over the weekend.What is the actual use case for this function? Is the above behavior
actually useful?
The idea is to be able to filter a table that contains XML in TEXT that
might not be well formed. Knowing that you're only dealing with well
formed XML prevents you blowing up when you attempt the cast.
One reason to stick with boolean is backward compatibility.
To be honest I'm happiest with returning a boolean, even if there is
some confusion over content only being valid. Though changing the return
value to DOCUMENT/CONTENT/NULL makes things a touch more explicit, the
same results can be achieved by simply running:
SELECT data::xml FROM mixed WHERE xml_is_well_formed(data) AND data::xml
IS DOCUMENT;
Regards,
--
Mike Fowler
Registered Linux user: 379787
Peter Eisentraut <peter_e@gmx.net> writes:
On lör, 2010-07-31 at 13:40 -0400, Robert Haas wrote:
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.
Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()
I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.
I think I agree with the later discussion that xml_is_well_formed()
should tell you whether a cast to xml would succeed (and hence it has to
pay attention to XMLOPTION). However, it seems to also make sense to
provide the other two functions suggested here, both to satify people
who know XML and to offer tests that will tell you whether
XMLPARSE ( { DOCUMENT | CONTENT } value ) will succeed.
Merging the three cases into one function doesn't seem like a win
for either compatibility or usability.
regards, tom lane
On Sun, Aug 8, 2010 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
On lör, 2010-07-31 at 13:40 -0400, Robert Haas wrote:
I think the point of this function is to determine whether a cast to
xml will throw an error. The behavior should probably match exactly
whatever test would be applied there.Maybe there should be
xml_is_well_formed()
xml_is_well_formed_document()
xml_is_well_formed_content()I agree that consistency with SQL/XML is desirable, but for someone
coming from the outside, the unqualified claim that 'foo' is well-formed
XML might sound suspicious.I think I agree with the later discussion that xml_is_well_formed()
should tell you whether a cast to xml would succeed (and hence it has to
pay attention to XMLOPTION). However, it seems to also make sense to
provide the other two functions suggested here, both to satify people
who know XML and to offer tests that will tell you whether
XMLPARSE ( { DOCUMENT | CONTENT } value ) will succeed.Merging the three cases into one function doesn't seem like a win
for either compatibility or usability.
+1. I didn't realize how funky the XMLOPTION stuff was at the start
of this discussion; I think your analysis here is spot-on.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
On lör, 2010-08-07 at 16:47 +0100, Mike Fowler wrote:
To be honest I'm happiest with returning a boolean, even if there is
some confusion over content only being valid. Though changing the
return
value to DOCUMENT/CONTENT/NULL makes things a touch more explicit,
the
same results can be achieved by simply running:SELECT data::xml FROM mixed WHERE xml_is_well_formed(data) AND
data::xml IS DOCUMENT;
Note that this wouldn't necessarily work because it is not guaranteed
that the well-formedness test is executed before the cast to xml. SQL
doesn't short-circuit left to right. (A CASE expression could work.)