xpath processing brain dead

Started by Andrew Dunstanalmost 17 years ago48 messages
#1Andrew Dunstan
andrew@dunslane.net

Andrew Gierth was just pointing out to me how badly broken our XPath
processing is.

For fear of passing an ill formed fragment of xml to the processor, we
strip the xml declaration if any and surround what's left with '<x>" and
'</x>' and prepend '/x' to the supposed xpath. This is just horrible. It
will break for every xpath expression that doesn't begin with a '/' and
probably for many that do.

This whole thing is a mess, and I suspect the only fix for now is to
undo all the mangling of both the xml and the xpath expression. If the
programmer passes an ill formed piece of xml to the processor that is
their lookout, but I think we should ensure that we give back correct
results on well formed input.

The only good piece of news is that the xpath procedures in contrib/xml2
don't apparently suffer these faults.

cheers

andrew

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: xpath processing brain dead

Andrew Dunstan <andrew@dunslane.net> writes:

For fear of passing an ill formed fragment of xml to the processor, we
strip the xml declaration if any and surround what's left with '<x>" and
'</x>' and prepend '/x' to the supposed xpath. This is just horrible.

I seem to recall having complained about that at the time, but I didn't
(and don't) know enough about xpath to do any better.

This whole thing is a mess, and I suspect the only fix for now is to
undo all the mangling of both the xml and the xpath expression.

I don't think we should change the behavior if it's just to arrive at
another less-than-desirable behavior. Whacking semantics around afresh
with each release does not endear us to users. If we know how to fix it
right, great; but if we can't then we should keep compatibility with 8.3
until we can.

regards, tom lane

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: xpath processing brain dead

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

For fear of passing an ill formed fragment of xml to the processor, we
strip the xml declaration if any and surround what's left with '<x>" and
'</x>' and prepend '/x' to the supposed xpath. This is just horrible.

I seem to recall having complained about that at the time, but I didn't
(and don't) know enough about xpath to do any better.

Well, a few of us do. I guess I took my eye off the ball a bit back when
we were putting this into 8.3.

This whole thing is a mess, and I suspect the only fix for now is to
undo all the mangling of both the xml and the xpath expression.

I don't think we should change the behavior if it's just to arrive at
another less-than-desirable behavior. Whacking semantics around afresh
with each release does not endear us to users. If we know how to fix it
right, great; but if we can't then we should keep compatibility with 8.3
until we can.

Honestly, this is a bug, pure and simple. There really can't be an
argument about that. For the stable branch, we could make the following
changes that should result in a Pareto improvement (nothing gets worse
while some things get better):

* only do the xml transformation if the xml is known not to be be
well formed
* if we need to mangle the xpath expression due to doing the xml
transformation, then unless the xpath expression begins with a
'/', prepend it with '/x//'. (two slashes corresponds to the
descendent axis in xpath - in effect it stands for any number of
descendent elements).

But that's just a holding operation. For 8.4 we should stop this
nonsense and simply say that it is up to the programmer to ensure that
the xml passed to the processor is well formed.

The thing that is so very bad about this is that if the programmer *has*
made sure that the inputs are correct, s/he can still end up with broken
results. If we're going to try to fix bad inputs, we must make damn sure
that we don't break on correct inputs as a result. However, I can't off
hand think of a lightning way to fix bad inputs that doesn't carry some
danger to good inputs. Until someone comes up with something tolerably
bulletproof, I suggest that we simply say that it is the programmer's
responsibility.

cheers

andrew

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#1)
Re: xpath processing brain dead

Andrew Dunstan wrote:

For fear of passing an ill formed fragment of xml to the processor, we
strip the xml declaration if any and surround what's left with '<x>" and
'</x>' and prepend '/x' to the supposed xpath. This is just horrible. It
will break for every xpath expression that doesn't begin with a '/' and
probably for many that do.

This whole thing is a mess, and I suspect the only fix for now is to
undo all the mangling of both the xml and the xpath expression. If the
programmer passes an ill formed piece of xml to the processor that is
their lookout, but I think we should ensure that we give back correct
results on well formed input.

It's not about ill-formed pieces, it is about (well-formed) content
fragments that are not full documents (exactly one root element).
libxml2 doesn't support xpath on content fragments.

The tradeoff here is either supporting no xpath at all on content
fragments or supporting some xpath on both kinds of xml data. Whoever
made the initial implementation of this (Nikolay, Cc) decided for the
latter, but I can't say I'm happy with it either. I'd be OK with
changing it, but I haven't had time to get to it, unfortunately.

See also <http://wiki.postgresql.org/wiki/XPath_Todo#XPath&gt;.

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#4)
Re: xpath processing brain dead

Peter Eisentraut wrote:

Andrew Dunstan wrote:

For fear of passing an ill formed fragment of xml to the processor,
we strip the xml declaration if any and surround what's left with
'<x>" and '</x>' and prepend '/x' to the supposed xpath. This is just
horrible. It will break for every xpath expression that doesn't begin
with a '/' and probably for many that do.

This whole thing is a mess, and I suspect the only fix for now is to
undo all the mangling of both the xml and the xpath expression. If
the programmer passes an ill formed piece of xml to the processor
that is their lookout, but I think we should ensure that we give back
correct results on well formed input.

It's not about ill-formed pieces, it is about (well-formed) content
fragments that are not full documents (exactly one root element).
libxml2 doesn't support xpath on content fragments.

The tradeoff here is either supporting no xpath at all on content
fragments or supporting some xpath on both kinds of xml data. Whoever
made the initial implementation of this (Nikolay, Cc) decided for the
latter, but I can't say I'm happy with it either. I'd be OK with
changing it, but I haven't had time to get to it, unfortunately.

See also <http://wiki.postgresql.org/wiki/XPath_Todo#XPath&gt;.

I don't think it is our responsibility to remedy the deficiencies of
libxml2, especially if it involves breaking the processing of valid
xpath expressions on well formed XML.

If someone can come up with a correct scheme for handling fragments, I'm
all ears, but otherwise I think a) we should rip this out of 8.4 and b)
we should try to make 8.3 slightly less broken at least, along the lines
of my earlier suggestion.

cheers

andrew

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: xpath processing brain dead

Andrew Dunstan <andrew@dunslane.net> writes:

I don't think it is our responsibility to remedy the deficiencies of
libxml2, especially if it involves breaking the processing of valid
xpath expressions on well formed XML.

If someone can come up with a correct scheme for handling fragments, I'm
all ears, but otherwise I think a) we should rip this out of 8.4 and b)
we should try to make 8.3 slightly less broken at least, along the lines
of my earlier suggestion.

I'm okay with removing this for 8.4, but I think it's too late to
change the behavior of 8.3.

regards, tom lane

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#6)
Re: xpath processing brain dead

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I don't think it is our responsibility to remedy the deficiencies of
libxml2, especially if it involves breaking the processing of valid
xpath expressions on well formed XML.

If someone can come up with a correct scheme for handling fragments, I'm
all ears, but otherwise I think a) we should rip this out of 8.4 and b)
we should try to make 8.3 slightly less broken at least, along the lines
of my earlier suggestion.

I'm okay with removing this for 8.4, but I think it's too late to
change the behavior of 8.3.

It's never too late to fix a bug. The current behaviour is just plain
nonsense ... if your xpath expression is 'foo/bar' it ends up searching
for '/xfoo/bar' which can't possibly be right. Surely we can at least
fix that.

cheers

andrew

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#7)
Re: xpath processing brain dead

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

I'm okay with removing this for 8.4, but I think it's too late to
change the behavior of 8.3.

It's never too late to fix a bug.

When the proposed fix involves breaking cases that used to behave
usefully, you need a much stronger argument than that.

regards, tom lane

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: xpath processing brain dead

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

I'm okay with removing this for 8.4, but I think it's too late to
change the behavior of 8.3.

It's never too late to fix a bug.

When the proposed fix involves breaking cases that used to behave
usefully, you need a much stronger argument than that.

What I have proposed for 8.3 should not break a single case that
currently behaves usefully. If anyone has a counter-example please show it.

What I have proposed for 8.4 possibly would break current "useful"
behaviour (FSVO "useful"), but should be done anyway on correctness grounds.

cheers

andrew

#10Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#9)
Re: xpath processing brain dead

What I have proposed for 8.3 should not break a single case that currently
behaves usefully. If anyone has a counter-example please show it.

What I have proposed for 8.4 possibly would break current "useful" behaviour
(FSVO "useful"), but should be done anyway on correctness grounds.

I dunno, aren't XML document fragments sort of a pretty common case?

...Robert

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#10)
Re: xpath processing brain dead

Robert Haas wrote:

What I have proposed for 8.3 should not break a single case that currently
behaves usefully. If anyone has a counter-example please show it.

What I have proposed for 8.4 possibly would break current "useful" behaviour
(FSVO "useful"), but should be done anyway on correctness grounds.

I dunno, aren't XML document fragments sort of a pretty common case?

They are. So what? How does that contradict either of the statements
made above?

Programmers using libxml2 are used to handling this problem. Why must
postgres alone use a totally brain dead and utterly incorrect wrapping
to solve a problem that everyone else leaves up to the programmer to handle?

People in this thread are not concentrating on the fact that what we do
now can break correct input. That makes it an unquestionable bug,
IMNSHO. There seems to be agreement about what to do for 8.4, so we seem
to be arguing about what to do for 8.3. Are you *really* arguing that
prepending the xpath expression with '/x' in all cases should be allowed
to continue? If you are I can only assume your use of xml/xpath is
probably fairly light.

cheers

andrew

#12Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#11)
Re: xpath processing brain dead

On Thu, Feb 26, 2009 at 3:18 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

I dunno, aren't XML document fragments sort of a pretty common case?

They are. So what? How does that contradict either of the statements made
above?

Programmers using libxml2 are used to handling this problem. Why must
postgres alone use a totally brain dead and utterly incorrect wrapping to
solve a problem that everyone else leaves up to the programmer to handle?

I can't think of any reason, especially not when you put it that way. :-)

People in this thread are not concentrating on the fact that what we do now
can break correct input. That makes it an unquestionable bug, IMNSHO. There
seems to be agreement about what to do for 8.4, so we seem to be arguing
about what to do for 8.3. Are you *really* arguing that prepending the xpath
expression with '/x' in all cases should be allowed to continue? If you are
I can only assume your use of xml/xpath is probably fairly light.

Mine is very light indeed. But the change you're proposing seems like
it could CONCEIVABLY break a working application that counts on
PostgreSQL's particular flavor of misbehavior, and therefore it seems
questionable to put that into a stable branch. The fact that the
application perhaps should not have been written that way to begin
with is neither here nor there. We don't want people to be afraid of
upgrading to the latest point release when we fix, say, a backend
crash or a data corruption bug.

...Robert

#13Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#12)
Re: xpath processing brain dead

On Thu, 2009-02-26 at 15:37 -0500, Robert Haas wrote:

Mine is very light indeed. But the change you're proposing seems like
it could CONCEIVABLY break a working application that counts on
PostgreSQL's particular flavor of misbehavior,

That is what release notes are for.

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#12)
Re: xpath processing brain dead

Robert Haas wrote:

But the change you're proposing seems like
it could CONCEIVABLY break a working application that counts on
PostgreSQL's particular flavor of misbehavior, and therefore it seems
questionable to put that into a stable branch. The fact that the
application perhaps should not have been written that way to begin
with is neither here nor there. We don't want people to be afraid of
upgrading to the latest point release when we fix, say, a backend
crash or a data corruption bug.

Let me reiterate the changes that I propose, and again challenge you to
provide a working counter-example if you believe it will break anything
not currently broken, even cases involving fragments.

First, I propose that we abandon this mangling, if, and only if, the xml
is in fact a well formed XML document. Since the whole point of the
mangling is to handle situations where the XML is not a well formed
document, that seems fairly straight-forward. If this change were to
upset any user, it must be because they are relying on undisputably
incorrect results.

Second, I propose that, in the remaining cases, where we do mangle the
XML, if the xpath expression does not begin with a '/', instead of
prepending it with '/x/, which can not possibly be correct under any
circumstance, we prepend it with '/x//' which has some possibility of
giving correct results.

IOW, these proposals will expand the number of correct results from the
code, without contributing any new incorrect cases. These changes are
*very* conservative, as is usual when we fix things on stable branches.

cheers

andrew

#15Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#14)
Re: xpath processing brain dead

I wrote:

Second, I propose that, in the remaining cases, where we do mangle the
XML, if the xpath expression does not begin with a '/', instead of
prepending it with '/x/, which can not possibly be correct under any
circumstance, we prepend it with '/x//' which has some possibility of
giving correct results.

^^^
'/x'

cheers

andrew

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#14)
Re: xpath processing brain dead

Andrew Dunstan <andrew@dunslane.net> writes:

First, I propose that we abandon this mangling, if, and only if, the xml
is in fact a well formed XML document. Since the whole point of the
mangling is to handle situations where the XML is not a well formed
document, that seems fairly straight-forward. If this change were to
upset any user, it must be because they are relying on undisputably
incorrect results.

Second, I propose that, in the remaining cases, where we do mangle the
XML, if the xpath expression does not begin with a '/', instead of
prepending it with '/x/, which can not possibly be correct under any
circumstance, we prepend it with '/x//' which has some possibility of
giving correct results.

Hmm, does this proposal require adding a test of well-formed-ness to
a code path that doesn't currently have one? If so, is that likely
to contribute any noticeable slowdown?

I can't offhand see an objection to this other than possible performance
impact.

regards, tom lane

#17Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#14)
Re: xpath processing brain dead

On Thu, Feb 26, 2009 at 3:54 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Robert Haas wrote:

But the change you're proposing seems like
it could CONCEIVABLY break a working application that counts on
PostgreSQL's particular flavor of misbehavior, and therefore it seems
questionable to put that into a stable branch.  The fact that the
application perhaps should not have been written that way to begin
with is neither here nor there.  We don't want people to be afraid of
upgrading to the latest point release when we fix, say, a backend
crash or a data corruption bug.

Let me reiterate the changes that I propose, and again challenge you to
provide a working counter-example if you believe it will break anything not
currently broken, even cases involving fragments.

First, I propose that we abandon this mangling, if, and only if, the xml is
in fact a well formed XML document. Since the whole point of the mangling is
to handle situations where the XML is not a well formed document, that seems
fairly straight-forward. If this change were to upset any user, it must be
because they are relying on undisputably incorrect results.

Second, I propose that, in the remaining cases, where we do mangle the XML,
if the xpath expression does not begin with a '/', instead of prepending it
with '/x/, which can not possibly be correct under any circumstance, we
prepend it with '/x//' which has some possibility of giving correct results.

IOW, these proposals will expand the number of correct results from the
code, without contributing any new incorrect cases. These changes are *very*
conservative, as is usual when we fix things on stable branches.

cheers

andrew

You are right.

Nuff said,

...Robert

#18James Pye
lists@jwp.name
In reply to: Peter Eisentraut (#4)
Re: xpath processing brain dead

On Feb 26, 2009, at 9:37 AM, Peter Eisentraut wrote:

It's not about ill-formed pieces, it is about (well-formed) content
fragments that are not full documents (exactly one root element).
libxml2 doesn't support xpath on content fragments.

exslt:node-set() to the rescue? Or is that/equivalent functionality
not easily accessed at the C-level with libxml2?

http://www.exslt.org/exsl/functions/node-set/index.html

#19Andrew Dunstan
andrew@dunslane.net
In reply to: James Pye (#18)
Re: xpath processing brain dead

James Pye wrote:

On Feb 26, 2009, at 9:37 AM, Peter Eisentraut wrote:

It's not about ill-formed pieces, it is about (well-formed) content
fragments that are not full documents (exactly one root element).
libxml2 doesn't support xpath on content fragments.

exslt:node-set() to the rescue? Or is that/equivalent functionality
not easily accessed at the C-level with libxml2?

http://www.exslt.org/exsl/functions/node-set/index.html

A node-set isn't a document. In any case, this functionality doesn't
appear to be in libxml2, it's in libxslt according to the reference you
provided.

cheers

andrew

#20James Pye
lists@jwp.name
In reply to: Andrew Dunstan (#19)
Re: xpath processing brain dead

On Feb 26, 2009, at 5:41 PM, Andrew Dunstan wrote:

http://www.exslt.org/exsl/functions/node-set/index.html

A node-set isn't a document.

yes.. :)

I guess what I'm saying is that if:

tinman=# SELECT XML'<foo/><bar/>';
xml
--------------
<foo/><bar/>
(1 row)

is considered to be valid per *SQL-XML*, then it should probably be
treated as a node-set in the context of xpath, not mangled with
<x>...</x>..

Certainly, I would expect an implicit "node-set() call" long before
wrapping the fragment in <x>...</x> and prefixing my xpath query.

However, I find the validity of the above, XML'<foo/><bar/>', a bit
disturbing to begin with. :P

In any case, this functionality doesn't appear to be in libxml2,
it's in libxslt according to the reference you provided.

I think that's *just* referencing the list of xslt implementations
that the extension function is known to be available in.. I doubt that
means to imply that the function or equivalent functionality is not
available in libxml2 itself. I'd wager that equivalent functionality
could be implemented if it weren't directly/already available.. =)

#21Andrew Dunstan
andrew@dunslane.net
In reply to: James Pye (#20)
Re: xpath processing brain dead

James Pye wrote:

On Feb 26, 2009, at 5:41 PM, Andrew Dunstan wrote:

http://www.exslt.org/exsl/functions/node-set/index.html

A node-set isn't a document.

yes.. :)

I guess what I'm saying is that if:

tinman=# SELECT XML'<foo/><bar/>';
xml
--------------
<foo/><bar/>
(1 row)

is considered to be valid per *SQL-XML*, then it should probably be
treated as a node-set in the context of xpath, not mangled with
<x>...</x>..

Certainly, I would expect an implicit "node-set() call" long before
wrapping the fragment in <x>...</x> and prefixing my xpath query.

However, I find the validity of the above, XML'<foo/><bar/>', a bit
disturbing to begin with. :P

In any case, this functionality doesn't appear to be in libxml2,
it's in libxslt according to the reference you provided.

I think that's *just* referencing the list of xslt implementations
that the extension function is known to be available in.. I doubt that
means to imply that the function or equivalent functionality is not
available in libxml2 itself. I'd wager that equivalent functionality
could be implemented if it weren't directly/already available.. =)

James,

can you point me at any call in libxml2 which will evaluate an xpath
expression in the context of a nodeset instead of a document? Quite
apart from anything else, xpath requires there to be a (single) context
node (see http://www.w3.org/TR/xpath20/#context ). For a doc, we set
that node to the document node, but what would it be for a node-set or a
fragment? If we can't get over that hurdle we're screwed in pursuing
your line of thought.

cheers

andrew

#22Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#16)
Re: xpath processing brain dead

Tom Lane wrote:

Hmm, does this proposal require adding a test of well-formed-ness to
a code path that doesn't currently have one? If so, is that likely
to contribute any noticeable slowdown?

I can't offhand see an objection to this other than possible performance
impact.

Yeah, testing the well-formedness might cost a bit. We could
short-circuit the test by applying some comparatively fast heuristic tests.

Or we could decide that we'll just fix the xpath prefix part for 8.3 and
keep the wrapping. I don't want to spend a huge effort on fixing
something I regard as fundamentally broken.

I'll do some tests to see what the cost of extra xml parsing might be.

cheers

andrew

We

#23Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Robert Haas (#10)
Re: xpath processing brain dead

On Thu, 2009-02-26 at 13:51 -0500, Robert Haas wrote:

What I have proposed for 8.3 should not break a single case that currently
behaves usefully. If anyone has a counter-example please show it.

What I have proposed for 8.4 possibly would break current "useful" behaviour
(FSVO "useful"), but should be done anyway on correctness grounds.

I dunno, aren't XML document fragments sort of a pretty common case?

I'd rather argue that xml datatype should not even accept anything but
complete xml documents. Same as int field does not accept int[].

Or maybe we rather need separate xmldocument and xmlforest/xmlfragments
types in next releases and leave the "base" xml as it is but deprecated
due to inability to fix it without breaking backwards compatibility.

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#24Andrew Dunstan
andrew@dunslane.net
In reply to: Hannu Krosing (#23)
Re: xpath processing brain dead

Hannu Krosing wrote:

On Thu, 2009-02-26 at 13:51 -0500, Robert Haas wrote:

What I have proposed for 8.3 should not break a single case that currently
behaves usefully. If anyone has a counter-example please show it.

What I have proposed for 8.4 possibly would break current "useful" behaviour
(FSVO "useful"), but should be done anyway on correctness grounds.

I dunno, aren't XML document fragments sort of a pretty common case?

I'd rather argue that xml datatype should not even accept anything but
complete xml documents. Same as int field does not accept int[].

Or maybe we rather need separate xmldocument and xmlforest/xmlfragments
types in next releases and leave the "base" xml as it is but deprecated
due to inability to fix it without breaking backwards compatibility.

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

You can also force validation as a document by saying SET XML OPTION
DOCUMENT;

cheers

andrew

#25Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#22)
1 attachment(s)
Re: xpath processing brain dead

Andrew Dunstan wrote:

Tom Lane wrote:

Hmm, does this proposal require adding a test of well-formed-ness to
a code path that doesn't currently have one? If so, is that likely
to contribute any noticeable slowdown?

I can't offhand see an objection to this other than possible performance
impact.

Yeah, testing the well-formedness might cost a bit. We could
short-circuit the test by applying some comparatively fast heuristic
tests.

Or we could decide that we'll just fix the xpath prefix part for 8.3
and keep the wrapping. I don't want to spend a huge effort on fixing
something I regard as fundamentally broken.

I'll do some tests to see what the cost of extra xml parsing might be.

The extra cost appears to be fairly negligible.

regression=# create table xpathtest3 as select xmlconcat(xmlelement(name
unique1, unique1), '\n\t',xmlelement(name unique2, unique2),
'\n\t',xmlelement(name two, two), '\n\t',xmlelement(name four,
four),'\n\t',xmlelement(name ten,ten),'\n\t',xmlelement(name
twenty,twenty),'\n\t',xmlelement(name
hundred,hundred),'\n\t',xmlelement(name
thousand,thousand),'\n\t',xmlelement(name
twothusand,twothousand),'\n\t',xmlelement(name
fivethous,fivethous),'\n\t',xmlelement(name
tenthous,tenthous),'\n\t',xmlelement(name
odd,odd),'\n\t',xmlelement(name even,even),'\n\t',xmlelement(name
stringu1,stringu1),'\n\t',xmlelement(name
stringu2,stringu2),'\n\t',xmlelement(name string4,string4),'\n') from tenk1;

regression=# select count(*) from (select
xpath('//two[text()="0"]/text()',xmlconcat) as elems from xpathtest3,
generate_series(1,10) ) x ;
count
--------
100000
(1 row)

Time: 27.722 ms

Proposed patch for 8.3 attached. (Note: it only reparses in the
non-document case)

cheers

andrew

Attachments:

xpath.patchtext/x-patch; charset=iso-8859-1; name=xpath.patchDownload
Index: src/backend/utils/adt/xml.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/xml.c,v
retrieving revision 1.68.2.6
diff -c -r1.68.2.6 xml.c
*** src/backend/utils/adt/xml.c	10 Nov 2008 18:02:27 -0000	1.68.2.6
--- src/backend/utils/adt/xml.c	27 Feb 2009 20:59:28 -0000
***************
*** 3320,3360 ****
  
  	xml_init();
  
! 	/*
! 	 * To handle both documents and fragments, regardless of the fact whether
! 	 * the XML datum has a single root (XML well-formedness), we wrap the XML
! 	 * datum in a dummy element (<x>...</x>) and extend the XPath expression
! 	 * accordingly.  To do it, throw away the XML prolog, if any.
! 	 */
! 	if (len >= 5 &&
! 		xmlStrncmp((xmlChar *) datastr, (xmlChar *) "<?xml", 5) == 0)
! 	{
! 		i = 5;
! 		while (i < len &&
! 			   !(datastr[i - 1] == '?' && datastr[i] == '>'))
! 			i++;
! 
! 		if (i == len)
! 			xml_ereport(ERROR, ERRCODE_INTERNAL_ERROR,
! 						"could not parse XML data");
  
! 		++i;
  
! 		datastr += i;
! 		len -= i;
! 	}
! 
! 	string = (xmlChar *) palloc((len + 8) * sizeof(xmlChar));
! 	memcpy(string, "<x>", 3);
! 	memcpy(string + 3, datastr, len);
! 	memcpy(string + 3 + len, "</x>", 5);
! 	len += 7;
! 
! 	xpath_expr = (xmlChar *) palloc((xpath_len + 3) * sizeof(xmlChar));
! 	memcpy(xpath_expr, "/x", 2);
! 	memcpy(xpath_expr + 2, VARDATA(xpath_expr_text), xpath_len);
! 	xpath_expr[xpath_len + 2] = '\0';
! 	xpath_len += 2;
  
  	xmlInitParser();
  
--- 3320,3332 ----
  
  	xml_init();
  
! 	string = (xmlChar *) palloc((len + 8) * sizeof(xmlChar));
  
! 	xpath_expr = (xmlChar *) palloc((xpath_len + 5) * sizeof(xmlChar));
  
! 	memcpy (string, datastr, len);
! 	string[len] = '\0';
! 	
  
  	xmlInitParser();
  
***************
*** 3367,3375 ****
  		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
  					"could not allocate parser context");
  	doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
! 	if (doc == NULL)
! 		xml_ereport(ERROR, ERRCODE_INVALID_XML_DOCUMENT,
! 					"could not parse XML data");
  	xpathctx = xmlXPathNewContext(doc);
  	if (xpathctx == NULL)
  		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
--- 3339,3410 ----
  		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
  					"could not allocate parser context");
  	doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
! 
! 	if (doc == NULL || xmlDocGetRootElement(doc) == NULL)
! 	{
! 
! 		/*
! 		 * In case we have a fragment rather than a well-formed XML document,
! 		 * which has a single root (XML well-formedness), we try again after
! 		 * transforming the xml by stripping away the XML prolog, if any, and
! 		 * wrapping the remainder in a dummy element (<x>...</x>),
! 		 * and later extending the XPath expression accordingly. 
! 		 */
! 		if (len >= 5 &&
! 			xmlStrncmp((xmlChar *) datastr, (xmlChar *) "<?xml", 5) == 0)
! 		{
! 			i = 5;
! 			while (i < len &&
! 				   !(datastr[i - 1] == '?' && datastr[i] == '>'))
! 				i++;
! 			
! 			if (i == len)
! 				xml_ereport(ERROR, ERRCODE_INTERNAL_ERROR,
! 							"could not parse XML data");
! 			
! 			++i;
! 			
! 			datastr += i;
! 			len -= i;
! 		}
! 
! 		memcpy(string, "<x>", 3);
! 		memcpy(string + 3, datastr, len);
! 		memcpy(string + 3 + len, "</x>", 5);
! 		len += 7;
! 
! 		doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
! 
!  		if (doc == NULL)
! 			xml_ereport(ERROR, ERRCODE_INVALID_XML_DOCUMENT,
! 						"could not parse XML data");
! 
! 		if (*VARDATA(xpath_expr_text) == '/')
! 		{
! 			memcpy(xpath_expr, "/x", 2);
! 			memcpy(xpath_expr + 2, VARDATA(xpath_expr_text), xpath_len);
! 			xpath_expr[xpath_len + 2] = '\0';
! 			xpath_len += 2;
! 		}
! 		else
! 		{
! 			memcpy(xpath_expr, "/x//", 4);
! 			memcpy(xpath_expr + 4, VARDATA(xpath_expr_text), xpath_len);
! 			xpath_expr[xpath_len + 4] = '\0';
! 			xpath_len += 4;
! 		}
! 
! 	}
! 	else
! 	{
! 		/* 
! 		 * if we didn't need to mangle the XML, we don't need to mangle the
! 		 * xpath either.
! 		 */
! 		memcpy(xpath_expr, VARDATA(xpath_expr_text), xpath_len);
! 		xpath_expr[xpath_len] = '\0';
! 	}
! 
  	xpathctx = xmlXPathNewContext(doc);
  	if (xpathctx == NULL)
  		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#25)
Re: xpath processing brain dead

Andrew Dunstan <andrew@dunslane.net> writes:

I'll do some tests to see what the cost of extra xml parsing might be.

The extra cost appears to be fairly negligible.

Uh, you didn't actually show a comparison of before and after?
What it looks like to me is that this approach is free or nearly so for
well-formed documents, but doubles the parsing cost for forests.
Which is likely to annoy anyone who's really depending on the
capability.

Also,

! if (*VARDATA(xpath_expr_text) == '/')

This is risking a core dump if the xpath expr is of zero length. You
need something like

if (xpath_len > 0 && *VARDATA(xpath_expr_text) == '/')

It would also be a good idea if the allocation of string and xpath_expr
had a comment about why it's allocating extra space (something like "see
hacks below for use of this extra space" would be sufficient).

regards, tom lane

#27Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#26)
Re: xpath processing brain dead

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I'll do some tests to see what the cost of extra xml parsing might be.

The extra cost appears to be fairly negligible.

Uh, you didn't actually show a comparison of before and after?
What it looks like to me is that this approach is free or nearly so for
well-formed documents, but doubles the parsing cost for forests.
Which is likely to annoy anyone who's really depending on the
capability.

The difference is lost in the noise.

Without fix:

Time: 24.619 ms
Time: 24.245 ms
Time: 25.179 ms

With fix:

Time: 24.084 ms
Time: 21.980 ms
Time: 23.765 ms

The test is done on 10,000 short fragments each parsed 10 times (or 20
times with the fix).

I'll test again on some longer fragments since you don't seem convinced.

Also,

! if (*VARDATA(xpath_expr_text) == '/')

This is risking a core dump if the xpath expr is of zero length. You
need something like

if (xpath_len > 0 && *VARDATA(xpath_expr_text) == '/')

OK.

It would also be a good idea if the allocation of string and xpath_expr
had a comment about why it's allocating extra space (something like "see
hacks below for use of this extra space" would be sufficient).

OK.

cheers

andrew

#28Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Andrew Dunstan (#24)
Re: xpath processing brain dead

On Fri, 2009-02-27 at 16:37 -0500, Andrew Dunstan wrote:

Hannu Krosing wrote:

On Thu, 2009-02-26 at 13:51 -0500, Robert Haas wrote:

What I have proposed for 8.3 should not break a single case that currently
behaves usefully. If anyone has a counter-example please show it.

What I have proposed for 8.4 possibly would break current "useful" behaviour
(FSVO "useful"), but should be done anyway on correctness grounds.

I dunno, aren't XML document fragments sort of a pretty common case?

I'd rather argue that xml datatype should not even accept anything but
complete xml documents. Same as int field does not accept int[].

Or maybe we rather need separate xmldocument and xmlforest/xmlfragments
types in next releases and leave the "base" xml as it is but deprecated
due to inability to fix it without breaking backwards compatibility.

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

But then you could use xmlfragments as the functions return type, no ?

Does tha standard require that the same field type must store both
documents and fragments ?

You can also force validation as a document by saying SET XML OPTION
DOCUMENT;

cheers

andrew

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#29James Pye
lists@jwp.name
In reply to: Andrew Dunstan (#21)
Re: xpath processing brain dead

On Feb 26, 2009, at 7:03 PM, Andrew Dunstan wrote:

can you point me at any call in libxml2 which will evaluate an xpath
expression in the context of a nodeset instead of a document?

No, I can't. node-sets are XPath objects not xmlNode objects, so I
don't think it would be as simple as modifying:

xml.c:xpath() {
...
xpathctx->node = xmlDocGetRootElement(doc);

with the result of xmlXPathNewNodeSet..

[snip other questions]

My *guess* would be that if we were to use a node-set instead, we'd
still have to prefix the XPath query. In this case, with a function
call to an xpath extension function that creates the NodeSet from the
content fragment(s?) of the document created by xml_parse(ie, more or
less, a re-implementation of exsl:node-set() tailored for our use-
case). Well, that or force the user to call it explicitly. Possible or
not--wrt using a content fragment/document as the context node, I find
this less desirable than the current mangling, so I'm becoming quite
indifferent. :)

#30Andrew Dunstan
andrew@dunslane.net
In reply to: Hannu Krosing (#28)
Re: xpath processing brain dead

Hannu Krosing wrote:

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

But then you could use xmlfragments as the functions return type, no ?

Not in the case of xpath, no.

There is a very complete standard for the Xpath data model, and we need
to adhere to it.

cheers

andrew

#31James Pye
lists@jwp.name
In reply to: James Pye (#29)
2 attachment(s)
Re: xpath processing brain dead

sigh.. I got curious. :P

On Feb 27, 2009, at 7:19 PM, James Pye wrote:

Well, that or force the user to call it explicitly.

Attached is the patch that I used to get the results below..
This is just a proof of concept, so it's quite lacking. Notably, it
doesn't even try to identify well-formed documents.

Purpose/idea being, give the user access to the poorly-formed document
as a node-set via the "fragment" function instead of mangling the
xpath and xml:

postgres=# SELECT xpath('fragment()//*', 'bleh<foo/><bar/>'::xml);
xpath
-------
{}
(1 row)

postgres=# SELECT xpath('fragment()//*', 'bleh<meh><sub/></meh><foo/

<bar/>'::xml);

xpath
----------
{<sub/>}
(1 row)

postgres=# SELECT xpath('fragment()/*', 'bleh<meh><sub/></meh><foo/

<bar/>'::xml);

xpath
----------
{<sub/>}
(1 row)

postgres=# SELECT xpath('fragment()', 'bleh<meh><sub/></meh><foo/><bar/

'::xml);

xpath
------------------------
{bleh,"<meh>
<sub/>
</meh>",<foo/>,<bar/>}
(1 row)

postgres=# SELECT xpath('/*', 'bleh<meh><sub/></meh><foo/><bar/>'::xml);
xpath
-------
{}
(1 row)

postgres=# SELECT xpath('fragment()[local-name()="foo"]/@att',
'bleh<meh><sub/></meh><foo att="sometin"/><bar/>'::xml);
xpath
-----------
{sometin}
(1 row)

postgres=# SELECT xpath('fragment()[local-name()="meh"]/*',
'bleh<meh><sub/></meh><foo att="sometin"/><bar/>'::xml);
xpath
----------
{<sub/>}
(1 row)

postgres=# SELECT xpath('fragment()[local-name()="meh" or local-
name()="bar"]', 'bleh<meh><sub/></meh><foo att="sometin"/><bar/>'::xml);
xpath
-----------------
{"<meh>
<sub/>
</meh>",<bar/>}
(1 row)

postgres=# SELECT xpath('fragment()[local-name()="bar"]',
'bleh<meh><sub/></meh><foo att="sometin"/><bar/>'::xml);
xpath
----------
{<bar/>}
(1 row)

postgres=# SELECT xpath('fragment()[@*]',
'bleh<meh><sub/></meh>othertext<foo att="sometin"/><bar/>'::xml);
xpath
----------------------------
{"<foo att=\"sometin\"/>"}
(1 row)

Can't say that I've ever been thrilled with using node-sets, but
*shrug*.

I'm sleepy now..

Attachments:

git.diffapplication/octet-stream; name=git.diff; x-unix-mode=0644Download
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 5fa5f0b..2dbab27 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -110,6 +110,8 @@ static int parse_xml_decl(const xmlChar * str, size_t *lenp,
 			   xmlChar ** version, xmlChar ** encoding, int *standalone);
 static bool print_xml_decl(StringInfo buf, const xmlChar * version,
 			   pg_enc encoding, int standalone);
+static xmlDocPtr xml_parse_with_nodes(text *data, XmlOptionType xmloption_arg,
+		  bool preserve_whitespace, xmlChar * encoding, xmlNodePtr *chunks);
 static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg,
 		  bool preserve_whitespace, xmlChar * encoding);
 static text *xml_xmlnodetoxmltype(xmlNodePtr cur);
@@ -1125,8 +1127,8 @@ print_xml_decl(StringInfo buf, const xmlChar * version,
  * yet do not use SAX - see xmlreader.c)
  */
 static xmlDocPtr
-xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
-		  xmlChar * encoding)
+xml_parse_with_nodes(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
+		  xmlChar * encoding, xmlNodePtr *chunks)
 {
 	int32		len;
 	xmlChar    *string;
@@ -1185,7 +1187,7 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
 								res_code);
 
 		res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
-											   utf8string + count, NULL);
+											   utf8string + count, chunks);
 		if (res_code != 0)
 			xml_ereport(ERROR, ERRCODE_INVALID_XML_CONTENT,
 						"invalid XML content");
@@ -1200,6 +1202,15 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
 	return doc;
 }
 
+static xmlDocPtr
+xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
+		  xmlChar * encoding)
+{
+	return xml_parse_with_nodes(
+		data, xmloption_arg, preserve_whitespace, encoding, NULL
+	);
+}
+
 
 /*
  * xmlChar<->text conversions
@@ -3174,6 +3185,30 @@ xml_xmlnodetoxmltype(xmlNodePtr cur)
 }
 #endif
 
+/*
+ * Provide access to the document's content fragment.
+ */
+static void
+xpathfun_fragment(xmlXPathParserContextPtr ctxt, int nargs)
+{
+	xmlNodeSetPtr nset = NULL;
+	xmlNodePtr cur;
+
+	if (nargs != 0)
+	{
+	   xmlXPathSetArityError(ctxt);
+		return;
+	}
+
+	cur = ctxt->context->node;
+	while (cur != NULL)
+	{
+		nset = xmlXPathNodeSetMerge(nset, xmlXPathNodeSetCreate(cur));
+		cur = cur->next;
+	}
+
+	valuePush(ctxt, xmlXPathWrapNodeSet(nset));
+}
 
 /*
  * Evaluate XPath expression and return array of XML values.
@@ -3193,15 +3228,12 @@ xpath(PG_FUNCTION_ARGS)
 	xmltype    *data = PG_GETARG_XML_P(1);
 	ArrayType  *namespaces = PG_GETARG_ARRAYTYPE_P(2);
 	ArrayBuildState *astate = NULL;
-	xmlParserCtxtPtr ctxt;
 	xmlDocPtr	doc;
+	xmlNodePtr	chunks;
 	xmlXPathContextPtr xpathctx;
 	xmlXPathCompExprPtr xpathcomp;
 	xmlXPathObjectPtr xpathobj;
-	char	   *datastr;
-	int32		len;
 	int32		xpath_len;
-	xmlChar    *string;
 	xmlChar    *xpath_expr;
 	int			i;
 	int			res_nitems;
@@ -3248,51 +3280,12 @@ xpath(PG_FUNCTION_ARGS)
 		ns_count = 0;
 	}
 
-	datastr = VARDATA(data);
-	len = VARSIZE(data) - VARHDRSZ;
-	xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ;
-	if (xpath_len == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_EXCEPTION),
-				 errmsg("empty XPath expression")));
-
 	xml_init();
 
-	/*
-	 * To handle both documents and fragments, regardless of the fact whether
-	 * the XML datum has a single root (XML well-formedness), we wrap the XML
-	 * datum in a dummy element (<x>...</x>) and extend the XPath expression
-	 * accordingly.  To do it, throw away the XML prolog, if any.
-	 */
-	if (len >= 5 &&
-		xmlStrncmp((xmlChar *) datastr, (xmlChar *) "<?xml", 5) == 0)
-	{
-		i = 5;
-		while (i < len &&
-			   !(datastr[i - 1] == '?' && datastr[i] == '>'))
-			i++;
-
-		if (i == len)
-			xml_ereport(ERROR, ERRCODE_INTERNAL_ERROR,
-						"could not parse XML data");
-
-		++i;
-
-		datastr += i;
-		len -= i;
-	}
-
-	string = (xmlChar *) palloc((len + 8) * sizeof(xmlChar));
-	memcpy(string, "<x>", 3);
-	memcpy(string + 3, datastr, len);
-	memcpy(string + 3 + len, "</x>", 5);
-	len += 7;
-
-	xpath_expr = (xmlChar *) palloc((xpath_len + 3) * sizeof(xmlChar));
-	memcpy(xpath_expr, "/x", 2);
-	memcpy(xpath_expr + 2, VARDATA(xpath_expr_text), xpath_len);
-	xpath_expr[xpath_len + 2] = '\0';
-	xpath_len += 2;
+	xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ;
+	xpath_expr = (xmlChar *) palloc((xpath_len + 1) * sizeof(xmlChar));
+	memcpy(xpath_expr, VARDATA(xpath_expr_text), xpath_len);
+	xpath_expr[xpath_len] = '\0';
 
 	xmlInitParser();
 
@@ -3300,19 +3293,19 @@ xpath(PG_FUNCTION_ARGS)
 	 * redundant XML parsing (two parsings for the same value during one
 	 * command execution are possible)
 	 */
-	ctxt = xmlNewParserCtxt();
-	if (ctxt == NULL)
-		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
-					"could not allocate parser context");
-	doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+	doc = xml_parse_with_nodes(data, xmloption, true, NULL, &chunks);
 	if (doc == NULL)
 		xml_ereport(ERROR, ERRCODE_INVALID_XML_DOCUMENT,
 					"could not parse XML data");
+
 	xpathctx = xmlXPathNewContext(doc);
 	if (xpathctx == NULL)
 		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
 					"could not allocate XPath context");
-	xpathctx->node = xmlDocGetRootElement(doc);
+
+	xmlXPathRegisterFunc(xpathctx, (xmlChar *) "fragment", (xmlXPathFunction) xpathfun_fragment);
+
+	xpathctx->node = chunks;
 	if (xpathctx->node == NULL)
 		xml_ereport(ERROR, ERRCODE_INTERNAL_ERROR,
 					"could not find root XML element");
@@ -3376,7 +3369,6 @@ xpath(PG_FUNCTION_ARGS)
 	xmlXPathFreeObject(xpathobj);
 	xmlXPathFreeContext(xpathctx);
 	xmlFreeDoc(doc);
-	xmlFreeParserCtxt(ctxt);
 
 	if (res_nitems == 0)
 		PG_RETURN_ARRAYTYPE_P(construct_empty_array(XMLOID));
xmlc_file.diffapplication/octet-stream; name=xmlc_file.diff; x-unix-mode=0644Download
--- src/backend/utils/adt/xml.c	2009-02-28 02:12:16.000000000 -0700
+++ ../git/src/backend/utils/adt/xml.c	2009-02-28 01:35:14.000000000 -0700
@@ -7,7 +7,7 @@
  * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
- * $PostgreSQL: pgsql/src/backend/utils/adt/xml.c,v 1.83 2009/01/07 13:44:37 tgl Exp $
+ * $PostgreSQL$
  *
  *-------------------------------------------------------------------------
  */
@@ -110,6 +110,8 @@
 			   xmlChar ** version, xmlChar ** encoding, int *standalone);
 static bool print_xml_decl(StringInfo buf, const xmlChar * version,
 			   pg_enc encoding, int standalone);
+static xmlDocPtr xml_parse_with_nodes(text *data, XmlOptionType xmloption_arg,
+		  bool preserve_whitespace, xmlChar * encoding, xmlNodePtr *chunks);
 static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg,
 		  bool preserve_whitespace, xmlChar * encoding);
 static text *xml_xmlnodetoxmltype(xmlNodePtr cur);
@@ -1125,8 +1127,8 @@
  * yet do not use SAX - see xmlreader.c)
  */
 static xmlDocPtr
-xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
-		  xmlChar * encoding)
+xml_parse_with_nodes(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
+		  xmlChar * encoding, xmlNodePtr *chunks)
 {
 	int32		len;
 	xmlChar    *string;
@@ -1185,7 +1187,7 @@
 								res_code);
 
 		res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
-											   utf8string + count, NULL);
+											   utf8string + count, chunks);
 		if (res_code != 0)
 			xml_ereport(ERROR, ERRCODE_INVALID_XML_CONTENT,
 						"invalid XML content");
@@ -1200,6 +1202,15 @@
 	return doc;
 }
 
+static xmlDocPtr
+xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
+		  xmlChar * encoding)
+{
+	return xml_parse_with_nodes(
+		data, xmloption_arg, preserve_whitespace, encoding, NULL
+	);
+}
+
 
 /*
  * xmlChar<->text conversions
@@ -3174,6 +3185,30 @@
 }
 #endif
 
+/*
+ * Provide access to the document's content fragment.
+ */
+static void
+xpathfun_fragment(xmlXPathParserContextPtr ctxt, int nargs)
+{
+	xmlNodeSetPtr nset = NULL;
+	xmlNodePtr cur;
+
+	if (nargs != 0)
+	{
+	   xmlXPathSetArityError(ctxt);
+		return;
+	}
+
+	cur = ctxt->context->node;
+	while (cur != NULL)
+	{
+		nset = xmlXPathNodeSetMerge(nset, xmlXPathNodeSetCreate(cur));
+		cur = cur->next;
+	}
+
+	valuePush(ctxt, xmlXPathWrapNodeSet(nset));
+}
 
 /*
  * Evaluate XPath expression and return array of XML values.
@@ -3193,15 +3228,12 @@
 	xmltype    *data = PG_GETARG_XML_P(1);
 	ArrayType  *namespaces = PG_GETARG_ARRAYTYPE_P(2);
 	ArrayBuildState *astate = NULL;
-	xmlParserCtxtPtr ctxt;
 	xmlDocPtr	doc;
+	xmlNodePtr	chunks;
 	xmlXPathContextPtr xpathctx;
 	xmlXPathCompExprPtr xpathcomp;
 	xmlXPathObjectPtr xpathobj;
-	char	   *datastr;
-	int32		len;
 	int32		xpath_len;
-	xmlChar    *string;
 	xmlChar    *xpath_expr;
 	int			i;
 	int			res_nitems;
@@ -3248,51 +3280,12 @@
 		ns_count = 0;
 	}
 
-	datastr = VARDATA(data);
-	len = VARSIZE(data) - VARHDRSZ;
-	xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ;
-	if (xpath_len == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_EXCEPTION),
-				 errmsg("empty XPath expression")));
-
 	xml_init();
 
-	/*
-	 * To handle both documents and fragments, regardless of the fact whether
-	 * the XML datum has a single root (XML well-formedness), we wrap the XML
-	 * datum in a dummy element (<x>...</x>) and extend the XPath expression
-	 * accordingly.  To do it, throw away the XML prolog, if any.
-	 */
-	if (len >= 5 &&
-		xmlStrncmp((xmlChar *) datastr, (xmlChar *) "<?xml", 5) == 0)
-	{
-		i = 5;
-		while (i < len &&
-			   !(datastr[i - 1] == '?' && datastr[i] == '>'))
-			i++;
-
-		if (i == len)
-			xml_ereport(ERROR, ERRCODE_INTERNAL_ERROR,
-						"could not parse XML data");
-
-		++i;
-
-		datastr += i;
-		len -= i;
-	}
-
-	string = (xmlChar *) palloc((len + 8) * sizeof(xmlChar));
-	memcpy(string, "<x>", 3);
-	memcpy(string + 3, datastr, len);
-	memcpy(string + 3 + len, "</x>", 5);
-	len += 7;
-
-	xpath_expr = (xmlChar *) palloc((xpath_len + 3) * sizeof(xmlChar));
-	memcpy(xpath_expr, "/x", 2);
-	memcpy(xpath_expr + 2, VARDATA(xpath_expr_text), xpath_len);
-	xpath_expr[xpath_len + 2] = '\0';
-	xpath_len += 2;
+	xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ;
+	xpath_expr = (xmlChar *) palloc((xpath_len + 1) * sizeof(xmlChar));
+	memcpy(xpath_expr, VARDATA(xpath_expr_text), xpath_len);
+	xpath_expr[xpath_len] = '\0';
 
 	xmlInitParser();
 
@@ -3300,19 +3293,19 @@
 	 * redundant XML parsing (two parsings for the same value during one
 	 * command execution are possible)
 	 */
-	ctxt = xmlNewParserCtxt();
-	if (ctxt == NULL)
-		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
-					"could not allocate parser context");
-	doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+	doc = xml_parse_with_nodes(data, xmloption, true, NULL, &chunks);
 	if (doc == NULL)
 		xml_ereport(ERROR, ERRCODE_INVALID_XML_DOCUMENT,
 					"could not parse XML data");
+
 	xpathctx = xmlXPathNewContext(doc);
 	if (xpathctx == NULL)
 		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
 					"could not allocate XPath context");
-	xpathctx->node = xmlDocGetRootElement(doc);
+
+	xmlXPathRegisterFunc(xpathctx, (xmlChar *) "fragment", (xmlXPathFunction) xpathfun_fragment);
+
+	xpathctx->node = chunks;
 	if (xpathctx->node == NULL)
 		xml_ereport(ERROR, ERRCODE_INTERNAL_ERROR,
 					"could not find root XML element");
@@ -3376,7 +3369,6 @@
 	xmlXPathFreeObject(xpathobj);
 	xmlXPathFreeContext(xpathctx);
 	xmlFreeDoc(doc);
-	xmlFreeParserCtxt(ctxt);
 
 	if (res_nitems == 0)
 		PG_RETURN_ARRAYTYPE_P(construct_empty_array(XMLOID));
#32Andrew Dunstan
andrew@dunslane.net
In reply to: James Pye (#31)
Re: xpath processing brain dead

James Pye wrote:

sigh.. I got curious. :P

On Feb 27, 2009, at 7:19 PM, James Pye wrote:

Well, that or force the user to call it explicitly.

Attached is the patch that I used to get the results below..
This is just a proof of concept, so it's quite lacking. Notably, it
doesn't even try to identify well-formed documents.

This is entirely out of the question for 8.3, as it's a significant
change of behaviour.

I'd also want to see this usage blessed by some xpath guru ... I'm not
sure it meets the standard's requirements, but I could be wrong.

And it seems to me much better to provide the facility as a separate
function e.g. xpath_fragment() (if at all) rather than by adding on a
non-standard xpath function, but that's just a first impression.

Nice piece of work, though.

cheers

andrew

#33James Pye
lists@jwp.name
In reply to: Andrew Dunstan (#32)
Re: xpath processing brain dead

On Feb 28, 2009, at 7:53 AM, Andrew Dunstan wrote:

This is entirely out of the question for 8.3, as it's a significant
change of behaviour.

Yep. Even with implicit prefixing, the semantics are very different.

What got me thinking about it was this:

http://archives.postgresql.org/pgsql-bugs/2008-07/msg00058.php

If it's desirable to avoid prefixing, what options remain?

(At least I find it desirable to avoid prefixing =)

I'd also want to see this usage blessed by some xpath guru ... I'm
not sure it meets the standard's requirements, but I could be wrong.

Oh, the context node question you raised? I think it would be easy to
expect that the standard is expecting a well-formed document to query
against in the first place, so I *do* think it's a very valid concern.

http://www.w3.org/TR/xpath
http://www.w3.org/TR/xpath#data-model
http://www.w3.org/TR/xpath#infoset

Curious, if we constructed an actual document fragment node from the
node list and set it as the document's root, would that be enough to
satisfy any requirements? It does appear to talk about nodes quite
generally.

In the current case, we're shaving the corners of the square peg so it
will fit in the round hole. In fragment()'s case, it seems we would be
trying to circumvent the round hole altogether..

I don't really like either way. :P

#34Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#27)
Re: xpath processing brain dead

I wrote:

I'll test again on some longer fragments since you don't seem convinced.

I set up a test with a much larger XML fragment - over 1Mb - basically
it's the English source of the SVN Turtle book.

The result is that the extra parsing cost is still pretty much unmeasurable:

regression=# select avg(length(foo)) from (select
repeat(xpath('//title',src)::text,i) as foo from xpathtest4,
generate_series(1,100) as i ) x;
avg
----------------------
1309869.000000000000
(1 row)

Without fix:
Time: 5695.930 ms
Time: 4855.837 ms
Time: 5453.044 ms

With fix:
Time: 5232.810 ms
Time: 5272.375 ms
Time: 5528.434 ms

So I'm going to go ahead and commit this change for 8.3, with Tom's
suggested ammendments.

cheers

andrew

#35Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Andrew Dunstan (#30)
Re: xpath processing brain dead

On Fri, 2009-02-27 at 22:55 -0500, Andrew Dunstan wrote:

Hannu Krosing wrote:

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

But then you could use xmlfragments as the functions return type, no ?

Not in the case of xpath, no.

single xml document is a sub-case of xmlforest, so xmlforest should be
allowed as return type, no ?

There is a very complete standard for the Xpath data model,
and we need to adhere to it.

Is declaring a single all-covering "xml" data type the best or even the
only way to adhere ?

cheers

andrew

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#36Andrew Dunstan
andrew@dunslane.net
In reply to: Hannu Krosing (#28)
Re: xpath processing brain dead

Hannu Krosing wrote:

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

But then you could use xmlfragments as the functions return type, no ?

Does tha standard require that the same field type must store both
documents and fragments ?

Yes, the standard very explicitly provides for one XML type which need
not be an XML document. We have no choice about that.

cheers

andrew

#37Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Andrew Dunstan (#36)
Re: xpath processing brain dead

On Sun, 2009-03-01 at 10:13 -0500, Andrew Dunstan wrote:

Hannu Krosing wrote:

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

But then you could use xmlfragments as the functions return type, no ?

Does tha standard require that the same field type must store both
documents and fragments ?

Yes, the standard very explicitly provides for one XML type which need
not be an XML document. We have no choice about that.

What is it then ? A sequence of XML elements ?

Which standard does postgreSQL XML type need to confirm to - general XML
DB, Xpath or some other XML ?

XML Path Language (XPath) Version 1.0
--------------------------------------

starts with this Abstract:

"XPath is a language for addressing parts of an XML document, designed
to be used by both XSLT and XPointer."

So I think that using Xpath on anything else than "XML document" is
invalid and results are undefined.

XML 1.0 and XML 1.1
-------------------

Also, both XML 1.0 and XML 1.1 standards are about a thing called an
"XML document", so I don't see anything there, which would make us
accept anything else as being "XML".

And SQL 2008, Part 14: XML-Related Specifications (SQL/XML)
-----------------------------------------------------------
Says:

"SQL defines a predefined data type named by the following <key word>:
XML.
...
The data types XML(DOCUMENT(UNTYPED)), XML(DOCUMENT(ANY)),
XML(DOCUMENT(XMLSCHEMA)), XML(CONTENT(UNTYPED)), XML(CONTENT(ANY)),
XML(CONTENT(XMLSCHEMA)), and XML(SEQUENCE) are referred to as the XML
types. Values of XML types are called XML values."

So while the type itself could be called XML, there are several
subtypes, like Document, Content and Sequence

Could the XML(SEQUENCE) better be represented as an array
of xml documents aka. xml[] , and maybe CONTENT could be done as
xmlelement[] where xmlelement can be any single XML element, including
CDATA and plain text ?

cheers

andrew

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#38Andrew Dunstan
andrew@dunslane.net
In reply to: Hannu Krosing (#37)
Re: xpath processing brain dead

Hannu Krosing wrote:

On Sun, 2009-03-01 at 10:13 -0500, Andrew Dunstan wrote:

Hannu Krosing wrote:

Some of the functions, including some specified in the standard, produce
fragments. That's why we have the 'IS DOCUMENT' test.

But then you could use xmlfragments as the functions return type, no ?

Does tha standard require that the same field type must store both
documents and fragments ?

Yes, the standard very explicitly provides for one XML type which need
not be an XML document. We have no choice about that.

What is it then ? A sequence of XML elements ?

In its typically oblique way, the 2003 draft says this:

NOTE 1 � An XML root information item is similar to an XML document
information item, with the following modifications: an
XML root information item does not have a [document element]
property, a [base URI] property, a [character encoding scheme] property,
or an [all declarations processed] property; and the [children]
property permits more than one XML element information item.

An SQL/XML information item is either an XML root information item
or one of the following (defined in
Subclause 3.1.3, �Definitions provided in Part 14�): an XML
attribute information item, an XML character
information item, an XML comment information item, an XML document
type declaration information item,
an XML element information item, an XML namespace information item,
an XML notation information item,
an XML processing instruction information item, an XML unexpanded
entity reference information item, or
an XML unparsed entity information item.

An XML value is either the null value, or a collection of SQL/XML
information items, consisting of exactly
one XML root information item, plus any other SQL/XML information
items that can be reached recursively
by traversing the properties of the SQL/XML information items.

Which standard does postgreSQL XML type need to confirm to - general XML
DB, Xpath or some other XML ?

I think the XML type needs to conform to the SQL/XML spec. However, we
are trying to apply XPath, which has a different data model, to that
type - hence the impedance mismatch.

I think that the best we can do (for 8.4, having fixed 8.3 as best we
can without adversely changing behaviour) is to throw the responsibility
for ensuring that the XML passed to the function is an XML document back
on the programmer. Anything else, especially any mangling of the XPath
expression, presents a very real danger of breaking on correct input.

cheers

andrew

#39Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#21)
Re: xpath processing brain dead

Andrew Dunstan wrote:

can you point me at any call in libxml2 which will evaluate an xpath
expression in the context of a nodeset instead of a document? Quite
apart from anything else, xpath requires there to be a (single) context
node (see http://www.w3.org/TR/xpath20/#context ). For a doc, we set
that node to the document node, but what would it be for a node-set or a
fragment? If we can't get over that hurdle we're screwed in pursuing
your line of thought.

Which may hint at the fact that running xpath on content fragments is
ill-defined to begin with?!?

#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Andrew Dunstan (#38)
Re: xpath processing brain dead

On Sun, 2009-03-01 at 18:22 -0500, Andrew Dunstan wrote:

I think the XML type needs to conform to the SQL/XML spec. However, we
are trying to apply XPath, which has a different data model, to that
type - hence the impedance mismatch.

I think that the best we can do (for 8.4, having fixed 8.3 as best we
can without adversely changing behaviour) is to throw the
responsibility
for ensuring that the XML passed to the function is an XML document
back on the programmer. Anything else, especially any mangling of the
XPath
expression, presents a very real danger of breaking on correct input.

Can we provide a single function to bridge the gap between fragment and
document? It will be clearer to do this than to see various forms of
appending/munging, even if that function is a simple wrapper around an
append.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#41Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#39)
Re: xpath processing brain dead

Peter Eisentraut wrote:

Andrew Dunstan wrote:

can you point me at any call in libxml2 which will evaluate an xpath
expression in the context of a nodeset instead of a document? Quite
apart from anything else, xpath requires there to be a (single)
context node (see http://www.w3.org/TR/xpath20/#context ). For a doc,
we set that node to the document node, but what would it be for a
node-set or a fragment? If we can't get over that hurdle we're
screwed in pursuing your line of thought.

Which may hint at the fact that running xpath on content fragments is
ill-defined to begin with?!?

Right. But that's no excuse for what we have been doing, which was
demonstrably providing false results on good input.

cheers

andrew

#42Andrew Dunstan
andrew@dunslane.net
In reply to: Simon Riggs (#40)
Re: xpath processing brain dead

Simon Riggs wrote:

On Sun, 2009-03-01 at 18:22 -0500, Andrew Dunstan wrote:

I think the XML type needs to conform to the SQL/XML spec. However, we
are trying to apply XPath, which has a different data model, to that
type - hence the impedance mismatch.

I think that the best we can do (for 8.4, having fixed 8.3 as best we
can without adversely changing behaviour) is to throw the
responsibility
for ensuring that the XML passed to the function is an XML document
back on the programmer. Anything else, especially any mangling of the
XPath
expression, presents a very real danger of breaking on correct input.

Can we provide a single function to bridge the gap between fragment and
document? It will be clearer to do this than to see various forms of
appending/munging, even if that function is a simple wrapper around an
append.

I have no objection to providing an *extra* function that explicitly
wraps non-documents and prefixes the xpath expression in that case, and
is documented to have limitations. But I don't think we can provide a
single function that always "does the right thing", especially when that
is so ill-defined in the case of fragments.

cheers

andrew

#43Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Andrew Dunstan (#42)
Re: xpath processing brain dead

On Mon, 2009-03-02 at 07:54 -0500, Andrew Dunstan wrote:

Simon Riggs wrote:

On Sun, 2009-03-01 at 18:22 -0500, Andrew Dunstan wrote:

I think the XML type needs to conform to the SQL/XML spec. However, we
are trying to apply XPath, which has a different data model, to that
type - hence the impedance mismatch.

I think that the best we can do (for 8.4, having fixed 8.3 as best we
can without adversely changing behaviour) is to throw the
responsibility
for ensuring that the XML passed to the function is an XML document
back on the programmer. Anything else, especially any mangling of the
XPath
expression, presents a very real danger of breaking on correct input.

Can we provide a single function to bridge the gap between fragment and
document? It will be clearer to do this than to see various forms of
appending/munging, even if that function is a simple wrapper around an
append.

I have no objection to providing an *extra* function that explicitly
wraps non-documents and prefixes the xpath expression in that case, and
is documented to have limitations. But I don't think we can provide a
single function that always "does the right thing", especially when that
is so ill-defined in the case of fragments.

Is it just that in you _can't_ use Xpath on fragments, and you _need_ to
pass full documents to Xpath ?

At least this is my reading of Xpath standard.

cheers

andrew

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#44Peter Eisentraut
peter_e@gmx.net
In reply to: Hannu Krosing (#43)
Re: xpath processing brain dead

Hannu Krosing wrote:

Is it just that in you _can't_ use Xpath on fragments, and you _need_ to
pass full documents to Xpath ?

At least this is my reading of Xpath standard.

It is easy to read the XPath standard that way, because the concept of
fragments is not defined outside of SQL/XML, and is therefore unknown to
the XPath standard. The question at hand is rather whether we can
usefully adapt it.

#45Andrew Dunstan
andrew@dunslane.net
In reply to: Hannu Krosing (#43)
Re: xpath processing brain dead

Hannu Krosing wrote:

Is it just that in you _can't_ use Xpath on fragments, and you _need_ to
pass full documents to Xpath ?

At least this is my reading of Xpath standard.

I think that's possibly overstating it., unless I have missed something
(W3 standards are sometimes not much more clear than the SQL standards ;-( )

For instance, there's this, that implies at least that the tree might
not be a document:

A "/" at the beginning of a path expression is an abbreviation for
the initial step fn:root(self::node()) treat as document-node()/
(however, if the "/" is the entire path expression, the trailing "/"
is omitted from the expansion.) The effect of this initial step is
to begin the path at the root node of the tree that contains the
context node. If the context item is not a node, a type error is
raised [err:XPTY0020]. At evaluation time, if the root node above
the context node is not a document node, a dynamic error is raised
[err:XPDY0050].

The problem is that we certainly do have to provide a context node (the
standard is clear about that), and unless we want to convert a
non-document to a node-set as James suggested and then apply the xpath
expression to each node in the node-set, we have no way of sanely
specifying the context node.

cheers

andrew

#46Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Peter Eisentraut (#44)
Re: xpath processing brain dead

On Mon, 2009-03-02 at 15:25 +0200, Peter Eisentraut wrote:

Hannu Krosing wrote:

Is it just that in you _can't_ use Xpath on fragments, and you _need_ to
pass full documents to Xpath ?

At least this is my reading of Xpath standard.

It is easy to read the XPath standard that way, because the concept of
fragments is not defined outside of SQL/XML, and is therefore unknown to
the XPath standard.

How is the opposite - Does SQL/XML specify Xpath usage for XML(SEQUENCE)
and XML(CONTENT) ?

The question at hand is rather whether we can
usefully adapt it.

This sounds like trying to adapt integer arithmetic to
lists-of-integers.

Even for simple things like addition, there are several ways of doing it

[1,2,3] + [1,1,1] = [1,2,3,1,1,1]
[1,2,3] + [1,1,1] = [2,3,4]
[1,2,3] + [1,1,1] = [[1,2,3],[1,1,1]]

all seem possible and "logical"

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#47Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#45)
Re: xpath processing brain dead

Andrew Dunstan wrote:

Hannu Krosing wrote:

Is it just that in you _can't_ use Xpath on fragments, and you _need_ to
pass full documents to Xpath ?
At least this is my reading of Xpath standard.

I think that's possibly overstating it., unless I have missed
something (W3 standards are sometimes not much more clear than the SQL
standards ;-( )

For instance, there's this, that implies at least that the tree might
not be a document:

A "/" at the beginning of a path expression is an abbreviation for
the initial step fn:root(self::node()) treat as document-node()/
(however, if the "/" is the entire path expression, the trailing "/"
is omitted from the expansion.) The effect of this initial step is
to begin the path at the root node of the tree that contains the
context node. If the context item is not a node, a type error is
raised [err:XPTY0020]. At evaluation time, if the root node above
the context node is not a document node, a dynamic error is raised
[err:XPDY0050].

The problem is that we certainly do have to provide a context node
(the standard is clear about that), and unless we want to convert a
non-document to a node-set as James suggested and then apply the xpath
expression to each node in the node-set, we have no way of sanely
specifying the context node.

No-one has come up with an answer to this, so I propose to remove the
hackery. That leaves the question of what to do when the xml is not a
well formed document ... raise an error?

cheers

andrew

#48James Pye
lists@jwp.name
In reply to: Andrew Dunstan (#47)
Re: xpath processing brain dead

On Mar 20, 2009, at 8:56 AM, Andrew Dunstan wrote:

Andrew Dunstan wrote:

A "/" at the beginning of a path expression is an abbreviation for
the initial step fn:root(self::node()) treat as document-node()/
(however, if the "/" is the entire path expression, the trailing
"/"
is omitted from the expansion.) The effect of this initial step is
to begin the path at the root node of the tree that contains the
context node. If the context item is not a node, a type error is
raised [err:XPTY0020]. At evaluation time, if the root node above
the context node is not a document node, a dynamic error is raised
[err:XPDY0050].

The problem is that we certainly do have to provide a context node
(the standard is clear about that), and unless we want to convert a
non-document to a node-set as James suggested and then apply the
xpath expression to each node in the node-set, we have no way of
sanely specifying the context node.

libxml2 only supports xpath1. Why are you referencing xpath20? And, if
SQL-XML requires an xpath20 conforming xpath() function, we have
bigger problems than "'/x' + xpath_str". ;)

Although, this is not terribly important to me as:

No-one has come up with an answer to this,

I don't feel fragment()/node-set() is a good idea from a usability
standpoint alone. I only mentioned it because that's how I've always
worked with fragments in the xslt1 world.. Mere curiosity drove most
of the remaining interest I had in it.

so I propose to remove the hackery.

I think this would probably be best for the core xpath() function.

However, it may be wise to relocate the munging functionality into
another function so users don't have invent their own when they feel
so inclined to work with fragments.

raise an error?

+1, "xpath requires a well-formed document"