Native XML

Started by Antonin Houskaover 15 years ago38 messageshackers

ah@cybertec.at

over 15 years ago

Hello,
I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
(By not having brought this up earlier I'm taking the chance that the
effort will be wasted, but that's not something you should worry about.)

The code is available here:
https://github.com/ahouska/postgres/commit/bde3d3ab05915e91a0d831a8877c2fed792693c7

Whoever is interested in my suggestions, I recommend to start at the
test (it needs to be executed standalone, pg_regress is not aware of it
yet):

src/test/regress/sql/xmlnode.sql
src/test/expected/xmlnode.out

In few words, the 'xmlnode' is a structured type that stores XML
document in a form of tree, as opposed to plain text.
Parsing is only performed on insert or update (for update it would also
make sense to implement functions that add/remove nodes at the low
level, w/o dumping & parsing).

Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
independent from any 3rd party code.
The binary (parsed) XML node is single chunk of memory, independent from
address where it was allocated.
The parser does yet fully conform to XML standard and some functionality
is still missing (DTD, PI, etc., see comments in the code if you're
interested in details).

'xquery()' function evaluates (so far just a simple) XMLPath expressions
and for each document it returns a set of matching nodes/subtrees.
'xmlpath' is parsed XMLPath (i.e. the expression + some metadata). It
helps to avoid repeated parsing of the XMLPath expressions by the
xquery() function.

I don't try to pretend that I invented this concept: DB2, Oracle and
probably some other commercial databases do have it for years.
Even though the mission of Postgres is not as simple as copying features
from other DBMs, I think the structured XML makes sense as such.
It allows for better integration of relational and XML data - especially
joining relational columns with XML node sets.

In the future, interesting features could be based on it. For example,
XML node/subtree can be located quickly within a xmlnode value and as
such it could be indexed (even though the existing indexes / access
methods might not be appropriate for that).

When reviewing my code, please focus on the ideas, rather than the code
quality :-) I'm aware that some refactoring will have to be done in case
this subproject will go on.

Thanks in advance for any feedback,
Tony.

Josh Berkus

josh@agliodbs.com

over 15 years ago

In reply to: Antonin Houska (#1)

Re: Native XML

On 2/26/11 3:40 PM, Anton wrote:

I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
(By not having brought this up earlier I'm taking the chance that the
effort will be wasted, but that's not something you should worry about.)

Nah, just if you don't get any feedback, bring it up again in June when
9.2 development officially starts.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Antonin Houska (#1)

Re: Native XML

Anton <antonin.houska@gmail.com> writes:

I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
...
Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
independent from any 3rd party code.

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea. How big a chunk of code do you think it'd be
by the time you complete the missing features?

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

over 15 years ago

In reply to: Tom Lane (#3)

Re: Native XML

On 02/27/2011 10:45 AM, Tom Lane wrote:

Anton<antonin.houska@gmail.com> writes:

I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
...
Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
independent from any 3rd party code.

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea. How big a chunk of code do you think it'd be
by the time you complete the missing features?

TBH, by the time it does all the things that libxml2, and libxslt, which
depends on it, do for us, I think it will be huge. Do we really want to
be maintaining a complete xpath and xslt implementation? I think that's
likely to be a waste of our scarce resources.

I use Postgres' XML functionality a lot, so I'm all in favor of
improving it, but rolling our own doesn't seem like the best way to go.

As for the pain, we seem to be over the worst of it, AFAICT. It would be
nice to move the remaining pieces of the xml2 contrib module into the core.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Andrew Dunstan (#4)

Re: Native XML

Andrew Dunstan <andrew@dunslane.net> writes:

On 02/27/2011 10:45 AM, Tom Lane wrote:

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea. How big a chunk of code do you think it'd be
by the time you complete the missing features?

TBH, by the time it does all the things that libxml2, and libxslt, which
depends on it, do for us, I think it will be huge. Do we really want to
be maintaining a complete xpath and xslt implementation? I think that's
likely to be a waste of our scarce resources.

Well, that's why I asked --- if it's going to be a huge chunk of code,
then I agree this is the wrong path to pursue. However, I do feel that
libxml pretty well sucks, so if we could replace it with a relatively
small amount of code, that might be the right thing to do.

I use Postgres' XML functionality a lot, so I'm all in favor of
improving it, but rolling our own doesn't seem like the best way to go.

As for the pain, we seem to be over the worst of it, AFAICT.

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need. See the open bugs
on the TODO list.

regards, tom lane

David E. Wheeler

david@kineticode.com

over 15 years ago

In reply to: Tom Lane (#5)

Re: Native XML

On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:

Well, that's why I asked --- if it's going to be a huge chunk of code,
then I agree this is the wrong path to pursue. However, I do feel that
libxml pretty well sucks, so if we could replace it with a relatively
small amount of code, that might be the right thing to do.

I think that XML parsers must be hard to get really right, because of all those I've used in Perl, XML::LibXML is far and away the best. Its docs suck, but it does the work really well.

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need. See the open bugs
on the TODO list.

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.

http://github.com/theory/explanation/

Is this something I need to worry about?

Best,

David

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: David E. Wheeler (#6)

Re: Native XML

"David E. Wheeler" <david@kineticode.com> writes:

On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need. See the open bugs
on the TODO list.

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.

Well, if you're only using cases that work, you don't need to worry.

regards, tom lane

Mike Fowler

mike@mlfowler.com

over 15 years ago

In reply to: David E. Wheeler (#6)

Re: Native XML

On 27/02/11 19:37, David E. Wheeler wrote:

On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:

Well, that's why I asked --- if it's going to be a huge chunk of code,
then I agree this is the wrong path to pursue. However, I do feel that
libxml pretty well sucks, so if we could replace it with a relatively
small amount of code, that might be the right thing to do.

I think that XML parsers must be hard to get really right, because of all those I've used in Perl, XML::LibXML is far and away the best. Its docs suck, but it does the work really well.

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need. See the open bugs
on the TODO list.

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.

http://github.com/theory/explanation/

Is this something I need to worry about

I don't believe that XPath is "fundamentally broken", but I think Tom
may have meant xslt. When reviewing a recent patch to xml2/xslt I found
a few bugs in the way were using libxslt, as well as a bug in the
library itself (see
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

However if Tom does mean that xpath is the culprit, it may be with the
way the libxml2 library works. It's a very messy singleton. If I'm
wrong, I'm sure I'll be corrected!

Regards,
--
Mike Fowler
Registered Linux user: 379787

David E. Wheeler

david@kineticode.com

over 15 years ago

In reply to: Tom Lane (#7)

Re: Native XML

On Feb 27, 2011, at 11:43 AM, Tom Lane wrote:

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.

Well, if you're only using cases that work, you don't need to worry.

Okay then.

David

#10

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Mike Fowler (#8)

Re: Native XML

Mike Fowler <mike@mlfowler.com> writes:

I don't believe that XPath is "fundamentally broken", but I think Tom
may have meant xslt. When reviewing a recent patch to xml2/xslt I found
a few bugs in the way were using libxslt, as well as a bug in the
library itself (see
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

Most of the other stuff on the TODO list looks like it just requires
application of round tuits, although some of it seems to me to reinforce
the thesis that libxml/libxslt don't do quite what we need.

regards, tom lane

#11

Antonin Houska

ah@cybertec.at

over 15 years ago

In reply to: Tom Lane (#10)

Fwd: Re: Native XML

Sorry for resending, I forgot to add 'pgsql-hackers' to CC.

-------- Original Message --------
Subject: Re: [HACKERS] Native XML
Date: Sun, 27 Feb 2011 23:18:03 +0100
From: Anton <antonin.houska@gmail.com>
To: Tom Lane <tgl@sss.pgh.pa.us>

On 02/27/2011 04:45 PM, Tom Lane wrote:

Anton <antonin.houska@gmail.com> writes:

I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
...
Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
independent from any 3rd party code.

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea. How big a chunk of code do you think it'd be
by the time you complete the missing features?

regards, tom lane

Right, no dependency, everything coded from scratch.
For the initial stable version, my plan is to make the parser conform to
the standard as much as possible and the same for XMLPath / XMLQuery.
(In all cases the question is which version of the standard to start at.)

Integration of SQL & XML data in queries is my primary interest. I
didn't really think to re-implement XSLT. For those who really need to
use XSLT functionality at the database level, can't the API be left for
optional installation?

Also I'm not sure if document validation is necessary for the initial
version - I still see a related item on the current TODO list.

Sincerely,
Tony,

Import Notes

Resolved by subject fallback

#12

Peter Eisentraut

peter_e@gmx.net

over 15 years ago

In reply to: Tom Lane (#3)

Re: Native XML

On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea.

This doesn't replace the existing xml functionality, so it won't help
getting rid of libxml.

#13

Andrew Dunstan

andrew@dunslane.net

over 15 years ago

In reply to: Tom Lane (#10)

Re: Native XML

On 02/27/2011 03:06 PM, Tom Lane wrote:

Mike Fowler<mike@mlfowler.com> writes:

I don't believe that XPath is "fundamentally broken", but I think Tom
may have meant xslt. When reviewing a recent patch to xml2/xslt I found
a few bugs in the way were using libxslt, as well as a bug in the
library itself (see
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

I'd forgotten about this. But as ugly as it is, I don't think it's
libxml2's fault.

cheers

andrew

#14

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Andrew Dunstan (#13)

Re: Native XML

Andrew Dunstan <andrew@dunslane.net> writes:

On 02/27/2011 03:06 PM, Tom Lane wrote:

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

I'd forgotten about this. But as ugly as it is, I don't think it's
libxml2's fault.

Well, strictly speaking it's libxslt's fault, no? But AFAIK those two
things are a package.

regards, tom lane

#15

Andrew Dunstan

andrew@dunslane.net

over 15 years ago

In reply to: Tom Lane (#14)

Re: Native XML

On 02/27/2011 10:07 PM, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 02/27/2011 03:06 PM, Tom Lane wrote:

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

I'd forgotten about this. But as ugly as it is, I don't think it's
libxml2's fault.

Well, strictly speaking it's libxslt's fault, no? But AFAIK those two
things are a package.

No, I think the xpath implementation is from libxml2. But in any case, I
think the problem is in the whole design of the xpath_table function,
and not in the library used for running the xpath queries. i.e it's our
fault, and not the libraries. (mutters about workmen and tools)

cheers

andrew

#16

Antonin Houska

ah@cybertec.at

about 15 years ago

In reply to: Peter Eisentraut (#12)

Re: Native XML

On 02/27/2011 11:57 PM, Peter Eisentraut wrote:

On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea.

This doesn't replace the existing xml functionality, so it won't help
getting rid of libxml.

Right, what I published on github.com doesn't replace the libxml2
functionality and I didn't say it does at this moment. The idea is to
design (or rather start designing) a low-level XML API on which SQL/XML
functionality can be based. As long as XSLT can be considered a sort of
separate topic, then Postgres uses very small subset of what libxml2
offers and thus it might not be that difficult to implement the same
level of functionality in a new way.

In addition, I think that using a low-level API that Postgres
development team fully controls would speed-up enhancements of the XML
functionality in the future. When I thought of implementing some
functionality listed on the official TODO, I was a little bit
discouraged by the workarounds that need to be added in order to deal
with libxml2 memory management. Also parsing the document each time it's
accessed (which involves parser initialization and finalization) is not
too comfortable and eventually efficient.

A question is of course, if potential new implementation must
necessarily replace the existing one, immediately or at all. What I
published is implemented as a new data type and thus pg_type.h and
pg_proc.h are the only files where something needs to be merged. From
technical point of view, the new type can co-exist with the existing easily.

This however implies a question if such co-existence (whether temporary
or permanent) would be acceptable for users, i.e. if it wouldn't bring
some/significant confusion. That's something I'm not able to answer.

#17

Andrew Dunstan

andrew@dunslane.net

about 15 years ago

In reply to: Antonin Houska (#16)

Re: Native XML

On 02/28/2011 04:25 AM, Anton wrote:

On 02/27/2011 11:57 PM, Peter Eisentraut wrote:

On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea.

This doesn't replace the existing xml functionality, so it won't help
getting rid of libxml.

Right, what I published on github.com doesn't replace the libxml2
functionality and I didn't say it does at this moment. The idea is to
design (or rather start designing) a low-level XML API on which SQL/XML
functionality can be based. As long as XSLT can be considered a sort of
separate topic, then Postgres uses very small subset of what libxml2
offers and thus it might not be that difficult to implement the same
level of functionality in a new way.

In addition, I think that using a low-level API that Postgres
development team fully controls would speed-up enhancements of the XML
functionality in the future. When I thought of implementing some
functionality listed on the official TODO, I was a little bit
discouraged by the workarounds that need to be added in order to deal
with libxml2 memory management. Also parsing the document each time it's
accessed (which involves parser initialization and finalization) is not
too comfortable and eventually efficient.

A question is of course, if potential new implementation must
necessarily replace the existing one, immediately or at all. What I
published is implemented as a new data type and thus pg_type.h and
pg_proc.h are the only files where something needs to be merged. From
technical point of view, the new type can co-exist with the existing easily.

This however implies a question if such co-existence (whether temporary
or permanent) would be acceptable for users, i.e. if it wouldn't bring
some/significant confusion. That's something I'm not able to answer.

The only reason we need the XML stuff in core at all and not in a
separate module is because of the odd syntax requirements of SQL/XML.
But those operators work on the xml type, and not on any new type you
might invent.

Which TODO items were you trying to implement? And what were the blockers?

We really can't just consider XSLT, and more importantly XPath, as
separate topics. Any alternative XML implementation that doesn't include
XPath is going to be unacceptably incomplete, IMNSHO.

cheers

andrew

#18

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Andrew Dunstan (#17)

Re: Native XML

Andrew Dunstan <andrew@dunslane.net> writes:

On 02/28/2011 04:25 AM, Anton wrote:

A question is of course, if potential new implementation must
necessarily replace the existing one, immediately or at all. What I
published is implemented as a new data type and thus pg_type.h and
pg_proc.h are the only files where something needs to be merged. From
technical point of view, the new type can co-exist with the existing easily.

This however implies a question if such co-existence (whether temporary
or permanent) would be acceptable for users, i.e. if it wouldn't bring
some/significant confusion. That's something I'm not able to answer.

The only reason we need the XML stuff in core at all and not in a
separate module is because of the odd syntax requirements of SQL/XML.
But those operators work on the xml type, and not on any new type you
might invent.

Well, in principle we could allow them to work on both, just the same
way that (for instance) "+" is a standardized operator but works on more
than one datatype. But I agree that the prospect of two parallel types
with essentially duplicate functionality isn't pleasing at all.

I think a reasonable path forwards for this work would be to develop and
extend the non-libxml-based type as an extension, outside of core, with
the idea that it might replace the core implementation if it ever gets
complete enough. The main thing that that would imply that you might
not bother with otherwise is an ability to deal with existing
plain-text-style stored values. This doesn't seem terribly hard to do
IMO --- one easy way would be to insert an initial zero byte in all
new-style values as a flag to distinguish them from old-style. The
forced parsing that would occur to deal with an old-style value would be
akin to detoasting and could be hidden in the same access macros.

We really can't just consider XSLT, and more importantly XPath, as
separate topics. Any alternative XML implementation that doesn't include
XPath is going to be unacceptably incomplete, IMNSHO.

Agreed. The single most pressing problem we've got with XML right now
is the poor state of the XPath extensions in contrib/xml2. If we don't
see a meaningful step forward in that area, a new implementation of the
xml datatype isn't likely to win acceptance.

regards, tom lane

#19

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Andrew Dunstan (#15)

Re: Native XML

On Sun, Feb 27, 2011 at 10:20 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

No, I think the xpath implementation is from libxml2. But in any case, I
think the problem is in the whole design of the xpath_table function, and
not in the library used for running the xpath queries. i.e it's our fault,
and not the libraries. (mutters about workmen and tools)

Yeah, I think the problem is that we picked a poor definition for the
xpath_table() function. That poor definition will be equally capable
of causing us headaches on top of any other implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Andrew Dunstan

andrew@dunslane.net

about 15 years ago

In reply to: Tom Lane (#18)

Re: Native XML

On 02/28/2011 10:30 AM, Tom Lane wrote:

The single most pressing problem we've got with XML right now
is the poor state of the XPath extensions in contrib/xml2. If we don't
see a meaningful step forward in that area, a new implementation of the
xml datatype isn't likely to win acceptance.

xpath_table is severely broken by design IMNSHO. We need a new design,
but I'm reluctant to work on that until someone does LATERAL, because a
replacement would be much nicer to design with it than without it.

But I don't believe replacing the underlying XML/XPath implementation
would help us fix it at all.

cheers

andreww

#21

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Andrew Dunstan (#20)

#22

Robert Haas

robertmhaas@gmail.com

about 15 years ago