extract text from XML

Started by Chris Pacejoover 9 years ago2 messages
#1Chris Pacejo
chris@pacejo.net

Hi, I have found a basic use case which is supported by the xml2 module,
but is unsupported by the new XML API.

It is not possible to correctly extract text (either from text nodes or
attribute values) which contains the characters '<', '&', or '>'.
xpath() (correctly) returns XML text nodes for queries targeting these
node types, and there is no inverse to xmlelement(). For example:

=> select (xpath('/a/text()', xmlelement(name a, '<&>')))[1]::text;
xpath
-----------
&lt;&amp;&gt;
(1 row)

Again, not a bug; but there is no way to specify my desired intent. The
xml2 module does provide such a function, xpath_string:

=> select xpath_string(xmlelement(name a, '<&>')::text, '/a/text()');
xpath_string
--------------
<&>
(1 row)

One workaround is to return the node's text value by serializing the XML
value, and textually replacing those three entities with the characters
they represent, but this relies on the xpath() function not generating
other entities.

(My use case is importing data in XML format, and processing with
Postgres into a relational format.)

Perhaps a function xpath_value(text, xml) -> text[] would close the gap?
(I did search and no such function seems to exist currently, outside
xml2.)

Thanks,
Chris

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tobias Bussmann
t.bussmann@gmx.net
In reply to: Chris Pacejo (#1)
Re: extract text from XML

I have found a basic use case which is supported by the xml2 module,
but is unsupported by the new XML API.
It is not possible to correctly extract text

Indeed. I came accross this shortcomming some months ago myself but still manage an item on my ToDo list to report it here as the deprecation notice at https://www.postgresql.org/docs/devel/static/xml2.html#AEN180625 asks for. Done, thanks ;)

I did some archive-browsing on that topic. The issue (if you want to call it that way) was introduced by an patch to ensure xpath() always returns xml, applied for 9.2 after some discussion: /messages/by-id/201106291934.23089.rsmogura@softperience.eu and is since then known: /messages/by-id/1409795403248-5817667.post@n5.nabble.com The new behaviour was later reported as a bug and discussed again: /messages/by-id/CAAY5AM1L83y79rtOZAUJioREO6n4=XAFKcGu6qO3hCZE1yJytg@mail.gmail.com

Anyhow - (un)escaping functions to support the text<->xml conversion are often talked about but still seem only to be found in xml2 module. Seeing a xmltable implementing patch here recently, these functions would be another step to make the contrib module obsolete, finally.

Perhaps a function xpath_value(text, xml) -> text[] would close the gap?

such an design, resembling the xml2 behaviour, would certainly fit the need, imho.

regards
Tobias

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers