[BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

Started by Jim Jonesover 1 year ago6 messageshackers
Jump to latest
#1Jim Jones
jim.jones@uni-muenster.de

Hi,

While testing a feature reported by Pavel in this thread[1] I realized
that elements containing whitespaces between them won't be indented with
XMLSERIALIZE( ... INDENT)

SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);

  xmlserialize   
-----------------
 <foo>          +
   <bar>42</bar>+
 </foo>         +
 
(1 row)

SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
INDENT);
        xmlserialize        
----------------------------
 <foo> <bar>42</bar> </foo>+
 
(1 row)

Other products have a different approach[2]

Perhaps simply setting xmltotext_with_options' parameter "perserve_whitespace" to false when XMLSERIALIZE(.. INDENT) would do the trick.

doc = xml_parse(data, xmloption_arg, !indent ? true : false,
GetDatabaseEncoding(),
&parsed_xmloptiontype, &content_nodes,
(Node *) &escontext);

(diff attached)

SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
INDENT);
xmlserialize
-----------------
<foo> +
<bar>42</bar>+
</foo> +

(1 row)

If this is indeed the way to go I can update the regression tests accordingly.

Best,

--
Jim

1 - /messages/by-id/cbd68a31-9776-4742-9c09-4344a4c5e6dc@uni-muenster.de
2 - https://dbfiddle.uk/zdKnfsqX

Attachments:

remove-blanknodes-in-xmlserialize-indent.difftext/x-patch; charset=UTF-8; name=remove-blanknodes-in-xmlserialize-indent.diffDownload+8-2
#2Jim Jones
jim.jones@uni-muenster.de
In reply to: Jim Jones (#1)
Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

On 28.08.24 10:19, Jim Jones wrote:

Hi,

While testing a feature reported by Pavel in this thread[1] I realized
that elements containing whitespaces between them won't be indented with
XMLSERIALIZE( ... INDENT)

SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);

  xmlserialize   
-----------------
 <foo>          +
   <bar>42</bar>+
 </foo>         +
 
(1 row)

SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
INDENT);
        xmlserialize        
----------------------------
 <foo> <bar>42</bar> </foo>+
 
(1 row)

Other products have a different approach[2]

Perhaps simply setting xmltotext_with_options' parameter "perserve_whitespace" to false when XMLSERIALIZE(.. INDENT) would do the trick.

doc = xml_parse(data, xmloption_arg, !indent ? true : false,
GetDatabaseEncoding(),
&parsed_xmloptiontype, &content_nodes,
(Node *) &escontext);

(diff attached)

SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
INDENT);
xmlserialize
-----------------
<foo> +
<bar>42</bar>+
</foo> +

(1 row)

If this is indeed the way to go I can update the regression tests accordingly.

Best,

Just created a CF entry for this: https://commitfest.postgresql.org/49/5217/
v1 attached includes regression tests.

--
Jim

Attachments:

v1-0001-Bug-fix-for-XMLSERIALIZE-.INDENT-for-xml-containi.patchtext/x-patch; charset=UTF-8; name=v1-0001-Bug-fix-for-XMLSERIALIZE-.INDENT-for-xml-containi.patchDownload+60-3
#3Jim Jones
jim.jones@uni-muenster.de
In reply to: Jim Jones (#1)
Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

On 28.08.24 10:19, Jim Jones wrote:

Hi,

While testing a feature reported by Pavel in this thread[1] I realized
that elements containing whitespaces between them won't be indented with
XMLSERIALIZE( ... INDENT)

mmh... xmlDocContentDumpOutput seems to add a trailing newline in the
end of a document by default, making the serialization of the same xml
string with DOCUMENT and CONTENT different:

-- postgres v16

SELECT xmlserialize(CONTENT '<foo><bar>42</bar></foo>' AS text INDENT);
  xmlserialize   
-----------------
 <foo>          +
   <bar>42</bar>+
 </foo>
(1 row)

SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);
  xmlserialize   
-----------------
 <foo>          +
   <bar>42</bar>+
 </foo>         +
 
(1 row)

I do recall a discussion along these lines some time ago, but I just
can't find it now. Does anyone know if this is the expected behaviour?
Or should we in this case consider something like this in
xmltotext_with_options()?

result = cstring_to_text_with_len((const char *) xmlBufferContent(buf),
xmlBufferLength(buf) - 1);

--
Jim

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Jones (#3)
Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

Jim Jones <jim.jones@uni-muenster.de> writes:

mmh... xmlDocContentDumpOutput seems to add a trailing newline in the
end of a document by default, making the serialization of the same xml
string with DOCUMENT and CONTENT different:

Does seem a bit inconsistent.

Or should we in this case consider something like this in
xmltotext_with_options()?
result = cstring_to_text_with_len((const char *) xmlBufferContent(buf),
xmlBufferLength(buf) - 1);

I think it'd be quite foolish to assume that every extant and future
version of libxml2 will share this glitch. Probably should use
logic more like pg_strip_crlf(), although we can't use that directly.

Would it ever be the case that trailing whitespace would be valid
data? In a bit of testing, it seems like that could be true in
CONTENT mode but not DOCUMENT mode.

regards, tom lane

#5Jim Jones
jim.jones@uni-muenster.de
In reply to: Tom Lane (#4)
Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

Hi Tom

On 06.09.24 18:34, Tom Lane wrote:

I think it'd be quite foolish to assume that every extant and future
version of libxml2 will share this glitch. Probably should use
logic more like pg_strip_crlf(), although we can't use that directly.

Makes sense. I Introduced this logic in the end of
xmltotext_with_options() in case it was called with INDENT and DOCUMENT
type xml string.

SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);
  xmlserialize   
-----------------
 <foo>          +
   <bar>42</bar>+
 </foo>
(1 row)

The regression tests were updated accordingly - see patch v2-0002.

Would it ever be the case that trailing whitespace would be valid
data? In a bit of testing, it seems like that could be true in
CONTENT mode but not DOCUMENT mode.

Yes, in case of CONTENT it is valid data and it will be preserved, as
CONTENT can be pretty much anything.

SELECT xmlserialize(CONTENT E'<foo><bar>42</bar></foo>\n\n\t\t\t' AS
text INDENT);
       xmlserialize       
--------------------------
 <foo>                   +
   <bar>42</bar>         +
 </foo>                  +
                         +
                         
(1 row)

With DOCUMENT it is superfluous and should be removed after indentation.
IIRC there's an xmlSaveToBuffer option called XML_SAVE_WSNONSIG that can
be used to preserve it.

Thanks

Best, Jim

Attachments:

v2-0002-Bug-fix-remove-default-trailing-newline-from-XMLS.patchtext/x-patch; charset=UTF-8; name=v2-0002-Bug-fix-remove-default-trailing-newline-from-XMLS.patchDownload+29-26
v2-0001-Bug-fix-for-XMLSERIALIZE-.INDENT-for-xml-containi.patchtext/x-patch; charset=UTF-8; name=v2-0001-Bug-fix-for-XMLSERIALIZE-.INDENT-for-xml-containi.patchDownload+60-3
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Jones (#5)
Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

Jim Jones <jim.jones@uni-muenster.de> writes:

[ xmlserialize patches ]

Pushed with minor editorialization. Notably, I got rid of scribbling
on xmlBufferContent's buffer --- I don't know how likely that is to
upset libxml2, but it seems like a fairly bad idea given that they
declare the result as "const xmlChar*". Casting away the const is
poor form in any case.

regards, tom lane