Fix XML handling with DOCTYPE
Hi all,
I'm investigating the issue I reported here:
/messages/by-id/153478795159.1302.9617586466368699403@wrigleys.postgresql.org
As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
should accept all valid XML. At this time, XML with a DOCTYPE declaration
is not accepted with this setting even though it is considered valid XML.
I'd like to work on a patch to address this issue and make it work as
advertised.
I traced the source of the error to line ~1500 in
/src/backend/utils/adt/xml.c
res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string +
count, NULL);
It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't
work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE
element makes the XML not well-balanced. From:
http://xmlsoft.org/html/libxml-parser.html#xmlParseBalancedChunkMemory
This function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser
error code otherwise
I see xmlParseBalancedChunkMemoryRecover that might provide the
functionality needed. That function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser
error code otherwise In case recover is set to 1, the nodelist will not be
empty even if the parsed chunk is not well balanced, assuming the parsing
succeeded to some extent.
I haven't tested yet to see if this parses the data w/ DOCTYPE successfully
yet. If it does, I don't think it would be difficult to update the check
on res_code to not fail. I'm making another assumption that there is a
distinct code from libxml to differentiate from other errors, but I
couldn't find those codes quickly. The current check is this:
if (res_code != 0 || xmlerrcxt->err_occurred)
Does this sound reasonable? Have I missed some major aspect? If this is
on the right track I can work on creating a patch to move this forward.
Thanks,
*Ryan Lambert*
RustProof Labs
www.rustprooflabs.com
On 03/16/19 16:10, Ryan Lambert wrote:
As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
should accept all valid XML. At this time, XML with a DOCTYPE declaration
is not accepted with this setting even though it is considered valid XML.
Hello Ryan,
A patch for your issue is currently registered in the 2019-03 commitfest[1]https://commitfest.postgresql.org/22/1872/.
If it attracts somebody to review it before the end of the month, it might
make it into PG v12.
It is the xml-content-2006-2.patch found on the email thread [2]/messages/by-id/5C81F8C0.6090901@anastigmatix.net. (The other
patch found there is associated documentation fixes, and also needs to be
reviewed.)
Further conversation should probably be on that email thread so that it
stays associated with the commitfest entry.
Thanks for your interest in the issue!
Regards,
Chapman Flack
[1]: https://commitfest.postgresql.org/22/1872/
[2]: /messages/by-id/5C81F8C0.6090901@anastigmatix.net
Ryan Lambert <ryan@rustprooflabs.com> writes:
I'm investigating the issue I reported here:
/messages/by-id/153478795159.1302.9617586466368699403@wrigleys.postgresql.org
I'd like to work on a patch to address this issue and make it work as
advertised.
Good idea, because it doesn't seem like anybody else cares ...
I see xmlParseBalancedChunkMemoryRecover that might provide the
functionality needed.
TBH, our experience with libxml has not been so positive that I'd think
adding dependencies on new parts of its API would be a good plan.
Experimenting with different inputs, it seems like removing the
"<!DOCTYPE ...>" tag is enough to make it work. So what I'm wondering
about is writing something like parse_xml_decl() to skip over that.
Bear in mind though that I know next to zip about XML. There may be
some good reason why we don't want to strip off the !DOCTYPE part
from what libxml sees.
regards, tom lane
Chapman Flack <chap@anastigmatix.net> writes:
A patch for your issue is currently registered in the 2019-03 commitfest[1].
Oh! I apologize for saying nobody was working on this issue.
Taking a very quick look at your patch, though, I dunno --- it seems like
it adds a boatload of new assumptions about libxml's data structures and
error-reporting behavior. I'm sure it works for you, but will it work
across a wide range of libxml versions?
What do you think of the idea I just posted about parsing off the DOCTYPE
thing for ourselves, and not letting libxml see it?
regards, tom lane
On 03/16/19 16:42, Tom Lane wrote:
Ryan Lambert <ryan@rustprooflabs.com> writes:
I'm investigating the issue I reported here:
/messages/by-id/153478795159.1302.9617586466368699403@wrigleys.postgresql.org
I'd like to work on a patch to address this issue and make it work as
advertised.Good idea, because it doesn't seem like anybody else cares ...
ahem
On 03/16/19 16:55, Tom Lane wrote:
What do you think of the idea I just posted about parsing off the DOCTYPE
thing for ourselves, and not letting libxml see it?
The principled way of doing that would be to pre-parse to find a DOCTYPE,
and if there is one, leave it there and parse the input as we do for
'document'. Per XML, if there is a DOCTYPE, the document must satisfy
the 'document' syntax requirements, and per SQL/XML:2006-and-later,
'content' is a proper superset of 'document', so if we were asked for
'content' and can successfully parse it as 'document', we're good,
and if we see a DOCTYPE and yet it incurs a parse error as 'document',
well, that's what needed to happen.
The DOCTYPE can appear arbitrarily far in, but the only things that
can precede it are the XML decl, whitespace, XML comments, and XML
processing instructions. None of those things nest, so the preceding
stuff makes a regular language, and a regular expression that matches
any amount of that stuff ending in <!DOCTYPE would be enough to detect
that the parse should be shunted off to get 'document' treatment.
The patch I submitted essentially relies on libxml to do that same
parsing up to that same point and detect the error, so it would add
no upfront cost in the majority of cases that aren't this corner one.
But keeping a little compiled regex around and testing the input with that
would hardly break the bank, either.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes:
On 03/16/19 16:55, Tom Lane wrote:
What do you think of the idea I just posted about parsing off the DOCTYPE
thing for ourselves, and not letting libxml see it?
The principled way of doing that would be to pre-parse to find a DOCTYPE,
and if there is one, leave it there and parse the input as we do for
'document'. Per XML, if there is a DOCTYPE, the document must satisfy
the 'document' syntax requirements, and per SQL/XML:2006-and-later,
'content' is a proper superset of 'document', so if we were asked for
'content' and can successfully parse it as 'document', we're good,
and if we see a DOCTYPE and yet it incurs a parse error as 'document',
well, that's what needed to happen.
Hm, so, maybe just
(1) always try to parse as document. If successful, we're done.
(2) otherwise, if allowed by xmloption, try to parse using our
current logic for the CONTENT case.
This avoids adding any new assumptions about how libxml acts,
which is what I was hoping to achieve.
One interesting question is which error to report if both (1) and (2)
fail.
regards, tom lane
On 03/16/19 17:21, Tom Lane wrote:
Chapman Flack <chap@anastigmatix.net> writes:
On 03/16/19 16:55, Tom Lane wrote:
What do you think of the idea I just posted about parsing off the DOCTYPE
thing for ourselves, and not letting libxml see it?The principled way of doing that would be to pre-parse to find a DOCTYPE,
and if there is one, leave it there and parse the input as we do for
'document'. Per XML, if there is a DOCTYPE, the document must satisfy
the 'document' syntax requirements, and per SQL/XML:2006-and-later,
'content' is a proper superset of 'document', so if we were asked for
'content' and can successfully parse it as 'document', we're good,
and if we see a DOCTYPE and yet it incurs a parse error as 'document',
well, that's what needed to happen.Hm, so, maybe just
(1) always try to parse as document. If successful, we're done.
(2) otherwise, if allowed by xmloption, try to parse using our
current logic for the CONTENT case.
What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).
I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.
So the current structure is:
restart:
asked for document?
parse as document, or fail
else asked for content:
parse as content
failed?
because DOCTYPE? restart as if document
else fail
and a pre-scan structure could be very similar:
restart:
asked for document?
parse as document, or fail
else asked for content:
pre-scan finds DOCTYPE?
restart as if document
else parse as content, or fail
The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.
Regards,
-Chap
Thank you both! I had glanced at that item in the commitfest but didn't
notice it would fix this issue.
I'll try to test/review this before the end of the month, much better than
starting from scratch myself. A quick glance at the patch looks logical
and looks like it should work for my use case.
Thanks,
Ryan Lambert
On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <chap@anastigmatix.net> wrote:
Show quoted text
On 03/16/19 17:21, Tom Lane wrote:
Chapman Flack <chap@anastigmatix.net> writes:
On 03/16/19 16:55, Tom Lane wrote:
What do you think of the idea I just posted about parsing off the
DOCTYPE
thing for ourselves, and not letting libxml see it?
The principled way of doing that would be to pre-parse to find a
DOCTYPE,
and if there is one, leave it there and parse the input as we do for
'document'. Per XML, if there is a DOCTYPE, the document must satisfy
the 'document' syntax requirements, and per SQL/XML:2006-and-later,
'content' is a proper superset of 'document', so if we were asked for
'content' and can successfully parse it as 'document', we're good,
and if we see a DOCTYPE and yet it incurs a parse error as 'document',
well, that's what needed to happen.Hm, so, maybe just
(1) always try to parse as document. If successful, we're done.
(2) otherwise, if allowed by xmloption, try to parse using our
current logic for the CONTENT case.What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.So the current structure is:
restart:
asked for document?
parse as document, or fail
else asked for content:
parse as content
failed?
because DOCTYPE? restart as if document
else failand a pre-scan structure could be very similar:
restart:
asked for document?
parse as document, or fail
else asked for content:
pre-scan finds DOCTYPE?
restart as if document
else parse as content, or failThe pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.Regards,
-Chap
On 03/16/19 18:33, Chapman Flack wrote:
The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
If the available regexp functions want to start by munging the entire input
into a pg_wchar array, then it may be better to implement the pre-scan as
open code, the same way parse_xml_decl() is already implemented.
Given that parse_xml_decl() already covers the first optional thing that
can precede the doctype, the remaining scan routine would only need to
recognize comments, PIs, and whitespace. That would be pretty straightforward.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes:
On 03/16/19 17:21, Tom Lane wrote:
Hm, so, maybe just
(1) always try to parse as document. If successful, we're done.
(2) otherwise, if allowed by xmloption, try to parse using our
current logic for the CONTENT case.
What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).
That doesn't seem particularly likely to me: based on what's been
said here, I'd expect parsing with the wrong expectation to usually
fail near the start of the input. In any case, the other patch
also requires repeat parsing, no? It's just doing that in a different
set of cases.
The reason I'm pressing you for a simpler patch is that dump/reload
failures are pretty bad, so ideally we'd find a fix that we're
comfortable with back-patching into the released branches.
Personally I would never dare to back-patch the proposed patch:
it's too complex, so it's not real clear that it doesn't have unwanted
side-effects, and it's not at all certain that there aren't libxml
version dependencies in it. (Maybe another committer with more
familiarity with libxml would evaluate the risks differently, but
I doubt it.) But I think that something close to what I sketched
above would pass muster as safe-to-backpatch.
regards, tom lane
On 03/17/19 11:45, Tom Lane wrote:
Chapman Flack <chap@anastigmatix.net> writes:
On 03/16/19 17:21, Tom Lane wrote:
(1) always try to parse as document. If successful, we're done.
(2) otherwise, if allowed by xmloption, try to parse using ourWhat I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).That doesn't seem particularly likely to me: based on what's been
said here, I'd expect parsing with the wrong expectation to usually
fail near the start of the input. In any case, the other patch
also requires repeat parsing, no? It's just doing that in a different
set of cases.
I'll do up a version with the open-coded prescan I proposed last night.
Whether parsing with the wrong expectation is likely to fail near the
start of the input depends on both the input and the expectation. If
your expectation is DOCUMENT and the input is CONTENT, it's possible
for the determining difference to be something that follows the first
element, and a first element can be (and often is) nearly all of the input.
What I was doing in the patch is the reverse: parsing with the expectation
of CONTENT to see if a DTD gets tripped over. It isn't allowed for an
element to precede a DTD, so that approach can be expected to fail fast
if the other branch needs to be taken.
But a quick pre-scan for the same thing would have the same property,
without the libxml dependencies that bother you here. Watch this space.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes:
What I was doing in the patch is the reverse: parsing with the expectation
of CONTENT to see if a DTD gets tripped over. It isn't allowed for an
element to precede a DTD, so that approach can be expected to fail fast
if the other branch needs to be taken.
Ah, right. I don't have any problem with trying the CONTENT approach
before the DOCUMENT approach rather than vice-versa. What I was concerned
about was adding a lot of assumptions about exactly how libxml would
report the failure. IMO a maximally-safe patch would just rearrange
things we're already doing without adding new things.
But a quick pre-scan for the same thing would have the same property,
without the libxml dependencies that bother you here. Watch this space.
Do we need a pre-scan at all?
regards, tom lane
On 03/17/19 13:16, Tom Lane wrote:
Chapman Flack <chap@anastigmatix.net> writes:
What I was doing in the patch is the reverse: parsing with the expectation
of CONTENT to see if a DTD gets tripped over. It isn't allowed for an
element to precede a DTD, so that approach can be expected to fail fast
if the other branch needs to be taken.Ah, right. I don't have any problem with trying the CONTENT approach
before the DOCUMENT approach rather than vice-versa. What I was concerned
about was adding a lot of assumptions about exactly how libxml would
report the failure. IMO a maximally-safe patch would just rearrange
things we're already doing without adding new things.But a quick pre-scan for the same thing would have the same property,
without the libxml dependencies that bother you here. Watch this space.Do we need a pre-scan at all?
Without it, we double the time to a failure result in every case that
should actually fail, as well as in this one corner case that we want to
see succeed, and the question you posed earlier about which error message
to return becomes thornier.
If the query asked for CONTENT, any error result should be one you could get
when parsing as CONTENT. If we switch and try parsing as DOCUMENT _because
the input is claiming to have the form of a DOCUMENT_, then it's defensible
to return errors explaining why it's not a DOCUMENT ... but not in the
general case of just throwing DOCUMENT at it any time CONTENT parse fails.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes:
On 03/17/19 13:16, Tom Lane wrote:
Do we need a pre-scan at all?
Without it, we double the time to a failure result in every case that
should actually fail, as well as in this one corner case that we want to
see succeed, and the question you posed earlier about which error message
to return becomes thornier.
I have absolutely zero concern about whether it takes twice as long to
detect bad input; nobody should be sending bad input if they're concerned
about performance. (The costs of the ensuing transaction abort would
likely dwarf xml_in's runtime in any case.) Besides, with what we're
talking about doing here,
(1) the extra runtime is consumed only in cases that would fail up to now,
so nobody can complain about a performance regression;
(2) doing a pre-scan *would* be a performance regression for cases that
work today; not a large one we hope, but still...
The error message issue is indeed a concern, but I don't see why it's
complicated if you agree that
If the query asked for CONTENT, any error result should be one you could get
when parsing as CONTENT.
That just requires us to save the first error message and be sure to issue
that one not the DOCUMENT one, no? That's what we'd want to do from a
backwards-compatibility standpoint anyhow, since that's the error message
wording you'd get with today's code.
regards, tom lane
On 03/17/19 15:06, Tom Lane wrote:
The error message issue is indeed a concern, but I don't see why it's
complicated if you agree thatIf the query asked for CONTENT, any error result should be one you could get
when parsing as CONTENT.That just requires us to save the first error message and be sure to issue
that one not the DOCUMENT one, no?
I confess I haven't looked hard yet at how to do that. The way errors come
out of libxml is more involved than "here's a message", so there's a choice
of (a) try to copy off that struct in a way that's sure to survive
re-executing the parser, and then use the copy, or (b) generate a message
right away from the structured information and save that, and I guess b
might not be too bad; a might not be too bad, or it might, and slide right
back into the kind of libxml-behavior-assumptions you're wanting to avoid.
Meanwhile, here is a patch on the lines I proposed earlier, with a
pre-check. Any performance hit that it could entail (which I'd really
expect to be de minimis, though I haven't benchmarked) ought to be
compensated by the strlen I changed to strnlen in parse_xml_decl (as
there's really no need to run off and count the whole rest of the input
just to know if 1, 2, 3, or 4 bytes are available to decode a UTF-8 char).
... and, yes, I know that could be an independent patch, and then the
performance effect here should be measured from there. But it was near
what I was doing anyway, so I included it here.
Attaching both still-outstanding patches (this one and docfix) so the
CF app doesn't lose one.
Regards,
-Chap
There might be too many different email threads on this with patches,
but in case it went under the radar, xml-content-2006-3.patch appeared
in my previous message on this thread[1]/messages/by-id/5C8ECAA4.3090301@anastigmatix.net.
It is based on a simple pre-check of the prefix of the input, determining
which form of parse to apply. That may or may not be simpler than parse-
once-save-error-parse-again-report-first-error, but IMV it's a more direct
solution and clearer (the logic is clearly about "how do I determine the way
this input should be parsed?" which is the problem on the table, rather
than "how should I save and regurgitate this libxml error?" which turns the
problem on the table to a different one).
I decided, for a first point of reference, to wear the green eyeshade and
write a pre-check that exactly implements the applicable rules. That gives
a starting point for simplifications that are probably safe.
For example, a bunch of lines at the end have to do with verifying the
content inside of a processing-instruction, after finding where it ends.
We could reasonably decide that, for the purpose of skipping it, knowing
where it ends is enough, as libxml will parse it next and report any errors
anyway.
That would slightly violate my intention of sending input to (the parser
that wasn't asked for) /only/ when it's completely clear (from the prefix
we've seen) that that's where it should go. The relaxed version could do
that in completely-clear cases and cases with an invalid PI ahead of what
looks like a DTD. But you'd pretty much expect both parsers to produce
the same message for a bad PI anyway.
That made me just want to try it now, and--surprise!--the messages from
libxml are not the same. So maybe I would lean to keeping the green-eyeshade
rules in the test, if you can stomach them, but I would understand taking
them out.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes:
I decided, for a first point of reference, to wear the green eyeshade and
write a pre-check that exactly implements the applicable rules. That gives
a starting point for simplifications that are probably safe.
For example, a bunch of lines at the end have to do with verifying the
content inside of a processing-instruction, after finding where it ends.
We could reasonably decide that, for the purpose of skipping it, knowing
where it ends is enough, as libxml will parse it next and report any errors
anyway.
Yeah, I did not like that code too much, particularly not all the magic
Unicode-code-point numbers. I removed that, made some other changes to
bring the patch more in line with PG coding style, and pushed it.
That made me just want to try it now, and--surprise!--the messages from
libxml are not the same. So maybe I would lean to keeping the green-eyeshade
rules in the test, if you can stomach them, but I would understand taking
them out.
I doubt anyone will care too much about whether error messages for bad
XML input are exactly like what they were before; and even if someone
does, I doubt that these extra tests would be enough to ensure that
the messages don't change. You're not really validating that the input
is something that libxml would accept, unless its processing of XML PIs
is far stupider than I would expect it to be.
regards, tom lane
On 03/23/19 16:59, Tom Lane wrote:
Unicode-code-point numbers. I removed that, made some other changes to
bring the patch more in line with PG coding style, and pushed it.
Thanks! It looks good. I'm content with the extra PI checking being gone.
The magic Unicode-code-point numbers come straight from the XML standard;
I couldn't make that stuff up. :)
You're not really validating that the input
is something that libxml would accept, unless its processing of XML PIs
is far stupider than I would expect it to be.
Out of curiosity, what further processing would you expect libxml to do?
XML parsers are supposed to be transparent PI-preservers, except in the
rare case of seeing a PI that actually means something to the embedding
application, which isn't going to be the case for a database simply
implementing an XML data type.
The standard literally requires that the target must be a NAME, and
can't match [Xx][Mm][Ll], and if there's whitespace and anything after
that, there can't be an embedded ?> ... and that's it.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes:
On 03/23/19 16:59, Tom Lane wrote:
You're not really validating that the input
is something that libxml would accept, unless its processing of XML PIs
is far stupider than I would expect it to be.
Out of curiosity, what further processing would you expect libxml to do?
Hm, I'd have thought it'd try to parse the arguments to some extent,
but maybe not. Does everybody reimplement attribute parsing for
themselves when using PIs?
regards, tom lane