Regression with large XML data input

Started by Michael Paquier9 months ago36 messageshackers

michael@paquier.xyz

9 months ago

Hi all,
(Adding in CC Tom and Eric, as committer and author.)

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

The problem is introduced by the change from
xmlParseBalancedChunkMemory() to xmlNewNode() +
xmlParseInNodeContext() in xml_parse(), to avoid an issue in
xmlParseBalancedChunkMemory() in the range of libxml2 2.13.0-2.13.2
for a bug that has already been fixed upstream, where we use a
temporary root node for the case where parse_as_document is false.

If the input XML data is large enough, one gets a failure at the top
of the latest branches, and it worked properly before. Here is a
short test case (courtesy of a colleague, case that I've modified
slightly):
CREATE TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
BEGIN
BEGIN
INSERT INTO xmldata (id, message) VALUES
( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) );
RAISE NOTICE 'insert 40MB successful';
EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Error insert 40MB: %', SQLERRM;
END;
END $$;

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue. A quick POC is
attached. It fails one case in check-world with SERIALIZE because I
am not sure it is possible to pass down some options through
xmlParseBalancedChunkMemory(), still the regression is gone, and I am
wondering if there is not a better solution to be able to dodge the
original problem and still accept this case. One good thing is that
xmlParseBalancedChunkMemory() is able to return a list of nodes, that
we need for this code path of xml_parse(). So perhaps one solution
would be the addition of a code path with
xmlParseBalancedChunkMemory() depending on the options we want to
process, keeping the code path with the fake "content-root" for the
XML SERIALIZE case.

The patch in question has been applied first to 6082b3d5d3d1 on HEAD
impacting v18~, then it has been backpatched down to all stable
branches, like f68d6aabb7e2, introducing the regression in all the
stable branches since the minor releases done in August 2024, as of:
12.20, 13.16, 14.13, 15.8, 16.4.

Thoughts or comments?
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Michael Paquier (#1)

Re: Regression with large XML data input

Michael Paquier <michael@paquier.xyz> writes:

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

Bleah.

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue.

Yeah, just reverting these commits might be an acceptable answer,
since the main point was to work around a bleeding-edge bug:

* Early 2.13.x releases of libxml2 contain a bug that causes
xmlParseBalancedChunkMemory to return the wrong status value in some
cases. This breaks our regression tests. While that bug is now fixed
upstream and will probably never be seen in any production-oriented
distro, it is currently a problem on some more-bleeding-edge-friendly
platforms.

Presumably that problem is now gone, a year later. The other point
about

* xmlParseBalancedChunkMemory is considered to depend on libxml2's
semi-deprecated SAX1 APIs, and will go away when and if they do.

is still hypothetical I think. But we might want to keep this bit:

While here, avoid allocating an xmlParserCtxt in DOCUMENT parse mode,
since that code path is not going to use it.

regards, tom lane

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Tom Lane (#2)

Re: Regression with large XML data input

On Wed, Jul 23, 2025 at 11:28:38PM -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue.

Yeah, just reverting these commits might be an acceptable answer,
since the main point was to work around a bleeding-edge bug:

Still it is not possible to do exactly that on all the branches
because of the business with XMLSERIALIZE that requires some options
for xmlParseInNodeContext(), is it?

* Early 2.13.x releases of libxml2 contain a bug that causes
xmlParseBalancedChunkMemory to return the wrong status value in some
cases. This breaks our regression tests. While that bug is now fixed
upstream and will probably never be seen in any production-oriented
distro, it is currently a problem on some more-bleeding-edge-friendly
platforms.

Presumably that problem is now gone, a year later. The other point
about

I would probably agree that it does not seem worth caring for this
range in the early 2.13 series. I didn't mention it upthread but all
my tests were with debian GID's libxml2 which seems to be a 2.12.7
flavor with some 2.9.14 pieces, based on what apt is telling me. I
did not test with a different version from upstream, but I'm pretty
sure that's the same story.

* xmlParseBalancedChunkMemory is considered to depend on libxml2's
semi-deprecated SAX1 APIs, and will go away when and if they do.

is still hypothetical I think. But we might want to keep this bit:

Worth mentioning upstream 4f329dc52490, I guess, added to the 2.14
branch:
parser: Implement xmlCtxtParseContent

This implements xmlCtxtParseContent, a better alternative to
xmlParseInNodeContext or xmlParseBalancedChunkMemory. It accepts a
parser context and a parser input, making it a lot more versatile.

With all our stable branches, I am not sure if this should be
considered, but that seems worth keeping in mind.

While here, avoid allocating an xmlParserCtxt in DOCUMENT parse mode,
since that code path is not going to use it.

Are you planning to look at that for the next minor release? It would
take me a couple of hours to dig into all that, and I am sure that I
am going to need your stamp or Erik's to avoid doing a stupid thing.
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Michael Paquier (#3)

Re: Regression with large XML data input

Michael Paquier <michael@paquier.xyz> writes:

Are you planning to look at that for the next minor release? It would
take me a couple of hours to dig into all that, and I am sure that I
am going to need your stamp or Erik's to avoid doing a stupid thing.

It was my commit, so my responsibility, so I'll deal with it.
Planning to wait for Erik's input though.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Tom Lane (#2)

Re: Regression with large XML data input

I wrote:

Michael Paquier <michael@paquier.xyz> writes:

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

Bleah.

The supplied test case hides important details in the error message.
If you get rid of the exception block so that the error is reported
in full, what you see is

regression=# CREATE TEMP TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
CREATE TABLE
regression=# DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
regression$# BEGIN
regression$# INSERT INTO xmldata (id, message) VALUES
regression$# ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) );
regression$# END $$;
ERROR: invalid XML content
DETAIL: line 1: internal error: Huge input lookup
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
^
CONTEXT: SQL statement "INSERT INTO xmldata (id, message) VALUES
( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) )"
PL/pgSQL function inline_code_block line 3 at SQL statement
regression=#

That is, what we are hitting is libxml2's internal protections
against processing "too large" input. I am not really sure
why the other coding failed to hit this same thing, but I wonder
if we shouldn't leave well enough alone. See commits 2197d0622
and f2743a7d7, where we tried to enable such cases and then
decided it was too risky. I'm afraid now that our prior coding
might have allowed billion-laugh-like cases to be reachable.

regards, tom lane

Erik Wienhold

ewie@ewie.name

9 months ago

In reply to: Tom Lane (#5)

Re: Regression with large XML data input

On 2025-07-24 20:10 +0200, Tom Lane wrote:

The supplied test case hides important details in the error message.
If you get rid of the exception block so that the error is reported
in full, what you see is

regression=# CREATE TEMP TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
CREATE TABLE
regression=# DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
regression$# BEGIN
regression$# INSERT INTO xmldata (id, message) VALUES
regression$# ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) );
regression$# END $$;
ERROR: invalid XML content
DETAIL: line 1: internal error: Huge input lookup
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
^
CONTEXT: SQL statement "INSERT INTO xmldata (id, message) VALUES
( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) )"
PL/pgSQL function inline_code_block line 3 at SQL statement
regression=#

That is, what we are hitting is libxml2's internal protections
against processing "too large" input. I am not really sure
why the other coding failed to hit this same thing, but I wonder
if we shouldn't leave well enough alone. See commits 2197d0622
and f2743a7d7, where we tried to enable such cases and then
decided it was too risky. I'm afraid now that our prior coding
might have allowed billion-laugh-like cases to be reachable.

I was just looking into Michael's fix when I saw your message. The fix
works on libxml2 2.14.5. But on 2.13.8 xmlParseBalancedChunkMemory
returns XML_ERR_RESOURCE_LIMIT and I get this error:

ERROR: invalid XML content
DETAIL: line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Not sure how to set XML_PARSE_HUGE without an xmlParserCtxtPtr at hand,
though. Also haven't looked into why 2.14.5 is not subject to that
resource limit. But as you've already noted, that code was heavily
refactored in 2.13 [1]/messages/by-id/716736.1720376901@sss.pgh.pa.us.

[1]: /messages/by-id/716736.1720376901@sss.pgh.pa.us

--
Erik Wienhold

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Tom Lane (#5)

Re: Regression with large XML data input

Michael Paquier <michael@paquier.xyz> writes:

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

BTW, further testing shows that the same failure occurs at
f68d6aabb7e2^. So AFAICS, the answer as to why the behavior
changed there is that it didn't.

regards, tom lane

Erik Wienhold

ewie@ewie.name

9 months ago

In reply to: Michael Paquier (#1)

Re: Regression with large XML data input

On 2025-07-24 05:12 +0200, Michael Paquier wrote:

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue. A quick POC is
attached. It fails one case in check-world with SERIALIZE because I
am not sure it is possible to pass down some options through
xmlParseBalancedChunkMemory(), still the regression is gone, and I am
wondering if there is not a better solution to be able to dodge the
original problem and still accept this case.

The whitespace can be preserved by setting xmlKeepBlanksDefault before
parsing. See attached v2. That function is deprecated, though. But
libxml2 uses thread-local globals, so it should be safe. Other than
that, I see no other way to set XML_PARSE_NOBLANKS with
xmlParseBalancedChunkMemory.

[1]: https://gitlab.gnome.org/GNOME/libxml2/-/blob/408bd0e18e6ddba5d18e51d52da0f7b3ca1b4421/parserInternals.c#L2833

--
Erik Wienhold

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Tom Lane (#7)

Re: Regression with large XML data input

I wrote:

BTW, further testing shows that the same failure occurs at
f68d6aabb7e2^. So AFAICS, the answer as to why the behavior
changed there is that it didn't.

Oh, wait ... the plot thickens! The above statement is true
when testing on my Mac with libxml2 2.13.8 from MacPorts.
With either HEAD or f68d6aabb7e2^, I get errors similar to
what Erik just showed:

ERROR: invalid XML content
DETAIL: line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

However, when testing on RHEL8 with libxml2 2.9.7, indeed
I get "Huge input lookup" with our current code but no
failure with f68d6aabb7e2^.

The way I interpret these results is that in older libxml2 versions,
xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
that does exist in newer versions. So even if we were to do some kind
of reversion, it would only prevent the error in libxml2 versions that
lack that check. And in those versions we'd probably be exposing
ourselves to resource-exhaustion problems.

On the whole I'm thinking more and more that we don't want to
touch this. Our recommendation for processing multi-megabyte
chunks of XML should be "don't". Unless we want to find or
write a replacement for libxml2 ... which we have discussed,
but so far nothing's happened.

regards, tom lane

#10

Jim Jones

jim.jones@uni-muenster.de

9 months ago

In reply to: Tom Lane (#9)

Re: Regression with large XML data input

On 24.07.25 21:23, Tom Lane wrote:

Oh, wait ... the plot thickens! The above statement is true
when testing on my Mac with libxml2 2.13.8 from MacPorts.
With either HEAD or f68d6aabb7e2^, I get errors similar to
what Erik just showed:

ERROR: invalid XML content
DETAIL: line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I get the same error with libxml2 2.9.14 on Ubuntu.

However, when testing on RHEL8 with libxml2 2.9.7, indeed
I get "Huge input lookup" with our current code but no
failure with f68d6aabb7e2^.

The way I interpret these results is that in older libxml2 versions,
xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
that does exist in newer versions. So even if we were to do some kind
of reversion, it would only prevent the error in libxml2 versions that
lack that check. And in those versions we'd probably be exposing
ourselves to resource-exhaustion problems.

On the whole I'm thinking more and more that we don't want to
touch this. Our recommendation for processing multi-megabyte
chunks of XML should be "don't". Unless we want to find or
write a replacement for libxml2 ... which we have discussed,
but so far nothing's happened.

I also believe that addressing this limitation may not be worth the
associated risks. Moreover, a 10MB text node is rather large and
probably exceeds the needs of most users.

Best, Jim

#11

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Jim Jones (#10)

Re: Regression with large XML data input

On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:

On 24.07.25 21:23, Tom Lane wrote:

However, when testing on RHEL8 with libxml2 2.9.7, indeed
I get "Huge input lookup" with our current code but no
failure with f68d6aabb7e2^.

The way I interpret these results is that in older libxml2 versions,
xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
that does exist in newer versions. So even if we were to do some kind
of reversion, it would only prevent the error in libxml2 versions that
lack that check. And in those versions we'd probably be exposing
ourselves to resource-exhaustion problems.

Linux distributions may not seem very eager to add this check, though?
The top of debian GID uses a version of libxml2 where the difference
shows up, so it means that we have years ahead with the old code.

If it were discussing things from the perspective where this new code
was added after a major version bump of Postgres, I would not argue
much about that: breakages happen every year and users adapt their
applications to it. Here, however, we are talking about a change in a
stable branch, across a minor version, which should be a bit more
flawless from a user perspective? I may be influenced by the point of
seeing a customer impacted by that, of course, there is no denying
that. The point is that this enforces one behavior that's part of
2.13 and onwards. Versions of PG before f68d6aabb7e2 were still OK
with that and the new code of Postgres closes the door completely.
Even if the behavior that Postgres had when linking with libxml2 2.12
or older was kind of "accidendal" because
xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
it was there, and users relied on that.

One possibility that I could see here for stable branches would be to
make the code a bit smarter depending on LIBXML_VERSION, where we
could keep the new code for 2.13 onwards, but keep a compatible
behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().

On the whole I'm thinking more and more that we don't want to
touch this. Our recommendation for processing multi-megabyte
chunks of XML should be "don't". Unless we want to find or
write a replacement for libxml2 ... which we have discussed,
but so far nothing's happened.

I also believe that addressing this limitation may not be worth the
associated risks. Moreover, a 10MB text node is rather large and
probably exceeds the needs of most users.

Yeah, still some people use it, so while I am OK to accept this as a
conclusion at the end and send back to this thread that workarounds
are required in applications to split the inputs, that was really
surprising. (Aka from the point of view of the customer whose
application suddenly fails after what should have been a "simple"
minor update.)
--
Michael

#12

Robert Treat

xzilla@users.sourceforge.net

9 months ago

In reply to: Michael Paquier (#11)

Re: Regression with large XML data input

On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:

On 24.07.25 21:23, Tom Lane wrote:

However, when testing on RHEL8 with libxml2 2.9.7, indeed
I get "Huge input lookup" with our current code but no
failure with f68d6aabb7e2^.

The way I interpret these results is that in older libxml2 versions,
xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
that does exist in newer versions. So even if we were to do some kind
of reversion, it would only prevent the error in libxml2 versions that
lack that check. And in those versions we'd probably be exposing
ourselves to resource-exhaustion problems.

Linux distributions may not seem very eager to add this check, though?
The top of debian GID uses a version of libxml2 where the difference
shows up, so it means that we have years ahead with the old code.

If it were discussing things from the perspective where this new code
was added after a major version bump of Postgres, I would not argue
much about that: breakages happen every year and users adapt their
applications to it. Here, however, we are talking about a change in a
stable branch, across a minor version, which should be a bit more
flawless from a user perspective? I may be influenced by the point of
seeing a customer impacted by that, of course, there is no denying
that. The point is that this enforces one behavior that's part of
2.13 and onwards. Versions of PG before f68d6aabb7e2 were still OK
with that and the new code of Postgres closes the door completely.
Even if the behavior that Postgres had when linking with libxml2 2.12
or older was kind of "accidendal" because
xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
it was there, and users relied on that.

One possibility that I could see here for stable branches would be to
make the code a bit smarter depending on LIBXML_VERSION, where we
could keep the new code for 2.13 onwards, but keep a compatible
behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().

While I am pretty sympathetic to the idea that we hang our hats on
"Postgres doesn't break things in minor version updates", and this
seems to betray that, one scenario where we would break things is if
it were the only reasonable option wrt a bug / security fix, which
this seems potentially close to. If we can come up with a work around
like the above though, it would certainly be nice to give people a
path forward even if it ends up with a breaking major version change.
This at least eliminates the "surprise!" factor.

On the whole I'm thinking more and more that we don't want to
touch this. Our recommendation for processing multi-megabyte
chunks of XML should be "don't". Unless we want to find or
write a replacement for libxml2 ... which we have discussed,
but so far nothing's happened.

I also believe that addressing this limitation may not be worth the
associated risks. Moreover, a 10MB text node is rather large and
probably exceeds the needs of most users.

Yeah, still some people use it, so while I am OK to accept this as a
conclusion at the end and send back to this thread that workarounds
are required in applications to split the inputs, that was really
surprising. (Aka from the point of view of the customer whose
application suddenly fails after what should have been a "simple"
minor update.)

There are a lot of public data sets that provide xml dumps as a
generic format for "non-commercial databases", and those can often be
quite large. I suspect we don't see those use cases a lot because
historically users have been forced to resort to perl/python/etc
scripts to convert the data prior to ingesting. Which is to say, I
think these use cases are more common than we think, and if there were
ever a stable implementation that supported these large use cases,
we'll start to see more of it.

Robert Treat
https://xzilla.net

#13

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Robert Treat (#12)

Re: Regression with large XML data input

Robert Treat <rob@xzilla.net> writes:

On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier <michael@paquier.xyz> wrote:

If it were discussing things from the perspective where this new code
was added after a major version bump of Postgres, I would not argue
much about that: breakages happen every year and users adapt their
applications to it. Here, however, we are talking about a change in a
stable branch, across a minor version, which should be a bit more
flawless from a user perspective?

While I am pretty sympathetic to the idea that we hang our hats on
"Postgres doesn't break things in minor version updates", and this
seems to betray that, one scenario where we would break things is if
it were the only reasonable option wrt a bug / security fix, which
this seems potentially close to.

I'll be the first to say that I'm not too pleased with it either.
However, from Jim Jones' result upthread, a "minor update" of libxml2
could also have caused this problem: 2.9.7 and 2.9.14 behave
differently.  So we don't have sole control --- or sole responsibility
--- here.

I'd be more excited about trying to avoid the failure if I were not
afraid that "avoid the failure" really means "re-expose a security
hazard". Why should we believe that if libxml2 throws a
resource-limit error (for identical inputs) in one code path and not
another, that's anything but a missed error check in the second path?
(Maybe this is the same thing Robert is saying, not quite sure.)

There are a lot of public data sets that provide xml dumps as a
generic format for "non-commercial databases", and those can often be
quite large. I suspect we don't see those use cases a lot because
historically users have been forced to resort to perl/python/etc
scripts to convert the data prior to ingesting. Which is to say, I
think these use cases are more common than we think, and if there were
ever a stable implementation that supported these large use cases,
we'll start to see more of it.

Yeah, it's a real shame that we don't have more-reliable
infrastructure for XML. I'm not volunteering to fix it though...

regards, tom lane

#14

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Tom Lane (#13)

Re: Regression with large XML data input

On Fri, Jul 25, 2025 at 02:21:26PM -0400, Tom Lane wrote:

I'll be the first to say that I'm not too pleased with it either.
However, from Jim Jones' result upthread, a "minor update" of libxml2
could also have caused this problem: 2.9.7 and 2.9.14 behave
differently.  So we don't have sole control --- or sole responsibility
--- here.

This sentence is incorrect after I have double-checked the behaviors I
am seeing based on local builds of libxml2 2.9.7 and 2.9.14. For
example with the top of REL_15_STABLE, with and without
72c65d6658d4, I am getting (removing the exception in the example does
not matter if it's a success):
- libxml2 2.9.7 + top of REL_15_STABLE => test failure
- libxml2 2.9.14 + top of REL_15_STABLE => test failure
- libxml2 2.9.7 + top of REL_15_STABLE + revert of 72c65d6658d4
=> test success
- libxml2 2.9.14 + top of REL_15_STABLE + revert of 72c65d6658d4
=> test success

So if one uses a version of libxml2 2.9.X, he/she would be able to see
the large data case work with Postgres at 72c65d6658d4^1, and a
failure with 72c65d6658d4 and onwards. Taking Postgres in isolation
with any version of libxml2 in the 2.9.X series prevents the case to
work. This does not depend on 2.9.X, only on the fact that we link
Postgres to a newer major version of libxml2.

Please note that this is also the behavior I see in a Debian GID
environment and I guess any existing Debian release: we rely on
libxml2 2.9.X, so a minor upgrade of Postgres is the factor able to
trigger the behavior change. It seems to me that there's an argument
for compatibility with the 2.9.X series, which still seems quite
present in the wild, and that we could decide one solution over the
other in xml_parse() based on LIBXML_VERSION. What I am seeing is
that at fixed major version of libxml2, then Postgres holds the
responsibility here.
--
Michael

#15

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Michael Paquier (#14)

Re: Regression with large XML data input

Michael Paquier <michael@paquier.xyz> writes:

On Fri, Jul 25, 2025 at 02:21:26PM -0400, Tom Lane wrote:

However, from Jim Jones' result upthread, a "minor update" of libxml2
could also have caused this problem: 2.9.7 and 2.9.14 behave
differently.  So we don't have sole control --- or sole responsibility
--- here.

This sentence is incorrect after I have double-checked the behaviors I
am seeing based on local builds of libxml2 2.9.7 and 2.9.14.

Hmm, okay, I misread Jim's results then. But there still remains
the big question: what reason is there to believe that it's safe
to return to the old behavior? If newer libxml2 versions report
XML_ERR_RESOURCE_LIMIT on the same input, doesn't it seem likely
that there's a live hazard in the old code?

regards, tom lane

#16

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Tom Lane (#15)

Re: Regression with large XML data input

On Sun, Jul 27, 2025 at 10:16:47PM -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

This sentence is incorrect after I have double-checked the behaviors I
am seeing based on local builds of libxml2 2.9.7 and 2.9.14.

Hmm, okay, I misread Jim's results then. But there still remains
the big question: what reason is there to believe that it's safe
to return to the old behavior? If newer libxml2 versions report
XML_ERR_RESOURCE_LIMIT on the same input, doesn't it seem likely
that there's a live hazard in the old code?

Yes, I guess that it's not entirely safe to use libxml2 2.9.X if we
rely on the old code, but it also sucks for customers that relied on
the old bevahior to not have a way out in stable branches, and there
are still plenty of environments that rely on 2.9.X.

That's the reason why I am suggesting an hybrid approach on stable
branches where the regression shows up, where it would be possible to
store large XML data values for *some* cases, keeping some
compatibility:
- if building Postgres with 2.13.X, use the new code, as on HEAD,
forcing the restrictions.
- if building Postgres with 2.12.X or older, then use the old code, as
in what Postgres did before f68d6aabb.

If one wants to enforce a stricter behavior, then it is enough to
upgrade the version of libxml2 on a stable branch of PG. If one
wants to preserve compatibility, then keep the older version of
libxml2 but be aware that, while things are compatible in terms of
input handling, libxml2 was unsafe and it was so for years. Hence,
based on the number of environments still using 2.9.X, that seems
worth the hack on stable branches.

I am not going to argue against the commits that have reached
REL_18_STABLE to add compatibility for libxml2 2.13.X, we can leave
this stuff as-is and enforce stronger restrictions across all versions
of libxml2, letting users deal with application changes across a major
version change of PG. So, while it is not perfect and I'm aware of
that my argument is not perfect, it would at least give packagers and
users the option to use the previous compatibility layer if they want,
leaving the stable branches of PG somewhat intact. What I think is
bad for users is the fact that Postgres closes entirely this window,
on a stable branch, even if this was accidental based on the previous
coding style. I understand that from the point of view of a
maintainer this is rather bad, but from the customer point of view the
current situation is also bad to deal with in the scope of a minor
upgrade, because applications suddenly break.
--
Michael

#17

David G. Johnston

david.g.johnston@gmail.com

9 months ago

In reply to: Michael Paquier (#16)

Re: Regression with large XML data input

On Sunday, July 27, 2025, Michael Paquier <michael@paquier.xyz> wrote:

I am not going to argue against the commits that have reached
REL_18_STABLE to add compatibility for libxml2 2.13.X, we can leave
this stuff as-is and enforce stronger restrictions across all versions
of libxml2, letting users deal with application changes across a major
version change of PG. So, while it is not perfect and I'm aware of
that my argument is not perfect, it would at least give packagers and
users the option to use the previous compatibility layer if they want,
leaving the stable branches of PG somewhat intact. What I think is
bad for users is the fact that Postgres closes entirely this window,
on a stable branch, even if this was accidental based on the previous
coding style. I understand that from the point of view of a
maintainer this is rather bad, but from the customer point of view the
current situation is also bad to deal with in the scope of a minor
upgrade, because applications suddenly break.

Is breaking with tradition and implementing the solution that locks down
the system but also gives DBAs the ability the make an informed runtime
decision to bypass said restriction on the table?

David J.

#18

Jim Jones

jim.jones@uni-muenster.de

9 months ago

In reply to: Michael Paquier (#16)

Re: Regression with large XML data input

On 28.07.25 04:47, Michael Paquier wrote:

I understand that from the point of view of a
maintainer this is rather bad, but from the customer point of view the
current situation is also bad to deal with in the scope of a minor
upgrade, because applications suddenly break.

I totally get it --- from the user’s perspective, it’s hard to see this
as a bugfix.

I was wondering whether using XML_PARSE_HUGE in xml_parse's options
could help address this, for example:

options = XML_PARSE_NOENT | XML_PARSE_DTDATTR | XML_PARSE_HUGE
| (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);

According to libxml2's parserInternals.h:

/**
* Maximum size allowed for a single text node when building a tree.
* This is not a limitation of the parser but a safety boundary feature,
* use XML_PARSE_HUGE option to override it.
* Introduced in 2.9.0
*/
#define XML_MAX_TEXT_LENGTH 10000000

/**
* Maximum size allowed when XML_PARSE_HUGE is set.
*/
#define XML_MAX_HUGE_LENGTH 1000000000

The XML_MAX_TEXT_LENGTH limit is what we're hitting now, but
XML_MAX_HUGE_LENGTH is extremely generous. Here's a quick PoC using
XML_PARSE_HUGE:

psql (19devel)
Type "help" for help.

postgres=# CREATE TABLE xmldata (message xml);
CREATE TABLE
postgres=# DO $$
DECLARE huge_size text := repeat('X', 1000000000);
BEGIN
INSERT INTO xmldata (message) VALUES
((('<foo><bar>' || huge_size ||'</bar></foo>')::xml));
END $$;
DO
postgres=# SELECT pg_size_pretty(length(message::text)::bigint) FROM
xmldata;
pg_size_pretty
----------------
954 MB
(1 row)

While XML_MAX_HUGE_LENGTH prevents unlimited memory usage, it still
opens the door to potential resource exhaustion. I couldn't find a way
to dynamically adjust this limit in libxml2.

One idea would be to guard XML_PARSE_HUGE behind a GUC --- say,
xml_enable_huge_parsing. That would at least allow controlled
environments to opt in. But of course, that wouldn't help current releases.

Best regards, Jim

#19

Erik Wienhold

ewie@ewie.name

9 months ago

In reply to: Jim Jones (#18)

Re: Regression with large XML data input

On 2025-07-28 09:45 +0200, Jim Jones wrote:

On 28.07.25 04:47, Michael Paquier wrote:

I understand that from the point of view of a maintainer this is
rather bad, but from the customer point of view the current
situation is also bad to deal with in the scope of a minor upgrade,
because applications suddenly break.

I totally get it --- from the user’s perspective, it’s hard to see
this as a bugfix.

I was wondering whether using XML_PARSE_HUGE in xml_parse's options
could help address this, for example:

options = XML_PARSE_NOENT | XML_PARSE_DTDATTR | XML_PARSE_HUGE
| (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);

This also came to my mind, but it was already tried and reverted soon
after for security reasons. [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f2743a7d70e7b2891277632121bb51e739743a47

One idea would be to guard XML_PARSE_HUGE behind a GUC --- say,
xml_enable_huge_parsing. That would at least allow controlled
environments to opt in. But of course, that wouldn't help current
releases.

+1 for new major releases. But normal users must not be allowed to
enable that GUC. So probably context PGC_SU_BACKEND.

I'm leaning towards Michael's proposal of adding a libxml2 version check
in the stable branches before REL_18_STABLE and parsing the content with
xmlParseBalancedChunkMemory on versions up to 2.12.x.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f2743a7d70e7b2891277632121bb51e739743a47

--
Erik Wienhold

#20

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Erik Wienhold (#19)

Re: Regression with large XML data input

Erik Wienhold <ewie@ewie.name> writes:

I'm leaning towards Michael's proposal of adding a libxml2 version check
in the stable branches before REL_18_STABLE and parsing the content with
xmlParseBalancedChunkMemory on versions up to 2.12.x.

I spent some time investigating this. It appears that even when using
our old code with xmlParseBalancedChunkMemory (I tested against our
commit 6082b3d5d^), libxml2 versions 2.12.x and 2.13.x will throw
a resource-exhaustion error; but earlier and later releases do not.
Drilling down with "git bisect", the first libxml2 commit that throws
an error is

commit 834b8123efc4a4c369671cad2f1b0eb744ae67e9
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date: Tue Aug 8 15:21:28 2023 +0200

parser: Stream data when reading from memory

Don't create a copy of the whole input buffer. Read the data chunk by
chunk to save memory.

and then the one that restores it to not throwing an error is

commit 7148b778209aaaf684c156c5e2e40d6e477f13f7
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date: Sun Jul 7 16:11:08 2024 +0200

parser: Optimize memory buffer I/O

Reenable zero-copy IO for zero-terminated static memory buffers.

Don't stream zero-terminated dynamic memory buffers on top of creating
a copy.

So apparently the "resource exhaustion" complaint is not about
anything deeper than not wanting to make a copy of a many-megabyte
input string, and the fact that it appeared and disappeared is an
artifact of libxml2 implementation choices that we have no control
over (and, IMO, no responsibility for working around).

Based on this, I'm okay with reverting to using
xmlParseBalancedChunkMemory: while that won't help users of 2.12.x or
2.13.x, it will help users of earlier and later libxml2. But I think
we should just do that across the board, not insert libxml2 version
tests nor do it differently in different PG versions.

I've not looked at the details of the proposed patches, but will
do so now that the direction to go in is apparent.

regards, tom lane

#21

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Tom Lane (#20)

#22

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Tom Lane (#21)