astreamer fixes

Started by Andrew Dunstan8 days ago6 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

After the recent rash of fixes to the astreamer code, I thought it might
be a good idea to take a closer look for more issues. The attached
proposes six fixes. Full disclosure: I found two of these (those in
astreamer_tar_parser_free() and astreamer_extractor_content() ), and
claude found the rest. I believe all those it found are indeed things
that should be fixed (and backpatched). The wrong data pointer issue is
one I suspect it would have been quite hard to find.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

v1-0001-Fix-multiple-bugs-in-astreamer-pipeline-code.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-multiple-bugs-in-astreamer-pipeline-code.patchDownload+18-5
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: astreamer fixes

Andrew Dunstan <andrew@dunslane.net> writes:

After the recent rash of fixes to the astreamer code, I thought it might
be a good idea to take a closer look for more issues. The attached
proposes six fixes. Full disclosure: I found two of these (those in
astreamer_tar_parser_free() and astreamer_extractor_content() ), and
claude found the rest. I believe all those it found are indeed things
that should be fixed (and backpatched). The wrong data pointer issue is
one I suspect it would have been quite hard to find.

Bleah ...

Your changes in astreamer_file.c and astreamer_tar.c are clearly
correct fixes. I fear that astreamer_gzip.c is still several bricks
shy of a load though: looks like Claude noticed some issues all right,
but its fixes are wrong/incomplete. Reading the documentation in
/usr/include/zlib.h, I notice:

1. I do not think it's possible to get Z_STREAM_END from inflate()
in astreamer_gzip_decompressor_content, because we don't tell it
that we've reached end-of-stream. So the proposed changes there
are incorrect. We should indeed switch to whitelisting not
blacklisting result codes, but not expect Z_STREAM_END.

2. What Claude noticed I think, but failed to correct accurately, is
that astreamer_gzip_decompressor_finalize needs to invoke inflate()
with Z_FINISH, and pump it until it returns Z_STREAM_END. The
code as it stands is probably failing to produce the last few bytes
of decompressed output in many cases. We've not noticed because that
just results in truncating the undefined post-tar-trailer junk.

3. It sure looks to me like astreamer_lz4_decompressor_finalize and
astreamer_zstd_decompressor_finalize have related bugs. There's no
provision in them for flushing any buffered data out of those
libraries, either.

Maybe we should instrument astreamer_tar_parser_finalize to report
exactly how much trailer data it got, in order to check whether this
apparent bug is real. If the result is different between an
uncompressed tar file and the same file compressed, then it's
broken.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: astreamer fixes

I wrote:

1. I do not think it's possible to get Z_STREAM_END from inflate()
in astreamer_gzip_decompressor_content, because we don't tell it
that we've reached end-of-stream.

Nah, I'm mistaken about that. I forgot that a zlib-compressed stream
has its own internal end-of-stream marker, and that's what triggers
the Z_STREAM_END report.

2. What Claude noticed I think, but failed to correct accurately, is
that astreamer_gzip_decompressor_finalize needs to invoke inflate()
with Z_FINISH, and pump it until it returns Z_STREAM_END.

I spent some time testing this and could not devise a scenario where
there is leftover data still to be decompressed once
astreamer_gzip_decompressor_content finishes. Closer reading of
zlib.h says that inflate() won't report all the input data as having
been consumed if it runs out of output buffer space, so in practice
the logic in astreamer_gzip_decompressor_content is fine, except that
we're ignoring some error report codes that we shouldn't. So I think
we should take the proposed patch hunk

-		if (res == Z_STREAM_ERROR)
-			pg_fatal("could not decompress data: %s", zs->msg);
+		if (res != Z_OK && res != Z_STREAM_END && res != Z_BUF_ERROR)
+			pg_fatal("could not decompress data: %s",
+					 zs->msg ? zs->msg : "unknown error");

but I don't like the other bit

+		/* If we've hit the end of the compressed stream, stop. */
+		if (res == Z_STREAM_END)
+			break;

Paying attention to the amount of data consumed seems just as
good and takes less code.

3. It sure looks to me like astreamer_lz4_decompressor_finalize and
astreamer_zstd_decompressor_finalize have related bugs. There's no
provision in them for flushing any buffered data out of those
libraries, either.

I'm not quite convinced either way about those two. Perhaps they
are just as aggressive as zlib about dumping decompressed data,
but they might not be. I did try instrumenting tar_parser_finalize
and failed to observe any misbehavior in the regression tests, but
that's not a very stressful test.

regards, tom lane

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#3)
Re: astreamer fixes

On 2026-03-28 Sa 1:48 PM, Tom Lane wrote:

I wrote:

1. I do not think it's possible to get Z_STREAM_END from inflate()
in astreamer_gzip_decompressor_content, because we don't tell it
that we've reached end-of-stream.

Nah, I'm mistaken about that. I forgot that a zlib-compressed stream
has its own internal end-of-stream marker, and that's what triggers
the Z_STREAM_END report.

2. What Claude noticed I think, but failed to correct accurately, is
that astreamer_gzip_decompressor_finalize needs to invoke inflate()
with Z_FINISH, and pump it until it returns Z_STREAM_END.

I spent some time testing this and could not devise a scenario where
there is leftover data still to be decompressed once
astreamer_gzip_decompressor_content finishes. Closer reading of
zlib.h says that inflate() won't report all the input data as having
been consumed if it runs out of output buffer space, so in practice
the logic in astreamer_gzip_decompressor_content is fine, except that
we're ignoring some error report codes that we shouldn't. So I think
we should take the proposed patch hunk

-		if (res == Z_STREAM_ERROR)
-			pg_fatal("could not decompress data: %s", zs->msg);
+		if (res != Z_OK && res != Z_STREAM_END && res != Z_BUF_ERROR)
+			pg_fatal("could not decompress data: %s",
+					 zs->msg ? zs->msg : "unknown error");

but I don't like the other bit

+		/* If we've hit the end of the compressed stream, stop. */
+		if (res == Z_STREAM_END)
+			break;

Paying attention to the amount of data consumed seems just as
good and takes less code.

So IIUC, you agree with the patch except for this last piece, on which I
agree with your reasoning.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#4)
Re: astreamer fixes

Andrew Dunstan <andrew@dunslane.net> writes:

On 2026-03-28 Sa 1:48 PM, Tom Lane wrote:

but I don't like the other bit

+		/* If we've hit the end of the compressed stream, stop. */
+		if (res == Z_STREAM_END)
+			break;

Paying attention to the amount of data consumed seems just as
good and takes less code.

So IIUC, you agree with the patch except for this last piece, on which I
agree with your reasoning.

Right, I would leave out the above-quoted hunk but the rest of it
looks good.

regards, tom lane

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#5)
Re: astreamer fixes

On 2026-03-28 Sa 3:46 PM, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 2026-03-28 Sa 1:48 PM, Tom Lane wrote:

but I don't like the other bit

+		/* If we've hit the end of the compressed stream, stop. */
+		if (res == Z_STREAM_END)
+			break;

Paying attention to the amount of data consumed seems just as
good and takes less code.

So IIUC, you agree with the patch except for this last piece, on which I
agree with your reasoning.

Right, I would leave out the above-quoted hunk but the rest of it
looks good.

Thanks. pushed.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com