Trying out libarchive for reading user-generated WAL tarballs

Started by Thomas Munroabout 10 hours ago1 messageshackers
Jump to latest
#1Thomas Munro
thomas.munro@gmail.com

Hi,

Here's an experimental patch that gives you optional extra tar (and
potentially zip etc) support if compiled --with-libarchive, but only
for pg_waldump where we expect to meet user-generated archives. The
recent band-aid applied to pg_waldump/t/001_basic.pl becomes:

+# If we don't have libarchive, then we tell tar to stick to ustar format that
+# astreamer_tar.c can decode.  Otherwise we should be able to accept anything
+# that any current tar produces.
+@tar_p_flags = tar_portability_options($tar)
+  if !check_pg_config("#define USE_LIBARCHIVE");

I was compelled to try this to avoid being sucked into the rabbithole
of hacking on tar code, after pg_waldump broke my computer[1]/messages/by-id/CA+hUKGL2dppjO4o28ZY7n_LTWviKLAi-7KZ=tx5w2HGevCEYPA@mail.gmail.com. It
doesn't seem to make much sense to try to speedrun everything that
happened to archiving since 1988 when you're a database project. I
was encouraged by Robert's prediction[2]/messages/by-id/CA+TgmoYg0C4ZkuSD=mag+wbq=0GGiBm+-k1zM7LHJTDpioLYuw@mail.gmail.com that we'd probably want to do
precisely this as soon as we started accepting user-generated
archives. I postdict the same!

libarchive is really easy to work with, widely used and seems well put
together. The only thing I was a bit sad about was the lack of an
async-friendly API that would let us push a raw byte stream into it.
So I tried modelling it as a "source only" astreamer that you pump by
calling astreamer_pull() when you want more content to be delivered to
the next streamer.

I don't immediately see why that'd be a problem, but I may lack
imagination. It's still incremental, can still stop earlier, and we
don't do any multiplexing or AIO in this or any other uses of
astreamers. It does mean that pg_waldump's read_archive_file() has to
treat this astreamer slightly differently though, which is annoying.
Perhaps that could be fixed if astreamer_file.c provided
"astreamer_file_reader" with the same semantics, so that it could
unconditionally call astreamer_pull(privateInfo->archive_streamer),
instead of doing the read, push-into-stream itself? Just a thought.

[1]: /messages/by-id/CA+hUKGL2dppjO4o28ZY7n_LTWviKLAi-7KZ=tx5w2HGevCEYPA@mail.gmail.com
[2]: /messages/by-id/CA+TgmoYg0C4ZkuSD=mag+wbq=0GGiBm+-k1zM7LHJTDpioLYuw@mail.gmail.com

Attachments:

0001-libarchive-Add-configure-and-meson-options.patchtext/x-patch; charset=US-ASCII; name=0001-libarchive-Add-configure-and-meson-options.patchDownload+179-1
0002-libarchive-Provide-astreamer_libarchive.c.patchtext/x-patch; charset=US-ASCII; name=0002-libarchive-Provide-astreamer_libarchive.c.patchDownload+278-1
0003-fixup-Use-more-efficient-zero-copy-API.patchtext/x-patch; charset=US-ASCII; name=0003-fixup-Use-more-efficient-zero-copy-API.patchDownload+53-11
0004-pg_waldump-Use-astreamer_libarchive.c.patchtext/x-patch; charset=US-ASCII; name=0004-pg_waldump-Use-astreamer_libarchive.c.patchDownload+44-4