make dist using git archive
One of the goals is to make the creation of the distribution tarball
more directly traceable to the git repository. That is why we removed
the "make distprep" step.
Here I want to take another step in that direction, by changing "make
dist" to directly use "git archive", rather than the custom shell script
it currently runs.
The simple summary is that it would run
git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-17devel.tar.gz
(with appropriate version numbers, of course), and that's the tarball we
would ship.
There are analogous commands for other compression formats.
The actual command gets subtly more complicated if you need to run this
in a separate build directory. In my attached patch, the make version
doesn't support vpath at the moment, just so that it's easier to
understand for now. The meson version is a bit hairy.
I have studied and tested this quite a bit, and I have found that the
archives produced this way are deterministic and reproducible, meaning
for a given commit the result file should always be bit-for-bit identical.
The exception is that if you use a git version older than 2.38.0, gzip
records the platform in the archive, so you'd get a different output on
Windows vs. macOS vs. "UNIX" (everything else). In git 2.38.0, this was
changed so that everything is recorded as "UNIX" now. This is just
something to keep in mind. This issue is specific to the gzip format,
it does not affect other compression formats.
Meson has its own distribution building command (meson dist), but opted
against using that. The main problem is that the way they have
implemented it, it is not deterministic in the above sense. (Another
point is of course that we probably want a "make" version for the time
being.)
But the target name "dist" in meson is reserved for that reason, so I
needed to call the custom target "pgdist".
I did take one idea from meson: It runs a check before git archive that
the checkout is clean. That way, you avoid mistakes because of
uncommitted changes. This works well in my "make" implementation. In
the meson implementation, I had to find a workaround, because a
custom_target cannot have a dependency on a run_target. As also
mentioned above, the whole meson implementation is a bit ugly.
Anyway, with the attached patch you can do
make dist
or
meson compile -C build pgdist
and it produces the same set of tarballs as before, except it's done
differently.
The actual build scripts need some fine-tuning, but the general idea is
correct, I think.
Attachments:
v0-0001-make-dist-uses-git-archive.patchtext/plain; charset=UTF-8; name=v0-0001-make-dist-uses-git-archive.patchDownload+50-23
Hi,
On Mon, Jan 22, 2024 at 3:32 PM Peter Eisentraut <peter@eisentraut.org> wrote:
One of the goals is to make the creation of the distribution tarball
more directly traceable to the git repository. That is why we removed
the "make distprep" step.Here I want to take another step in that direction, by changing "make
dist" to directly use "git archive", rather than the custom shell script
it currently runs.The simple summary is that it would run
git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-17devel.tar.gz(with appropriate version numbers, of course), and that's the tarball we
would ship.There are analogous commands for other compression formats.
The actual command gets subtly more complicated if you need to run this
in a separate build directory. In my attached patch, the make version
doesn't support vpath at the moment, just so that it's easier to
understand for now. The meson version is a bit hairy.I have studied and tested this quite a bit, and I have found that the
archives produced this way are deterministic and reproducible, meaning
for a given commit the result file should always be bit-for-bit identical.The exception is that if you use a git version older than 2.38.0, gzip
records the platform in the archive, so you'd get a different output on
Windows vs. macOS vs. "UNIX" (everything else). In git 2.38.0, this was
changed so that everything is recorded as "UNIX" now. This is just
something to keep in mind. This issue is specific to the gzip format,
it does not affect other compression formats.Meson has its own distribution building command (meson dist), but opted
against using that. The main problem is that the way they have
implemented it, it is not deterministic in the above sense. (Another
point is of course that we probably want a "make" version for the time
being.)But the target name "dist" in meson is reserved for that reason, so I
needed to call the custom target "pgdist".I did take one idea from meson: It runs a check before git archive that
the checkout is clean. That way, you avoid mistakes because of
uncommitted changes. This works well in my "make" implementation. In
the meson implementation, I had to find a workaround, because a
custom_target cannot have a dependency on a run_target. As also
mentioned above, the whole meson implementation is a bit ugly.Anyway, with the attached patch you can do
make dist
or
meson compile -C build pgdist
I played this with meson build on macOS, the packages are generated
in source root but not build root, I'm sure if this is by design but I think
polluting *working directory* is not good.
Another thing I'd like to point out is, should we also introduce *git commit*
or maybe *git tag* to package name, something like:
git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-17devel-`git rev-parse --short HEAD`.tar.gz
git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-`git describe --tags`.tar.gz
and it produces the same set of tarballs as before, except it's done
differently.The actual build scripts need some fine-tuning, but the general idea is
correct, I think.
I think this is a good idea, thanks for working on this.
--
Regards
Junwang Zhao
On 22.01.24 13:10, Junwang Zhao wrote:
I played this with meson build on macOS, the packages are generated
in source root but not build root, I'm sure if this is by design but I think
polluting *working directory* is not good.
Yes, it's not good, but I couldn't find a way to make it work.
This is part of the complications with meson I referred to. The
@BUILD_ROOT@ placeholder in custom_target() is apparently always a
relative path, but it doesn't know that git -C changes the current
directory.
Another thing I'd like to point out is, should we also introduce *git commit*
or maybe *git tag* to package name, something like:git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-17devel-`git rev-parse --short HEAD`.tar.gz
git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-`git describe --tags`.tar.gz
I'm not sure why we would need it built-in. It can be done by hand, of
course.
On Mon Jan 22, 2024 at 1:31 AM CST, Peter Eisentraut wrote:
From 4b128faca90238d0a0bb6949a8050c2501d1bd67 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Sat, 20 Jan 2024 21:54:36 +0100
Subject: [PATCH v0] make dist uses git archive---
GNUmakefile.in | 34 ++++++++++++----------------------
meson.build | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 22 deletions(-)diff --git a/GNUmakefile.in b/GNUmakefile.in index eba569e930e..3e04785ada2 100644 --- a/GNUmakefile.in +++ b/GNUmakefile.in @@ -87,29 +87,19 @@ update-unicode: | submake-generated-headers submake-libpgport distdir = postgresql-$(VERSION) dummy = =install=+GIT = git + dist: $(distdir).tar.gz $(distdir).tar.bz2 - rm -rf $(distdir) - -$(distdir).tar: distdir - $(TAR) chf $@ $(distdir) - -.INTERMEDIATE: $(distdir).tar - -distdir-location: - @echo $(distdir) - -distdir: - rm -rf $(distdir)* $(dummy) - for x in `cd $(top_srcdir) && find . \( -name CVS -prune \) -o \( -name .git -prune \) -o -print`; do \ - file=`expr X$$x : 'X\./\(.*\)'`; \ - if test -d "$(top_srcdir)/$$file" ; then \ - mkdir "$(distdir)/$$file" && chmod 777 "$(distdir)/$$file"; \ - else \ - ln "$(top_srcdir)/$$file" "$(distdir)/$$file" >/dev/null 2>&1 \ - || cp "$(top_srcdir)/$$file" "$(distdir)/$$file"; \ - fi || exit; \ - done - $(MAKE) -C $(distdir) distclean + +.PHONY: check-dirty-index +check-dirty-index: + $(GIT) diff-index --quiet HEAD + +$(distdir).tar.gz: check-dirty-index + $(GIT) archive --format tar.gz --prefix $(distdir)/ HEAD -o $@ + +$(distdir).tar.bz2: check-dirty-index + $(GIT) -c tar.tar.bz2.command='$(BZIP2) -c' archive --format tar.bz2 --prefix $(distdir)/ HEAD -o $@distcheck: dist rm -rf $(dummy) diff --git a/meson.build b/meson.build index c317144b6bc..f0d870c5192 100644 --- a/meson.build +++ b/meson.build @@ -3347,6 +3347,44 @@ run_target('help',+############################################################### +# Distribution archive +############################################################### + +git = find_program('git', required: false, native: true, disabler: true) +bzip2 = find_program('bzip2', required: false, native: true, disabler: true)
This doesn't need to be a disabler. git is fine as-is. See later
comment. Disablers only work like you are expecting when they are used
like how git is used. Once you call a method like .path(), all bets are
off.
+distdir = meson.project_name() + '-' + meson.project_version() + +check_dirty_index = run_target('check-dirty-index', + command: [git, 'diff-index', '--quiet', 'HEAD'])
Seems like you might want to add -C here too?
+ +tar_gz = custom_target('tar.gz', + build_always_stale: true, + command: [git, '-C', '@SOURCE_ROOT@', 'archive', + '--format', 'tar.gz', + '--prefix', distdir + '/', + '-o', '@BUILD_ROOT@/@OUTPUT@', + 'HEAD', '.'], + install: false, + output: distdir + '.tar.gz', +) + +tar_bz2 = custom_target('tar.bz2', + build_always_stale: true, + command: [git, '-C', '@SOURCE_ROOT@', '-c', 'tar.tar.bz2.command=' + bzip2.path() + ' -c', 'archive', + '--format', 'tar.bz2', + '--prefix', distdir + '/',
- '-o', '@BUILD_ROOT@/@OUTPUT@',
+ '-o', join_paths(meson.build_root(), '@OUTPUT@'),
This will generate the tarballs in the build directory. Do the same for
the previous target. Tested locally.
+ 'HEAD', '.'], + install: false, + output: distdir + '.tar.bz2', +)
The bz2 target should be wrapped in an `if bzip2.found()`. It is
possible for git to be found, but not bzip2. I might also define the bz2
command out of line. Also, you may want to add
these programs to meson_options.txt for overriding, even though the
"meson-ic" way is to use a machine file.
+ +alias_target('pgdist', [check_dirty_index, tar_gz, tar_bz2])
Are you intending for the check_dirty_index target to prohibit the other
two targets from running? Currently that is not the case. If it is what
you intend, use a stamp file or something to indicate a relationship.
Alternatively, inline the git diff-index into the other commands. These
might also do better as external scripts. It would reduce duplication
between the autotools and Meson builds.
+ + + ############################################################### # The End, The End, My Friend ###############################################################
I am not really following why we can't use the builtin Meson dist
command. The only difference from my testing is it doesn't use
a --prefix argument.
--
Tristan Partin
Neon (https://neon.tech)
On Tue, Jan 23, 2024 at 2:36 AM Peter Eisentraut <peter@eisentraut.org> wrote:
On 22.01.24 13:10, Junwang Zhao wrote:
I played this with meson build on macOS, the packages are generated
in source root but not build root, I'm sure if this is by design but I think
polluting *working directory* is not good.Yes, it's not good, but I couldn't find a way to make it work.
This is part of the complications with meson I referred to. The
@BUILD_ROOT@ placeholder in custom_target() is apparently always a
relative path, but it doesn't know that git -C changes the current
directory.Another thing I'd like to point out is, should we also introduce *git commit*
or maybe *git tag* to package name, something like:git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-17devel-`git rev-parse --short HEAD`.tar.gz
git archive --format tar.gz --prefix postgresql-17devel/ HEAD -o
postgresql-`git describe --tags`.tar.gzI'm not sure why we would need it built-in. It can be done by hand, of
course.
If this is only used by the release phase, one can do this by hand.
*commit id/tag* in package name can be used to identify the git source,
which might be useful for cooperation between QA and dev team,
but surely there are better ways for this, so I do not have a strong
opinion here.
--
Regards
Junwang Zhao
On 22.01.24 21:04, Tristan Partin wrote:
I am not really following why we can't use the builtin Meson dist
command. The only difference from my testing is it doesn't use a
--prefix argument.
Here are some problems I have identified:
1. meson dist internally runs gzip without the -n option. That makes
the tar.gz archive include a timestamp, which in turn makes it not
reproducible.
2. Because gzip includes a platform indicator in the archive, the
produced tar.gz archive is not reproducible across platforms. (I don't
know if gzip has an option to avoid that. git archive uses an internal
gzip implementation that handles this.)
3. Meson does not support tar.bz2 archives.
4. Meson uses git archive internally, but then unpacks and repacks the
archive, which loses the ability to use git get-tar-commit-id.
5. I have found that the tar archives created by meson and git archive
include the files in different orders. I suspect that the Python
tarfile module introduces some either randomness or platform dependency.
6. meson dist is also slower because of the additional work.
7. meson dist produces .sha256sum files but we have called them .sha256.
(This is obviously trivial, but it is something that would need to be
dealt with somehow nonetheless.)
Most or all of these issues are fixable, either upstream in Meson or by
adjusting our own requirements. But for now this route would have some
significant disadvantages.
On Tue Jan 23, 2024 at 3:30 AM CST, Peter Eisentraut wrote:
On 22.01.24 21:04, Tristan Partin wrote:
I am not really following why we can't use the builtin Meson dist
command. The only difference from my testing is it doesn't use a
--prefix argument.Here are some problems I have identified:
1. meson dist internally runs gzip without the -n option. That makes
the tar.gz archive include a timestamp, which in turn makes it not
reproducible.2. Because gzip includes a platform indicator in the archive, the
produced tar.gz archive is not reproducible across platforms. (I don't
know if gzip has an option to avoid that. git archive uses an internal
gzip implementation that handles this.)3. Meson does not support tar.bz2 archives.
4. Meson uses git archive internally, but then unpacks and repacks the
archive, which loses the ability to use git get-tar-commit-id.5. I have found that the tar archives created by meson and git archive
include the files in different orders. I suspect that the Python
tarfile module introduces some either randomness or platform dependency.6. meson dist is also slower because of the additional work.
7. meson dist produces .sha256sum files but we have called them .sha256.
(This is obviously trivial, but it is something that would need to be
dealt with somehow nonetheless.)Most or all of these issues are fixable, either upstream in Meson or by
adjusting our own requirements. But for now this route would have some
significant disadvantages.
Thanks Peter. I will bring these up with upstream!
--
Tristan Partin
Neon (https://neon.tech)
On Wed Jan 24, 2024 at 10:18 AM CST, Tristan Partin wrote:
On Tue Jan 23, 2024 at 3:30 AM CST, Peter Eisentraut wrote:
On 22.01.24 21:04, Tristan Partin wrote:
I am not really following why we can't use the builtin Meson dist
command. The only difference from my testing is it doesn't use a
--prefix argument.Here are some problems I have identified:
1. meson dist internally runs gzip without the -n option. That makes
the tar.gz archive include a timestamp, which in turn makes it not
reproducible.
It doesn't look like Python provides the facilities to affect this.
2. Because gzip includes a platform indicator in the archive, the
produced tar.gz archive is not reproducible across platforms. (I don't
know if gzip has an option to avoid that. git archive uses an internal
gzip implementation that handles this.)
Same reason as above.
3. Meson does not support tar.bz2 archives.
Submitted https://github.com/mesonbuild/meson/pull/12770.
4. Meson uses git archive internally, but then unpacks and repacks the
archive, which loses the ability to use git get-tar-commit-id.
Because Meson allows projects to distribute arbitrary files via
meson.add_dist_script(), and can include subprojects via `meson dist
--include-subprojects`, this doesn't seem like an easily solvable
problem.
5. I have found that the tar archives created by meson and git archive
include the files in different orders. I suspect that the Python
tarfile module introduces some either randomness or platform dependency.
Seems likely.
6. meson dist is also slower because of the additional work.
Not easily solvable due to 4.
7. meson dist produces .sha256sum files but we have called them .sha256.
(This is obviously trivial, but it is something that would need to be
dealt with somehow nonetheless.)Most or all of these issues are fixable, either upstream in Meson or by
adjusting our own requirements. But for now this route would have some
significant disadvantages.Thanks Peter. I will bring these up with upstream!
I think the solution to point 4 is to not unpack/repack if there are no
dist scripts and/or subprojects to distribute. I can take a look at
this later. I think this would also solve points 1, 2, 5, and 6 because
at that point meson is just calling git-archive.
--
Tristan Partin
Neon (https://neon.tech)
On 24.01.24 18:57, Tristan Partin wrote:
4. Meson uses git archive internally, but then unpacks and repacks
the > archive, which loses the ability to use git get-tar-commit-id.
Because Meson allows projects to distribute arbitrary files via
meson.add_dist_script(), and can include subprojects via `meson dist
--include-subprojects`, this doesn't seem like an easily solvable problem.
git archive has the --add-file option, which can probably do the same
thing. Subprojects are another thing, but obviously are more rarely used.
I think the solution to point 4 is to not unpack/repack if there are no
dist scripts and/or subprojects to distribute. I can take a look at this
later. I think this would also solve points 1, 2, 5, and 6 because at
that point meson is just calling git-archive.
I think that would be a useful direction.
On 22.01.24 21:04, Tristan Partin wrote:
+git = find_program('git', required: false, native: true, disabler: true) +bzip2 = find_program('bzip2', required: false, native: true, disabler: true)This doesn't need to be a disabler. git is fine as-is. See later
comment. Disablers only work like you are expecting when they are used
like how git is used. Once you call a method like .path(), all bets are
off.
ok, fixed
+distdir = meson.project_name() + '-' + meson.project_version() + +check_dirty_index = run_target('check-dirty-index', + command: [git, 'diff-index', '--quiet', 'HEAD'])Seems like you might want to add -C here too?
done
+tar_bz2 = custom_target('tar.bz2', + build_always_stale: true, + command: [git, '-C', '@SOURCE_ROOT@', '-c', 'tar.tar.bz2.command=' + bzip2.path() + ' -c', 'archive', + '--format', 'tar.bz2', + '--prefix', distdir + '/',- '-o', '@BUILD_ROOT@/@OUTPUT@', + '-o', join_paths(meson.build_root(), '@OUTPUT@'),This will generate the tarballs in the build directory. Do the same for
the previous target. Tested locally.
Fixed, thanks. I had struggled with this one.
+ 'HEAD', '.'], + install: false, + output: distdir + '.tar.bz2', +)The bz2 target should be wrapped in an `if bzip2.found()`.
Well, I think we want the dist step to fail if bzip2 is not there. At
least that is the current expectation.
+alias_target('pgdist', [check_dirty_index, tar_gz, tar_bz2])
Are you intending for the check_dirty_index target to prohibit the other
two targets from running? Currently that is not the case.
Yes, that was the hope, and that's how the make dist variant works. But
I couldn't figure this out with meson. Also, the above actually also
doesn't work with older meson versions, so I had to comment this out to
get CI to work.
If it is what
you intend, use a stamp file or something to indicate a relationship.
Alternatively, inline the git diff-index into the other commands. These
might also do better as external scripts. It would reduce duplication
between the autotools and Meson builds.
Yeah, maybe that's a direction.
The updated patch also supports vpath builds with make now.
I have also added a CI patch, for amusement. Maybe we'll want to keep
it, though.
On Thu Jan 25, 2024 at 10:04 AM CST, Peter Eisentraut wrote:
On 22.01.24 21:04, Tristan Partin wrote:
+ 'HEAD', '.'], + install: false, + output: distdir + '.tar.bz2', +)The bz2 target should be wrapped in an `if bzip2.found()`.
The way that this currently works is that you will fail at configure
time if bz2 doesn't exist on the system. Meson will try to resolve
a .path() method on a NotFoundProgram. You might want to define the bz2
target to just call `exit 1` in this case.
if bzip2.found()
# do your current target
else
bz2 = run_target('tar.bz2', command: ['exit', 1])
endif
This should cause pgdist to appropriately fail at run time when
generating the bz2 tarball.
Well, I think we want the dist step to fail if bzip2 is not there. At
least that is the current expectation.+alias_target('pgdist', [check_dirty_index, tar_gz, tar_bz2])
Are you intending for the check_dirty_index target to prohibit the other
two targets from running? Currently that is not the case.Yes, that was the hope, and that's how the make dist variant works. But
I couldn't figure this out with meson. Also, the above actually also
doesn't work with older meson versions, so I had to comment this out to
get CI to work.If it is what
you intend, use a stamp file or something to indicate a relationship.
Alternatively, inline the git diff-index into the other commands. These
might also do better as external scripts. It would reduce duplication
between the autotools and Meson builds.Yeah, maybe that's a direction.
For what it's worth, I run Meson 1.3, and the behavior of generating the
tarballs even though it is a dirty tree still occurred. In the new patch
you seem to say it was fixed in 0.60.
--
Tristan Partin
Neon (https://neon.tech)
On 25.01.24 17:25, Tristan Partin wrote:
The way that this currently works is that you will fail at configure
time if bz2 doesn't exist on the system. Meson will try to resolve a
.path() method on a NotFoundProgram. You might want to define the bz2
target to just call `exit 1` in this case.if bzip2.found()
# do your current target
else
bz2 = run_target('tar.bz2', command: ['exit', 1])
endifThis should cause pgdist to appropriately fail at run time when
generating the bz2 tarball.
Ok, done that way.
For what it's worth, I run Meson 1.3, and the behavior of generating the
tarballs even though it is a dirty tree still occurred. In the new patch
you seem to say it was fixed in 0.60.
The problem I'm referring to is that before 0.60, alias_target cannot
depend on run_target (only "build target"). This is AFAICT not
documented and might not have been an intentional change, but you can
trace it in the meson source code, and it shows in the PostgreSQL CI.
That's also why for the above bzip2 issue I have to use custom_target in
place of your run_target.
On Fri Jan 26, 2024 at 12:28 AM CST, Peter Eisentraut wrote:
On 25.01.24 17:25, Tristan Partin wrote:
For what it's worth, I run Meson 1.3, and the behavior of generating the
tarballs even though it is a dirty tree still occurred. In the new patch
you seem to say it was fixed in 0.60.The problem I'm referring to is that before 0.60, alias_target cannot
depend on run_target (only "build target"). This is AFAICT not
documented and might not have been an intentional change, but you can
trace it in the meson source code, and it shows in the PostgreSQL CI.
That's also why for the above bzip2 issue I have to use custom_target in
place of your run_target.
https://github.com/mesonbuild/meson/pull/12783
Thanks for finding these issues.
--
Tristan Partin
Neon (https://neon.tech)
Hello, meson developer here.
On 1/23/24 4:30 AM, Peter Eisentraut wrote:
On 22.01.24 21:04, Tristan Partin wrote:
I am not really following why we can't use the builtin Meson dist
command. The only difference from my testing is it doesn't use a
--prefix argument.Here are some problems I have identified:
1. meson dist internally runs gzip without the -n option. That makes
the tar.gz archive include a timestamp, which in turn makes it not
reproducible.
Well, it uses python tarfile which uses python gzip support under the
hood, but yes, that is true, python tarfile doesn't expose this tunable.
2. Because gzip includes a platform indicator in the archive, the
produced tar.gz archive is not reproducible across platforms. (I don't
know if gzip has an option to avoid that. git archive uses an internal
gzip implementation that handles this.)
This appears to be https://github.com/python/cpython/issues/112346
3. Meson does not support tar.bz2 archives.
Simple enough to add, but I'm a bit surprised as usually people seem to
want either gzip for portability or xz for efficient compression.
4. Meson uses git archive internally, but then unpacks and repacks the
archive, which loses the ability to use git get-tar-commit-id.
What do you use this for? IMO a more robust way to track the commit used
is to use gitattributes export-subst to write a `.git_archival.txt` file
containing the commit sha1 and other info -- this can be read even after
the file is extracted, which means it can also be used to bake the ID
into the built binaries e.g. as part of --version output.
5. I have found that the tar archives created by meson and git archive
include the files in different orders. I suspect that the Python
tarfile module introduces some either randomness or platform dependency.
Different orders is meaningless, the question is whether the order is
internally consistent. Python uses sorted() to guarantee a stable order,
which may be a different algorithm than the one git-archive uses to
guarantee a stable order. But the order should be stable and that is
what matters.
6. meson dist is also slower because of the additional work.
I'm amenable to skipping the extraction/recombination of subprojects and
running of dist scripts in the event that neither exist, as Tristan
offered to do, but...
7. meson dist produces .sha256sum files but we have called them .sha256.
(This is obviously trivial, but it is something that would need to be
dealt with somehow nonetheless.)Most or all of these issues are fixable, either upstream in Meson or by
adjusting our own requirements. But for now this route would have some
significant disadvantages.
Overall I feel like much of this is about requiring dist tarballs to be
byte-identical to other dist tarballs, although reproducible builds is
mainly about artifacts, not sources, and for sources it doesn't
generally matter unless the sources are ephemeral and generated
on-demand (in which case it is indeed very important to produce the same
tarball each time). A tarball is usually generated once, signed, and
uploaded to release hosting. Meson already guarantees the contents are
strictly based on the built tag.
--
Eli Schwartz
Attachments:
On 26.01.24 22:18, Eli Schwartz wrote:
Hello, meson developer here.
Hello, and thanks for popping in!
3. Meson does not support tar.bz2 archives.
Simple enough to add, but I'm a bit surprised as usually people seem to
want either gzip for portability or xz for efficient compression.
We may very well end up updating our requirements here before too long,
so I wouldn't bother with this on the meson side. Last time we
discussed this, there were still platforms under support that didn't
have xz easily available.
4. Meson uses git archive internally, but then unpacks and repacks the
archive, which loses the ability to use git get-tar-commit-id.What do you use this for? IMO a more robust way to track the commit used
is to use gitattributes export-subst to write a `.git_archival.txt` file
containing the commit sha1 and other info -- this can be read even after
the file is extracted, which means it can also be used to bake the ID
into the built binaries e.g. as part of --version output.
It's a marginal use case, for sure. But it is something that git
provides tooling for that is universally available. Any alternative
would be an ad-hoc solution that is specific to our project and would be
different for the next project.
5. I have found that the tar archives created by meson and git archive
include the files in different orders. I suspect that the Python
tarfile module introduces some either randomness or platform dependency.Different orders is meaningless, the question is whether the order is
internally consistent. Python uses sorted() to guarantee a stable order,
which may be a different algorithm than the one git-archive uses to
guarantee a stable order. But the order should be stable and that is
what matters.
(FWIW, I couldn't reproduce this anymore, so maybe it's not actually an
issue.)
Overall I feel like much of this is about requiring dist tarballs to be
byte-identical to other dist tarballs, although reproducible builds is
mainly about artifacts, not sources, and for sources it doesn't
generally matter unless the sources are ephemeral and generated
on-demand (in which case it is indeed very important to produce the same
tarball each time).
The source tarball is, in a way, also an artifact.
I think it's useful that others can easily independently verify that the
produced tarball matches what they have locally. It's not an absolute
requirement, but given that it is possible, it seems useful to take
advantage of it.
In a way, this also avoids the need for signing the tarball, which we
don't do. So maybe that contributes to a different perspective.
On 1/31/24 3:03 AM, Peter Eisentraut wrote:
What do you use this for? IMO a more robust way to track the commit used
is to use gitattributes export-subst to write a `.git_archival.txt` file
containing the commit sha1 and other info -- this can be read even after
the file is extracted, which means it can also be used to bake the ID
into the built binaries e.g. as part of --version output.It's a marginal use case, for sure. But it is something that git
provides tooling for that is universally available. Any alternative
would be an ad-hoc solution that is specific to our project and would be
different for the next project.
mercurial has the "archivemeta" config setting that exports similar
information, but forces the filename ".hg_archival.txt".
The setuptools-scm project follows this pattern by requiring the git
file to be called ".git_archival.txt" with a set pattern mimicking the
hg one:
https://setuptools-scm.readthedocs.io/en/latest/usage/#git-archives
So, I guess you could use this and then it would not be specific to your
project. :)
Overall I feel like much of this is about requiring dist tarballs to be
byte-identical to other dist tarballs, although reproducible builds is
mainly about artifacts, not sources, and for sources it doesn't
generally matter unless the sources are ephemeral and generated
on-demand (in which case it is indeed very important to produce the same
tarball each time).The source tarball is, in a way, also an artifact.
I think it's useful that others can easily independently verify that the
produced tarball matches what they have locally. It's not an absolute
requirement, but given that it is possible, it seems useful to take
advantage of it.In a way, this also avoids the need for signing the tarball, which we
don't do. So maybe that contributes to a different perspective.
Since you mention signing and not as a simple "aside"...
That's a fascinating perspective. I wonder how people independently
verify that what they have locally (I assume from git clones) matches
what the postgres committers have authorized.
I'm a bit skeptical that you can avoid the need to perform code-signing
at some stage, somewhere, somehow, by suggesting that people can simply
git clone, run some commands and compare the tarball. The point of
signing is to verify that no one has acquired an untraceable API token
they should not have and gotten write access to the authoritative server
then uploaded malicious code under various forged identities, possibly
overwriting previous versions, either in git or out of git.
Ideally git commits should be signed, but that requires large numbers of
people to have security-minded git commit habits. From a quick check of
the postgres commit logs, only one person seems to be regularly signing
commits, which does provide a certain measure of protection -- an
attacker cannot attack via `git push --force` across that boundary, and
those commits serve as verifiable states that multiple people have seen.
The tags aren't signed either, which is a big issue for verifiably
identifying the release artifacts published by the release manager. Even
if not every commit is signed, having signed tags provides a known
coordination point of code that has been broadly tested and code-signed
for mass use.
...
In summary, my opinion is that using git-get-tar-commit-id provides zero
security guarantees, and if that's not something you are worried about
then that's one thing, but if you were expecting it to *replace* signing
the tarball, then that's.... very much another thing entirely, and not
one I can agree at all with.
--
Eli Schwartz
Attachments:
On Wed, Jan 31, 2024 at 10:50 AM Eli Schwartz <eschwartz93@gmail.com> wrote:
Ideally git commits should be signed, but that requires large numbers of
people to have security-minded git commit habits. From a quick check of
the postgres commit logs, only one person seems to be regularly signing
commits, which does provide a certain measure of protection -- an
attacker cannot attack via `git push --force` across that boundary, and
those commits serve as verifiable states that multiple people have seen.The tags aren't signed either, which is a big issue for verifiably
identifying the release artifacts published by the release manager. Even
if not every commit is signed, having signed tags provides a known
coordination point of code that has been broadly tested and code-signed
for mass use.In summary, my opinion is that using git-get-tar-commit-id provides zero
security guarantees, and if that's not something you are worried about
then that's one thing, but if you were expecting it to *replace* signing
the tarball, then that's.... very much another thing entirely, and not
one I can agree at all with.
I read this part with interest. I think there's definitely something
to be said for strengthening some of our practices in this area. At
the same time, I think it's reasonable for Peter to want to pursue the
limited goal he stated in the original post, namely reproducible
tarball generation, without getting tangled up in possible policy
changes that might be controversial and might require a bunch of
planning and coordination. "GPG signatures are good" can be true
without "reproducible tarball generation is good" being false; and if
"git archive" allows for that and "meson dist" doesn't, then we're
unlikely to adopt "meson dist".
--
Robert Haas
EDB: http://www.enterprisedb.com
Small update: I noticed that on Windows (at least the one that is
running the CI job), I need to use git -c core.autocrlf=false, otherwise
git archive does line-ending conversion for the files it puts into the
archive. With this fix, all the archives produced by all the CI jobs
across the different platforms match, except the .tar.gz archive from
the Linux job, which I suspect suffers from an old git version. We
should get the Linux images updated to a newer Debian version soon
anyway, so I think that issue will go away.
On Sun Feb 11, 2024 at 5:09 PM CST, Peter Eisentraut wrote:
Small update: I noticed that on Windows (at least the one that is
running the CI job), I need to use git -c core.autocrlf=false, otherwise
git archive does line-ending conversion for the files it puts into the
archive. With this fix, all the archives produced by all the CI jobs
across the different platforms match, except the .tar.gz archive from
the Linux job, which I suspect suffers from an old git version. We
should get the Linux images updated to a newer Debian version soon
anyway, so I think that issue will go away.
I think with this change, it is unlikely I will be able to upstream
anything to Meson that would benefit Postgres here since setting this
option seems project dependent.
--
Tristan Partin
Neon (https://neon.tech)
On 12.02.24 18:26, Tristan Partin wrote:
On Sun Feb 11, 2024 at 5:09 PM CST, Peter Eisentraut wrote:
Small update: I noticed that on Windows (at least the one that is
running the CI job), I need to use git -c core.autocrlf=false,
otherwise git archive does line-ending conversion for the files it
puts into the archive. With this fix, all the archives produced by
all the CI jobs across the different platforms match, except the
.tar.gz archive from the Linux job, which I suspect suffers from an
old git version. We should get the Linux images updated to a newer
Debian version soon anyway, so I think that issue will go away.I think with this change, it is unlikely I will be able to upstream
anything to Meson that would benefit Postgres here since setting this
option seems project dependent.
Meson is vulnerable to the same problem: If the person who makes the
release had some crlf-related git setting activated in their
environment, then that would affect the tarball. And such a tarball
would be genuinely broken for non-Windows users, because at least some
parts of Unix systems can't process such CRLF files correctly.
(This is easy to test: Run meson dist with core.autocrlf=true on the
postgresql tree on a non-Windows system. It will fail during dist check.)