Regression tests fail with musl libc because libpq.so can't be loaded
Running the regression tests when building with musl libc fails, with
errors like the following:
ERROR: could not load library
"<builddir>/tmp_install/usr/lib/postgresql/libpqwalreceiver.so": Error
loading shared library libpq.so.5: No such file or directory (needed by
<builddir>/tmp_install/usr/lib/postgresql/libpqwalreceiver.so)
This was observed in Alpine Linux [1]https://github.com/alpinelinux/aports/commit/d67ceb66a1ca9e1899071c9ef09fffba29fa0417#diff-2bd25b5172fc52319de1b09086ac0db6314d2e9fa73497979f5198f8caaec1b9 and nixpkgs [2]https://github.com/NixOS/nixpkgs/commit/09ffd722072291f00f2a54d7404eb568a15e562a a few years ago. I
now looked at this a bit and this is what happens:
- The temporary install location is set via LD_LIBRARY_PATH in the
regression tests, so that postgres can find those libs.
- All tests which load an extension / shared module via dlopen() fail,
when the loaded library in turn depends on another library in
tmp_install - I think in practice it's libpq.so all the time.
- LD_LIBRARY_PATH is used correctly to look for the direct dependency
loaded in dlopen(...), but is not taken into account anymore when trying
to locate libpq.so. This step only fails with musl, but works fine with
glibc.
I can reproduce this with a simple Dockerfile (attached), which uses the
library/postgres-alpine image, moves libpq.so to a different folder and
points LD_LIBRARY_PATH at it. Build and run the dockerfile like this:
docker build . -t pg-musl && docker run --rm pg-musl
This Dockerfile can easily be adjusted to work with the debian image -
which shows that doing the same with glibc works just fine.
Even though this originated in "just" the regression tests, I'm filing
this as a bug, because:
- The docs explicitly mention LD_LIBRARY_PATH support to point at a
different /lib folder in [3]https://www.postgresql.org/docs/current/install-post.html.
- This can clearly break outside the test-suite as shown with the
Dockerfile.
I tried a few more things:
- When I add an /etc/ld-musl-$(ARCH).path file and add the path to
libpq.so's libdir to it, libpq.so can be found.
- When I add the path to libpq.so as an rpath to the postgres binary,
libpq.so can be found.
Both is not surprising, but just confirms musl-ld actually works as
expected. It's just LD_LIBRARY_PATH that seems to not be passed on.
To rule out a musl bug, I also put together a very simple test-case of
an executable loading liba with dlopen(), which depends on libb and then
constructing the same scenario with LD_LIBRARY_PATH. This works fine
when compiled with glibc and musl, too. Thus, I believe the problem to
be somewhere in how postgres loads those libraries.
Best,
Wolfgang
[1]: https://github.com/alpinelinux/aports/commit/d67ceb66a1ca9e1899071c9ef09fffba29fa0417#diff-2bd25b5172fc52319de1b09086ac0db6314d2e9fa73497979f5198f8caaec1b9
https://github.com/alpinelinux/aports/commit/d67ceb66a1ca9e1899071c9ef09fffba29fa0417#diff-2bd25b5172fc52319de1b09086ac0db6314d2e9fa73497979f5198f8caaec1b9
[2]: https://github.com/NixOS/nixpkgs/commit/09ffd722072291f00f2a54d7404eb568a15e562a
https://github.com/NixOS/nixpkgs/commit/09ffd722072291f00f2a54d7404eb568a15e562a
[3]: https://www.postgresql.org/docs/current/install-post.html
Attachments:
Wolfgang Walther <walther@technowledgy.de> writes:
- LD_LIBRARY_PATH is used correctly to look for the direct dependency
loaded in dlopen(...), but is not taken into account anymore when trying
to locate libpq.so. This step only fails with musl, but works fine with
glibc.
Why do you think this is our bug and not musl's? We do not even have
any code that knows anything about indirect library dependencies.
regards, tom lane
Tom Lane:
Why do you think this is our bug and not musl's?
Because I tried to reproduce with musl directly with a very simple
example as mentioned in:
To rule out a musl bug, I also put together a very simple test-case of an executable loading liba with dlopen(), which depends on libb and then constructing the same scenario with LD_LIBRARY_PATH. This works fine when compiled with glibc and musl, too. Thus, I believe the problem to be somewhere in how postgres loads those libraries.
My test case looked like the attached. To compile it with musl via
Dockerfile:
docker build . -t musl-dlopen && docker run --rm musl-dlopen
a.c/a.h is equivalent to libpqwalreceiver and b.c/b.h to libpq.
This works fine with both musl and glibc.
(Note: I also tried putting liba.so and libb.so in different folders,
adding both to LD_LIBRARY_PATH - but that worked fine as well.)
Now my very simple example probably does something different than
postgres, so that the problem doesn't appear there. But since it seems
possible to do this with musl in principle, it should be possible to do
it differently in postgres to make it work, too.
Any ideas?
Best,
Wolfgang
Attachments:
On Sat, Mar 16, 2024 at 11:55 AM Wolfgang Walther <walther@technowledgy.de>
wrote:
Tom Lane:
Why do you think this is our bug and not musl's?
Because I tried to reproduce with musl directly with a very simple
example as mentioned in:To rule out a musl bug, I also put together a very simple test-case of
an executable loading liba with dlopen(), which depends on libb and then
constructing the same scenario with LD_LIBRARY_PATH. This works fine when
compiled with glibc and musl, too. Thus, I believe the problem to be
somewhere in how postgres loads those libraries.My test case looked like the attached. To compile it with musl via
Dockerfile:docker build . -t musl-dlopen && docker run --rm musl-dlopen
a.c/a.h is equivalent to libpqwalreceiver and b.c/b.h to libpq.
This works fine with both musl and glibc.
(Note: I also tried putting liba.so and libb.so in different folders,
adding both to LD_LIBRARY_PATH - but that worked fine as well.)Now my very simple example probably does something different than
postgres, so that the problem doesn't appear there. But since it seems
possible to do this with musl in principle, it should be possible to do
it differently in postgres to make it work, too.Any ideas?
On Alpine Linux, which uses musl libc, you have to run `make install`
before you can run `make check`. Have you tried
that?
(Note to self: need a new Alpine buildfarm member)
cheers
andrew
Andrew Dunstan:
On Alpine Linux, which uses musl libc, you have to run `make install`
before you can run `make check`. Have you tried that?
I can see how that could work around the problem, because the library
would already be available in the default library path / rpath and
LD_LIBRARY_PATH would not be needed.
However, this would only be a workaround for the specific case of
running the regression tests, not a solution. Using LD_LIBRARY_PATH, as
documented, would still not be possible.
In my case, I am just using docker images with Alpine to easily
reproduce the problem. I am building with NixOS / nixpkgs' pkgsMusl. The
order of check and install phases can't be changed there, AFAICT. The
workaround I use right now is to temporarily patch rpath of the postgres
binary - this will be reset during the install phase anyway. This works,
but again is not a real solution.
Best,
Wolfgang
Andrew Dunstan <andrew@dunslane.net> writes:
On Alpine Linux, which uses musl libc, you have to run `make install`
before you can run `make check`. Have you tried that?
We have the same situation on macOS. There, it seems to be the result
of a "security feature" that strips DYLD_LIBRARY_PATH from the process
environment when make executes a shell. There's not much we can do
about that, and I suspect there is not much we can do about musl's
behavior either. (I am not a fan of proposals to modify the binaries
under test, because then you are not testing what you intend to
install.)
regards, tom lane
On Sun, Mar 17, 2024 at 4:56 AM Wolfgang Walther
<walther@technowledgy.de> wrote:
Any ideas?
I'd look into whether there is a difference in the rules it uses for
deciding not to trust LD_LIBRARY_PATH, which seems to be around here
somewhere:
https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1812
I wonder if you can break into an affected program and check out the
magic there. FWIW on MacOS something equivalent happens at the moment
we execute a shell, because the system shell is 'code signed' and that
OS treats signed stuff similar to setuid binaries for this purpose
(IIRC setting SHELL to point to a suitable unsigned shell could work
around the problem there?)
Another interesting thing that came up when I googled musl/glibc
differences -- old but looks plausibly still true (not that I expect
our code to be modifying that stuff in place, just something to
check):
Tom Lane:
We have the same situation on macOS. There, it seems to be the result
of a "security feature" that strips DYLD_LIBRARY_PATH from the process
environment when make executes a shell.
I'm not sure whether this explanation is sufficient for the musl case,
because LD_LIBRARY_PATH does make a difference: The direct dependency
(libpqwalreceiver.so) can still be found if it's moved elsewhere and
LD_LIBRARY_PATH points at it. So clearly the LD_LIBRARY_PATH variable is
still set after make executed the shell - it's just not in effect on the
*indirect* dependency (libpq.so) anymore.
Best,
Wolfgang
Thomas Munro:
I'd look into whether there is a difference in the rules it uses for
deciding not to trust LD_LIBRARY_PATH, which seems to be around here
somewhere:https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1812
Yeah, I have been looking at that, too. I had also experimented a bit
with setuid/setgid for that matter, but that didn't lead anywhere, yet.
I'm not 100% sure, but I think this would also not match my other
observation, that LD_LIBRARY_PATH does work for libpqwalreceiver (direct
dep), but not libpq (indirect dep).
Another interesting thing that came up when I googled musl/glibc
differences -- old but looks plausibly still true (not that I expect
our code to be modifying that stuff in place, just something to
check):
To me, this seems very much like what could happen - it matches all my
observations, so far. But I can't tell how likely that is, not knowing
much of the postgres code.
Best,
Wolfgang
On Sun, Mar 17, 2024 at 9:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Another interesting thing that came up when I googled musl/glibc
differences -- old but looks plausibly still true (not that I expect
our code to be modifying that stuff in place, just something to
check):
Hmm, that does mention setproctitle, and our ps_status.c does indeed
clobber some stuff in that region (in fact our ps_status.c is likely
derived from the setproctitle() function from sendmail AFAICT). But
that's in our "backend" server processes, unlike the problems we have
on Macs... oh but you're failing to load libpqwalreceiver.so which
makes some sense for the backend hypothesis. What happens if you hack
ps_status.c to use PS_USE_NONE?
Thomas Munro:
Hmm, that does mention setproctitle, and our ps_status.c does indeed
clobber some stuff in that region (in fact our ps_status.c is likely
derived from the setproctitle() function from sendmail AFAICT). But
that's in our "backend" server processes, unlike the problems we have
on Macs... oh but you're failing to load libpqwalreceiver.so which
makes some sense for the backend hypothesis. What happens if you hack
ps_status.c to use PS_USE_NONE?
Nailed it. PS_USE_NONE fixes it.
Best,
Wolfgang
On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
Nailed it. PS_USE_NONE fixes it.
Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case? It seems wrong that the current test passes a case that doesn't actually work.
Christophe Pettus:
Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case? It seems wrong that the current test passes a case that doesn't actually work.
The missing macro is on purpose and unlikely to change:
https://openwall.com/lists/musl/2013/03/29/13
I also found this thread, which discusses exactly our case:
https://www.openwall.com/lists/musl/2022/08/17/1
Some quotes from that thread:
I understand that what Postgres et al are doing is a nasty hack.
And:
Applications that *really* want setproctitle type functionality can
presumably do something like re-exec themselves with a suitably large
argv[0] to give them safe space to overwrite with their preferred
message, rather than UB trying to relocate the environment (and auxv?
how? they can't tell libc they moved it) to some other location.
Could that be a more portable way of doing this?
Best,
Wolfgang
On Mar 17, 2024, at 06:11, Wolfgang Walther <walther@technowledgy.de> wrote:
The missing macro is on purpose and unlikely to change: https://openwall.com/lists/musl/2013/03/29/13
Indeed.
I also found this thread, which discusses exactly our case: https://www.openwall.com/lists/musl/2022/08/17/1
While getting proper setproctitle functionality on musl would be great, my goal was more modest: Have it correctly set PS_USE_NONE when compiling against musl.
On Sun, Mar 17, 2024 at 11:45 AM Christophe Pettus <xof@thebuild.com> wrote:
On Mar 17, 2024, at 06:11, Wolfgang Walther <walther@technowledgy.de>
wrote:
The missing macro is on purpose and unlikely to change:
https://openwall.com/lists/musl/2013/03/29/13
Indeed.
That seems a little shortsighted. If other libc implementations find it
appropriate to have similar macros why should they be different?
I also found this thread, which discusses exactly our case:
https://www.openwall.com/lists/musl/2022/08/17/1
While getting proper setproctitle functionality on musl would be great, my
goal was more modest: Have it correctly set PS_USE_NONE when compiling
against musl.
One simple thing might be for us to enclose the block in ps_status.c at
lines 49-59 in #ifndef PS_USE_NONE/#endif. Then you could compile with
-DPS_USE_NONE in your CPPFLAGS.
cheers
andrew
On Mar 17, 2024, at 13:33, Andrew Dunstan <andrew@dunslane.net> wrote:
That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why should they be different?
It's a philosophical argument against checking for particular libc implementations instead of particular features. I'm not unsympathetic to that argument, but AFAICT there's no clean way of checking for this by examining feature #defines.
On Mon, Mar 18, 2024 at 10:06 AM Christophe Pettus <xof@thebuild.com> wrote:
On Mar 17, 2024, at 13:33, Andrew Dunstan <andrew@dunslane.net> wrote:
That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why should they be different?It's a philosophical argument against checking for particular libc implementations instead of particular features. I'm not unsympathetic to that argument, but AFAICT there's no clean way of checking for this by examining feature #defines.
I like their philosophy, and I like their FAQ. Down with software
monocultures, up with standards and cooperation. But anyway...
I wondered for a moment if there could be a runtime way to test if
we'd broken stuff, but it seems hard without a way to ask the runtime
linker for its search path to see if it has any pointers into the
environment. We can't, that "env_path" variable in dynlink.c is not
accessible to us by any reasonable means. And yeah, this whole thing
is a nasty invasive hack that harks back to the 80s I assume, before
many systems provided a clean way to do this (and some never did)...
Hmm, I can think of one dirty hack on top of our existing dirty hack
that might work. I feel bad typing this out, but here goes nothing:
In save_ps_display_args(), we compute end_of_area, stepping past
contiguous arguments and environment variables. But what if we
terminated that if we saw an environment entry beginning "LD_"? We'd
still feel free to clobber the memory up to that point (rather than
limiting ourselves to the argv space, another more conservative choice
that might truncate a few PS display messages, or maybe not given the
typical postmaster arguments, maye that'd work out OK), and we'd still
copy the environment to somewhere new, but anything like "LD_XXX" that
the runtime linker might have stashed a pointer to would remain valid.
/me runs away and hides
On Mar 17, 2024, at 16:20, Thomas Munro <thomas.munro@gmail.com> wrote:
We'd
still feel free to clobber the memory up to that point (rather than
limiting ourselves to the argv space, another more conservative choice
that might truncate a few PS display messages, or maybe not given the
typical postmaster arguments, maye that'd work out OK), and we'd still
copy the environment to somewhere new, but anything like "LD_XXX" that
the runtime linker might have stashed a pointer to would remain valid.
/me runs away and hides
It doesn't lack for bravery! (And I have to just comment that the linker storing pointers into that space as a way of finding libraries... well, that doesn't get them the moral high ground for nasty hacks.)
I'm comfortable with "if you are using musl, you don't get the ps messages" as a first solution, if we can find a way of detecting a libc that passes the other tests but doesn't support any of the existing hacks.
On Mon, Mar 18, 2024 at 10:34 PM Christophe Pettus <xof@thebuild.com> wrote:
On Mar 17, 2024, at 16:20, Thomas Munro <thomas.munro@gmail.com> wrote:
We'd
still feel free to clobber the memory up to that point (rather than
limiting ourselves to the argv space, another more conservative choice
that might truncate a few PS display messages, or maybe not given the
typical postmaster arguments, maye that'd work out OK), and we'd still
copy the environment to somewhere new, but anything like "LD_XXX" that
the runtime linker might have stashed a pointer to would remain valid.
/me runs away and hidesIt doesn't lack for bravery! (And I have to just comment that the linker storing pointers into that space as a way of finding libraries... well, that doesn't get them the moral high ground for nasty hacks.)
FWIW here is a blind patch if someone wants to try it out... no musl here.
(Hmm, I think it's not that unreasonable on their part to assume the
initial environment is immutable if their implementation doesn't
mutate it, and our doing so is undeniably UB; surprising, maybe, given
that the technique works on that other popular brand of C library on
that kind of kernel, not to mention dozens of old Unixen of yore...
The real solution may turn out to be the prctl() described in that
thread, where you can tell the kernel where you're planning to move
your argv and it can find it to show ps/top, but I checked and you
still can't call that without special privileges, so maybe someone
should get onto complaining to kernel hackers about that? That thread
is wrong about us clobbering auxv BTW, we're not animals!)
Attachments:
0001-Don-t-clobber-LD_-environment-variables.patchapplication/octet-stream; name=0001-Don-t-clobber-LD_-environment-variables.patchDownload
From 1b2aa67cc1e692065f6cc18b70ee49131ec4682a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Mar 2024 00:25:34 +1300
Subject: [PATCH] Don't clobber LD_* environment variables.
Our PS_USE_CLOBBER_ARGV code relocates the environment, which itself is
allowed, in order to steal the old space to make a bigger argv[0] for
ps/top to show, which is probably formally undefined behavior.
Unfortunately that corrupts musl's copy of LD_LIBRARY_PATH if set,
because it stashes a pointer to the initial value before main() begins.
It probably doesn't matter for installed servers but breaks the
regression tests.
Here we look out for variables named LD_* while computing how much space
to steal, so we can avoid clobbering them. No change in behaviour if
not found, but otherwise you might potentially get ps status messages
truncated to a smaller size than before depending on the length of
preceding clobberable variables. There doesn't seem to be a nice way to
distinguish musl from glibc, and the truncation size shouldn't be too
small or at least has an easy mitigation: define a dummy variable.
Reported-by: Wolfgang Walther <walther@technowledgy.de>
Discussion: https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488%40technowledgy.de
---
src/backend/utils/misc/ps_status.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e48..782dea352e 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -146,11 +146,16 @@ save_ps_display_args(int argc, char **argv)
}
/*
- * check for contiguous environ strings following argv
+ * Check for contiguous environ strings following argv, but skip
+ * LD_* variables that the runtime linker might be interested in. The
+ * musl libc implementation is known to stash pointers directly to
+ * their values in the initial environment, assuming them to be
+ * immutable.
*/
for (i = 0; environ[i] != NULL; i++)
{
- if (end_of_area + 1 == environ[i])
+ if (end_of_area + 1 == environ[i] &&
+ strncmp(environ[i], "LD_", 3) != 0)
end_of_area = environ[i] + strlen(environ[i]);
}
--
2.39.3 (Apple Git-146)
Thomas Munro <thomas.munro@gmail.com> writes:
(Hmm, I think it's not that unreasonable on their part to assume the
initial environment is immutable if their implementation doesn't
mutate it, and our doing so is undeniably UB; surprising, maybe, given
that the technique works on that other popular brand of C library on
that kind of kernel, not to mention dozens of old Unixen of yore...
Does their implementation also ignore the effects of putenv() or
setenv() on LD_LIBRARY_PATH? They have no moral high ground
whatsoever if that's the case. But if it doesn't, an alternative
route to a solution could be to scan the original environment, strdup
and putenv each entry to move it to freshly malloc'd space, and
then reclaim the old environment area.
regards, tom lane
On Tue, Mar 19, 2024 at 3:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
(Hmm, I think it's not that unreasonable on their part to assume the
initial environment is immutable if their implementation doesn't
mutate it, and our doing so is undeniably UB; surprising, maybe, given
that the technique works on that other popular brand of C library on
that kind of kernel, not to mention dozens of old Unixen of yore...Does their implementation also ignore the effects of putenv() or
setenv() on LD_LIBRARY_PATH? They have no moral high ground
whatsoever if that's the case. But if it doesn't, an alternative
route to a solution could be to scan the original environment, strdup
and putenv each entry to move it to freshly malloc'd space, and
then reclaim the old environment area.
Yes, the musl linker/loader ignores putenv()/setenv() changes to
LD_LIBRARY_PATH after process start (that is, changes only effect the
search path when injected into a new program with exec*()). As does
glibc, it's just that it captures by copy instead of reference
(according to one of the links above, I didn't check the source). So
setenv() has no effect on dlopen() in *this* program, and using putenv
in that way won't help. We simply can't move the value of
LD_LIBRARY_PATH (though my patch could be a little sneakier and steal
all the bytes right up to the = sign to get more space for our
message!).
One way to tell if a copy has been made is to trace a program that does:
getenv("LD_LIBRARY_PATH")[2] = 'X';
dlopen("foo.so", RTLD_NOW | RTLD_GLOBAL);
... when run with LD_LIBRARY_PATH set to /asdf. On FreeBSD I see it
tries to open "/aXdf...", so now I know that FreeBSD also captures it
by reference like musl. But we don't use the clobber trick on
FreeBSD, it has a proper setproctitle() function that knows how to
negotiate with the kernel, so it doesn't matter. It also ignores
changes made with setent()/putenv(), because those create fresh
entries but leave the initial environment strings untouched.
Solaris also ignores changes made after startup (it's in the dlopen
man page), and from a very quick look at its ld_lib_setup() I think it
achieved that with a copy. I believe its ancestor SunOS 4 invented
all of these conventions (and the mmap/virtual memory concepts they
rode in on), later nailed down to some degree in the System V ABI and
very widely adopted, but I don't see anything in the latter that
specifically addresses this point, eg LD_LIBRARY copy vs reference and
interaction with dlopen() (perhaps I didn't look hard enough). I'm
not sure what else you can point to to make strong claims about this
stuff, but I bet every system ignores changes after startup, it's just
that they found two ways to achieve that. POSIX says of dlopen that
the "file [argument] is used in an implementation-defined manner", and
of environ that we're welcome to swap a whole new environ, but doesn't
seem to tell us anything about the one that is replaced (who owns it?
is the initial one set up at execution time special? etc). The line
banning manipulation of the pointers environ refers to doesn't exactly
describe what we're doing (we're manipulating the strings pointed to
by the *previous* environ). UB.
On Tue, Mar 19, 2024 at 10:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
... (though my patch could be a little sneakier and steal
all the bytes right up to the = sign to get more space for our
message!).
Here's one like that. No musl here -- does this work Wolfgang? Do we
think it's generous enough with space in practice that we could just
always do this for __linux__ systems without anyone noticing (ie
including glibc users)? Should we be more specific about which LD_*
variables? Do people not doing hacking/testing ever really set those,
eg on production servers? This code path was once used by up to a
dozen or so OSes but they're all dead, only Linux, Solaris and macOS
left, and I don't have any reason to think they suffer from this
problem and Macs don't even follow the SysV LD_ naming convention,
hence gating on Linux.
Attachments:
v2-0001-Don-t-clobber-LD_-environment-variables.patchapplication/octet-stream; name=v2-0001-Don-t-clobber-LD_-environment-variables.patchDownload
From a3a9a31bd2e95d93b3f2111e414fe0f3cda16423 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Mar 2024 00:25:34 +1300
Subject: [PATCH v2] Don't clobber LD_* environment variables.
Our PS_USE_CLOBBER_ARGV code relocates the environment, which itself is
allowed, in order to steal the old space to make a bigger argv[0] for
ps/top to show, which is probably formally undefined behavior.
Unfortunately that corrupts musl's copy of LD_LIBRARY_PATH if set,
because it stashes a pointer to the initial value before main() begins.
It probably doesn't matter for installed servers but breaks the
regression tests.
Here we look out for variables named LD_* while computing how much space
to steal, so we can avoid clobbering them. No change in behaviour if
not found, but otherwise you might potentially get ps status messages
truncated to a smaller size than before depending on the length of
preceding clobberable variables. There doesn't seem to be a nice way to
distinguish musl from glibc, and the truncation size shouldn't be too
small or at least has an easy mitigation: define a dummy variable.
Reported-by: Wolfgang Walther <walther@technowledgy.de>
Discussion: https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488%40technowledgy.de
---
src/backend/utils/misc/ps_status.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e48..e71f54623b 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,31 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
+ {
+
+#if defined(__linux__)
+ /*
+ * If we see a variable named LD_XXX, give up here. The musl
+ * runtime linker is known to stash pointers to such values
+ * before main() starts, so we don't want to corrupt them or
+ * dlopen() might break. Since they presumably need only the
+ * value, it should be safe to steal the space right up to the
+ * equal sign.
+ */
+ if (strncmp(environ[i], "LD_", 3) == 0)
+ {
+ char *equals = strchr(environ[i], '=');
+
+ if (equals)
+ end_of_area = equals + 1;
+ else
+ end_of_area = environ[i] + strlen(environ[i]);
+ break;
+ }
+#endif
+
end_of_area = environ[i] + strlen(environ[i]);
+ }
}
ps_buffer = argv[0];
--
2.39.3 (Apple Git-146)
On Tue, Mar 19, 2024 at 11:48:50AM +1300, Thomas Munro wrote:
On Tue, Mar 19, 2024 at 10:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
... (though my patch could be a little sneakier and steal
all the bytes right up to the = sign to get more space for our
message!).Here's one like that. No musl here -- does this work Wolfgang? Do we
think it's generous enough with space in practice that we could just
always do this for __linux__ systems without anyone noticing (ie
including glibc users)? Should we be more specific about which LD_*
variables? Do people not doing hacking/testing ever really set those,
eg on production servers? This code path was once used by up to a
dozen or so OSes but they're all dead, only Linux, Solaris and macOS
left, and I don't have any reason to think they suffer from this
problem and Macs don't even follow the SysV LD_ naming convention,
hence gating on Linux.
So this would truncate the process title on all Linux that have an LD_
environment entry, even those without musl?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On Wed, Mar 20, 2024 at 12:54 PM Bruce Momjian <bruce@momjian.us> wrote:
So this would truncate the process title on all Linux that have an LD_
environment entry, even those without musl?
Yep. How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
postmaster near you? You'd always get that much, plus as much of
/proc/XXX/environ as we can find before you reach LD_XXX=, which on a
typical system would, I guess, usually be never. If it's a problem
you could try to arrange for LD_ XXX to come later in environ[]. What
I observe is that they seem to get copied in backwards, wrt the
environment exported by the parent, so if you set DUMMY=XXXXXXXX just
before starting the process it'll make sacrificial space in the right
place (but I'm not sure where that effect is coming from so don't
quote me).
On Wed, Mar 20, 2024 at 02:12:54PM +1300, Thomas Munro wrote:
On Wed, Mar 20, 2024 at 12:54 PM Bruce Momjian <bruce@momjian.us> wrote:
So this would truncate the process title on all Linux that have an LD_
environment entry, even those without musl?Yep. How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
postmaster near you? You'd always get that much, plus as much of
$ cat /proc/2000/cmdline |wc -c
30
/proc/XXX/environ as we can find before you reach LD_XXX=, which on a
typical system would, I guess, usually be never. If it's a problem
you could try to arrange for LD_ XXX to come later in environ[]. What
I observe is that they seem to get copied in backwards, wrt the
environment exported by the parent, so if you set DUMMY=XXXXXXXX just
before starting the process it'll make sacrificial space in the right
place (but I'm not sure where that effect is coming from so don't
quote me).
I am just cautious about changing behavior on our most common platform
for a libc library I have never heard of.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Thomas Munro <thomas.munro@gmail.com> writes:
Yep. How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
postmaster near you? You'd always get that much, plus as much of
/proc/XXX/environ as we can find before you reach LD_XXX=, which on a
typical system would, I guess, usually be never.
I'd be happier about this if the target pattern were more restrictive.
Is there reason to think that musl keeps a pointer to anything besides
LD_LIBRARY_PATH?
regards, tom lane
On Wed, Mar 20, 2024 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
Yep. How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
postmaster near you? You'd always get that much, plus as much of
/proc/XXX/environ as we can find before you reach LD_XXX=, which on a
typical system would, I guess, usually be never.I'd be happier about this if the target pattern were more restrictive.
Is there reason to think that musl keeps a pointer to anything besides
LD_LIBRARY_PATH?
Also LD_PRELOAD:
https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1824
Yeah we could do just those two.
Thomas Munro <thomas.munro@gmail.com> writes:
On Wed, Mar 20, 2024 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'd be happier about this if the target pattern were more restrictive.
Is there reason to think that musl keeps a pointer to anything besides
LD_LIBRARY_PATH?
Also LD_PRELOAD:
https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1824
Yeah we could do just those two.
+1 for stopping only at one of those two names.
regards, tom lane
On Wed, Mar 20, 2024 at 2:27 PM Bruce Momjian <bruce@momjian.us> wrote:
I am just cautious about changing behavior on our most common platform
for a libc library I have never heard of.
Yeah, I hear you. I don't have a dog in this race, I just like
retro-computing mysteries and arguments about the meaning of
standards... That said I'm pretty sure no one should be running
production PostgreSQL systems held together by LD_LIBRARY_PATH, and if
they do, it looks like systemd/rc.d scripts set a bunch of PG_OOM_BLAH
stuff just before the start the cluster that would provide extra chaff
in front of LD_XXX stuff defined earlier, and then pg_ctl inserts even
more, and even if they don't use any of that stuff and just run the
postmaster directly with some other homegrown tooling, now we're down
to very niche/expert scenarios where it should be acceptable to point
to this thread that says "try setting an extra dummy variable after
you set LD_XXX!".
On Wed, Mar 20, 2024 at 3:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
+1 for stopping only at one of those two names.
Here's one like that for Wolfgang to test on musl.
Attachments:
v3-0001-Don-t-clobber-LD_LIBRARY_PATH-or-LD_PRELOAD.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Don-t-clobber-LD_LIBRARY_PATH-or-LD_PRELOAD.patchDownload
From b6d49f538c752ec427b748db18eadfb8b4a8a317 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Mar 2024 00:25:34 +1300
Subject: [PATCH v3] Don't clobber LD_LIBRARY_PATH or LD_PRELOAD.
Our PS_USE_CLOBBER_ARGV code relocates the environment, which itself is
allowed, in order to steal the old space to make a bigger argv[0] for
ps/top to show, which is undefined behavior, but a widely used technique
going back to sendmail.
Unfortunately that corrupts musl's copy of LD_LIBRARY_PATH if set,
because it stashes a pointer to the initial value before main() begins.
It probably doesn't matter for installed servers but breaks the
regression tests and seems generally bad.
Stop clobbering memory at or beyond those those variable names. Though
LD_LIBRARY_PATH is probably rare on production systems, anyone who is
defining it should hopefully be unlikely to see any new truncation in
practice, because pg_ctl defines some environment variables that appear
at the start of environ[] in the postmaster. If that's still not
enough, affected users may be able to define extra dummy variables.
Though most Linux users use glibc, it doesn't seem that easy or nice to
to try to distinguish glibc from musl, so apply this logic to all
__linux__ systems.
Reported-by: Wolfgang Walther <walther@technowledgy.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Discussion: https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488%40technowledgy.de
---
src/backend/utils/misc/ps_status.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e483..d989e29364a 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,31 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
+ {
+
+#if defined(__linux__)
+ /*
+ * If we see variables that the musl runtime linker is known
+ * to stash pointers to, give up so we don't break later calls
+ * to dlopen().
+ */
+ if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
+ strstr(environ[i], "LD_PRELOAD=") == environ[i])
+ {
+ /*
+ * We can overwrite the name, but stop at the equals sign.
+ * Future loops will not find contiguous space, but we
+ * don't break early because we want to count the total
+ * number.
+ */
+ end_of_area = strchr(environ[i], '=');
+ }
+ else
+#endif
+ {
+ end_of_area = environ[i] + strlen(environ[i]);
+ }
+ }
}
ps_buffer = argv[0];
--
2.39.2
On 17.03.24 11:33, Christophe Pettus wrote:
On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
Nailed it. PS_USE_NONE fixes it.
Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case? It seems wrong that the current test passes a case that doesn't actually work.
We could turn it around and do
#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endif
On Wed, Mar 20, 2024 at 2:03 AM Peter Eisentraut <peter@eisentraut.org>
wrote:
On 17.03.24 11:33, Christophe Pettus wrote:
On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de>
wrote:
Nailed it. PS_USE_NONE fixes it.
Given the musl (still?) does not define a preprocessor macro specific to
it, is there a way of improving the test in pg_status.c to catch this
case? It seems wrong that the current test passes a case that doesn't
actually work.We could turn it around and do
#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endif
I like it. Neat and minimal.
cheers
andrew
Thomas Munro:
On Wed, Mar 20, 2024 at 3:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
+1 for stopping only at one of those two names.
Here's one like that for Wolfgang to test on musl.
Works fine.
Peter Eisentraut:
We could turn it around and do
#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endif
This works as well.
I also put together a PoC of what was mentioned in musl's mailing list:
Instead of clobbering environ at all, exec yourself again with padded
argv0. This works, too. Attached.
Best,
Wolfgang
Attachments:
0001-ps-status-exec.patchtext/x-patch; charset=UTF-8; name=0001-ps-status-exec.patchDownload
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e483..f8035c15df8 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -23,8 +23,6 @@
#include "utils/guc.h"
#include "utils/ps_status.h"
-extern char **environ;
-
/* GUC variable */
bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
@@ -38,7 +36,7 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
* use the function setproctitle(const char *, ...)
* (other BSDs)
* PS_USE_CLOBBER_ARGV
- * write over the argv and environment area
+ * write over the argv area
* (Linux and most SysV-like systems)
* PS_USE_WIN32
* push the string out as the name of a Windows event
@@ -69,9 +67,9 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
#ifndef PS_USE_NONE
-#ifndef PS_USE_CLOBBER_ARGV
/* all but one option need a buffer to write their ps line in */
#define PS_BUFFER_SIZE 256
+#ifndef PS_USE_CLOBBER_ARGV
static char ps_buffer[PS_BUFFER_SIZE];
static const size_t ps_buffer_size = PS_BUFFER_SIZE;
#else /* PS_USE_CLOBBER_ARGV */
@@ -105,9 +103,7 @@ static char **save_argv;
* from being clobbered by subsequent ps_display actions.
*
* (The original argv[] will not be overwritten by this routine, but may be
- * overwritten during init_ps_display. Also, the physical location of the
- * environment strings may be moved, so this should be called before any code
- * that might try to hang onto a getenv() result.)
+ * overwritten during init_ps_display.)
*
* Note that in case of failure this cannot call elog() as that is not
* initialized yet. We rely on write_stderr() instead.
@@ -126,7 +122,6 @@ save_ps_display_args(int argc, char **argv)
*/
{
char *end_of_area = NULL;
- char **new_environ;
int i;
/*
@@ -145,38 +140,32 @@ save_ps_display_args(int argc, char **argv)
return argv;
}
- /*
- * check for contiguous environ strings following argv
- */
- for (i = 0; environ[i] != NULL; i++)
- {
- if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
- }
-
ps_buffer = argv[0];
last_status_len = ps_buffer_size = end_of_area - argv[0];
- /*
- * move the environment out of the way
- */
- new_environ = (char **) malloc((i + 1) * sizeof(char *));
- if (!new_environ)
+ if (ps_buffer_size < PS_BUFFER_SIZE)
{
- write_stderr("out of memory\n");
+ char *argv0_padded;
+ int len = strlen(argv[0]);
+ char *self = argv[0];
+
+ /* right pad argv[0] with slashes up to PS_BUFFER_SIZE */
+ argv0_padded = (char *) malloc(len + PS_BUFFER_SIZE - ps_buffer_size);
+ memcpy(argv0_padded, argv[0], len);
+ MemSet(argv0_padded + len, '/', PS_BUFFER_SIZE - len - 1);
+ argv[0] = argv0_padded;
+
+ execvp(self, argv);
+ write_stderr("exec failed: %s\n%s", strerror(errno), self);
exit(1);
}
- for (i = 0; environ[i] != NULL; i++)
+ else
{
- new_environ[i] = strdup(environ[i]);
- if (!new_environ[i])
- {
- write_stderr("out of memory\n");
- exit(1);
- }
+ /* We might have already padded argv[0], so we must strip
+ * it again to make get_program_name work. */
+ int len = strlen(argv[0]);
+ while (len > 0 && argv[0][len - 1] == '/') argv[0][--len] = '\0';
}
- new_environ[i] = NULL;
- environ = new_environ;
}
/*
--
2.44.0
On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
Peter Eisentraut:
We could turn it around and do
#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endifThis works as well.
Yes, I prefer this. I am worried the environ hackery will bite us
someday and the cause will be hard to find.
I also put together a PoC of what was mentioned in musl's mailing list:
Instead of clobbering environ at all, exec yourself again with padded argv0.
This works, too. Attached.
It is hard to imagine why we would add an extra exec on every Linux
server start for this.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On Wed, Mar 20, 2024 at 09:35:51AM -0400, Bruce Momjian wrote:
On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
I also put together a PoC of what was mentioned in musl's mailing list:
Instead of clobbering environ at all, exec yourself again with padded argv0.
This works, too. Attached.It is hard to imagine why we would add an extra exec on every Linux
server start for this.
I guess we could conditionally exec only if we find we must, but then
such exec cases would be rare and rarely tested.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Bruce Momjian:
I guess we could conditionally exec only if we find we must, but then
such exec cases would be rare and rarely tested.
I think you might be seriously underestimating how often musl is used.
Alpine Linux uses musl and is very widespread in the container world
because of smaller image size.
The library/postgres docker image has been pulled about 8 billion times
since 2014 [1]https://hub.docker.com/v2/repositories/library/postgres. While we can't really tell how many of those pulled the
alpine variant of the image, comparing the alpine [2]https://hub.docker.com/v2/repositories/library/alpine and ubuntu/debian
[3,4] base images gives a rough estimate of >50% using alpine in general.
This is certainly not rare.
But yeah, buildfarm coverage for musl would be good, I agree. Maybe even
directly in CI?
Best,
Wolfgang
[1]: https://hub.docker.com/v2/repositories/library/postgres
[2]: https://hub.docker.com/v2/repositories/library/alpine
[3]: https://hub.docker.com/v2/repositories/library/ubuntu
[4]: https://hub.docker.com/v2/repositories/library/debian
On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
I think you might be seriously underestimating how often musl is used.
Alpine Linux uses musl and is very widespread in the container world
because of smaller image size
The last time I looked, its collation support didn't work at all...
Yours,
Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
I think you might be seriously underestimating how often musl is used.
Alpine Linux uses musl and is very widespread in the container world
because of smaller image size
The last time I looked, its collation support didn't work at all...
I think the same is true of some of the BSDen, so that's not a
large impediment to us. But in any case, if somebody wants
Alpine or musl to be considered a supported platform, they'd
best step up and run a buildfarm animal.
regards, tom lane
Bruce Momjian:
On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
Peter Eisentraut:
We could turn it around and do
#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endifThis works as well.
Yes, I prefer this. I am worried the environ hackery will bite us
someday and the cause will be hard to find.
Well, the environ hackery already bit and it sure was hard to find. But
this approach would still clobber environ happily... which is undefined
behavior. But certainly the opt-in to known-to-be-good libc variants is
a better approach than before.
Between this and "stop clobbering at LD_LIBRARY_PATH", I prefer the
latter, though.
I also put together a PoC of what was mentioned in musl's mailing list:
Instead of clobbering environ at all, exec yourself again with padded argv0.
This works, too. Attached.It is hard to imagine why we would add an extra exec on every Linux
server start for this.
Would this be a problem? For a running server this would happen only
once when the postmaster starts up, AFAICT.
Best,
Wolfgang
Laurenz Albe:
On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
I think you might be seriously underestimating how often musl is used.
Alpine Linux uses musl and is very widespread in the container world
because of smaller image sizeThe last time I looked, its collation support didn't work at all...
IIUC, using icu collations should work. I didn't extensively try,
though. But yeah - musl itself doesn't do it, knowingly so.
Best,
Wolfgang
Tom Lane:
But in any case, if somebody wants
Alpine or musl to be considered a supported platform, they'd
best step up and run a buildfarm animal.
Yeah, I was already thinking about that. But I guess we'd need to first
make the test suite pass on musl. i.e. $subject, but there are also some
smaller issues after that, before the full test suite will pass.
Best,
Wolfgang
On 20.03.24 15:37, Wolfgang Walther wrote:
It is hard to imagine why we would add an extra exec on every Linux
server start for this.Would this be a problem? For a running server this would happen only
once when the postmaster starts up, AFAICT.
I wonder if it would cause issues with systemd or similar, if the PID of
the running process is not the one that systemd started. If so, there
is probably a workaround, but it would have to be analyzed.
Peter Eisentraut:
Would this be a problem? For a running server this would happen only
once when the postmaster starts up, AFAICT.I wonder if it would cause issues with systemd or similar, if the PID of
the running process is not the one that systemd started. If so, there
is probably a workaround, but it would have to be analyzed.
I don't think that exec even creates a new PID. The current process is
replaced, so the PID stays the same.
Best,
Wolfgang
On Wed, Mar 20, 2024 at 03:24:58PM +0100, Wolfgang Walther wrote:
Bruce Momjian:
I guess we could conditionally exec only if we find we must, but then
such exec cases would be rare and rarely tested.I think you might be seriously underestimating how often musl is used.
Alpine Linux uses musl and is very widespread in the container world because
of smaller image size.The library/postgres docker image has been pulled about 8 billion times
since 2014 [1]. While we can't really tell how many of those pulled the
alpine variant of the image, comparing the alpine [2] and ubuntu/debian
[3,4] base images gives a rough estimate of >50% using alpine in general.
Uh, what is the current behavior of Postgres on musl? It just fails if
the process title is longer than argv[0] plus the environment space to
the LD_ environment variable, and then linking fails for certain
extensions? If there are many downloads, why would we only be getting
this report now?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Bruce Momjian:
The library/postgres docker image has been pulled about 8 billion times
since 2014 [1]. While we can't really tell how many of those pulled the
alpine variant of the image, comparing the alpine [2] and ubuntu/debian
[3,4] base images gives a rough estimate of >50% using alpine in general.Uh, what is the current behavior of Postgres on musl? It just fails if
the process title is longer than argv[0] plus the environment space to
the LD_ environment variable, and then linking fails for certain
extensions? If there are many downloads, why would we only be getting
this report now?
The process title works fine. It's just the way how space is cleared for
the process title, that is causing problems elsewhere.
The thing that is broken when running postgres on alpine/musl is, to put
libpq in a custom path and use LD_LIBRARY_PATH to find it when loading
libpqwalreceiver (+ some contrib modules). Nobody does that, especially
not in a container environment where postgres is likely the only thing
running in that container, so there is no point in using any custom
library paths or anything - the image is built once and made to work,
and everybody else is just using that working image.
The much more practical problem is that the test suite doesn't run,
because it makes use of LD_LIBRARY_PATH for that purpose. In the past,
the packagers for alpine only disabled the failing tests, but IIRC they
have given up on that and just disabled the whole test suite by now.
Best,
Wolfgang
On Wed, Mar 20, 2024 at 08:29:21PM +0100, Wolfgang Walther wrote:
Bruce Momjian:
The library/postgres docker image has been pulled about 8 billion times
since 2014 [1]. While we can't really tell how many of those pulled the
alpine variant of the image, comparing the alpine [2] and ubuntu/debian
[3,4] base images gives a rough estimate of >50% using alpine in general.Uh, what is the current behavior of Postgres on musl? It just fails if
the process title is longer than argv[0] plus the environment space to
the LD_ environment variable, and then linking fails for certain
extensions? If there are many downloads, why would we only be getting
this report now?The process title works fine. It's just the way how space is cleared for the
process title, that is causing problems elsewhere.The thing that is broken when running postgres on alpine/musl is, to put
libpq in a custom path and use LD_LIBRARY_PATH to find it when loading
libpqwalreceiver (+ some contrib modules). Nobody does that, especially not
in a container environment where postgres is likely the only thing running
in that container, so there is no point in using any custom library paths or
anything - the image is built once and made to work, and everybody else is
just using that working image.The much more practical problem is that the test suite doesn't run, because
it makes use of LD_LIBRARY_PATH for that purpose. In the past, the packagers
for alpine only disabled the failing tests, but IIRC they have given up on
that and just disabled the whole test suite by now.
Thanks, that is very clear.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On Thu, Mar 21, 2024 at 2:35 AM Bruce Momjian <bruce@momjian.us> wrote:
Yes, I prefer this. I am worried the environ hackery will bite us
someday and the cause will be hard to find.
Some speculation on how widespread this trick is: as mentioned, it
seems to come from sendmail (I didn't spend the time to find a repo
that has ancient versions, so this is someone's random snapshot repo,
but it's referring to pretty old dead systems):
https://github.com/Distrotech/sendmail/blob/master/sendmail/conf.c#L2436
That was once ubiquitous, back in the day. One of the most widespread
envon-clobberers these days must be openssh:
And funnily enough, googling LD_LIBRARY_PATH and openssh brings up a
few unsolved/unanswered questions about mysterious breakage (though I
didn't see any that mentioned musl by name and there could be other
explanations, *shrug*).
There is also Chromium/Chrome:
https://github.com/chromium/chromium/blob/main/base/process/set_process_title_linux.cc#L136
That code has some interesting commentary and points to a commit in
Linux which mentions setproctitle() and making sure it still works
(funny because setproctitle() isn't a function in any standard
userspace library AFAIK so I guess it just means the trick in
general), and also mentions the failure of attempts to get an official
way to do this negotiated between the relevant projects.
Of course we have to distinguish between the basic argv[] clobbering
trick which is barely even a trick, and the more advanced environ
stealing trick which confuses musl. A very widespread user of the
basic trick would be systemd, which tries to use the prctl() if it can
to get a much bigger window of memory to write on, but otherwise falls
back to accepting a small one. I guess we'd do the same if we could,
ie if a future Linux version didn't require CAP_SYS_RESOURCES to do
it:
Anyway, it looks like there is plenty of will out there to keep this
working, albeit in a weird semi-supported state whose cruftiness is
undeniable.
Thomas Munro:
Of course we have to distinguish between the basic argv[] clobbering
trick which is barely even a trick, and the more advanced environ
stealing trick which confuses musl.
Right. The latter not only confuses musl, but also makes
/proc/<pid>/environ return garbage. This is also mentioned at the bottom
of main.c, which has a workaround for the specific case of UBSan
depending on that. This is kind of funny: Because we are relying on
undefined behavior regarding the modification of environ, we need a
workaround for the "UndefinedBehaviorSanitizer" - I guess by failing
without this workaround, it wanted to tell us something..
This happens on glibc, too.
So summarizing:
1. The simple approach is to use PS_USE_CLOBBER_ARGV on Linux only for
glibc and other known-to-be-good-and-identifiable libc variants,
otherwise default to PS_USE_NONE. This will not only keep the problem
for /proc/../environ for glibc users, but also disable ps status for
musl entirely. Considering that probably the biggest use-case for musl
is to run postgres in containers, it's quite likely to actually run more
than just one cluster on a single machine. In this case... ps status
would be especially handy to identify which cluster a process belongs to.
2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH
/ LD_PRELOAD is found to keep those intact. This will keep ps status
support on musl, which is good. But the /proc/.../environ problem will
still be there, unchanged.
Both of those approaches rely on the undefined behavior of clobbering
environ.
3. The logical consequence of this is, to stop clobbering environ and
use only the available argv space. However, this will quickly leave us
with a very small ps status buffer to work with, making the feature less
useful. Note, that this could happen theoretically by starting postgres
with the fewest arguments and environment possible, too. Not sure what
the minimal buffer size is that could be achieved that way. The point
is: The buffer size is not guaranteed at all.
4. The upstream (musl) suggestion of which I sent a PoC was to "exec
yourself with a bigger argv". This works. I chose to pad argv0 with
trailing slashes. Those can safely be stripped away again, because any
argv0 which would come with a trailing slash to start with, would not be
the current executable, but a directory - so would fail exec immediately
anyway. This keeps /proc/.../environ intact and does not rely on
undefined behavior. Additionally, we get a guaranteed ps buffer size of
256, which is what we use on BSDs and Windows, too.
I wonder why we actually fall back to PS_USE_NONE by default.. and how
much of that is related to the environment clobbering to start with?
Could we even use the exec-approach as the fallback in all other cases
except BSDs and Windows and get rid of PS_USE_NONE? Clobbering only argv
sure seems way safer to do than what we do right now.
Best,
Wolfgang
Here is what we could with this:
2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH
/ LD_PRELOAD is found to keep those intact.
We could backpatch this down to v12. This would be one step to make the
test suite pass on Alpine Linux with musl and ultimately allow setting
up a buildfarm animal for that.
It does not solve the /proc/.../environ problem, but at least keeps ps
status working on musl as it did before, so not a regression.
4. The upstream (musl) suggestion of which I sent a PoC was to "exec
yourself with a bigger argv".
We could do this in HEAD now ...
Could we even use the exec-approach as the fallback in all other cases
except BSDs and Windows and get rid of PS_USE_NONE?
... and then remove PS_USE_NONE at the beginning of the v18 cycle.
This would give a bit more time for those "other systems", which were
previously falling back PS_USE_NONE and would then clobber argv, too.
Opinions?
Best,
Wolfgang
On Fri, Mar 22, 2024 at 9:30 AM <walther@technowledgy.de> wrote:
4. The upstream (musl) suggestion of which I sent a PoC was to "exec
yourself with a bigger argv".We could do this in HEAD now ...
Just a thought: if we want to go this way, do we need a new exec call?
We already control the initial exec in pg_ctl.c.
Could we even use the exec-approach as the fallback in all other cases
except BSDs and Windows and get rid of PS_USE_NONE?... and then remove PS_USE_NONE at the beginning of the v18 cycle.
This would give a bit more time for those "other systems", which were
previously falling back PS_USE_NONE and would then clobber argv, too.
RIght. It's unspecified by POSIX whether ps shows changes to those
strings (and there are systems that don't), but it can't hurt to do so
anyway, and it'd be better than having a PS_USE_NONE code path that is
untested. I dimly recall that it turned out that PS_USE_NONE was
actually broken for a while without anyone noticing.
Thomas Munro <thomas.munro@gmail.com> writes:
Just a thought: if we want to go this way, do we need a new exec call?
We already control the initial exec in pg_ctl.c.
I'm resistant to assuming the postmaster is launched through pg_ctl.
systemd, for example, might well prefer not to do that, not to
mention all the troglodytes still using 1990s launch scripts.
A question that seems worth debating in this thread is how much
updating the process title is even worth nowadays. It feels like
a hangover from before we had pg_stat_activity and other monitoring
support. So I don't feel a huge need to support it on musl.
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.
regards, tom lane
On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.
What about meeting musl halfway: clobber argv, but only clobber
environ for the libcs known to tolerate that? Then musl might see
truncation at 30-60 characters or whatever it is, but that's probably
enough to see your cluster_name and backend type/user name which is
pretty useful information.
On Thu, Mar 21, 2024 at 7:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
Just a thought: if we want to go this way, do we need a new exec call?
We already control the initial exec in pg_ctl.c.I'm resistant to assuming the postmaster is launched through pg_ctl.
systemd, for example, might well prefer not to do that, not to
mention all the troglodytes still using 1990s launch scripts.A question that seems worth debating in this thread is how much
updating the process title is even worth nowadays. It feels like
a hangover from before we had pg_stat_activity and other monitoring
support. So I don't feel a huge need to support it on musl.
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.
+1
cheers
andrew
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.
What about meeting musl halfway: clobber argv, but only clobber
environ for the libcs known to tolerate that? Then musl might see
truncation at 30-60 characters or whatever it is, but that's probably
enough to see your cluster_name and backend type/user name which is
pretty useful information.
No real objection here. I do wonder about the point you (or somebody)
made upthread that we don't have any testing of the PS_USE_NONE case;
but that could be addressed some other way.
regards, tom lane
On Fri, Mar 22, 2024 at 11:42:52AM +1300, Thomas Munro wrote:
On Fri, Mar 22, 2024 at 9:30 AM <walther@technowledgy.de> wrote:
4. The upstream (musl) suggestion of which I sent a PoC was to "exec
yourself with a bigger argv".We could do this in HEAD now ...
Just a thought: if we want to go this way, do we need a new exec call?
We already control the initial exec in pg_ctl.c.Could we even use the exec-approach as the fallback in all other cases
except BSDs and Windows and get rid of PS_USE_NONE?... and then remove PS_USE_NONE at the beginning of the v18 cycle.
This would give a bit more time for those "other systems", which were
previously falling back PS_USE_NONE and would then clobber argv, too.RIght. It's unspecified by POSIX whether ps shows changes to those
strings (and there are systems that don't), but it can't hurt to do so
anyway, and it'd be better than having a PS_USE_NONE code path that is
untested. I dimly recall that it turned out that PS_USE_NONE was
actually broken for a while without anyone noticing.
Actually, I was thinking the opposite. Since the musl libc is widely
used, it will be tested, and I don't want to disable process display
updates for such a common platform.
I suggest we use the #ifdef test to continue our existing behavior for
the libraries we know about, like glibc, and use the LD_* process title
truncation hack for libc's we don't recognize.
Attached is a prototype patch which implements this based on previous
patches.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Attachments:
proctitle.difftext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e483..154862f9fb0 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,32 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
+ {
+
+ /* Do this for Linux libc libraries that might be unsafe. */
+#if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
+ /*
+ * If we see variables that the musl runtime linker is known
+ * to stash pointers to, give up so we don't break later calls
+ * to dlopen().
+ */
+ if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
+ strstr(environ[i], "LD_PRELOAD=") == environ[i])
+ {
+ /*
+ * We can overwrite the name, but stop at the equals sign.
+ * Future loops will not find contiguous space, but we
+ * don't break early because we want to count the total
+ * number.
+ */
+ end_of_area = strchr(environ[i], '=');
+ }
+ else
+#endif
+ {
+ end_of_area = environ[i] + strlen(environ[i]);
+ }
+ }
}
ps_buffer = argv[0];
On Fri, Mar 22, 2024 at 12:20:11PM +1300, Thomas Munro wrote:
On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.What about meeting musl halfway: clobber argv, but only clobber
environ for the libcs known to tolerate that? Then musl might see
truncation at 30-60 characters or whatever it is, but that's probably
enough to see your cluster_name and backend type/user name which is
pretty useful information.
I just posted a prototype patch to implement this.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Hi,
On 2024-03-21 21:16:46 +0100, Wolfgang Walther wrote:
Right. The latter not only confuses musl, but also makes /proc/<pid>/environ
return garbage. This is also mentioned at the bottom of main.c, which has a
workaround for the specific case of UBSan depending on that. This is kind of
funny: Because we are relying on undefined behavior regarding the
modification of environ, we need a workaround for the
"UndefinedBehaviorSanitizer" - I guess by failing without this workaround,
it wanted to tell us something..
I don't think that's quite a fair description. Ubsan is basically doing
undefined things itself, so it's turtles all the way down.
So summarizing:
FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.
Greetings,
Andres Freund
Sent from my iPad
On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2024-03-21 21:16:46 +0100, Wolfgang Walther wrote:
Right. The latter not only confuses musl, but also makes /proc/<pid>/environ
return garbage. This is also mentioned at the bottom of main.c, which has a
workaround for the specific case of UBSan depending on that. This is kind of
funny: Because we are relying on undefined behavior regarding the
modification of environ, we need a workaround for the
"UndefinedBehaviorSanitizer" - I guess by failing without this workaround,
it wanted to tell us something..I don't think that's quite a fair description. Ubsan is basically doing
undefined things itself, so it's turtles all the way down.So summarizing:
FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.
We had one till 3 months ago. It’s on my list to recreate.
Cheers
Andrew
Andrew Dunstan <andrew@dunslane.net> writes:
On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.
We had one till 3 months ago. It’s on my list to recreate.
How was it passing? The issue discussed in this thread has surely
been there for a long time, and Wolfgang mentioned that he sees
others.
regards, tom lane
On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:age
On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.We had one till 3 months ago. It’s on my list to recreate.
How was it passing? The issue discussed in this thread has surely
been there for a long time, and Wolfgang mentioned that he sees
others.
The buildfarm client has a switch that delays running regression tests until after the install stages.
Cheers
Andrew
Andres Freund:
FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.
I am already working with Andrew to set up a buildfarm animal to run
Alpine Linux/musl. I can look into the CI task as well. Are you
suggesting to change an existing task to run with Alpine/musl or to add
a new task for it? It would be docker image based for sure.
Best,
Wolfgang
Andrew Dunstan:
On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
How was it passing? The issue discussed in this thread has surely
been there for a long time, and Wolfgang mentioned that he sees
others.The buildfarm client has a switch that delays running regression tests until after the install stages.
Hm. So while that switch makes the animal pass the build, it did hide
exactly this problem. Not sure whether this switch should be used at
all, then. Was this switch only implemented for the specific case of
Alpine/musl or is there a different reason for it, as well?
The other issues I had been seeing were during make check-world, but not
make check. Those were things around setlocale() / /bin/locale, IIRC.
Not sure whether all of the tests are run by the buildfarm?
Best,
Wolfgang
Tom Lane:
Thomas Munro <thomas.munro@gmail.com> writes:
Just a thought: if we want to go this way, do we need a new exec call?
We already control the initial exec in pg_ctl.c.I'm resistant to assuming the postmaster is launched through pg_ctl.
systemd, for example, might well prefer not to do that, not to
mention all the troglodytes still using 1990s launch scripts.
Right, the systemd example in the docs is not using pg_ctl.
But, it should be easily possible to have:
- pg_ctl call postgres with a padded argv0
- postgres call itself with padding, if it wasn't called with that already
This way, there would be no additional exec call when started through
pg_ctl, but one more call when started directly.
Best,
Wolfgang
Bruce Momjian:
I suggest we use the #ifdef test to continue our existing behavior for
the libraries we know about, like glibc, and use the LD_* process title
truncation hack for libc's we don't recognize.Attached is a prototype patch which implements this based on previous
patches.
The condition to check for linux/glibc in your patch is slightly off:
#if ! defined(__linux__) || (! defined(__GLIBC__) &&
defined(__UCLIBC__ ))
should be
#if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
With the latter, it passes tests with musl.
Best,
Wolfgang
Wolfgang Walther:
The other issues I had been seeing were during make check-world, but not
make check. Those were things around setlocale() / /bin/locale, IIRC.
Not sure whether all of the tests are run by the buildfarm?
Ah, those other tests only fail when building --with-icu, but Andrew's
animal didn't do that.
Best,
Wolfgang
On Fri, Mar 22, 2024 at 4:02 AM Wolfgang Walther <walther@technowledgy.de>
wrote:
Andrew Dunstan:
On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
How was it passing? The issue discussed in this thread has surely
been there for a long time, and Wolfgang mentioned that he sees
others.The buildfarm client has a switch that delays running regression tests
until after the install stages.
Hm. So while that switch makes the animal pass the build, it did hide
exactly this problem. Not sure whether this switch should be used at
all, then. Was this switch only implemented for the specific case of
Alpine/musl or is there a different reason for it, as well?
Alpine was the main motivation, but it's also probably useful on Macs with
SIP enabled.
ISTR raising the Alpine issue back then (2018) but I can't find a reference
now.
cheers
andrew
On Fri, Mar 22, 2024 at 09:33:38AM +0100, walther@technowledgy.de wrote:
Bruce Momjian:
I suggest we use the #ifdef test to continue our existing behavior for
the libraries we know about, like glibc, and use the LD_* process title
truncation hack for libc's we don't recognize.Attached is a prototype patch which implements this based on previous
patches.The condition to check for linux/glibc in your patch is slightly off:
#if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
should be
#if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
With the latter, it passes tests with musl.
Yes, my logic was wrong. Not sure what I was thinking, frankly.
I am not a big fan of negating a complex conditional, but would rather
pass the negation into the conditional, new patch attached.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Attachments:
proctitle.difftext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e483..c0786d30937 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,32 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
+ {
+
+ /* Do this for Linux libc libraries that might be unsafe. */
+#if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__ ))
+ /*
+ * If we see variables that the musl runtime linker is known
+ * to stash pointers to, give up so we don't break later calls
+ * to dlopen().
+ */
+ if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
+ strstr(environ[i], "LD_PRELOAD=") == environ[i])
+ {
+ /*
+ * We can overwrite the name, but stop at the equals sign.
+ * Future loops will not find contiguous space, but we
+ * don't break early because we want to count the total
+ * number.
+ */
+ end_of_area = strchr(environ[i], '=');
+ }
+ else
+#endif
+ {
+ end_of_area = environ[i] + strlen(environ[i]);
+ }
+ }
}
ps_buffer = argv[0];
On Fri, Mar 22, 2024 at 09:36:19AM -0400, Bruce Momjian wrote:
On Fri, Mar 22, 2024 at 09:33:38AM +0100, walther@technowledgy.de wrote:
Bruce Momjian:
I suggest we use the #ifdef test to continue our existing behavior for
the libraries we know about, like glibc, and use the LD_* process title
truncation hack for libc's we don't recognize.Attached is a prototype patch which implements this based on previous
patches.The condition to check for linux/glibc in your patch is slightly off:
#if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
should be
#if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
With the latter, it passes tests with musl.
Yes, my logic was wrong. Not sure what I was thinking, frankly.
I am not a big fan of negating a complex conditional, but would rather
pass the negation into the conditional, new patch attached.
With no one "hoping this patch dies in a fire"*, I have updated it with
more details, which I now think is committable to master. Is this
something to backpatch? Seems too rare a bug to me.
* Robert Haas, /messages/by-id/CA+TgmoYsyrCNmg+Yh6rgP7K8r-bYPjCeF1tPxENRFwD4VZAZvw@mail.gmail.com
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
Attachments:
proctitle.difftext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e483..b4ed2b1a365 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,37 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
+ {
+
+#if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__ ))
+ /*
+ * The musl runtime linker stores pointers to variable
+ * values which are defined in the process's environment.
+ * Therefore, in these cases we cannot overwrite such
+ * variable values when setting the process title or dynamic
+ * linking (dlopen) might fail. Here, we truncate the update
+ * of the process title when either of two important dynamic
+ * linking environment variables are set. Musl does not
+ * define any compiler symbols, so we have to do this for
+ * any Linux libc we don't know is safe.
+ */
+ if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
+ strstr(environ[i], "LD_PRELOAD=") == environ[i])
+ {
+ /*
+ * We can overwrite the name, but stop at the equals sign.
+ * Future loops will not find contiguous space, but we
+ * don't break early because we want to count the total
+ * number.
+ */
+ end_of_area = strchr(environ[i], '=');
+ }
+ else
+#endif
+ {
+ end_of_area = environ[i] + strlen(environ[i]);
+ }
+ }
}
ps_buffer = argv[0];
On Thu, Mar 21, 2024 at 11:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Some speculation on how widespread this trick is: as mentioned, it
Just one more, which I just ran into by accident while looking for
something else, which I'm posting just in case this thread ever gets
used to try to convince musl hackers to change their end to support
it:
https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_setproctitle.c#L158
Bruce Momjian:
With no one "hoping this patch dies in a fire"*, I have updated it with
more details, which I now think is committable to master. Is this
something to backpatch? Seems too rare a bug to me.
I would like to see this backpatched to be able to run the regression
tests easily on all stable branches.
I have taken your's/Thomas' patch and extended it with a few more taking
in many of the ideas in this thread:
0001 Don't clobber LD_*
This is the patch you posted. This applies cleanly all the way down to
v12. This fixes the bug and allows running most of the tests with musl -
yeah!
I also confirmed, that this will not create practical problems with
library/postgres docker image, where this is likely used the most. While
"postgres" is called by default without any arguments here, plenty of
environment variables are passed. The docker image does use LD_PRELOAD
to trick initdb, but that's not set when running the postmaster, so not
a problem here.
This use-case also shows why the proposed patch to still partially
clobber environ at this stage is better than to not clobber environ at
all - in this case, the docker image would essentially have no ps status
at all by default.-
0002 Allow setting PS_USE_NONE via CPPFLAGS
This was proposed by Andrew and applies cleanly down to v12. Thus, it
could be backpatched, too. First and foremost this would allow setting a
buildfarm animal to use this flag to make sure this code path is
actually build/tested at all. This is something that Thomas and Tom
hinted at.
0003 Don't ever clobber environ again
This is the approach I had previously posted as a PoC. This would not be
backpatched, but I suggested this could go into v17 now. This avoids the
undefined behavior and sets the table to eventually set ps status via
argv by default and remove PS_USE_NONE later.
Compared to the PoC patch, I decided not to pad argv[0], because this
will break the ps status display for the postmaster itself. Instead exec
is called with an additional argument consisting of exactly 255 spaces.
I also tried avoiding the additional exec-call if postgres was called
via pg_ctl, as suggested by Peter. This quickly turned out to be more
invasive than I would have liked this to be, though.
The current approach works very well, the environment doesn't need to be
copied anymore and the workaround for /proc/<pid>/environ in main.c can
go away, too.
0004 Default to PS_USE_CLOBBER_ARGV
This changes the default to display ps status on all other systems, too.
This could potentially go in now as well, or be delayed to the beginning
of the v18 cycle. In the unlikely event that this breaks something on a
platform not considered here and we get a bug report, we can easily
advise to compile with CPPFLAGS=-DPS_USE_NONE, which is still there at
this stage.
0005 Remove PS_USE_NONE
However, if no reports come in and no problems are detected with 0004,
then this can be entirely removed. This for "later", whenever that is.
Best,
Wolfgang
Attachments:
0001-Don-t-clobber-LD_-environment-variables.patchtext/x-patch; charset=UTF-8; name=0001-Don-t-clobber-LD_-environment-variables.patchDownload
From 55bc8a12773d51c0ef070dfbb2714d08801390f5 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther <walther@technowledgy.de>
Date: Sun, 24 Mar 2024 17:12:45 +0100
Subject: [PATCH 1/5] Don't clobber LD_* environment variables
Our PS_USE_CLOBBER_ARGV code relocates the environment, which itself is
allowed, in order to steal the old space to make a bigger argv[0] for
ps/top to show, which is probably formally undefined behavior.
Unfortunately that corrupts musl's copy of LD_LIBRARY_PATH if set,
because it stashes a pointer to the initial value before main() begins.
It probably doesn't matter for installed servers but breaks the
regression tests.
Here we look out for variables named LD_* while computing how much space
to steal, so we can avoid clobbering them. No change in behaviour if
not found, but otherwise you might potentially get ps status messages
truncated to a smaller size than before depending on the length of
preceding clobberable variables. Musl does not define any compiler
symbols, so we do this for any Linux libc we don't know is safe. The
truncation size shouldn't be too small or at least has an easy mitigation:
define a dummy variable.
Reported-by: Wolfgang Walther <walther@technowledgy.de>
Discussion: https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488%40technowledgy.de
---
src/backend/utils/misc/ps_status.c | 32 +++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e483..7532af17dbb 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,37 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
+ {
+
+#if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__ ))
+ /*
+ * The musl runtime linker stores pointers to variable values
+ * which are defined in the process's environment. Therefore,
+ * in these cases we cannot overwrite such variable values
+ * when setting the process title or dynamic linking (dlopen)
+ * might fail. Here, we truncate the update of the process
+ * title when either of two important dynamic linking
+ * environment variables are set. Musl does not define any
+ * compiler symbols, so we have to do this for any Linux libc
+ * we don't know is safe.
+ */
+ if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
+ strstr(environ[i], "LD_PRELOAD=") == environ[i])
+ {
+ /*
+ * We can overwrite the name, but stop at the equals sign.
+ * Future loops will not find contiguous space, but we
+ * don't break early because we want to count the total
+ * number.
+ */
+ end_of_area = strchr(environ[i], '=');
+ }
+ else
+#endif
+ {
+ end_of_area = environ[i] + strlen(environ[i]);
+ }
+ }
}
ps_buffer = argv[0];
--
2.44.0
0002-Allow-disabling-ps-status-display-via-CPPFLAGS.patchtext/x-patch; charset=UTF-8; name=0002-Allow-disabling-ps-status-display-via-CPPFLAGS.patchDownload
From 26b65a2f4a33463b28d15419159f49569a61d003 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther <walther@technowledgy.de>
Date: Mon, 25 Mar 2024 21:08:39 +0100
Subject: [PATCH 2/5] Allow disabling ps status display via CPPFLAGS
The PS_USE_NONE is deemed the safest choice and thus the default if none of
the other options can safely be taken. However this option is also rarely
tested, if at all.
This change allows to explicitly build with CPPFLAGS=-DPS_USE_NONE for two
reasons:
- This can be used to have an animal in the buildfarm actually build this
code path.
- This will allow us to possibly change the default to PS_USE_CLOBBER_ARGV
later and still have an escape hatch in case users report problems with
that default.
Suggested by Andrew Dunstan.
---
src/backend/utils/misc/ps_status.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 7532af17dbb..d3513b22693 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -46,6 +46,8 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
* don't update ps display
* (This is the default, as it is safest.)
*/
+#ifndef PS_USE_NONE
+
#if defined(HAVE_SETPROCTITLE_FAST)
#define PS_USE_SETPROCTITLE_FAST
#elif defined(HAVE_SETPROCTITLE)
@@ -58,6 +60,7 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
#define PS_USE_NONE
#endif
+#endif /* not PS_USE_NONE */
/* Different systems want the buffer padded differently */
#if defined(__linux__) || defined(__darwin__)
--
2.44.0
0003-Don-t-clobber-any-environment-variables-for-ps-statu.patchtext/x-patch; charset=UTF-8; name=0003-Don-t-clobber-any-environment-variables-for-ps-statu.patchDownload
From 2834cbbbdd0794346eb6f78e9799c563b18b135c Mon Sep 17 00:00:00 2001
From: Wolfgang Walther <walther@technowledgy.de>
Date: Mon, 25 Mar 2024 22:07:09 +0100
Subject: [PATCH 3/5] Don't clobber any environment variables for ps status
Clobbering the environment area for ps status is formally undefined
behavior and thus makes up a bad default.
Here we replace the clobbering of the environment with a different
mechanisum to increase the available argv area for ps status: If
the given arguments have not been long enough, yet, we exec ourself
with an extra argument of spaces. This argument of spaces is then
removed again for further processing, but the additional space is
used for ps status.
This a more portable approach, because it doesn't rely on undefined
behavior and can possibly be made the default instead of PS_USE_NONE
later. It also allows to remove the workaround for UBSan in main.c
and makes /proc/<pid>/environ work again.
---
src/backend/main/main.c | 41 ++---------
src/backend/utils/misc/ps_status.c | 114 +++++++++++++----------------
src/include/utils/ps_status.h | 2 +-
3 files changed, 58 insertions(+), 99 deletions(-)
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index bfd0c5ed658..f2e333d74eb 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -80,14 +80,16 @@ main(int argc, char *argv[])
* Remember the physical location of the initially given argv[] array for
* possible use by ps display. On some platforms, the argv[] storage must
* be overwritten in order to set the process title for ps. In such cases
- * save_ps_display_args makes and returns a new copy of the argv[] array.
+ * save_ps_display_args makes a new copy of the argv[] array.
*
- * save_ps_display_args may also move the environment strings to make
- * extra room. Therefore this should be done as early as possible during
- * startup, to avoid entanglements with code that might save a getenv()
- * result pointer.
+ * save_ps_display_args may also exec ourself again with an extra argument
+ * to increase the available argv area. Therefore this should be done as
+ * early as possible during startup, to avoid doing more things than
+ * necessary twice. Once called with an the extra argument, this will be
+ * removed again by save_ps_display_args, which then returns the new
+ * number for argc.
*/
- argv = save_ps_display_args(argc, argv);
+ argc = save_ps_display_args(argc, argv);
/*
* Fire up essential subsystems: error and memory management
@@ -414,30 +416,3 @@ check_root(const char *progname)
}
#endif /* WIN32 */
}
-
-/*
- * At least on linux, set_ps_display() breaks /proc/$pid/environ. The
- * sanitizer library uses /proc/$pid/environ to implement getenv() as it wants
- * to work independent of libc. When just using undefined and alignment
- * sanitizers, the sanitizer library is only initialized when the first error
- * occurs, by which time we've often already called set_ps_display(),
- * preventing the sanitizer libraries from seeing the options.
- *
- * We can work around that by defining __ubsan_default_options, a weak symbol
- * libsanitizer uses to get defaults from the application, and return
- * getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we
- * don't end up relying on a not-yet-working getenv().
- *
- * As this function won't get called when not running a sanitizer, it doesn't
- * seem necessary to only compile it conditionally.
- */
-const char *__ubsan_default_options(void);
-const char *
-__ubsan_default_options(void)
-{
- /* don't call libc before it's guaranteed to be initialized */
- if (!reached_main)
- return "";
-
- return getenv("UBSAN_OPTIONS");
-}
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index d3513b22693..c1161e85507 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -23,8 +23,6 @@
#include "utils/guc.h"
#include "utils/ps_status.h"
-extern char **environ;
-
/* GUC variable */
bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
@@ -38,7 +36,7 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
* use the function setproctitle(const char *, ...)
* (other BSDs)
* PS_USE_CLOBBER_ARGV
- * write over the argv and environment area
+ * write over the argv area
* (Linux and most SysV-like systems)
* PS_USE_WIN32
* push the string out as the name of a Windows event
@@ -78,6 +76,7 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
static char ps_buffer[PS_BUFFER_SIZE];
static const size_t ps_buffer_size = PS_BUFFER_SIZE;
#else /* PS_USE_CLOBBER_ARGV */
+#define PS_BUFFER_MIN_SIZE 256
static char *ps_buffer; /* will point to argv area */
static size_t ps_buffer_size; /* space determined at run time */
static size_t last_status_len; /* use to minimize length of clobber */
@@ -107,15 +106,13 @@ static char **save_argv;
* If needed, we make a copy of the original argv[] array to preserve it
* from being clobbered by subsequent ps_display actions.
*
- * (The original argv[] will not be overwritten by this routine, but may be
- * overwritten during init_ps_display. Also, the physical location of the
- * environment strings may be moved, so this should be called before any code
- * that might try to hang onto a getenv() result.)
+ * The original argv[] will not be overwritten by this routine, but may be
+ * overwritten during init_ps_display.
*
* Note that in case of failure this cannot call elog() as that is not
* initialized yet. We rely on write_stderr() instead.
*/
-char **
+int
save_ps_display_args(int argc, char **argv)
{
save_argc = argc;
@@ -125,11 +122,9 @@ save_ps_display_args(int argc, char **argv)
/*
* If we're going to overwrite the argv area, count the available space.
- * Also move the environment to make additional room.
*/
{
char *end_of_area = NULL;
- char **new_environ;
int i;
/*
@@ -145,71 +140,60 @@ save_ps_display_args(int argc, char **argv)
{
ps_buffer = NULL;
ps_buffer_size = 0;
- return argv;
- }
-
- /*
- * check for contiguous environ strings following argv
- */
- for (i = 0; environ[i] != NULL; i++)
- {
- if (end_of_area + 1 == environ[i])
- {
-
-#if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__ ))
- /*
- * The musl runtime linker stores pointers to variable values
- * which are defined in the process's environment. Therefore,
- * in these cases we cannot overwrite such variable values
- * when setting the process title or dynamic linking (dlopen)
- * might fail. Here, we truncate the update of the process
- * title when either of two important dynamic linking
- * environment variables are set. Musl does not define any
- * compiler symbols, so we have to do this for any Linux libc
- * we don't know is safe.
- */
- if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
- strstr(environ[i], "LD_PRELOAD=") == environ[i])
- {
- /*
- * We can overwrite the name, but stop at the equals sign.
- * Future loops will not find contiguous space, but we
- * don't break early because we want to count the total
- * number.
- */
- end_of_area = strchr(environ[i], '=');
- }
- else
-#endif
- {
- end_of_area = environ[i] + strlen(environ[i]);
- }
- }
+ return argc;
}
ps_buffer = argv[0];
last_status_len = ps_buffer_size = end_of_area - argv[0];
+ }
- /*
- * move the environment out of the way
- */
- new_environ = (char **) malloc((i + 1) * sizeof(char *));
- if (!new_environ)
- {
- write_stderr("out of memory\n");
- exit(1);
- }
- for (i = 0; environ[i] != NULL; i++)
+ /*
+ * If the argv area wasn't big enough, also exec ourself an additional
+ * argument to make more room.
+ *
+ * Create a new argument which is just a big number of spaces. While this
+ * could technically be a valid filename and thus a valid argument to one
+ * of the other options, it should still be safe enough to break the case
+ * of passing exactly 256 spaces as a path. If someone really wants to do
+ * that, they could work around this, by just adding another argument
+ * after that, because we match the very last argument to this pattern.
+ */
+ {
+ char *extra_arg;
+
+ extra_arg = (char *) malloc(PS_BUFFER_MIN_SIZE);
+ MemSet(extra_arg, ' ', PS_BUFFER_MIN_SIZE - 1);
+ extra_arg[PS_BUFFER_MIN_SIZE - 1] = NULL;
+
+ if (ps_buffer_size <= PS_BUFFER_MIN_SIZE)
{
- new_environ[i] = strdup(environ[i]);
- if (!new_environ[i])
+ char **padded_argv;
+ int i;
+
+ padded_argv = (char **) malloc((argc + 2) * sizeof(char *));
+ if (!padded_argv)
{
write_stderr("out of memory\n");
exit(1);
}
+ for (i = 0; i < argc; i++)
+ padded_argv[i] = argv[i];
+ padded_argv[argc] = extra_arg;
+ padded_argv[argc + 1] = NULL;
+
+ execvp(argv[0], padded_argv);
+ write_stderr("exec failed: %s\n", strerror(errno));
+ exit(1);
+ }
+ else if (strcmp(argv[argc - 1], extra_arg) == 0)
+ {
+ /*
+ * To hide the extra_arg spaces on the ps output for the
+ * postmaster, we need to overwrite them with PS_PADDING.
+ */
+ MemSet(argv[argc - 1], PS_PADDING, PS_BUFFER_MIN_SIZE - 1);
+ save_argc = --argc;
}
- new_environ[i] = NULL;
- environ = new_environ;
}
/*
@@ -258,7 +242,7 @@ save_ps_display_args(int argc, char **argv)
}
#endif /* PS_USE_CLOBBER_ARGV */
- return argv;
+ return argc;
}
/*
diff --git a/src/include/utils/ps_status.h b/src/include/utils/ps_status.h
index ff5a2b2b8a2..7cd2cce8f2c 100644
--- a/src/include/utils/ps_status.h
+++ b/src/include/utils/ps_status.h
@@ -21,7 +21,7 @@
extern PGDLLIMPORT bool update_process_title;
-extern char **save_ps_display_args(int argc, char **argv);
+extern int save_ps_display_args(int argc, char **argv);
extern void init_ps_display(const char *fixed_part);
--
2.44.0
0004-Update-ps-status-on-all-systems-by-default.patchtext/x-patch; charset=UTF-8; name=0004-Update-ps-status-on-all-systems-by-default.patchDownload
From 78d0aebce3fa2722d1b6eef0c937251e8b5005ca Mon Sep 17 00:00:00 2001
From: Wolfgang Walther <walther@technowledgy.de>
Date: Mon, 25 Mar 2024 22:18:11 +0100
Subject: [PATCH 4/5] Update ps status on all systems by default
With the new approach, PS_USE_CLOBBER_ARGV does not overwrite any
environment variables anymore and can be made the default.
For the unlikely event that a system could have problems other than
the ps status not working with this, PS_USE_NONE is kept for the
moment. This can still be explicitly enabled via CPPFLAGS.
---
src/backend/utils/misc/ps_status.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index c1161e85507..d1df1b39ee0 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -35,14 +35,14 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
* PS_USE_SETPROCTITLE
* use the function setproctitle(const char *, ...)
* (other BSDs)
- * PS_USE_CLOBBER_ARGV
- * write over the argv area
- * (Linux and most SysV-like systems)
* PS_USE_WIN32
* push the string out as the name of a Windows event
+ * PS_USE_CLOBBER_ARGV
+ * write over the argv area
+ * (This is the default)
* PS_USE_NONE
* don't update ps display
- * (This is the default, as it is safest.)
+ * (This can be enabled manually with CPPFLAGS=-DPS_USE_NONE.)
*/
#ifndef PS_USE_NONE
@@ -50,12 +50,10 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
#define PS_USE_SETPROCTITLE_FAST
#elif defined(HAVE_SETPROCTITLE)
#define PS_USE_SETPROCTITLE
-#elif defined(__linux__) || defined(__sun) || defined(__darwin__)
-#define PS_USE_CLOBBER_ARGV
#elif defined(WIN32)
#define PS_USE_WIN32
#else
-#define PS_USE_NONE
+#define PS_USE_CLOBBER_ARGV
#endif
#endif /* not PS_USE_NONE */
--
2.44.0
0005-Remove-obsolete-way-of-disabling-ps-status.patchtext/x-patch; charset=UTF-8; name=0005-Remove-obsolete-way-of-disabling-ps-status.patchDownload
From 57be007a32e0963db003a5ac9ca3cfbc0deef974 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther <walther@technowledgy.de>
Date: Mon, 25 Mar 2024 22:21:10 +0100
Subject: [PATCH 5/5] Remove obsolete way of disabling ps status
This can be committed once it's clear that nobody actually needs it.
---
src/backend/utils/misc/ps_status.c | 30 ------------------------------
1 file changed, 30 deletions(-)
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index d1df1b39ee0..5d61259cc22 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -40,12 +40,7 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
* PS_USE_CLOBBER_ARGV
* write over the argv area
* (This is the default)
- * PS_USE_NONE
- * don't update ps display
- * (This can be enabled manually with CPPFLAGS=-DPS_USE_NONE.)
*/
-#ifndef PS_USE_NONE
-
#if defined(HAVE_SETPROCTITLE_FAST)
#define PS_USE_SETPROCTITLE_FAST
#elif defined(HAVE_SETPROCTITLE)
@@ -56,8 +51,6 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
#define PS_USE_CLOBBER_ARGV
#endif
-#endif /* not PS_USE_NONE */
-
/* Different systems want the buffer padded differently */
#if defined(__linux__) || defined(__darwin__)
#define PS_PADDING '\0'
@@ -66,8 +59,6 @@ bool update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
#endif
-#ifndef PS_USE_NONE
-
#ifndef PS_USE_CLOBBER_ARGV
/* all but one option need a buffer to write their ps line in */
#define PS_BUFFER_SIZE 256
@@ -92,8 +83,6 @@ static size_t ps_buffer_nosuffix_len;
static void flush_ps_display(void);
-#endif /* not PS_USE_NONE */
-
/* save the original argv[] location here */
static int save_argc;
static char **save_argv;
@@ -254,15 +243,12 @@ save_ps_display_args(int argc, char **argv)
void
init_ps_display(const char *fixed_part)
{
-#ifndef PS_USE_NONE
bool save_update_process_title;
-#endif
Assert(fixed_part || MyBackendType);
if (!fixed_part)
fixed_part = GetBackendTypeDesc(MyBackendType);
-#ifndef PS_USE_NONE
/* no ps display for stand-alone backend */
if (!IsUnderPostmaster)
return;
@@ -318,10 +304,8 @@ init_ps_display(const char *fixed_part)
update_process_title = true;
set_ps_display("");
update_process_title = save_update_process_title;
-#endif /* not PS_USE_NONE */
}
-#ifndef PS_USE_NONE
/*
* update_ps_display_precheck
* Helper function to determine if updating the process title is
@@ -346,7 +330,6 @@ update_ps_display_precheck(void)
return true;
}
-#endif /* not PS_USE_NONE */
/*
* set_ps_display_suffix
@@ -356,7 +339,6 @@ update_ps_display_precheck(void)
void
set_ps_display_suffix(const char *suffix)
{
-#ifndef PS_USE_NONE
size_t len;
/* first, check if we need to update the process title */
@@ -398,7 +380,6 @@ set_ps_display_suffix(const char *suffix)
/* and set the new title */
flush_ps_display();
-#endif /* not PS_USE_NONE */
}
/*
@@ -408,7 +389,6 @@ set_ps_display_suffix(const char *suffix)
void
set_ps_display_remove_suffix(void)
{
-#ifndef PS_USE_NONE
/* first, check if we need to update the process title */
if (!update_ps_display_precheck())
return;
@@ -426,7 +406,6 @@ set_ps_display_remove_suffix(void)
/* and set the new title */
flush_ps_display();
-#endif /* not PS_USE_NONE */
}
/*
@@ -440,7 +419,6 @@ set_ps_display_with_len(const char *activity, size_t len)
{
Assert(strlen(activity) == len);
-#ifndef PS_USE_NONE
/* first, check if we need to update the process title */
if (!update_ps_display_precheck())
return;
@@ -466,10 +444,8 @@ set_ps_display_with_len(const char *activity, size_t len)
/* Transmit new setting to kernel, if necessary */
flush_ps_display();
-#endif /* not PS_USE_NONE */
}
-#ifndef PS_USE_NONE
static void
flush_ps_display(void)
{
@@ -506,7 +482,6 @@ flush_ps_display(void)
}
#endif /* PS_USE_WIN32 */
}
-#endif /* not PS_USE_NONE */
/*
* Returns what's currently in the ps display, in case someone needs
@@ -526,12 +501,7 @@ get_ps_display(int *displen)
}
#endif
-#ifndef PS_USE_NONE
*displen = (int) (ps_buffer_cur_len - ps_buffer_fixed_size);
return ps_buffer + ps_buffer_fixed_size;
-#else
- *displen = 0;
- return "";
-#endif
}
--
2.44.0
On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
Bruce Momjian:
With no one "hoping this patch dies in a fire"*, I have updated it with
more details, which I now think is committable to master. Is this
something to backpatch? Seems too rare a bug to me.I would like to see this backpatched to be able to run the regression tests
easily on all stable branches.
You want to risk destabilizing Postgres by backpatching something this
complex so the regression tests can be run on all stable branches? I
think you are overestimating our desire to take on risk.
Also, in my patch, the parentheses here:
#if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__))
are unnecessary so they should be removed:
#if defined(__linux__) && ! defined(__GLIBC__) && ! defined(__UCLIBC__)
I am only willing to apply my patch, and only to master. Other
committers might be more willing.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On 22.03.24 20:44, Bruce Momjian wrote:
+ * linking (dlopen) might fail. Here, we truncate the update + * of the process title when either of two important dynamic + * linking environment variables are set. Musl does not + * define any compiler symbols, so we have to do this for + * any Linux libc we don't know is safe. + */ + if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] || + strstr(environ[i], "LD_PRELOAD=") == environ[i])
What determines which variables require this treatment?
On Mon, Mar 25, 2024 at 11:46:09PM +0100, Peter Eisentraut wrote:
On 22.03.24 20:44, Bruce Momjian wrote:
+ * linking (dlopen) might fail. Here, we truncate the update + * of the process title when either of two important dynamic + * linking environment variables are set. Musl does not + * define any compiler symbols, so we have to do this for + * any Linux libc we don't know is safe. + */ + if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] || + strstr(environ[i], "LD_PRELOAD=") == environ[i])What determines which variables require this treatment?
Thomas Munro came up with that part of the patch. I just combined his
patch with the macro test.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On Tue, Mar 26, 2024 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
What determines which variables require this treatment?
That came from me peeking at their code:
/messages/by-id/CA+hUKGKNK5V8XwwJJZm36s3EUy8V51xu4XiE8=26n=Wq3OGd4A@mail.gmail.com
I had originally proposed to avoid anything beginning "LD_" but Tom
suggested being more specific. I doubt LD_PRELOAD can really hurt you
though (the linker probably only needs the value at the start by
definition, not at later dlopen() time (?)). I dunno. If you're
asking if there is any standard or whatever supplying these names, the
System V or at least ELF standards talk about LD_LIBRARY_PATH (though
those standards don't know/care what happens after userspace takes
over control of the environment concept, they just talk about how the
world is created when you exec a process, so they AFAICS they don't
address this clobbering stuff, and AFAIK other LD_XXX stuff is
probably implementation specific).
Bruce Momjian <bruce@momjian.us> writes:
On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
I would like to see this backpatched to be able to run the regression tests
easily on all stable branches.
You want to risk destabilizing Postgres by backpatching something this
complex so the regression tests can be run on all stable branches? I
think you are overestimating our desire to take on risk.
If we want a buildfarm animal testing this platform, we kind of need
to support it on all branches. Having said that, I agree with you
that we are looking for a minimalist fix not a maximalist one.
I think the 0001 patch is about right, but the rest seem to be solving
problems we don't have.
regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes:
On Tue, Mar 26, 2024 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
What determines which variables require this treatment?
I had originally proposed to avoid anything beginning "LD_" but Tom
suggested being more specific. I doubt LD_PRELOAD can really hurt you
though (the linker probably only needs the value at the start by
definition, not at later dlopen() time (?)).
Oh, good point. So we could simplify the patch by only looking for
LD_LIBRARY_PATH.
regards, tom lane
On Mon, Mar 25, 2024 at 07:14:25PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
I would like to see this backpatched to be able to run the regression tests
easily on all stable branches.You want to risk destabilizing Postgres by backpatching something this
complex so the regression tests can be run on all stable branches? I
think you are overestimating our desire to take on risk.If we want a buildfarm animal testing this platform, we kind of need
to support it on all branches. Having said that, I agree with you
that we are looking for a minimalist fix not a maximalist one.
I think the 0001 patch is about right, but the rest seem to be solving
problems we don't have.
I could support the minimalist patch applied to all branches.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
I wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
I had originally proposed to avoid anything beginning "LD_" but Tom
suggested being more specific. I doubt LD_PRELOAD can really hurt you
though (the linker probably only needs the value at the start by
definition, not at later dlopen() time (?)).
Oh, good point. So we could simplify the patch by only looking for
LD_LIBRARY_PATH.
I looked at the musl source code you identified and confirmed that
only the LD_LIBRARY_PATH string is remembered in a static variable;
LD_PRELOAD is only accessed locally in that initialization function.
So we only need to do the attached. (I failed to resist the
temptation to rewrite the comments.)
regards, tom lane
Attachments:
v2-0001-Don-t-clobber-LD_-environment-variables.patchtext/x-diff; charset=us-ascii; name=v2-0001-Don-t-clobber-LD_-environment-variables.patchDownload
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e48..92cd2c7899 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,33 @@ save_ps_display_args(int argc, char **argv)
for (i = 0; environ[i] != NULL; i++)
{
if (end_of_area + 1 == environ[i])
- end_of_area = environ[i] + strlen(environ[i]);
+ {
+ /*
+ * The musl runtime linker keeps a static pointer to the
+ * initial value of LD_LIBRARY_PATH, if that is defined in the
+ * process's environment. Therefore, we must not overwrite the
+ * value of that setting and thus cannot advance end_of_area
+ * beyond it. Musl does not define any identifying compiler
+ * symbol, so we have to do this for any Linux libc we don't
+ * know is safe.
+ */
+#if defined(__linux__) && (!defined(__GLIBC__) && !defined(__UCLIBC__ ))
+ if (strncmp(environ[i], "LD_LIBRARY_PATH=", 16) == 0)
+ {
+ /*
+ * We can overwrite the name, but stop at the equals sign.
+ * Future loop iterations will not find any more
+ * contiguous space, but we don't break early because we
+ * need to count the total number of environ[] entries.
+ */
+ end_of_area = environ[i] + 15;
+ }
+ else
+#endif
+ {
+ end_of_area = environ[i] + strlen(environ[i]);
+ }
+ }
}
ps_buffer = argv[0];
On Tue, Mar 26, 2024 at 12:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked at the musl source code you identified and confirmed that
only the LD_LIBRARY_PATH string is remembered in a static variable;
LD_PRELOAD is only accessed locally in that initialization function.
So we only need to do the attached. (I failed to resist the
temptation to rewrite the comments.)
LGTM.
Hi,
On 2024-03-22 08:55:52 +0100, Wolfgang Walther wrote:
Andres Freund:
FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.I am already working with Andrew to set up a buildfarm animal to run Alpine
Linux/musl. I can look into the CI task as well. Are you suggesting to
change an existing task to run with Alpine/musl or to add a new task for it?
It would be docker image based for sure.
I'd rather adapt one of the existing tasks, to avoid increasing CI costs
unduly.
The way we currently run CI for testing of not-yet-merged patches runs
all tasks other than macos as full VMs, that turned out to be faster &
cheaper.
FWIW, except for one small issue, building postgres against musl works on
debian and the tests pass if I install first.
The small problem mentioned above is that on debian linux/fs.h isn't available
when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
fail to compile. I assume that's not the case on "fully musl" distro?
Greetings,
Andres Freund
On Tue, Mar 26, 2024 at 12:49:55PM +1300, Thomas Munro wrote:
On Tue, Mar 26, 2024 at 12:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked at the musl source code you identified and confirmed that
only the LD_LIBRARY_PATH string is remembered in a static variable;
LD_PRELOAD is only accessed locally in that initialization function.
So we only need to do the attached. (I failed to resist the
temptation to rewrite the comments.)LGTM.
+1
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On 26.03.24 00:43, Tom Lane wrote:
I wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
I had originally proposed to avoid anything beginning "LD_" but Tom
suggested being more specific. I doubt LD_PRELOAD can really hurt you
though (the linker probably only needs the value at the start by
definition, not at later dlopen() time (?)).Oh, good point. So we could simplify the patch by only looking for
LD_LIBRARY_PATH.I looked at the musl source code you identified and confirmed that
only the LD_LIBRARY_PATH string is remembered in a static variable;
LD_PRELOAD is only accessed locally in that initialization function.
So we only need to do the attached. (I failed to resist the
temptation to rewrite the comments.)
Yeah, I was more looking for a comment for posterity for *why* we need
to preserve this variable in particular. The updated comment looks
reasonable.
Bruce Momjian:
You want to risk destabilizing Postgres by backpatching something this
complex so the regression tests can be run on all stable branches? I
think you are overestimating our desire to take on risk.
I specifically wrote about backpatching the first (and maybe second)
patch. None of that is risky.
Patches 3-5 were not meant for backpatching at all.
Best,
Wolfgang
Tom Lane:
If we want a buildfarm animal testing this platform, we kind of need
to support it on all branches. Having said that, I agree with you
that we are looking for a minimalist fix not a maximalist one.
I think the 0001 patch is about right, but the rest seem to be solving
problems we don't have.
The second patch potentially solves the problem of PS_USE_NONE not being
tested. Of course you could also set up a buildfarm animal on some other
platform, which is sure to fall through to PS_USE_NONE, but that seems
to have not worked in the past:
Thomas Munro:
I dimly recall that it turned out that PS_USE_NONE was
actually broken for a while without anyone noticing
Best,
Wolfgang
The need to do $subject came up in [1]/messages/by-id/fddd1cd6-dc16-40a2-9eb5-d7fef2101488@technowledgy.de. Moving this to a separate
discussion on -hackers, because there are more issues to solve than just
the LD_LIBRARY_PATH problem.
Andres Freund:
FWIW, except for one small issue, building postgres against musl works on
debian and the tests pass if I install first.The small problem mentioned above is that on debian linux/fs.h isn't available
when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
fail to compile. I assume that's not the case on "fully musl" distro?
Correct, I have not seen this before on Alpine.
Here is my progress setting up a buildfarm animal to run on Alpine Linux
and the issues I found, so far:
The animal runs in a docker container via GitHub Actions in [2]https://github.com/technowledgy/postgresql-buildfarm-alpine/actions/workflows/run.yaml. Right
now it's still running with --test, until I get the credentials to
activate it.
I tried to enable everything (except systemd, because Alpine doesn't
have it) and run all tests. The LDAP tests are failing right now, but
that is likely something that I need to fix in the Dockerfile - it's
failing to start the slapd, IIRC. There are other issues, though - all
of them have open pull requests in that repo [3]https://github.com/technowledgy/postgresql-buildfarm-alpine/pulls.
I also had to skip the recovery check. Andrew mentioned that he had to
do that, too, when he was still running his animal on Alpine. Not sure
what this is about, yet.
Building --with-icu fails two tests. One of them (001_initdb) is fixed
by having the "locale" command in your PATH, which is not the case on
Alpine by default. I assume this will not break on your debian/musl
build, Andres - but it will also probably not return any sane values,
because it will run glibc's locale command.
I haven't looked into that in detail, yet, but I think the other test
(icu/010_database) fails because it expects that setlocale(LC_COLLATE,
<illegal_value>) throws an error. I think it doesn't do that on musl,
because LC_COLLATE is not implemented.
Those failing tests are not "just failing", but probably mean that we
need to do something about how we deal with locale/setlocale on musl.
The last failure is about building --with-nls. This fails with something
like:
ld: src/port/strerror.c:72:(.text+0x2d8): undefined reference to
`libintl_gettext'
Of course, gettext-dev is installed, libintl.so is available in /usr/lib
and it also contains the symbol. So not sure what's happening here.
Andres, did you build --with-icu and/or --with-nls on debian/musl? Did
you run the recovery tests?
Best,
Wolfgang
[1]: /messages/by-id/fddd1cd6-dc16-40a2-9eb5-d7fef2101488@technowledgy.de
/messages/by-id/fddd1cd6-dc16-40a2-9eb5-d7fef2101488@technowledgy.de
[2]: https://github.com/technowledgy/postgresql-buildfarm-alpine/actions/workflows/run.yaml
https://github.com/technowledgy/postgresql-buildfarm-alpine/actions/workflows/run.yaml
[3]: https://github.com/technowledgy/postgresql-buildfarm-alpine/pulls
walther@technowledgy.de writes:
The second patch potentially solves the problem of PS_USE_NONE not being
tested. Of course you could also set up a buildfarm animal on some other
platform, which is sure to fall through to PS_USE_NONE, but that seems
to have not worked in the past:
Thomas Munro:
I dimly recall that it turned out that PS_USE_NONE was
actually broken for a while without anyone noticing
I think what Thomas is recollecting is this:
Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_15_BR [0fb6954aa] 2022-03-27 12:57:46 -0400
Branch: REL_14_STABLE Release: REL_14_3 [3f7a59c59] 2022-03-27 12:57:52 -0400
Branch: REL_13_STABLE Release: REL_13_7 [9016a2a3d] 2022-03-27 12:57:57 -0400
Fix breakage of get_ps_display() in the PS_USE_NONE case.
Commit 8c6d30f21 caused this function to fail to set *displen
in the PS_USE_NONE code path. If the variable's previous value
had been negative, that'd lead to a memory clobber at some call
sites. We'd managed not to notice due to very thin test coverage
of such configurations, but this appears to explain buildfarm member
lorikeet's recent struggles.
Credit to Andrew Dunstan for spotting the problem. Back-patch
to v13 where the bug was introduced.
Discussion: /messages/by-id/136102.1648320427@sss.pgh.pa.us
The problem wasn't lack of coverage, it was that the failure was
intermittent and erratic enough to be very hard to diagnose.
I think that's more bad luck than anything else.
regards, tom lane
Peter Eisentraut <peter@eisentraut.org> writes:
On 26.03.24 00:43, Tom Lane wrote:
I looked at the musl source code you identified and confirmed that
only the LD_LIBRARY_PATH string is remembered in a static variable;
LD_PRELOAD is only accessed locally in that initialization function.
So we only need to do the attached. (I failed to resist the
temptation to rewrite the comments.)
Yeah, I was more looking for a comment for posterity for *why* we need
to preserve this variable in particular. The updated comment looks
reasonable.
OK, pushed after a bit more comment-fiddling.
regards, tom lane
Here's an update on the progress to run musl (Alpine Linux) in the
buildfarm.
Wolfgang Walther:
The animal runs in a docker container via GitHub Actions in [2]. Right
now it's still running with --test, until I get the credentials to
activate it.
The animals have been activated and are reporting now. Thanks, Andrew!
I tried to enable everything (except systemd, because Alpine doesn't
have it) and run all tests. The LDAP tests are failing right now, but
that is likely something that I need to fix in the Dockerfile - it's
failing to start the slapd, IIRC. There are other issues, though - all
of them have open pull requests in that repo [3].
ldap tests are enabled, just a missing package.
I also had to skip the recovery check. Andrew mentioned that he had to
do that, too, when he was still running his animal on Alpine. Not sure
what this is about, yet.
This was about a missing init process in the docker image. Without an
init process reaping zombie processes, the recovery tests end up with
some supposed-to-be-terminated backends still running and can't start
them up again. Fixed by adding a minimal init process with "tinit".
Building --with-icu fails two tests. One of them (001_initdb) is fixed
by having the "locale" command in your PATH, which is not the case on
Alpine by default. I assume this will not break on your debian/musl
build, Andres - but it will also probably not return any sane values,
because it will run glibc's locale command.
I haven't looked into that in detail, yet, but I think the other test
(icu/010_database) fails because it expects that setlocale(LC_COLLATE,
<illegal_value>) throws an error. I think it doesn't do that on musl,
because LC_COLLATE is not implemented.
Those failing tests are not "just failing", but probably mean that we
need to do something about how we deal with locale/setlocale on musl.
I still need to look into this in depth.
The last failure is about building --with-nls. This fails with something
like:ld: src/port/strerror.c:72:(.text+0x2d8): undefined reference to
`libintl_gettext'Of course, gettext-dev is installed, libintl.so is available in /usr/lib
and it also contains the symbol. So not sure what's happening here.
This is an Alpine Linux packaging issue. Theoretically, it could be made
to work by introducing some configure/meson flag like "--with-gettext"
or so, to prefer gettext's libintl over the libc-builtin. However, NixOS
/ nixpkgs with its pkgsMusl overlay manages to solve this issue just
fine, builds with --enable-nls and gettext work. Thus, I conclude this
is best solved upstream in Alpine Linux.
TLDR: The only real issue which is still open from PostgreSQL's side is
around locales and ICU - certainly the pain point in musl. Will look
into it further.
Best,
Wolfgang
About building one of the CI tasks with musl:
Andres Freund:
I'd rather adapt one of the existing tasks, to avoid increasing CI costs unduly.
I looked into this and I think the only task that could be changed is
the SanityCheck. This is because this builds without any additional
features enabled. I guess that makes sense, because otherwise those
dependencies would first have to be built with musl-gcc as well.
FWIW, except for one small issue, building postgres against musl works on debian and the tests pass if I install first.
After the fix for LD_LIBRARY_PATH this now works as expected without
installing first. I confirmed it works on debian with CC=musl-gcc.
The small problem mentioned above is that on debian linux/fs.h isn't available
when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
fail to compile.
According to [1]https://debian-bugs-dist.debian.narkive.com/VlFkLigg/bug-789789-musl-fails-to-compile-stuff-that-depends-on-kernel-headers, this can be worked around by linking some folders:
ln -s /usr/include/linux /usr/include/x86_64-linux-musl/
ln -s /usr/include/asm-generic /usr/include/x86_64-linux-musl/
ln -s /usr/include/x86_64-linux-gnu/asm /usr/include/x86_64-linux-musl/
Please find a patch to use musl-gcc in SanityCheck attached. Logs from
the CI run are in [2]https://cirrus-ci.com/task/5741892590108672. It has this in the configure phase:
[13:19:52.712] Using 'CC' from environment with value: 'ccache musl-gcc'
[13:19:52.712] C compiler for the host machine: ccache musl-gcc (gcc
10.2.1 "cc (Debian 10.2.1-6) 10.2.1 20210110")
[13:19:52.712] C linker for the host machine: musl-gcc ld.bfd 2.35.2
[13:19:52.712] Using 'CC' from environment with value: 'ccache musl-gcc'
So meson picks up musl-gcc properly. I also double checked that without
the links above, the build does indeed fail with the linux/fs.h error.
I assume the installation of musl-tools should be done in the
pg-vm-images repo instead of the additional script here?
Best,
Wolfgang
[1]: https://debian-bugs-dist.debian.narkive.com/VlFkLigg/bug-789789-musl-fails-to-compile-stuff-that-depends-on-kernel-headers
https://debian-bugs-dist.debian.narkive.com/VlFkLigg/bug-789789-musl-fails-to-compile-stuff-that-depends-on-kernel-headers
[2]: https://cirrus-ci.com/task/5741892590108672
Attachments:
0001-Build-SanityCheck-against-musl.patchtext/x-patch; charset=UTF-8; name=0001-Build-SanityCheck-against-musl.patchDownload
From 4a69d9851e7bbd7cd521d236847af9ebf5e6253b Mon Sep 17 00:00:00 2001
From: Wolfgang Walther <walther@technowledgy.de>
Date: Sun, 31 Mar 2024 15:17:43 +0200
Subject: [PATCH] Build SanityCheck against musl
---
.cirrus.tasks.yml | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 72f553e69f4..5815c51abbe 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -69,6 +69,7 @@ task:
CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir
# no options enabled, should be small
CCACHE_MAXSIZE: "150M"
+ CC: ccache musl-gcc
# While containers would start up a bit quicker, building is a bit
# slower. This way we don't have to maintain a container image.
@@ -77,6 +78,14 @@ task:
ccache_cache:
folder: $CCACHE_DIR
+ setup_additional_packages_script: |
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install musl-tools
+ ln -s -t /usr/include/x86_64-linux-musl/ \
+ /usr/include/linux \
+ /usr/include/asm-generic \
+ /usr/include/x86_64-linux-gnu/asm
+
create_user_script: |
useradd -m postgres
chown -R postgres:postgres .
--
2.44.0
On Wed, Mar 27, 2024 at 11:27 AM Wolfgang Walther
<walther@technowledgy.de> wrote:
The animal runs in a docker container via GitHub Actions in [2].
Great idea :-)
On 31.03.24 15:34, walther@technowledgy.de wrote:
I'd rather adapt one of the existing tasks, to avoid increasing CI
costs unduly.I looked into this and I think the only task that could be changed is
the SanityCheck.
I think SanityCheck should run a simple, "average" environment, like the
current Debian one. Otherwise, niche problems with musl or multi-arch
or whatever will throw off the entire build pipeline.
Peter Eisentraut:
On 31.03.24 15:34, walther@technowledgy.de wrote:
I'd rather adapt one of the existing tasks, to avoid increasing CI
costs unduly.I looked into this and I think the only task that could be changed is
the SanityCheck.I think SanityCheck should run a simple, "average" environment, like the
current Debian one. Otherwise, niche problems with musl or multi-arch
or whatever will throw off the entire build pipeline.
All the errors/problems I have seen so far, while setting up the
buildfarm animal on Alpine Linux, have been way beyond what SanityCheck
does. Problems only appeared in the tests suites, of which sanity check
only runs *very* basic ones. I don't have much experience with the
"cross" setup, that "musl on debian" essentially is, though.
All those things are certainly out of scope for CI - they are tested in
the build farm instead.
I do agree: SanityCheck doesn't feel like the right place to put this.
But on the other side.. if it really fails to *build* with musl, then it
shouldn't make a difference whether you will be notified about that
immediately or later in the CI pipeline. It certainly needs the fewest
additional resources to put it there.
I'm not sure what Andres meant with "adopting one of the existing
tasks". It could fit as another step into the "Linux - Debian Bullseye -
Autoconf" task, too. A bit similar to how the meson task build for 32
and 64bit. This would still not be an entirely new task like I proposed
initially (to run in Docker).
Best,
Wolfgang
Wolfgang Walther <walther@technowledgy.de> writes:
Peter Eisentraut:
I think SanityCheck should run a simple, "average" environment, like the
current Debian one. Otherwise, niche problems with musl or multi-arch
or whatever will throw off the entire build pipeline.
I do agree: SanityCheck doesn't feel like the right place to put this.
But on the other side.. if it really fails to *build* with musl, then it
shouldn't make a difference whether you will be notified about that
immediately or later in the CI pipeline. It certainly needs the fewest
additional resources to put it there.
That is not the concern here. What I think Peter is worried about,
and certainly what I'm worried about, is that a breakage in
SanityCheck comprehensively breaks all CI testing for all Postgres
developers. One buildfarm member that's failing does not halt
progress altogether, so it's not even in the same ballpark of
being as critical. So I agree with Peter that SanityCheck had
better use a very common, vanilla environment.
To be blunt, I do not think we need to test musl in the CI pipeline.
I see it as one of the niche platforms that the buildfarm exists
to test.
regards, tom lane
Tom Lane:
That is not the concern here. What I think Peter is worried about,
and certainly what I'm worried about, is that a breakage in
SanityCheck comprehensively breaks all CI testing for all Postgres
developers.
You'd have to commit a failing patch first to break CI for all other
developers. If you're only going to commit patches that pass those CI
tasks, then this is not going to happen. Then it only becomes a question
of how much feedback *you* get from a single CI run of your own patch.
To be blunt, I do not think we need to test musl in the CI pipeline.
I see it as one of the niche platforms that the buildfarm exists
to test.
I don't really have an opinion on this. I'm fine with having musl in the
buildfarm only. I don't expect the core build itself to fail with musl
anyway, this has been working fine for years. Andres asked for it to be
added to CI, so maybe he sees more value on top of just "building with
musl"?
Best,
Wolfgang
Wolfgang Walther <walther@technowledgy.de> writes:
That is not the concern here. What I think Peter is worried about,
and certainly what I'm worried about, is that a breakage in
SanityCheck comprehensively breaks all CI testing for all Postgres
developers.
You'd have to commit a failing patch first to break CI for all other
developers.
No, what I'm more worried about is some change in the environment
causing the build to start failing. When that happens, it'd better
be an environment that many of us are familiar with and can test/fix.
regards, tom lane
Tom Lane:
You'd have to commit a failing patch first to break CI for all other
developers.No, what I'm more worried about is some change in the environment
causing the build to start failing. When that happens, it'd better
be an environment that many of us are familiar with and can test/fix.
The way I understand how this work is, that the images for the VMs in
which those CI tasks run, are not just dynamically updated - but are
actually tested before they are used in CI. So the environment doesn't
just change suddenly.
See e.g. [1]https://github.com/anarazel/pg-vm-images/pull/91 for a pull request to the repo containing those images to
update the linux debian image from bullseye to bookworm. This is exactly
the image we're talking about. Before this image is used in postgres CI,
it's tested and confirmed that it actually works there. If one of the
jobs was using musl - that would be tested as well. So this job would
not just suddenly start failing for everybody.
I do see the "familiarity" argument for the SanityCheck task, but for a
different reason: Even though it's unlikely for this job to fail for
musl specific reasons - if you're not familiar with musl and can't
easily test it locally, you might not be able to tell immediately
whether it's musl specific or not. If musl was run in one of the later
jobs, it's much different: You see all tests failing - alright, not musl
specific. You see only the musl test failing - yeah, musl problem. This
should give developers much more confidence looking at the results.
Best,
Wolfgang