Speed up clean meson builds by ~25%
Building the generated ecpg preproc file can take a long time. You can
check how long using:
ninja -C build src/interfaces/ecpg/preproc/ecpg.p/meson-generated_.._preproc.c.o
This moves that file much closer to the top of our build order, so
building it can be pipelined much better with other files.
It improved clean build times on my machine (10 cores/20 threads) from ~40
seconds to ~30 seconds.
You can check improvements for yourself with:
ninja -C build clean && ninja -C build all
Attachments:
v1-0001-Speed-up-clean-parallel-meson-builds-a-lot.patchapplication/octet-stream; name=v1-0001-Speed-up-clean-parallel-meson-builds-a-lot.patchDownload+7-5
On Fri, 5 Apr 2024 at 00:45, Jelte Fennema-Nio <me@jeltef.nl> wrote:
It improved clean build times on my machine (10 cores/20 threads) from ~40
seconds to ~30 seconds.
After discussing this off-list with Bilal, I realized that this gain
is only happening for clang builds on my system. Because those take a
long time as was also recently discussed in[1]/messages/by-id/CA+hUKGLvJ7-=fS-J9kN=aZWrpyqykwqCBbxXLEhUa9831dPFcg@mail.gmail.com. My builds don't take
nearly as long though. I tried with clang 15 through 18 and they all
took 10-22 seconds to run and clang comes from apt.llvm.org on Ubuntu
22.04
[1]: /messages/by-id/CA+hUKGLvJ7-=fS-J9kN=aZWrpyqykwqCBbxXLEhUa9831dPFcg@mail.gmail.com
Hi,
On 2024-04-05 15:36:34 +0200, Jelte Fennema-Nio wrote:
On Fri, 5 Apr 2024 at 00:45, Jelte Fennema-Nio <me@jeltef.nl> wrote:
It improved clean build times on my machine (10 cores/20 threads) from ~40
seconds to ~30 seconds.After discussing this off-list with Bilal, I realized that this gain
is only happening for clang builds on my system. Because those take a
long time as was also recently discussed in[1]. My builds don't take
nearly as long though. I tried with clang 15 through 18 and they all
took 10-22 seconds to run and clang comes from apt.llvm.org on Ubuntu
22.04[1]: /messages/by-id/CA+hUKGLvJ7-=fS-J9kN=aZWrpyqykwqCBbxXLEhUa9831dPFcg@mail.gmail.com
I recommend opening a bug report for clang, best with an already preprocessed
input file.
We're going to need to do something about this from our side as well, I
suspect. The times aren't great with gcc either, even if not as bad as with
clang.
Greetings,
Andres Freund
On Fri, 5 Apr 2024 at 17:24, Andres Freund <andres@anarazel.de> wrote:
I recommend opening a bug report for clang, best with an already preprocessed
input file.
We're going to need to do something about this from our side as well, I
suspect. The times aren't great with gcc either, even if not as bad as with
clang.
Agreed. While not a full solution, I think this patch is still good to
address some of the pain: Waiting 10 seconds at the end of my build
with only 1 of my 10 cores doing anything.
So while this doesn't decrease CPU time spent it does decrease
wall-clock time significantly in some cases, with afaict no downsides.
On Fri, Apr 05, 2024 at 06:19:20PM +0200, Jelte Fennema-Nio wrote:
Agreed. While not a full solution, I think this patch is still good to
address some of the pain: Waiting 10 seconds at the end of my build
with only 1 of my 10 cores doing anything.So while this doesn't decrease CPU time spent it does decrease
wall-clock time significantly in some cases, with afaict no downsides.
Well, this is also painful with ./configure. So, even if we are going
to move away from it at this point, we still need to support it for a
couple of years. It looks to me that letting the clang folks know
about the situation is the best way forward.
--
Michael
Hello Michael,
08.04.2024 08:23, Michael Paquier wrote:
On Fri, Apr 05, 2024 at 06:19:20PM +0200, Jelte Fennema-Nio wrote:
Agreed. While not a full solution, I think this patch is still good to
address some of the pain: Waiting 10 seconds at the end of my build
with only 1 of my 10 cores doing anything.So while this doesn't decrease CPU time spent it does decrease
wall-clock time significantly in some cases, with afaict no downsides.Well, this is also painful with ./configure. So, even if we are going
to move away from it at this point, we still need to support it for a
couple of years. It looks to me that letting the clang folks know
about the situation is the best way forward.
As I wrote in [1]/messages/by-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8@gmail.com, I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...
[1]: /messages/by-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8@gmail.com
Best regards,
Alexander
On 05.04.24 18:19, Jelte Fennema-Nio wrote:
On Fri, 5 Apr 2024 at 17:24, Andres Freund <andres@anarazel.de> wrote:
I recommend opening a bug report for clang, best with an already preprocessed
input file.We're going to need to do something about this from our side as well, I
suspect. The times aren't great with gcc either, even if not as bad as with
clang.Agreed. While not a full solution, I think this patch is still good to
address some of the pain: Waiting 10 seconds at the end of my build
with only 1 of my 10 cores doing anything.So while this doesn't decrease CPU time spent it does decrease
wall-clock time significantly in some cases, with afaict no downsides.
I have tested this with various compilers, and I can confirm that this
shaves off about 5 seconds from the build wall-clock time, which
represents about 10%-20% of the total time. I think this is a good patch.
Interestingly, if I apply the analogous changes to the makefiles, I
don't get any significant improvements. (Depends on local
circumstances, of course.) But I would still suggest to keep the
makefiles aligned.
On Mon, 8 Apr 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...
Using the attached script I got these timings. Clang is significantly
slower in all of them. But especially with -Og the difference between
is huge.
gcc 11.4.0: 7.276s
clang 18.1.3: 17.216s
gcc 11.4.0 --debug: 7.441s
clang 18.1.3 --debug: 18.164s
gcc 11.4.0 --debug -Og: 2.418s
clang 18.1.3 --debug -Og: 14.864s
I reported this same issue to the LLVM project here:
https://github.com/llvm/llvm-project/issues/87973
Attachments:
compile.sh.txttext/plain; charset=US-ASCII; name=compile.sh.txtDownload
On Mon, 8 Apr 2024 at 10:02, Peter Eisentraut <peter@eisentraut.org> wrote:
I have tested this with various compilers, and I can confirm that this
shaves off about 5 seconds from the build wall-clock time, which
represents about 10%-20% of the total time. I think this is a good patch.
Great to hear.
Interestingly, if I apply the analogous changes to the makefiles, I
don't get any significant improvements. (Depends on local
circumstances, of course.) But I would still suggest to keep the
makefiles aligned.
Attached is a patch that also updates the Makefiles, but indeed I
don't get any perf gains there either.
On Mon, 8 Apr 2024 at 07:23, Michael Paquier <michael@paquier.xyz> wrote:
Well, this is also painful with ./configure. So, even if we are going
to move away from it at this point, we still need to support it for a
couple of years. It looks to me that letting the clang folks know
about the situation is the best way forward.
I reported the issue to the clang folks:
https://github.com/llvm/llvm-project/issues/87973
But even if my patch doesn't help for ./configure, it still seems like
a good idea to me to still reduce compile times when using meson while
we wait for clang folks to address the issue.
Attachments:
v2-0001-Speed-up-clean-parallel-meson-builds-a-lot.patchapplication/octet-stream; name=v2-0001-Speed-up-clean-parallel-meson-builds-a-lot.patchDownload+10-8
Hello Jelte,
08.04.2024 11:36, Jelte Fennema-Nio wrote:
On Mon, 8 Apr 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...Using the attached script I got these timings. Clang is significantly
slower in all of them. But especially with -Og the difference between
is huge.gcc 11.4.0: 7.276s
clang 18.1.3: 17.216s
gcc 11.4.0 --debug: 7.441s
clang 18.1.3 --debug: 18.164s
gcc 11.4.0 --debug -Og: 2.418s
clang 18.1.3 --debug -Og: 14.864sI reported this same issue to the LLVM project here:
https://github.com/llvm/llvm-project/issues/87973
Maybe we're talking about different problems.
At [1] Thomas (and then I) was unhappy with more than 200 seconds
duration...
/messages/by-id/CA+hUKGLvJ7-=fS-J9kN=aZWrpyqykwqCBbxXLEhUa9831dPFcg@mail.gmail.com
Best regards,
Alexander
Hi,
On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin <exclusion@gmail.com> wrote:
Hello Michael,
08.04.2024 08:23, Michael Paquier wrote:
On Fri, Apr 05, 2024 at 06:19:20PM +0200, Jelte Fennema-Nio wrote:
Agreed. While not a full solution, I think this patch is still good to
address some of the pain: Waiting 10 seconds at the end of my build
with only 1 of my 10 cores doing anything.So while this doesn't decrease CPU time spent it does decrease
wall-clock time significantly in some cases, with afaict no downsides.Well, this is also painful with ./configure. So, even if we are going
to move away from it at this point, we still need to support it for a
couple of years. It looks to me that letting the clang folks know
about the situation is the best way forward.As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...[1] /messages/by-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8@gmail.com
I had this problem on my local computer. My build times are:
gcc: 20s
clang-15: 24s
clang-16: 105s
clang-17: 111s
clang-18: 25s
--
Regards,
Nazir Bilal Yavuz
Microsoft
On Mon, Apr 08, 2024 at 12:23:56PM +0300, Nazir Bilal Yavuz wrote:
On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin <exclusion@gmail.com> wrote:
As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...[1] /messages/by-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8@gmail.com
I had this problem on my local computer. My build times are:
gcc: 20s
clang-15: 24s
clang-16: 105s
clang-17: 111s
clang-18: 25s
Interesting. A parallel build of ecpg shows similar numbers here:
clang-16: 101s
clang-17: 112s
clang-18: 14s
gcc: 10s
Most of the time is still spent on preproc.c with clang-18, but that's
much, much faster (default version of clang is 16 on Debian GID where
I've run these numbers).
--
Michael
On Tue, Apr 9, 2024 at 5:01 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Apr 08, 2024 at 12:23:56PM +0300, Nazir Bilal Yavuz wrote:
On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin <exclusion@gmail.com> wrote:
As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...[1] /messages/by-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8@gmail.com
I had this problem on my local computer. My build times are:
gcc: 20s
clang-15: 24s
clang-16: 105s
clang-17: 111s
clang-18: 25sInteresting. A parallel build of ecpg shows similar numbers here:
clang-16: 101s
clang-17: 112s
clang-18: 14s
gcc: 10s
I don't expect it to get fixed BTW, because it's present in 16.0.6,
and .6 is the terminal release, if I understand their system
correctly. They're currently only doing bug fixes for 18, and even
there not for much longer. Interesting that not everyone saw this at
first, perhaps the bug arrived in a minor release that some people
didn't have yet? Or perhaps there is something special required to
trigger it?
Hi,
On 2024-04-09 17:13:52 +1200, Thomas Munro wrote:
On Tue, Apr 9, 2024 at 5:01 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Apr 08, 2024 at 12:23:56PM +0300, Nazir Bilal Yavuz wrote:
On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin <exclusion@gmail.com> wrote:
As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
is fixed already.
Perhaps it's worth rechecking...[1] /messages/by-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8@gmail.com
I had this problem on my local computer. My build times are:
gcc: 20s
clang-15: 24s
clang-16: 105s
clang-17: 111s
clang-18: 25sInteresting. A parallel build of ecpg shows similar numbers here:
clang-16: 101s
clang-17: 112s
clang-18: 14s
gcc: 10sI don't expect it to get fixed BTW, because it's present in 16.0.6,
and .6 is the terminal release, if I understand their system
correctly. They're currently only doing bug fixes for 18, and even
there not for much longer. Interesting that not everyone saw this at
first, perhaps the bug arrived in a minor release that some people
didn't have yet? Or perhaps there is something special required to
trigger it?
I think we need to do something about the compile time of this file, even with
gcc. Our main grammar already is an issue and stacking all the ecpg stuff on
top makes it considerably worse.
ISTM there's a bunch of pretty pointless stuff in the generated preproc.y,
which do seem to have some impact on compile time. E.g. a good bit of the file
is just stuff like
reserved_keyword:
ALL
{
$$ = mm_strdup("all");
}
...
Why are strduping all of these? We could instead just use the value of the
token, instead of forcing the compiler to generate branches for all individual
keywords etc.
I don't know off-hand if the keyword lookup machinery ends up with an
uppercase keyword, but if so, that'd be easy enough to change.
It actually looks to me like the many calls to mm_strdup() might actually be
what's driving clang nuts. I hacked up preproc.y to not need those calls for
unreserved_keyword
col_name_keyword
type_func_name_keyword
reserved_keyword
bare_label_keyword
by removing the actions and defining those tokens to be of type str. There are
many more such calls that could be dealt with similarly.
That alone reduced compile times with
clang-16 -O1 from 18.268s to 12.516s
clang-16 -O2 from 345.188 to 158.084s
clang-19 -O2 from 26.018s to 15.200s
I suspect what is happening is that clang tries to optimize the number of
calls to mm_strdup(), by separating the argument setup from the function
call. Which leads to a control flow graph with *many* incoming edges to the
basic block containing the function call to mm_strdup(), triggering a normally
harmless O(N^2) or such.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
I think we need to do something about the compile time of this file, even with
gcc. Our main grammar already is an issue and stacking all the ecpg stuff on
top makes it considerably worse.
Seems reasonable, if we can.
Why are strduping all of these?
IIRC, the issue is that the mechanism for concatenating the tokens
back together frees the input strings
static char *
cat2_str(char *str1, char *str2)
{
char * res_str = (char *)mm_alloc(strlen(str1) + strlen(str2) + 2);
strcpy(res_str, str1);
if (strlen(str1) != 0 && strlen(str2) != 0)
strcat(res_str, " ");
strcat(res_str, str2);
free(str1); <------------------
free(str2); <------------------
return res_str;
}
So that ought to dump core if you don't make all the productions
return malloc'd strings. How did you work around that?
(Maybe it'd be okay to just leak all the strings?)
regards, tom lane
Hi,
On 2024-04-09 15:33:10 -0700, Andres Freund wrote:
Which leads to a control flow graph with *many* incoming edges to the basic
block containing the function call to mm_strdup(), triggering a normally
harmless O(N^2) or such.
With clang-16 -O2 there is a basic block with 3904 incoming basic blocks. With
the hacked up preproc.y it's 2968. A 30% increase leading to a doubling of
runtime imo seems consistent with my theory of there being some ~quadratic
behaviour.
I suspect that this is also what's causing gram.c compilation to be fairly
slow, with both clang and gcc. There aren't as many pstrdup()s in gram.y as
the are mm_strdup() in preproc.y, but there still are many.
ISTM that there are many pstrdup()s that really make very little sense. I
think it's largely because there are many rules declared %type <str>, which
prevents us from returning a string constant without a warning.
There may be (?) some rules that modify strings returned by subsidiary rules,
but if so, it can't be many.
Greetings,
Andres
Hi,
On 2024-04-09 19:00:41 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I think we need to do something about the compile time of this file, even with
gcc. Our main grammar already is an issue and stacking all the ecpg stuff on
top makes it considerably worse.Seems reasonable, if we can.
Why are strduping all of these?
IIRC, the issue is that the mechanism for concatenating the tokens
back together frees the input strings
Ah, that explains it - but also seems somewhat unnecessary.
So that ought to dump core if you don't make all the productions
return malloc'd strings. How did you work around that?
I just tried to get to the point of understanding the reasons for slow
compilation, not to actually keep it working :). I.e. I didn't.
(Maybe it'd be okay to just leak all the strings?)
Hm. The input to ecpg can be fairly large, I guess. And we have fun code like
cat_str(), which afaict is O(arguments^2) in its memory usage if we wouldn't
free?
Not immediately sure what the right path is.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2024-04-09 19:00:41 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Why are strduping all of these?
IIRC, the issue is that the mechanism for concatenating the tokens
back together frees the input strings
Ah, that explains it - but also seems somewhat unnecessary.
I experimented with replacing mm_strdup() with
#define mm_strdup(x) (x)
As you did, I wasn't trying to get to a working result, so I didn't do
anything about removing all the free's or fixing the cast-away-const
warnings. The result was disappointing though. On my Mac laptop
(Apple clang version 15.0.0), the compile time for preproc.o went from
6.7sec to 5.5sec. Which is better, but not enough better to persuade
me to do all the janitorial work of restructuring ecpg's
string-slinging. I think we haven't really identified the problem.
As a comparison point, compiling gram.o on the same machine
takes 1.3sec. So I am seeing a problem here, sure enough,
although not as bad as it is in some other clang versions.
regards, tom lane
Hi,
On 2024-04-09 19:44:03 -0400, Tom Lane wrote:
I experimented with replacing mm_strdup() with
#define mm_strdup(x) (x)
As you did, I wasn't trying to get to a working result, so I didn't do
anything about removing all the free's or fixing the cast-away-const
warnings. The result was disappointing though. On my Mac laptop
(Apple clang version 15.0.0), the compile time for preproc.o went from
6.7sec to 5.5sec. Which is better, but not enough better to persuade
me to do all the janitorial work of restructuring ecpg's
string-slinging. I think we haven't really identified the problem.
With what level of optimization was that? It kinda looks like their version
might be from before the worst of the issue...
FWIW, just redefining mm_strdup() that way doesn't help much here either,
even with an affected compiler. The gain increases substantially after
simplifying unreserved_keyword etc to just use the default action.
I think having the non-default actions for those branches leaves you with a
similar issue, each of the actions just set a register, storing that and going
to the loop iteration is the same.
FWIW:
clang-19 -O2
"plain" 0m24.354s
mm_strdup redefined 0m23.741s
+use default action 0m14.218s
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2024-04-09 19:44:03 -0400, Tom Lane wrote:
As you did, I wasn't trying to get to a working result, so I didn't do
anything about removing all the free's or fixing the cast-away-const
warnings. The result was disappointing though. On my Mac laptop
(Apple clang version 15.0.0), the compile time for preproc.o went from
6.7sec to 5.5sec. Which is better, but not enough better to persuade
me to do all the janitorial work of restructuring ecpg's
string-slinging. I think we haven't really identified the problem.
With what level of optimization was that? It kinda looks like their version
might be from before the worst of the issue...
Just the autoconf-default -O2.
FWIW, just redefining mm_strdup() that way doesn't help much here either,
even with an affected compiler. The gain increases substantially after
simplifying unreserved_keyword etc to just use the default action.
Hm.
In any case, this is all moot unless we can come to a new design for
how ecpg does its string-mashing. Thoughts?
I thought for a bit about not allocating strings as such, but just
passing around pointers into the source text plus lengths, and
reassembling the string data only at the end when we need to output it.
Not sure how well that would work, but it could be a starting point.
regards, tom lane