Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Started by Michael Paquierover 4 years ago15 messages
#1Michael Paquier
michael@paquier.xyz

Hi all,

serinus has been complaining about the new gcd functions in 13~:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2021-06-03%2003%3A44%3A14

The overflow detection is going wrong the way up and down, like here:
 SELECT gcd((-9223372036854775808)::int8, (-9223372036854775808)::int8); -- overflow
-ERROR:  bigint out of range
+         gcd
+----------------------
+ -9223372036854775808
+(1 row)

That seems like a compiler bug to me as this host uses recent GCC
snapshots, and I cannot see a problem in GCC 10.2 on my own dev box.
But perhaps I am missing something?

Thanks,
--
Michael

#2Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#1)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

On Thu, 3 Jun 2021 at 08:26, Michael Paquier <michael@paquier.xyz> wrote:

Hi all,

serinus has been complaining about the new gcd functions in 13~:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2021-06-03%2003%3A44%3A14

The overflow detection is going wrong the way up and down, like here:
SELECT gcd((-9223372036854775808)::int8, (-9223372036854775808)::int8); -- overflow
-ERROR:  bigint out of range
+         gcd
+----------------------
+ -9223372036854775808
+(1 row)

That seems like a compiler bug to me as this host uses recent GCC
snapshots, and I cannot see a problem in GCC 10.2 on my own dev box.
But perhaps I am missing something?

Huh, yeah. The code is pretty clear that that should throw an error:

if (arg1 == PG_INT64_MIN)
{
if (arg2 == 0 || arg2 == PG_INT64_MIN)
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("bigint out of range")));

and FWIW it works OK on my dev box with gcc 10.2.1 and the same cflags.

Regards,
Dean

In reply to: Michael Paquier (#1)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Hello

I build gcc version 12.0.0 20210603 (experimental) locally, then tried REL_13_STABLE with same CFLAGS as serinus
./configure --prefix=/home/melkij/tmp/pgdev/inst CFLAGS="-O3 -ggdb -g3 -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers" --enable-tap-tests --enable-cassert --enable-debug
check-world passed.

regards, Sergei

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#1)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Michael Paquier <michael@paquier.xyz> writes:

serinus has been complaining about the new gcd functions in 13~:

moonjelly, which also runs a bleeding-edge gcc, started to fail the same
way at about the same time. Given that none of our code in that area
has changed, it's hard to think it's anything but a broken compiler.
Maybe somebody should report that to gcc upstream?

regards, tom lane

#5Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#4)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

serinus has been complaining about the new gcd functions in 13~:

moonjelly, which also runs a bleeding-edge gcc, started to fail the same
way at about the same time. Given that none of our code in that area
has changed, it's hard to think it's anything but a broken compiler.

Maybe somebody should report that to gcc upstream?

Yes.

I will isolate a small case (hopefully) and do a report over week-end,
after checking that the latest version is still broken.

--
Fabien.

#6Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#5)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

serinus has been complaining about the new gcd functions in 13~:

moonjelly, which also runs a bleeding-edge gcc, started to fail the same
way at about the same time. Given that none of our code in that area
has changed, it's hard to think it's anything but a broken compiler.

Maybe somebody should report that to gcc upstream?

Yes.

I will isolate a small case (hopefully) and do a report over week-end, after
checking that the latest version is still broken.

Not needed in the end, the problem has disappeared with today's
gcc recompilation.

--
Fabien.

#7Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#1)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

On 2021-Jun-03, Michael Paquier wrote:

Hi all,

serinus has been complaining about the new gcd functions in 13~:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2021-06-03%2003%3A44%3A14

Hello, this problem is still happening; serinus' configure output says
it's running a snapshot from 20210527, and Fabien mentioned downthread
that his compiler stopped complaining on 2021-06-05. Andres, maybe the
compiler in serinus is due for an update too?

Thanks

--
�lvaro Herrera 39�49'30"S 73�17'W
"Once again, thank you and all of the developers for your hard work on
PostgreSQL. This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#7)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Hello, this problem is still happening; serinus' configure output says
it's running a snapshot from 20210527, and Fabien mentioned downthread
that his compiler stopped complaining on 2021-06-05. Andres, maybe the
compiler in serinus is due for an update too?

Yeah, serinus is visibly still running one of the broken revisions:

configure: using compiler=gcc (Debian 20210527-1) 12.0.0 20210527 (experimental) [master revision 262e75d22c3:7bb6b9b2f47:9d3a953ec4d2695e9a6bfa5f22655e2aea47a973]

It'd sure be nice if seawasp stopped spamming the buildfarm failure log,
too. That seems to be a different issue:

Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f3827947859 in __GI_abort () at abort.c:79
#2 0x00007f3827947729 in __assert_fail_base (fmt=0x7f3827add588 "%s%s%s:%u: %s%sAssertion `%s' failed.\\n%n", assertion=0x7f381c3ce4c8 "S->getValue() && \\"Releasing SymbolStringPtr with zero ref count\\"", file=0x7f381c3ce478 "/home/fabien/llvm-src/llvm/include/llvm/ExecutionEngine/Orc/SymbolStringPool.h", line=91, function=<optimized out>) at assert.c:92
#3 0x00007f3827958f36 in __GI___assert_fail (assertion=0x7f381c3ce4c8 "S->getValue() && \\"Releasing SymbolStringPtr with zero ref count\\"", file=0x7f381c3ce478 "/home/fabien/llvm-src/llvm/include/llvm/ExecutionEngine/Orc/SymbolStringPool.h", line=91, function=0x7f381c3ce570 "llvm::orc::SymbolStringPtr::~SymbolStringPtr()") at assert.c:101
#4 0x00007f381c23c98d in llvm::orc::SymbolStringPtr::~SymbolStringPtr (this=0x277a8b0, __in_chrg=<optimized out>) at /home/fabien/llvm-src/llvm/include/llvm/ExecutionEngine/Orc/SymbolStringPool.h:91
#5 0x00007f381c24f879 in std::_Destroy<llvm::orc::SymbolStringPtr> (__pointer=0x277a8b0) at /home/fabien/gcc-10-bin/include/c++/10.3.1/bits/stl_construct.h:140
#6 0x00007f381c24d10c in std::_Destroy_aux<false>::__destroy<llvm::orc::SymbolStringPtr*> (__first=0x277a8b0, __last=0x277a998) at /home/fabien/gcc-10-bin/include/c++/10.3.1/bits/stl_construct.h:152
#7 0x00007f381c2488a6 in std::_Destroy<llvm::orc::SymbolStringPtr*> (__first=0x277a8b0, __last=0x277a998) at /home/fabien/gcc-10-bin/include/c++/10.3.1/bits/stl_construct.h:185
#8 0x00007f381c243c51 in std::_Destroy<llvm::orc::SymbolStringPtr*, llvm::orc::SymbolStringPtr> (__first=0x277a8b0, __last=0x277a998) at /home/fabien/gcc-10-bin/include/c++/10.3.1/bits/alloc_traits.h:738
#9 0x00007f381c23f1c3 in std::vector<llvm::orc::SymbolStringPtr, std::allocator<llvm::orc::SymbolStringPtr> >::~vector (this=0x7ffc73440a10, __in_chrg=<optimized out>) at /home/fabien/gcc-10-bin/include/c++/10.3.1/bits/stl_vector.h:680
#10 0x00007f381c26112c in llvm::orc::JITDylib::removeTracker (this=0x18b4240, RT=...) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/Core.cpp:1464
#11 0x00007f381c264cb9 in operator() (__closure=0x7ffc73440d00) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/Core.cpp:2054
#12 0x00007f381c264d29 in llvm::orc::ExecutionSession::runSessionLocked<llvm::orc::ExecutionSession::removeResourceTracker(llvm::orc::ResourceTracker&)::<lambda()> >(struct {...} &&) (this=0x187d110, F=...) at /home/fabien/llvm-src/llvm/include/llvm/ExecutionEngine/Orc/Core.h:1291
#13 0x00007f381c264ebc in llvm::orc::ExecutionSession::removeResourceTracker (this=0x187d110, RT=...) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/Core.cpp:2051
#14 0x00007f381c25734c in llvm::orc::ResourceTracker::remove (this=0x1910c30) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/Core.cpp:53
#15 0x00007f381c25a9c1 in llvm::orc::JITDylib::clear (this=0x18b4240) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/Core.cpp:622
#16 0x00007f381c26305e in llvm::orc::ExecutionSession::endSession (this=0x187d110) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/Core.cpp:1777
#17 0x00007f381c3373a3 in llvm::orc::LLJIT::~LLJIT (this=0x18a73b0, __in_chrg=<optimized out>) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/LLJIT.cpp:1002
#18 0x00007f381c38af48 in LLVMOrcDisposeLLJIT (J=0x18a73b0) at /home/fabien/llvm-src/llvm/lib/ExecutionEngine/Orc/OrcV2CBindings.cpp:561
#19 0x00007f381c596612 in llvm_shutdown (code=<optimized out>, arg=140722242323824) at llvmjit.c:892
#20 0x00000000007d4385 in proc_exit_prepare (code=code@entry=0) at ipc.c:209
#21 0x00000000007d4288 in proc_exit (code=code@entry=0) at ipc.c:107

regards, tom lane

#9Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#8)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

It'd sure be nice if seawasp stopped spamming the buildfarm failure log,
too.

There was a silent API breakage (same API, different behavior, how nice…)
in llvm main that Andres figured out, which will have to be fixed at some
point, so this is reminder that it is still a todo… Not sure when a fix is
planned, though. I'm afraid portability may require that different code is
executed depending on llvm version. Or maybe we should wrestle a revert on
llvm side? Hmmm…

So I'm not very confident that the noise will go away quickly, sorry.

--
Fabien.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#9)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Fabien COELHO <coelho@cri.ensmp.fr> writes:

It'd sure be nice if seawasp stopped spamming the buildfarm failure log,
too.

There was a silent API breakage (same API, different behavior, how nice…)
in llvm main that Andres figured out, which will have to be fixed at some
point, so this is reminder that it is still a todo…

If it were *our* todo, that would be one thing; but it isn't.

Not sure when a fix is
planned, though. I'm afraid portability may require that different code is
executed depending on llvm version. Or maybe we should wrestle a revert on
llvm side? Hmmm…

So I'm not very confident that the noise will go away quickly, sorry.

Could you please just shut down the animal until that's dealt with?
It's extremely unpleasant to have to root through a lot of useless
failures to find the ones that might be of interest. Right now
serinus and seawasp are degrading this report nearly to uselessness:

https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

regards, tom lane

#11Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#10)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Hello Tom,

So I'm not very confident that the noise will go away quickly, sorry.

Could you please just shut down the animal until that's dealt with?

Hmmm… Obviously I can.

However, please note that the underlying logic of "a test is failing,
let's just remove it" does not sound right to me at all:-(

The test is failing because there is a problem, and shuting down the test
to improve a report does not in any way help to fix it, it just helps to
hide it.

It's extremely unpleasant to have to root through a lot of useless
failures

I do not understand how they are useless. Pg does not work properly with
current LLVM, and keeps on not working. I think that this information is
worthy, even if I do not like it and would certainly prefer a quick fix.

to find the ones that might be of interest. Right now serinus and
seawasp are degrading this report nearly to uselessness:

https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

IMHO, the report should be improved, not the test removed.

If you insist I will shut down the animal, bit I'd prefer not to.

I think that the reminder has value, and just because some report is not
designed to handle this nicely does not seem like a good reason to do
that.

--
Fabien.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#11)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Fabien COELHO <coelho@cri.ensmp.fr> writes:

Could you please just shut down the animal until that's dealt with?

The test is failing because there is a problem, and shuting down the test
to improve a report does not in any way help to fix it, it just helps to
hide it.

Our buildfarm is run for the use of the Postgres project, not the LLVM
project. I'm not really happy that it contains any experimental-compiler
animals at all, but as long as they're unobtrusive I can stand it.
serinus and seawasp are being the opposite of unobtrusive.

If you don't want to shut it down entirely, maybe backing it off to
run only once a week would be an acceptable compromise. Since you
only update its compiler version once a week, I doubt we learn much
from runs done more often than that anyway.

regards, tom lane

#13Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#12)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

Hello Tom,

Could you please just shut down the animal until that's dealt with?

The test is failing because there is a problem, and shuting down the test
to improve a report does not in any way help to fix it, it just helps to
hide it.

Our buildfarm is run for the use of the Postgres project, not the LLVM
project.

The point of these animals is to have early warning of upcoming compiler
changes. Given the release cycle of the project and the fact that a
version is expected to work for 5 years, this is a clear benefit for
postgres, IMO. When the compiler is broken, it is noisy, too bad.

In this instance the compiler is not broken, but postgres is.

If the consensus is that these animals are useless, I'll remove them, and
be sad that the community is not able to see their value.

I'm not really happy that it contains any experimental-compiler
animals at all, but as long as they're unobtrusive I can stand it.
serinus and seawasp are being the opposite of unobtrusive.

I think that the problem is the report, not the failing animal.

In French we say "ce n’est pas en cassant le thermomètre qu’on fait tomber
la fièvre", which is an equivalent of "don't shoot the messenger".

If you don't want to shut it down entirely, maybe backing it off to
run only once a week would be an acceptable compromise. Since you
only update its compiler version once a week, I doubt we learn much
from runs done more often than that anyway.

Hmmm… I can slow it down. We will wait one week to learn that the problems
have been fixed, wow.

<Sigh>.

--
Fabien.

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#10)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

On Sat, Jun 19, 2021 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fabien COELHO <coelho@cri.ensmp.fr> writes:

It'd sure be nice if seawasp stopped spamming the buildfarm failure log,
too.

There was a silent API breakage (same API, different behavior, how nice…)
in llvm main that Andres figured out, which will have to be fixed at some
point, so this is reminder that it is still a todo…

If it were *our* todo, that would be one thing; but it isn't.

Over on the other thread[1]/messages/by-id/CA+hUKGLEy8mgtN7BNp0ooFAjUedDTJj5dME7NxLU-m91b85siA@mail.gmail.com we learned that this is an API change
affecting reference counting semantics[2]https://github.com/llvm/llvm-project/commit/c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671, so unless there is some
discussion somewhere about reverting the LLVM change that I'm unaware
of, I'm guessing we're going to need to change our code sooner or
later. I have a bleeding edge LLVM on my dev machine, and I'm willing
to try to reproduce the crash and write the trivial patch (that is,
figure out the right preprocessor incantation to detect the version or
feature, and bump the reference count as appropriate), if Andres
and/or Fabien aren't already on the case.

[1]: /messages/by-id/CA+hUKGLEy8mgtN7BNp0ooFAjUedDTJj5dME7NxLU-m91b85siA@mail.gmail.com
[2]: https://github.com/llvm/llvm-project/commit/c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671

#15Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Thomas Munro (#14)
Re: Failures with gcd functions with GCC snapshots GCC and -O3 (?)

There was a silent API breakage (same API, different behavior, how nice…)
in llvm main that Andres figured out, which will have to be fixed at some
point, so this is reminder that it is still a todo…

If it were *our* todo, that would be one thing; but it isn't.

Over on the other thread[1] we learned that this is an API change
affecting reference counting semantics[2], so unless there is some
discussion somewhere about reverting the LLVM change that I'm unaware
of, I'm guessing we're going to need to change our code sooner or
later.

Indeed, I'm afraid the solution will have to be on pg side.

I have a bleeding edge LLVM on my dev machine, and I'm willing to try to
reproduce the crash and write the trivial patch (that is, figure out the
right preprocessor incantation to detect the version or feature, and
bump the reference count as appropriate), if Andres and/or Fabien aren't
already on the case.

I'm not in the case, I'm only the one running the farm animal which barks
too annoyingly for Tom.

--
Fabien.