Recent failures on buildfarm member hornet

Started by Tom Laneover 5 years ago12 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

hornet has failed its last five runs with

2020-10-05 22:45:42.784 UTC [34734498:40] pg_regress/create_aggregate LOG: statement: create aggregate my_percentile_disc(float8 ORDER BY anyelement) (
stype = internal,
sfunc = ordered_set_transition,
finalfunc = percentile_disc_final,
finalfunc_extra = true,
finalfunc_modify = read_write
);
TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498)

After looking at the commits immediately preceding the first failure, and
digging around in the aggregate-related code, it seems like commit
cc99baa43 (Improve pg_list.h's linitial(), lsecond() and co macros)
must've broke it somehow. The nearest thing that I can see to a theory
is that where DefineAggregate does
numDirectArgs = intVal(lsecond(args));
it's coming out with the wrong result, leading to a failure of the
numDirectArgs-vs-numArgs sanity check in AggregateCreate. But how could
that be? I hesitate to blame the compiler twice in one week. OTOH, it's
a not-very-mainstream compiler on a not-very-mainstream architecture.

Noah, can you poke into this in a little more detail and try to verify
what is happening?

regards, tom lane

#2Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#1)
Re: Recent failures on buildfarm member hornet

On Mon, Oct 05, 2020 at 08:42:15PM -0400, Tom Lane wrote:

hornet has failed its last five runs with

2020-10-05 22:45:42.784 UTC [34734498:40] pg_regress/create_aggregate LOG: statement: create aggregate my_percentile_disc(float8 ORDER BY anyelement) (
stype = internal,
sfunc = ordered_set_transition,
finalfunc = percentile_disc_final,
finalfunc_extra = true,
finalfunc_modify = read_write
);
TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498)

After looking at the commits immediately preceding the first failure, and
digging around in the aggregate-related code, it seems like commit
cc99baa43 (Improve pg_list.h's linitial(), lsecond() and co macros)
must've broke it somehow. The nearest thing that I can see to a theory
is that where DefineAggregate does
numDirectArgs = intVal(lsecond(args));
it's coming out with the wrong result, leading to a failure of the
numDirectArgs-vs-numArgs sanity check in AggregateCreate.

Building the tree with -O0 suppresses the problem. (xlc does not have -O1.)
Building just aggregatecmds.c and pg_aggregate.c that way does not, so I
suppose the damage arose elsewhere.

But how could
that be? I hesitate to blame the compiler twice in one week. OTOH, it's
a not-very-mainstream compiler on a not-very-mainstream architecture.

A compiler bug is plausible. The compiler is eight years old, and we're not
seeing the problem on 32-bit (mandrill) or on 2019-vintage xlc (hoverfly).
Shall I make this animal use -O0 on v14+?

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#2)
Re: Recent failures on buildfarm member hornet

Noah Misch <noah@leadboat.com> writes:

On Mon, Oct 05, 2020 at 08:42:15PM -0400, Tom Lane wrote:

hornet has failed its last five runs with
TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498)

Building the tree with -O0 suppresses the problem. (xlc does not have -O1.)

OK, that's pretty unsurprising.

Building just aggregatecmds.c and pg_aggregate.c that way does not, so I
suppose the damage arose elsewhere.

Now that *is* surprising. Could you poke a little further to determine
which module is getting miscompiled? (gram.o seems like the next most
likely bet.)

regards, tom lane

#4Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#3)
Re: Recent failures on buildfarm member hornet

On Tue, Oct 06, 2020 at 09:56:49PM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Mon, Oct 05, 2020 at 08:42:15PM -0400, Tom Lane wrote:

hornet has failed its last five runs with
TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", Line: 216, PID: 34734498)

Building the tree with -O0 suppresses the problem. (xlc does not have -O1.)

OK, that's pretty unsurprising.

Building just aggregatecmds.c and pg_aggregate.c that way does not, so I
suppose the damage arose elsewhere.

Now that *is* surprising. Could you poke a little further to determine
which module is getting miscompiled? (gram.o seems like the next most
likely bet.)

gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o,
but those seem far less likely.)

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#4)
Re: Recent failures on buildfarm member hornet

Noah Misch <noah@leadboat.com> writes:

On Tue, Oct 06, 2020 at 09:56:49PM -0400, Tom Lane wrote:

Now that *is* surprising. Could you poke a little further to determine
which module is getting miscompiled? (gram.o seems like the next most
likely bet.)

gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o,
but those seem far less likely.)

This suggests that the problem is misoptimization of gram.y's
makeOrderedSetArgs:

/* don't merge into the next line, as list_concat changes directargs */
ndirectargs = list_length(directargs);

return list_make2(list_concat(directargs, orderedargs),
makeInteger(ndirectargs));

I think that if the compiler did what the comment says not to, it'd match
this symptom. However, I'm baffled as to why our recent pg_list.h changes
would've affected that.

regards, tom lane

#6David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#5)
Re: Recent failures on buildfarm member hornet

Thanks for investigating this, Noah.

On Thu, 8 Oct 2020 at 06:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Noah Misch <noah@leadboat.com> writes:

gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o,
but those seem far less likely.)

This suggests that the problem is misoptimization of gram.y's
makeOrderedSetArgs:

It would be interesting to see gram.s from both cc99baa4~1 and cc99baa4.

David

#7Noah Misch
noah@leadboat.com
In reply to: David Rowley (#6)
Re: Recent failures on buildfarm member hornet

On Thu, Oct 08, 2020 at 09:15:12AM +1300, David Rowley wrote:

On Thu, 8 Oct 2020 at 06:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Noah Misch <noah@leadboat.com> writes:

gram.o was it. (The way I rebuilt gram.o also rebuilt parser.o and scan.o,
but those seem far less likely.)

This suggests that the problem is misoptimization of gram.y's
makeOrderedSetArgs:

It would be interesting to see gram.s from both cc99baa4~1 and cc99baa4.

Attached. Both generated like this:

(cd src/backend/parser && xlc_r -D_LARGE_FILES=1 -qnoansialias -g -O2 -S -qmaxmem=33554432 -I. -I. -I../../../src/include -c gram.c; )

(Again, the compiler is eight years old, and we're not seeing the problem on
32-bit (mandrill) or on 2019-vintage xlc (hoverfly). As soon as the situation
no longer piques your curiosity, I'm happy to make hoverfly use -O0 on v14+.)

Attachments:

gram-cc99baa~1.s.gzapplication/x-gunzipDownload+4-6
gram-cc99baa.s.gzapplication/x-gunzipDownload+4-1
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#7)
Re: Recent failures on buildfarm member hornet

Noah Misch <noah@leadboat.com> writes:

On Thu, Oct 08, 2020 at 09:15:12AM +1300, David Rowley wrote:

It would be interesting to see gram.s from both cc99baa4~1 and cc99baa4.

Attached. Both generated like this:

Hm. I'm too lazy to go bone up on PPC64 ABI conventions, but this does
look suspiciously like the compiler is doing what I feared:

GOOD:

lwa r31,4(r27) # fetching list_length(directargs) ?
.line 16295
ori r3,r27,0x0000
bl .list_concat{PR}
ori r0,r0,0x0000
std r3,112(SP)
.line 16295
extsw r3,r31 # ... and passing it to makeInteger
bl .makeInteger{PR}
ori r0,r0,0x0000

BAD:

ori r3,r31,0x0000
bl .list_concat{PR}
ori r0,r0,0x0000
std r3,112(SP)
.line 16288
lwa r3,4(r31) # fetching list_length(directargs) ?
bl .makeInteger{PR}
ori r0,r0,0x0000

(I'm confused about why the line numbers don't match up, since cc99baa4
did not touch gram.y. But whatever.)

I'm tempted to propose the attached small code rearrangement, which
might dissuade the compiler from thinking it can get away with this.
While I concur with your point that an old xlc version might not be
that exciting, there could be other compilers doing the same thing
in the future.

regards, tom lane

Attachments:

rearrange-makeOrderedSetArgs.patchtext/x-diff; charset=us-ascii; name=rearrange-makeOrderedSetArgs.patchDownload+3-3
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: Recent failures on buildfarm member hornet

I wrote:

I'm tempted to propose the attached small code rearrangement, which
might dissuade the compiler from thinking it can get away with this.
While I concur with your point that an old xlc version might not be
that exciting, there could be other compilers doing the same thing
in the future.

After thinking about it a bit more, I'm not even convinced that what
xlc seems to be doing is illegal per C spec. There are no sequence
points within

return list_make2(list_concat(directargs, orderedargs),
makeInteger(ndirectargs));

and therefore there's an argument to be made that the compiler
doesn't have to care whether any side-effects of list_concat() occur
before or after the evaluation of makeInteger(ndirectargs).
If the potential side-effects of list_concat() can be disregarded
until the end of this statement, then the code change is perfectly legal.

Maybe some very careful language-lawyering could prove differently,
but it's not as open-and-shut as one could wish. So now I'm thinking
that we need this patch anyway, xlc or not.

regards, tom lane

#10Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#9)
Re: Recent failures on buildfarm member hornet

On Wed, Oct 07, 2020 at 06:03:16PM -0400, Tom Lane wrote:

I wrote:

I'm tempted to propose the attached small code rearrangement, which
might dissuade the compiler from thinking it can get away with this.

That patch does get hornet's -O2 build to again pass "make check". It doesn't
harm the code, so we may as well use it.

While I concur with your point that an old xlc version might not be
that exciting, there could be other compilers doing the same thing
in the future.

After thinking about it a bit more, I'm not even convinced that what
xlc seems to be doing is illegal per C spec. There are no sequence
points within

return list_make2(list_concat(directargs, orderedargs),
makeInteger(ndirectargs));

There is, however, a sequence point between list_length(directargs) and
list_concat(), and the problem arises because xlc reorders those two. It's
true that makeInteger() could run before or after list_concat(), but that
alone would not have been a problem.

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#10)
Re: Recent failures on buildfarm member hornet

Noah Misch <noah@leadboat.com> writes:

On Wed, Oct 07, 2020 at 06:03:16PM -0400, Tom Lane wrote:

After thinking about it a bit more, I'm not even convinced that what
xlc seems to be doing is illegal per C spec. There are no sequence
points within

return list_make2(list_concat(directargs, orderedargs),
makeInteger(ndirectargs));

There is, however, a sequence point between list_length(directargs) and
list_concat(), and the problem arises because xlc reorders those two. It's
true that makeInteger() could run before or after list_concat(), but that
alone would not have been a problem.

Yeah, that is the theory on which the existing code is built,
specifically that the list_length fetch must occur before list_concat
runs. What I am wondering about is a more aggressive interpretation of
"sequence point", namely that the compiler is free to disregard exactly
when list_concat's side-effects occur between this statement's sequence
points. I'm not sure that the C spec allows that interpretation, but
I'm not sure it doesn't, either.

regards, tom lane

#12Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#11)
Re: Recent failures on buildfarm member hornet

On Wed, Oct 07, 2020 at 06:22:04PM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Wed, Oct 07, 2020 at 06:03:16PM -0400, Tom Lane wrote:

After thinking about it a bit more, I'm not even convinced that what
xlc seems to be doing is illegal per C spec. There are no sequence
points within

return list_make2(list_concat(directargs, orderedargs),
makeInteger(ndirectargs));

There is, however, a sequence point between list_length(directargs) and
list_concat(), and the problem arises because xlc reorders those two. It's
true that makeInteger() could run before or after list_concat(), but that
alone would not have been a problem.

Yeah, that is the theory on which the existing code is built,
specifically that the list_length fetch must occur before list_concat
runs. What I am wondering about is a more aggressive interpretation of
"sequence point", namely that the compiler is free to disregard exactly
when list_concat's side-effects occur between this statement's sequence
points. I'm not sure that the C spec allows that interpretation, but
I'm not sure it doesn't, either.

C does not allow that, because the list_length() happens in a different "full
expression" (different statement):

ndirectargs = list_length(directargs);/* noah adds: a sequence point is here */

return list_make2(list_concat(directargs, orderedargs),
makeInteger(ndirectargs));

A compiler may implement like this:

ndirectargs = list_length(directargs);
a := list_concat(directargs, orderedargs);
b := makeInteger(ndirectargs);
return list_make2(a, b);

Or this:

ndirectargs = list_length(directargs);
b := makeInteger(ndirectargs);
a := list_concat(directargs, orderedargs);
return list_make2(a, b);

But not like this:

a := list_concat(directargs, orderedargs);
ndirectargs = list_length(directargs);
b := makeInteger(ndirectargs);
return list_make2(a, b);