012_subtransactions.pl vs clang -fsanitize=undefined

Started by Justin Pryzbyover 3 years ago5 messageshackers
Jump to latest
#1Justin Pryzby
pryzby@telsasoft.com

I found that -fsanitize causes the test to fail, going back to REL_10_STABLE,
for any clang in:

1:11.1.0-6
1:12.0.1-19ubuntu3
1:13.0.1-2ubuntu2
1:14.0.0-1ubuntu1

| time ./configure --enable-cassert --enable-debug --enable-tap-tests --with-CC=clang-13 CFLAGS='-fsanitize=undefined'
| time { make -j4 clean; make -j4; } >/dev/null
| time PROVE_TESTS=t/012_subtransactions.pl make -C ./src/test/recovery check
|
| t/012_subtransactions.pl .. 2/?
| # Failed test 'Visible'
| # at t/012_subtransactions.pl line 111.
| # got: '-1'
| # expected: '8128'
| ...
| # Looks like you failed 6 tests of 12.

I haven't found any combination of options which cause it to fail differently,
so I'm not even sure if the problem is in postgres, the test case, clang or
libubsan. Note that optimization seems to avoid the problem, which is why
"kestrel" shows no issue.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-07-23%2022%3A17%3A48

--
Justin

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#1)
Re: 012_subtransactions.pl vs clang -fsanitize=undefined

On Mon, Jul 25, 2022 at 8:59 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I found that -fsanitize causes the test to fail, going back to REL_10_STABLE,
for any clang in:

1:11.1.0-6
1:12.0.1-19ubuntu3
1:13.0.1-2ubuntu2
1:14.0.0-1ubuntu1

| time ./configure --enable-cassert --enable-debug --enable-tap-tests --with-CC=clang-13 CFLAGS='-fsanitize=undefined'
| time { make -j4 clean; make -j4; } >/dev/null
| time PROVE_TESTS=t/012_subtransactions.pl make -C ./src/test/recovery check
|
| t/012_subtransactions.pl .. 2/?
| # Failed test 'Visible'
| # at t/012_subtransactions.pl line 111.
| # got: '-1'
| # expected: '8128'
| ...
| # Looks like you failed 6 tests of 12.

I haven't found any combination of options which cause it to fail differently,
so I'm not even sure if the problem is in postgres, the test case, clang or
libubsan. Note that optimization seems to avoid the problem, which is why
"kestrel" shows no issue.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&amp;dt=2022-07-23%2022%3A17%3A48

Yeah I've seen this too... it'd be good to figure out how to fix it:

/messages/by-id/CA+hUKGLDA-GuQKRvDF3abHadDrrYZ33N9e4DEOGwKH3JqdYSCQ@mail.gmail.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#2)
Re: 012_subtransactions.pl vs clang -fsanitize=undefined

Thomas Munro <thomas.munro@gmail.com> writes:

On Mon, Jul 25, 2022 at 8:59 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I found that -fsanitize causes the test to fail, going back to REL_10_STABLE,
for any clang in:

Yeah I've seen this too... it'd be good to figure out how to fix it:
/messages/by-id/CA+hUKGLDA-GuQKRvDF3abHadDrrYZ33N9e4DEOGwKH3JqdYSCQ@mail.gmail.com

Yeah, reproduces here too with RHEL8's clang 13.0.1. I also see
that the failures are due to "stack depth exceeded" errors from that
recursive hs_subxids() function. As best I can tell, the stack depth
failure is entirely honest:

(gdb) p stack_base_ptr
$1 = 0x7ffd92032100 "\360[\\\006"
(gdb) p $sp
$2 = (void *) 0x7ffd91e305a0
(gdb) p 0x7ffd92032100 - 0x7ffd91e305a0
$3 = 2104160

I can get at most 82 recursion levels without failure.
With my normal build it can get to 678 levels before dying.

I think what's happening is just that this build configuration
eats stack extravagantly. Maybe it keeps all locals on the stack
instead of in registers? I'm too lazy to check out the assembly
code.

I thought for a moment of rewriting hs_subxids() to iterate instead
of recurse, but it doesn't look like that's tremendously feasible in
plpgsql --- the only way to make nested subtransactions is to recurse.
It could probably be done from the client by issuing a series of
SAVEPOINT commands, but not nearly as elegantly.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: 012_subtransactions.pl vs clang -fsanitize=undefined

I wrote:

I think what's happening is just that this build configuration
eats stack extravagantly.

That's definitely it, but I don't entirely see why. Here are a
couple of major offenders though:

(gdb) x/8i ExecInterpExpr
0x11a5530 <ExecInterpExpr>: push %rbp
0x11a5531 <ExecInterpExpr+1>: mov %rsp,%rbp
0x11a5534 <ExecInterpExpr+4>: sub $0x2f40,%rsp
0x11a553b <ExecInterpExpr+11>: mov %rdi,-0x10(%rbp)
0x11a553f <ExecInterpExpr+15>: mov %rsi,-0x18(%rbp)
0x11a5543 <ExecInterpExpr+19>: mov %rdx,-0x20(%rbp)
0x11a5547 <ExecInterpExpr+23>: jmpq 0x11a554c <ExecInterpExpr+28>
0x11a554c <ExecInterpExpr+28>: cmpq $0x0,-0x10(%rbp)

(gdb) p 0x2f40
$51 = 12096

(gdb) x/8i ExecInitExprRec
0x11672e0 <ExecInitExprRec>: push %rbp
0x11672e1 <ExecInitExprRec+1>: mov %rsp,%rbp
0x11672e4 <ExecInitExprRec+4>: sub $0x3c80,%rsp
0x11672eb <ExecInitExprRec+11>: mov %rdi,-0x8(%rbp)
0x11672ef <ExecInitExprRec+15>: mov %rsi,-0x10(%rbp)
0x11672f3 <ExecInitExprRec+19>: mov %rdx,-0x18(%rbp)
0x11672f7 <ExecInitExprRec+23>: mov %rcx,-0x20(%rbp)
0x11672fb <ExecInitExprRec+27>: lea -0x60(%rbp),%rdi

(gdb) p 0x3c80
$52 = 15488

It looks like this build eats about 24K of stack per plpgsql recursion
level, of which ExecInterpExpr alone accounts for half. Why is that?
It has no large local variables, mostly just ints and pointers.
There are a lot of them, but even if you presume that each one gets
its own dedicated bit of the stack frame, it's hard to arrive at 12K.

I'd almost call this a compiler bug.

regards, tom lane

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#4)
Re: 012_subtransactions.pl vs clang -fsanitize=undefined

On Mon, Jul 25, 2022 at 10:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I think what's happening is just that this build configuration
eats stack extravagantly.

That's definitely it, but I don't entirely see why. Here are a
couple of major offenders though:

Interesting. I wonder where we can read about what stuff clang puts
on the stack to implement the undefined behaviour checker (and what
GCC does differently here), but today I will resist the urge to go
looking.

As for workarounds (and as a note for my future self next time I'm
testing with UBSan), this is enough for the test to pass on my dev box
(4MB is not enough):

--- a/src/test/recovery/t/012_subtransactions.pl
+++ b/src/test/recovery/t/012_subtransactions.pl
@@ -16,6 +16,7 @@ $node_primary->append_conf(
        'postgresql.conf', qq(
        max_prepared_transactions = 10
        log_checkpoints = true
+       max_stack_depth = 5MB
 ));

It's also possible to tell it to keep out of certain functions:

https://github.com/llvm/llvm-project/blob/main/clang/docs/UndefinedBehaviorSanitizer.rst#disabling-instrumentation-with-attribute-no-sanitize-undefined