[PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )
Hi folks
The attached patch series adds support for detecting coding errors where a
stack-allocated ErrorContextCallback is not popped from the
error_context_stack before the variable leaves scope.
With the attached patches this code will emit a backtrace and abort() on
cassert builds at the "return" statement where it leaks the stack pointer:
ErrorContextCallback cb pg_errcontext_check();
...
cb.previous = error_context_stack;
error_context_stack = &cb;
...
if (something)
return; /* whoops! */
...
error_context_stack = error_context_stack->previous;
as I discuss on this old blog article here:
https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/
Note the pgl_errcontext_check() in the above, which enables the check.
The infrastructure added by the patch could well be useful for other
similar validation steps. In fact, I've added a similar validator that
detects escapes from PG_TRY() blocks, so you get a neat assertion failure
at the point of escape instead of a crash at a __longjmp() with a
nonsensical stack. Example using a deliberately introduced mistake in
postgres_fdw as a test:
Without the PG_TRY() patch:
#0 __longjmp () at ../sysdeps/x86_64/__longjmp.S:111
#1 0xc35a5d9e1a902cfc in ?? ()
Backtrace stopped: Cannot access memory at address 0xc0ea5d9e1a902cfc
With the PG_TRY() patch (abbreviated):
#3 0x0000000000abbd4b in pg_try_check_return_guard
#4 0x00007f8e2e20e41a in fetch_more_data
#5 0x00007f8e2e20a80a in postgresIterateForeignScan
...
Note that valgrind's memcheck, gcc's -fstack-protector, etc are all quite
bad at detecting this class of error, so being able to do so automatically
should be nice. I've had an interesting time chasing down error context
stack mistakes a few times in the past
We unfortunately cannot use it for RAII-style coding because the underlying
__attribute__((callback(..))) is not available on MSVC. As usual. But it
remains useful for checking and assertions.
The patch is supplied as a small series:
(1) Add some documentation on error context callback usage
(2) Add pg_errcontext_check() to detect ErrorContextCallback escapes
(3) Enable pg_errcontext_check() for all in-tree users
(4) Document that generally PG_CATCH() should PG_RE_THROW()
(5) Assert() if returning from inside a PG_TRY() / PG_CATCH() block
(6) [RFC] Add a more succinct API for error context stack callbacks
(7) [RFC] Use __has_attribute for compiler attribute detection where
possible
I also added a patch (6) on top as a RFC of sorts, which adds a more
succinct API for error context setup and teardown. It's not required for
these changes, but it might make sense to adopt it at the same time if
we're changing usage across the tree anyway.
A further patch (7) switches over to using __has_attribute instead of
explicit compiler version checks. Again, optional, more of a RFC.
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise
Attachments:
0001-Add-some-documentation-on-error-context-callback-usa.patchtext/x-patch; charset=US-ASCII; name=0001-Add-some-documentation-on-error-context-callback-usa.patchDownload+54-2
0002-Add-pg_errcontext_check-to-detect-ErrorContextCallba.patchtext/x-patch; charset=US-ASCII; name=0002-Add-pg_errcontext_check-to-detect-ErrorContextCallba.patchDownload+91-2
0004-Document-that-generally-PG_CATCH-should-PG_RE_THROW.patchtext/x-patch; charset=US-ASCII; name=0004-Document-that-generally-PG_CATCH-should-PG_RE_THROW.patchDownload+9-2
0005-Assert-if-returning-from-inside-a-PG_TRY-PG_CATCH-bl.patchtext/x-patch; charset=US-ASCII; name=0005-Assert-if-returning-from-inside-a-PG_TRY-PG_CATCH-bl.patchDownload+22-1
0003-Enable-pg_errcontext_check-for-all-in-tree-users.patchtext/x-patch; charset=US-ASCII; name=0003-Enable-pg_errcontext_check-for-all-in-tree-users.patchDownload+48-49
0006-Add-a-more-succinct-API-for-error-context-stack-call.patchtext/x-patch; charset=US-ASCII; name=0006-Add-a-more-succinct-API-for-error-context-stack-call.patchDownload+17-10
0007-Use-__has_attribute-for-compiler-attribute-detection.patchtext/x-patch; charset=US-ASCII; name=0007-Use-__has_attribute-for-compiler-attribute-detection.patchDownload+95-14
Craig Ringer <craig@2ndquadrant.com> writes:
The attached patch series adds support for detecting coding errors where a
stack-allocated ErrorContextCallback is not popped from the
error_context_stack before the variable leaves scope.
So my immediate thoughts about this are
(1) It's mighty invasive for what it accomplishes. AFAIK we have
had few of this class of bug, and so I'm not excited about touching
every callback use in the tree to add defenses. It's also not great
that future code additions won't be protected unless they remember
to add these magic annotations. The PG_TRY application is better since
it's wrapped into the existing macro.
(2) I don't like that this is adding runtime overhead to try to detect
such bugs.
(3) I think it's a complete failure that such a bug will only be
detected if the faulty code path is actually taken. I think it's
quite likely that any such bugs that are lurking are in "can't
happen", or at least hard-to-hit, corner cases --- if they were in
routinely tested paths, we'd have noticed them by now. So it'd be far
more helpful if we had a static-analysis way to detect such issues.
Thinking about (3), I wonder whether there's a way to instruct Coverity
to detect such errors.
regards, tom lane
On Thu, 3 Sep 2020 at 22:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
The attached patch series adds support for detecting coding errors where
a
stack-allocated ErrorContextCallback is not popped from the
error_context_stack before the variable leaves scope.So my immediate thoughts about this are
(1) It's mighty invasive for what it accomplishes. AFAIK we have
had few of this class of bug, and so I'm not excited about touching
every callback use in the tree to add defenses. It's also not great
that future code additions won't be protected unless they remember
to add these magic annotations. The PG_TRY application is better since
it's wrapped into the existing macro.
I agree with you there. I'd actually like to change how we set up
errcontext callbacks anyway, as I think they're ugly, verbose and
error-prone.
See patch 6 for a RFC on that.
With regards to PG_TRY() you may note that patch 5 adds a similar check to
detect escapes from PG_TRY() / PG_CATCH() bodies.
(2) I don't like that this is adding runtime overhead to try to detect
such bugs.
Right. Only in cassert builds, though.
Frankly I'd be happy enough even having it as something I can turn on when
I wanted it. I've had a heck of a time tracking down errcontext escapes in
the past. I wrote it for an extension, then figured I'd submit it to core
and see if anyone wanted it.
Even if we don't adopt it in PostgreSQL, now it's out there to help out
others debugging similar issues.
(3) I think it's a complete failure that such a bug will only be
detected if the faulty code path is actually taken. I think it's
quite likely that any such bugs that are lurking are in "can't
happen", or at least hard-to-hit, corner cases --- if they were in
routinely tested paths, we'd have noticed them by now. So it'd be far
more helpful if we had a static-analysis way to detect such issues.Thinking about (3), I wonder whether there's a way to instruct Coverity
to detect such errors.
I think calling it a "complete failure" is ridiculous. By the same
rationale, we shouldn't bother with Assert(...) at all. But I agree that
it's definitely far from as good as having a statically verifiable check
would be, and this *should* be something we can statically verify.
I actually had a pretty good look around for static analysis options to see
if I could find anything that might help us out before I landed up with
this approach.
The sticking point is the API we use. By using auto stack variables (and
doing so in pure C where they cannot have destructors) and using direct
assignments to globals, we cannot use the majority of static analysis
annotations since they tend to be aimed at function-based APIs. And for
some reason most static analysis tools appear to be poor at discovering
escapes of pointers to stack variables leaving scope, at least unless you
use annotated functions that record ownership transfers. Which we can't
because ... direct assignment.
It's one of the reasons I want to wrap the errcontext API per patch 6 in
the above series. The current API is extremely verbose, hard to validate
and check, and difficult to apply static analysis to.
If we had error context setup/teardown macros we could implement them using
static inline functions and annotate them as appropriate for the target
toolchain and available static analysis tools.
So what do you think of patch 6?
I'll try tweaking the updated API to add annotateable functions and see if
that helps static analysis tools detect issues, then report back.
I personally think it'd be well worth adopting a wrapped API and
simultaneously renaming ErrorContextCallback to ErrorContextCallbackData to
ensure that code must be updated to compile with the changes. It'd be
simple to provide a backwards compatibility header that extension authors
can copy into their trees so they can use the new API when building against
old postgres, greatly reducing the impact on extension authors. (We do that
sort of thing constantly in pglogical and BDR).
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise
On Fri, 4 Sep 2020 at 14:13, Craig Ringer <craig@2ndquadrant.com> wrote:
I actually had a pretty good look around for static analysis options to
see if I could find anything that might help us out before I landed up with
this approach.
Apparently not good enough.
https://clang.llvm.org/docs/analyzer/checkers.html#core-stackaddressescape-c
Using a test program:
return_stack_escape.c:14:3: warning: Address of stack memory associated
with local variable 'g' is still referred to by the global variable
'guard_ptr' upon returning to the caller. This will be a dangling reference
return do_fail;
^~~~~~~~~~~~~~
1 warning generated.
so ... that's interesting. I'll need to do some checking and verify that
it's effective on the actual problem I originally had, but if so, I shall
proceed with kicking myself now.
Handily, the same thing can be used to detect PG_TRY() escapes.
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise
On Fri, 4 Sep 2020 at 14:55, Craig Ringer <craig@2ndquadrant.com> wrote:
Using a test program:
return_stack_escape.c:14:3: warning: Address of stack memory associated
with local variable 'g' is still referred to by the global variable
'guard_ptr' upon returning to the caller. This will be a dangling reference
return do_fail;
^~~~~~~~~~~~~~
1 warning generated.
Example here:
https://github.com/ringerc/scrapcode/tree/master/c/clang_return_stack_checks
So I find that actually, the __attribute__((callback(fn)) approach is
unnecessary for the purpose proposed.
I still think we should be looking at tidying up the error contexts API,
but it's easy enough for extensions to provide a wrapper over it, so that's
not a big deal really.
I will submit my docs patches separately to ensure they get some attention,
and I'll experiment with llvm threadsafety annotations / capabilities.
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise
Craig Ringer <craig@2ndquadrant.com> writes:
Example here:
https://github.com/ringerc/scrapcode/tree/master/c/clang_return_stack_checks
So I find that actually, the __attribute__((callback(fn)) approach is
unnecessary for the purpose proposed.
I tested this by injecting some faults of the described sort.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..eaf7716816 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -3025,6 +3025,8 @@ CopyFrom(CopyState cstate)
myslot = CopyMultiInsertInfoNextFreeSlot(&multiInsertInfo,
resultRelInfo);
+ if (!myslot)
+ return 0;
}
/*
leads to
/home/tgl/pgsql/src/backend/commands/copy.c:3029:6: warning: Address of stack memory associated with local variable 'errcallback' is still referred to by the global variable 'error_context_stack' upon returning to the caller. This will be a dangling reference
return 0;
^~~~~~~~
So that's good. However, a similar experiment with returning from inside
a PG_TRY does *not* produce a warning. I suspect maybe the compiler
throws up its hands when it sees sigsetjmp?
(These results from clang 10.0.0 on Fedora 32.)
regards, tom lane
On Tue, 8 Sep 2020 at 10:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So that's good. However, a similar experiment with returning from inside
a PG_TRY does *not* produce a warning. I suspect maybe the compiler
throws up its hands when it sees sigsetjmp?
I find that odd myself, given that in PG_TRY() we:
PG_exception_stack = &_local_sigjmp_buf;
and a return within that block should count as a pointer to
_local_sigjmp_buf escaping. Yet I confirm that I don't get a warning on
clang-analyzer 10.0.0 (fedora 32) in my builds either. Even with -O0.
TL;DR:
-------
It's not due to sigsetjmp() at all though; see
https://github.com/ringerc/scrapcode/blob/master/c/clang_return_stack_checks/mask_local_escape.c
for part of what's going on.
I haven't worked out why sometimes an unrelated stack escape gets hidden
though.
Attached patches demonstrate tests and are DEFINITELY NOT proposals for
core!
SUMMARY
-----
I found per test details below (links and demo patches supplied, patches
NOT proposed for inclusion in pg) that scan-build seems to be able to track
a local escape via simple indirection such as copying the address to a
local first, but cannot track it when it is assigned to a struct member in
a global pointer if that global pointer is then replaced, even if it's
restored immediately.
Demonstration code to illustrate this is at
https://github.com/ringerc/scrapcode/blob/master/c/clang_return_stack_checks/mask_local_escape.c
This produces no warnings from scan-build but leaks a pointer to an
auto-variable on the stack, then dereferences it to read garbage. I
deliberately clobber the stack to ensure that it's predictable garbage, so
we get:
$ make run_mask_local_escape
./mask_local_escape
&g = 0x7fff1b31e6bc
&g has escaped: 0x7fff1b31e6bc
value is: 0x7f7f7f7f
I think we're using a similar pattern for PG_error_stack, and that's why we
don't get any warnings.
In this case valgrind detects the issue when the bogus data is actually
read. Run "valgrind ./mask_local_escape". But I'm pretty sure I've seen
other cases where it does not.
So I'm wondering if the explicit __attribute__((callback(cbfn))) approach
has some merits after all, for cassert builds. Assuming we care about this
error, since it's a bit of an academic exercise at this point. I just
needed to understand what was going on and why there was no warning.
DETAILS OF TESTS
-----
If I wrap the sigjmp_buf in a struct defined in Pg itself and allow that to
escape I see the same behaviour, so it's not specific to sigjmp_buf.
I tried adding a dummy guard variable to PG_TRY() and PG_END_TRY() solely
for the purpose of tracking this sort of escape. But I notice that a
warning from scan-build is only emitted in PG_CATCH() or PG_FINALLY(),
never from inside the PG_TRY() body. I'd expect to see a warning from
either branch, as it's escaping either way.
scan-build will raise other warnings from within the PG_CATCH() block, such
as if I introduce a
+ int foo = 0;
+ fprintf(stdout, "%d", 10/foo);
so it's not that scan-build as a whole is failing to take the sigsetjmp()
== 0 branch. It raises *other* warnings. This can also be verified by
replacing sigsetjmp() with a dummy static inline function that always takes
one branch or the other.
I landed up writing the experiment code at
https://github.com/ringerc/scrapcode/blob/master/c/clang_return_stack_checks/
to explore the behaviour some more.
The file sigjmp_escape.c explores various combinations of guard variables
and early returns. The file sigjmp_escape_hdr.c and its companions
sigjmp_escape_hdr_try.{c,h} check whether indirection via separate header
and compilation unit makes any difference (it didn't).
Initially I found that in my PG_TRY()-alike test code I got a warning like:
sigjmp_escape_hdr.c:36:4: warning: Address of stack memory associated with
local variable '_local_sigjmp_buf' is still referred to by the global
variable 'TEST_exception_stack' upon returning to the caller. This will be
a dangling reference
return;
^~~~~~
just like I want to see from inside a PG_TRY() block.
But with some further testing I noticed that it seems the checker may not
notice escapes where there's even the simplest level of indirection, and
also that one candidate escape may mask another. I'm not totally sure yet.
In the sigjmp_escape.c test, if I build it with both inner and outer guards
(-DUSE_INNER_GUARDS), scan-build will only complain about the inner guard
pointer g_branch_1 escaping. It will not complain about the stack pointer
to the outer guard escaping, nor about the possibility of
should_return_early causing the sigjmp_buf to escape into the global
sigjmp_buf * jbuf. Yet running
./sigjmp_escape 1 1
shows that both can occur.
This still doesn't explain how it *also* misses the escape of &b into jbuf.
Is it losing track of the unrelated escape when it stops tracking the first
one?
The test case sigjmp_escape_hdr doesn't attempt to use the guards, and it
reports the sigjmp_buf escape fine as shown above.
If sigjmp_escape.c is built with -DUSE_INNER_GUARDS -DPOP_ONLY_INNER_GUARDS
then it raises no warnings at all, BUT has both the sigjmp_buf leak AND a
guard variable stack escape as seen by:
./sigjmp_escape_inner_pop 1 1
We've seen above that it loses the outer guard pointer escape because it
doesn't track any sort of indirection of assignments.
What's a bit surprising is that it *also* fails to complain about the
escape of the sigjmp_buf pointer &b into the global variable jbuf in this
case, even though it successfully detects the escape with other
combinations of guard uses. It's like it can't track multiple things at
once properly.
RELATED: detecting return from PG_FINALLY()
--------
Note that even if scan-build did detect the escape of the sigjmp_buf,
returning from PG_FINALLY() is a coding error that we cannot detect this
way since we've restored PG_errror_stack so there's no local auto ptr to
escape. To do that we'd need our own stack that we pushed in PG_TRY() and
didn't pop until PG_END_TRY(), like in the demo patch attached. That works,
but adds another local to each PG_TRY() that's otherwise useless. If we
really felt like it we could use it to produce a poor-man's backtrace
through each PG_TRY() at the cost of one sizeof(void) per PG_TRY() + some
constdata storage, but that seems like a solution looking for a problem.
Also, AFAICS it's actually harmless, if ugly, to return from PG_CATCH(). It
should still be discouraged.
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise