gcov coverage data not full with immediate stop

Started by Alexander Lakhinalmost 6 years ago12 messageshackers
Jump to latest
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

I've found that gcov coverage data miss some information when a postgres
node stopped in 'immediate' mode.
For example, on the master branch:
make coverage-clean; time make check -C src/test/recovery/; make
coverage-html
generates a coverage report with 106193 lines/6318 functions for me
(`make check` takes 1m34s).
But with the attached simple patch I get a coverage report with 106540
lines/6332 functions (and `make check` takes 2m5s).
(IMO, the slowdown of the test is significant.)

So if we want to make the coverage reports more precise, I see the three
ways:
1. Change the stop mode in teardown_node to fast (probably only when
configured with --enable-coverage);
2. Explicitly stop nodes in TAP tests (where it's important) -- seems
too tedious and troublesome;
3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

Best regards,
Alexander

Attachments:

pgnode-stop.patchtext/x-patch; charset=UTF-8; name=pgnode-stop.patchDownload+1-1
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Lakhin (#1)
Re: gcov coverage data not full with immediate stop

(Strangely, I was just thinking about these branches of mine as I
closed my week last Friday...)

On 2020-May-10, Alexander Lakhin wrote:

So if we want to make the coverage reports more precise, I see the three
ways:
1. Change the stop mode in teardown_node to fast (probably only when
configured with --enable-coverage);
2. Explicitly stop nodes in TAP tests (where it's important) -- seems
too tedious and troublesome;
3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

I tried your idea 3 a long time ago and my experiments didn't show an
increase in coverage [1]/messages/by-id/20190531170503.GA24057@alvherre.pgsql. But I like this idea the best, and maybe I
did something wrong. Attached is the patch I had (on top of
fc115d0f9fc6), but I don't know if it still applies.

(The second attachment is another branch I had on this, I don't remember
why; that one was on top of 438e51987dcc. The curious thing is that I
didn't add the __gcov_flush to quickdie in this one. Maybe what we need
is a mix of both.)

I think we should definitely get this fixed for pg13 ...

[1]: /messages/by-id/20190531170503.GA24057@alvherre.pgsql

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-add-gcov_flush-call-in-quickdie.patchtext/x-diff; charset=us-asciiDownload+18-2
0001-gcov_flush-stuff.patchtext/x-diff; charset=us-asciiDownload+126-1
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: gcov coverage data not full with immediate stop

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2020-May-10, Alexander Lakhin wrote:

3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

I tried your idea 3 a long time ago and my experiments didn't show an
increase in coverage [1]. But I like this idea the best, and maybe I
did something wrong. Attached is the patch I had (on top of
fc115d0f9fc6), but I don't know if it still applies.

Putting ill-defined, not-controlled-by-us work into a quickdie signal
handler sounds like a really bad idea to me. Maybe it's all right,
since presumably it would only appear in specialized test builds; but
even so, how much could you trust the results?

I think we should definitely get this fixed for pg13 ...

-1 for shoving in such a thing so late in the cycle. We've survived
without it for years, we can do so for a few months more.

regards, tom lane

#4Alexander Lakhin
exclusion@gmail.com
In reply to: Alvaro Herrera (#2)
Re: gcov coverage data not full with immediate stop

Hello Alvaro,
11.05.2020 06:42, Alvaro Herrera wrote:

(Strangely, I was just thinking about these branches of mine as I
closed my week last Friday...)

On 2020-May-10, Alexander Lakhin wrote:

So if we want to make the coverage reports more precise, I see the three
ways:
1. Change the stop mode in teardown_node to fast (probably only when
configured with --enable-coverage);
2. Explicitly stop nodes in TAP tests (where it's important) -- seems
too tedious and troublesome;
3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

I tried your idea 3 a long time ago and my experiments didn't show an
increase in coverage [1]. But I like this idea the best, and maybe I
did something wrong. Attached is the patch I had (on top of
fc115d0f9fc6), but I don't know if it still applies.

Thanks for the reference to that discussion and your patch.
As I see the issue with that patch is that quickdie() is not the only
SIGQUIT handler. When a backend is interrupted with SIGQUIT, it's
exiting in SignalHandlerForCrashExit().
In fact if I only add __gcov_flush() in SignalHandlerForCrashExit(), it
raises test coverage for `make check -C src/test/recovery/` from
106198 lines/6319 functions
to
106420 lines/6328 functions

It's not yet clear to me what happens when __gcov_flush() called inside
__gcov_flush().
The test coverage changes to:
108432 lines/5417 functions
(number of function calls decreased)
And for example in coverage/src/backend/utils/cache/catcache.c.gcov.html
I see
���� 147���������� 8 : int2eqfast(Datum a, Datum b)
...
���� 153���������� 0 : int2hashfast(Datum datum)
but without __gcov_flush in quickdie() we have:
���� 147������ 78038 : int2eqfast(Datum a, Datum b)
...
���� 153����� 255470 : int2hashfast(Datum datum)
So it needs more investigation.

But I can confirm that calling __gcov_flush() in
SignalHandlerForCrashExit() really improves a code coverage report.
I tried to develop a test to elevate a coverage for gist:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html
(Please look at the attached test if it could be interesting.)
and came to this issue with a coverage. I tried to play with
GCOV_PREFIX, but without luck.
Yesterday I found the more recent discussion:
/messages/by-id/44ecae53-9861-71b7-1d43-4658acc52519@2ndquadrant.com
(where probably the same problem came out).

Finally I've managed to get an expected coverage when I performed
$node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit()
helps too).

Best regards,
Alexander

Attachments:

021_indexes-test.patchtext/x-patch; charset=UTF-8; name=021_indexes-test.patchDownload+102-0
#5Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Alexander Lakhin (#4)
Re: gcov coverage data not full with immediate stop

On Mon, May 11, 2020 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:

But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() really improves a code coverage report.

Finally I've managed to get an expected coverage when I performed $node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit() helps too).

What happens if a coverage tool other than gcov is used? From that
perspective, it's better to perform a clean shutdown in the TAP tests
instead of immediate if that's possible.

--
Best Wishes,
Ashutosh Bapat

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: gcov coverage data not full with immediate stop

On Mon, May 11, 2020 at 12:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we should definitely get this fixed for pg13 ...

-1 for shoving in such a thing so late in the cycle. We've survived
without it for years, we can do so for a few months more.

I agree, but also, we should start thinking about when to branch. I,
too, have patches that aren't critical enough to justify pushing them
post-freeze, but which are still good improvements that I'd like to
get into the tree. I'm queueing them right now to avoid the risk of
destabilizing things, but that generates more work, for me and for
other people, if their patches force me to rebase or the other way
around. I know there's always a concern with removing the focus on
release N too soon, but the open issues list is 3 items long right
now, and 2 of those look like preexisting issues, not new problems in
v13. Meanwhile, we have 20+ active committers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: gcov coverage data not full with immediate stop

Robert Haas <robertmhaas@gmail.com> writes:

I agree, but also, we should start thinking about when to branch. I,
too, have patches that aren't critical enough to justify pushing them
post-freeze, but which are still good improvements that I'd like to
get into the tree. I'm queueing them right now to avoid the risk of
destabilizing things, but that generates more work, for me and for
other people, if their patches force me to rebase or the other way
around. I know there's always a concern with removing the focus on
release N too soon, but the open issues list is 3 items long right
now, and 2 of those look like preexisting issues, not new problems in
v13. Meanwhile, we have 20+ active committers.

Yeah. Traditionally we've waited till the start of the next commitfest
(which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
differently). But it seems like things are slow enough that perhaps
we could branch earlier, like June 1, and give the committers a chance
to deal with some of their own stuff before starting the CF.

This is the wrong thread to be debating that in, though. Also I wonder
if this is really RMT turf?

regards, tom lane

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: gcov coverage data not full with immediate stop

On Mon, May 11, 2020 at 4:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is the wrong thread to be debating that in, though.

True.

Also I wonder if this is really RMT turf?

I think it is, but the RMT is permitted -- even encouraged -- to
consider the views of non-RMT members when making its decision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Tom Lane (#7)
Re: gcov coverage data not full with immediate stop

On Mon, May 11, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. Traditionally we've waited till the start of the next commitfest
(which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
differently). But it seems like things are slow enough that perhaps
we could branch earlier, like June 1, and give the committers a chance
to deal with some of their own stuff before starting the CF.

The RMT discussed this question informally yesterday. The consensus is
that we should wait and see what the early feedback from Beta 1 is
before making a final decision. An earlier June 1 branch date is an
idea that certainly has some merit, but we'd like to put off making a
final decision on that for at least another week, and possibly as long
as two weeks.

Can that easily be accommodated?

--
Peter Geoghegan

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#9)
Re: gcov coverage data not full with immediate stop

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, May 11, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. Traditionally we've waited till the start of the next commitfest
(which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
differently). But it seems like things are slow enough that perhaps
we could branch earlier, like June 1, and give the committers a chance
to deal with some of their own stuff before starting the CF.

The RMT discussed this question informally yesterday. The consensus is
that we should wait and see what the early feedback from Beta 1 is
before making a final decision. An earlier June 1 branch date is an
idea that certainly has some merit, but we'd like to put off making a
final decision on that for at least another week, and possibly as long
as two weeks.

Can that easily be accommodated?

There's no real lead time needed AFAICS: when we are ready to branch,
we can just do it. So sure, let's wait till the end of May to decide.
If things look bad then, we could reconsider again mid-June.

regards, tom lane

#11Michael Paquier
michael@paquier.xyz
In reply to: Ashutosh Bapat (#5)
Re: gcov coverage data not full with immediate stop

On Mon, May 11, 2020 at 05:56:33PM +0530, Ashutosh Bapat wrote:

What happens if a coverage tool other than gcov is used? From that
perspective, it's better to perform a clean shutdown in the TAP tests
instead of immediate if that's possible.

Nope, as that's the fastest path we have to shut down any remaining
nodes at the end of a test per the END{} block at the end of
PostgresNode.pm, and I would rather keep it this way because people
tend to like keeping around a lot of clusters alive at the end of any
new test added and shutdown checkpoints are not free either even if
fsync is enforced to off in the tests.

I think that a solution turning around __gcov_flush() could be the
best deal we have, as discussed last year in the thread Álvaro quoted
upthread, and I would vote for waiting until v14 opens for business
before merging something we consider worth it.
--
Michael

In reply to: Tom Lane (#10)
Re: gcov coverage data not full with immediate stop

On Tue, May 12, 2020 at 10:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Can that easily be accommodated?

There's no real lead time needed AFAICS: when we are ready to branch,
we can just do it. So sure, let's wait till the end of May to decide.
If things look bad then, we could reconsider again mid-June.

Great. Let's review it at the end of May, before actually branching.

--
Peter Geoghegan