9.4 failure on skink in _bt_newroot/XLogCheckBuffer

Started by Andres Freundalmost 10 years ago4 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

The valgrind animal just reported a large object related failure on 9.4:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=skink&dt=2016-05-19%2006%3A23%3A05

==9952== VALGRINDERROR-BEGIN
==9952== Conditional jump or move depends on uninitialised value(s)
==9952== at 0x4DC6D3: XLogCheckBuffer (xlog.c:2077)
==9952== by 0x4E5E52: XLogInsert (xlog.c:956)
==9952== by 0x4ACB10: _bt_newroot (nbtinsert.c:2123)
==9952== by 0x4ACEFF: _bt_insert_parent (nbtinsert.c:1727)
==9952== by 0x4AD4B7: _bt_insertonpg (nbtinsert.c:776)
==9952== by 0x4AE56F: _bt_doinsert (nbtinsert.c:191)
==9952== by 0x4B3409: btinsert (nbtree.c:251)
==9952== by 0x7A87E3: FunctionCall6Coll (fmgr.c:1437)
==9952== by 0x4A8D36: index_insert (indexam.c:226)
==9952== by 0x4FC62C: CatalogIndexInsert (indexing.c:136)
==9952== by 0x6A7210: inv_write (inv_api.c:723)
==9952== by 0x5E2985: lo_write (be-fsstubs.c:223)
==9952== Uninitialised value was created by a stack allocation
==9952== at 0x4AC481: _bt_newroot (nbtinsert.c:1989)
==9952==
==9952== VALGRINDERROR-END

I've not analyzed the problem beyond noticing that
xlog.c:2077
if (rdata->buffer_std)
which suggests an actual bug.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: 9.4 failure on skink in _bt_newroot/XLogCheckBuffer

Andres Freund <andres@anarazel.de> writes:

The valgrind animal just reported a large object related failure on 9.4:

The proximate cause seems to be that _bt_newroot isn't bothering to
fill the buffer_std field here:

/* Make a full-page image of the left child if needed */
rdata[2].data = NULL;
rdata[2].len = 0;
rdata[2].buffer = lbuf;
rdata[2].next = NULL;

which is indeed an actual bug, but the only consequence would be poor
compression of the full-page image (if the value chanced to be zero),
so it's not much of a problem.

What remains unclear is how come this only fails once in a blue moon.
Seems like any valgrind run of the regression tests should have caught it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: 9.4 failure on skink in _bt_newroot/XLogCheckBuffer

Hi tom,

On 2016-05-21 17:18:14 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

The valgrind animal just reported a large object related failure on 9.4:

The proximate cause seems to be that _bt_newroot isn't bothering to
fill the buffer_std field here:

/* Make a full-page image of the left child if needed */
rdata[2].data = NULL;
rdata[2].len = 0;
rdata[2].buffer = lbuf;
rdata[2].next = NULL;

which is indeed an actual bug, but the only consequence would be poor
compression of the full-page image (if the value chanced to be zero),
so it's not much of a problem.

Thanks for fixing that one!

What remains unclear is how come this only fails once in a blue moon.
Seems like any valgrind run of the regression tests should have caught it.

Looks like a timing issue. The relevant access to the uninitialized
buffer_std field only happens when
if (*lsn <= RedoRecPtr)
{
which presumably is not that likely to be hit. Even under valgrind the
individual tests are likely to finish below a checkpoint timeout.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: 9.4 failure on skink in _bt_newroot/XLogCheckBuffer

Andres Freund <andres@anarazel.de> writes:

On 2016-05-21 17:18:14 -0400, Tom Lane wrote:

What remains unclear is how come this only fails once in a blue moon.
Seems like any valgrind run of the regression tests should have caught it.

Looks like a timing issue.

Yeah, I came to the same conclusion after awhile.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers