Some bogus results from prairiedog

Started by Tom Laneover 11 years ago7 messages
#1Tom Lane
tgl@sss.pgh.pa.us

According to
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55
prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
but there's not a lot of evidence as to why in the buildfarm report,
because the postmaster log file is truncated well before where things got
interesting. Fortunately, I was able to capture a copy of check.log
before it got overwritten by the next run. I find that the place where
the webserver report stops matches this section of check.log:

[53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir);
[53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g;
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118');
[53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
[53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');

The ^@'s represent nul bytes, which I find runs of elsewhere in the file
as well. I think they are an artifact of OS X buffering policy caused by
multiple processes writing into the same file without any interlocks.
Perhaps we ought to consider making buildfarm runs use the logging
collector by default? But in any case, it seems uncool that either the
buildfarm log-upload process, or the buildfarm web server, is unable to
cope with log files containing nul bytes.

Anyway, to cut to the chase, the crash seems to be from this:

TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)

which the postmaster shortly later says this about:

[53cd99b6.130e:2] LOG: server process (PID 5230) was terminated by signal 6: Abort trap
[53cd99b6.130e:3] DETAIL: Failed process was running: ROLLBACK PREPARED 'foo1';
[53cd99b6.130e:4] LOG: terminating any other active server processes

So there is still something rotten in the fastpath lock logic.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: Some bogus results from prairiedog

On 07/22/2014 12:24 AM, Tom Lane wrote:

According to
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55
prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
but there's not a lot of evidence as to why in the buildfarm report,
because the postmaster log file is truncated well before where things got
interesting. Fortunately, I was able to capture a copy of check.log
before it got overwritten by the next run. I find that the place where
the webserver report stops matches this section of check.log:

[53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir);
[53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g;
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118');
[53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
[53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');

The ^@'s represent nul bytes, which I find runs of elsewhere in the file
as well. I think they are an artifact of OS X buffering policy caused by
multiple processes writing into the same file without any interlocks.
Perhaps we ought to consider making buildfarm runs use the logging
collector by default? But in any case, it seems uncool that either the
buildfarm log-upload process, or the buildfarm web server, is unable to
cope with log files containing nul bytes.

The data is there, just click on the "check" stage link at the top of
the page to see it in raw form.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: Some bogus results from prairiedog

On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

According to
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2014-07-21%2022%3A36%3A55
prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
but there's not a lot of evidence as to why in the buildfarm report,
because the postmaster log file is truncated well before where things got
interesting. Fortunately, I was able to capture a copy of check.log
before it got overwritten by the next run. I find that the place where
the webserver report stops matches this section of check.log:

[53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir);
[53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g;
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118');
[53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
[53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');

The ^@'s represent nul bytes, which I find runs of elsewhere in the file
as well. I think they are an artifact of OS X buffering policy caused by
multiple processes writing into the same file without any interlocks.
Perhaps we ought to consider making buildfarm runs use the logging
collector by default? But in any case, it seems uncool that either the
buildfarm log-upload process, or the buildfarm web server, is unable to
cope with log files containing nul bytes.

Anyway, to cut to the chase, the crash seems to be from this:

TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)

which the postmaster shortly later says this about:

[53cd99b6.130e:2] LOG: server process (PID 5230) was terminated by signal 6: Abort trap
[53cd99b6.130e:3] DETAIL: Failed process was running: ROLLBACK PREPARED 'foo1';
[53cd99b6.130e:4] LOG: terminating any other active server processes

So there is still something rotten in the fastpath lock logic.

Gosh, that sucks.

The inconstancy of this problem would seem to suggest some kind of
locking bug rather than a flat-out concurrency issue, but it looks to
me like everything relevant is marked volatile. But ... maybe the
problem could be in s_lock.h? That machine is powerpc, and powerpc's
definition of S_UNLOCK looks like this:

#ifdef USE_PPC_LWSYNC
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ (" lwsync \n"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#else
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ (" sync \n"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#endif /* USE_PPC_LWSYNC */

I believe Andres recently reported that this isn't enough to prevent
compiler reordering; for that, the __asm__ __volatile___ would need to
be augmented with ::: "memory". If that's correct, then the compiler
could decide to issue the volatile store before the lwsync or sync.
Then, the CPU could decide to perform the volatile store to the
spinlock before performing the update to FastPathStrongRelationLocks.
That would explain this problem.

The other possible explanation (at least, AFAICS) is that
lock_twophase_postcommit() is trying to release a strong lock count
that it doesn't actually hold. That would indicate a
strong-lock-count handling bug upstream someplace - e.g. that the
count got released when the transaction was prepared, or somesuch. I
certainly can't rule that out, although I don't know exactly where to
look. We could add an assertion to AtPrepare_Locks(), just before
setting locallock->holdsStrongLockCount = FALSE, that
locallock->holdsStrongLockCount is true if and only if the lock tag
and more suggest that it ought to be. But that's really just guessing
wildly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#2)
Re: Some bogus results from prairiedog

On 07/22/2014 10:55 AM, Andrew Dunstan wrote:

On 07/22/2014 12:24 AM, Tom Lane wrote:

According to
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2014-07-21%2022%3A36%3A55

prairiedog saw a crash in "make check" on the 9.4 branch earlier
tonight;
but there's not a lot of evidence as to why in the buildfarm report,
because the postmaster log file is truncated well before where things
got
interesting. Fortunately, I was able to capture a copy of check.log
before it got overwritten by the next run. I find that the place where
the webserver report stops matches this section of check.log:

[53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx
on test_range_gist using gist (ir);
[53cd99bb.134a:159] LOG: statement: insert into test_range_gist
select int4range(g, g+10) from generate_series(1,2000) g;
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\

@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG:
statement: INSERT INTO num_exp_div VALUES
(7,8,'-1108.80577182462841041118');
[53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES
(7,9,'-107955289.045047420');
[53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES
(7,9,'-58101680.954952580');

The ^@'s represent nul bytes, which I find runs of elsewhere in the file
as well. I think they are an artifact of OS X buffering policy
caused by
multiple processes writing into the same file without any interlocks.
Perhaps we ought to consider making buildfarm runs use the logging
collector by default? But in any case, it seems uncool that either the
buildfarm log-upload process, or the buildfarm web server, is unable to
cope with log files containing nul bytes.

The data is there, just click on the "check" stage link at the top of
the page to see it in raw form.

I have made a change in the upload receiver app to escape nul bytes in
the main log field too. This will operate prospectively.

Using the logging collector would be a larger change, but we can look at
it if this isn't sufficient.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: Some bogus results from prairiedog

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anyway, to cut to the chase, the crash seems to be from this:
TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
So there is still something rotten in the fastpath lock logic.

Gosh, that sucks.

The inconstancy of this problem would seem to suggest some kind of
locking bug rather than a flat-out concurrency issue, but it looks to
me like everything relevant is marked volatile.

I don't think that you need any big assumptions about machine-specific
coding issues to spot the problem. The assert in question is here:

/*
* Decrement strong lock count. This logic is needed only for 2PC.
*/
if (decrement_strong_lock_count
&& ConflictsWithRelationFastPath(&lock->tag, lockmode))
{
uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode);

SpinLockAcquire(&FastPathStrongRelationLocks->mutex);
Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0);
FastPathStrongRelationLocks->count[fasthashcode]--;
SpinLockRelease(&FastPathStrongRelationLocks->mutex);
}

and it sure looks to me like that
"ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
shared-memory lock object you just released. If someone else had managed
to recycle that locktable entry for some other purpose, the
ConflictsWithRelationFastPath call might incorrectly return true.

I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
something.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
Re: Some bogus results from prairiedog

On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anyway, to cut to the chase, the crash seems to be from this:
TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
So there is still something rotten in the fastpath lock logic.

Gosh, that sucks.

The inconstancy of this problem would seem to suggest some kind of
locking bug rather than a flat-out concurrency issue, but it looks to
me like everything relevant is marked volatile.

I don't think that you need any big assumptions about machine-specific
coding issues to spot the problem.

I don't think that I'm making what could be described as big
assumptions; I think we should fix and back-patch the PPC64 spinlock
change.

But...

The assert in question is here:

/*
* Decrement strong lock count. This logic is needed only for 2PC.
*/
if (decrement_strong_lock_count
&& ConflictsWithRelationFastPath(&lock->tag, lockmode))
{
uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode);

SpinLockAcquire(&FastPathStrongRelationLocks->mutex);
Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0);
FastPathStrongRelationLocks->count[fasthashcode]--;
SpinLockRelease(&FastPathStrongRelationLocks->mutex);
}

and it sure looks to me like that
"ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
shared-memory lock object you just released. If someone else had managed
to recycle that locktable entry for some other purpose, the
ConflictsWithRelationFastPath call might incorrectly return true.

I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
something.

...this is probably the real cause of the failures we've actually been
seeing. I'll go back-patch that change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: Some bogus results from prairiedog

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

and it sure looks to me like that
"ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
shared-memory lock object you just released. If someone else had managed
to recycle that locktable entry for some other purpose, the
ConflictsWithRelationFastPath call might incorrectly return true.

I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
something.

...this is probably the real cause of the failures we've actually been
seeing. I'll go back-patch that change.

For the archives' sake --- I took another look at the two previous
instances we'd seen of FastPathStrongRelationLocks assertion failures:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2014-03-25%2003%3A15%3A03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&amp;dt=2014-04-06%2017%3A04%3A00

Neither of these tripped at the bug site in LockRefindAndRelease, but
that's not so surprising, because there was no Assert there at the time.
An occurrence of the bug would have silently left a negative entry in the
FastPathStrongRelationLocks->count[] array, which would have triggered an
assertion by the next process unlucky enough to use the same array entry.

In the prairiedog case, the assert happened in a test running concurrently
with the prepared_xacts test (which is presumably where the bug occurred),
so that this seems a highly plausible explanation. In the rover_firefly
case, the assert happened quite a bit later than prepared_xacts; but it
might've been just luck that that particular array entry (out of 1024)
wasn't re-used for awhile.

So it's not certain that this one bug explains all three cases,
but it seems well within the bounds of plausibility that it does.
Also, the narrow width of the race condition window (from
LWLockRelease(partitionLock) to an inlined test in the next line)
is consistent with the very low failure rate we've seen in the buildfarm.

BTW, I wonder whether it would be interesting for testing purposes to
have a build option that causes SpinLockRelease and LWLockRelease to
do a sched_yield before returning. One could assume that that would
greatly increase the odds of detecting this type of bug.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers