recent failures on lorikeet
I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.
2021-06-14 05:04:00.220 EDT [60c71b7f.e8bf:60] pg_regress/vacuum LOG: statement: VACUUM FULL pg_class;
2021-06-14 05:04:00.222 EDT [60c71b80.e8c0:7] pg_regress/typed_table LOG: statement: CREATE TABLE persons OF person_type;
2021-06-14 05:04:00.232 EDT [60c71b80.e8c1:3] pg_regress/inherit LOG: statement: CREATE TABLE a (aa TEXT);
*** starting debugger for pid 59584, tid 9640
2021-06-14 05:04:14.678 EDT [60c71b53.e780:4] LOG: server process (PID 59584) exited with exit code 127
2021-06-14 05:04:14.678 EDT [60c71b53.e780:5] DETAIL: Failed process was running: CREATE TABLE persons OF person_type;
2021-06-14 05:04:14.678 EDT [60c71b53.e780:6] LOG: terminating any other active server processes
Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.
Do you have any idea what "exit code 127" signifies on that platform?
(BTW, not all of them look like that; many are reported as plain
segfaults.) I hadn't spotted the association with a concurrent "VACUUM
FULL pg_class" before, that does seem interesting.
Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)
If it does go away, that'd be interesting, but I don't see how it gets
us any closer to a fix. Seems like a stack trace is a necessity to
narrow it down.
regards, tom lane
On 6/14/21 9:39 AM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.Do you have any idea what "exit code 127" signifies on that platform?
(BTW, not all of them look like that; many are reported as plain
segfaults.) I hadn't spotted the association with a concurrent "VACUUM
FULL pg_class" before, that does seem interesting.Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)If it does go away, that'd be interesting, but I don't see how it gets
us any closer to a fix. Seems like a stack trace is a necessity to
narrow it down.
Some have given stack traces and some not, not sure why. The one from
June 13 has this:
---- backtrace ----
??
??:0
WaitOnLock
src/backend/storage/lmgr/lock.c:1831
LockAcquireExtended
src/backend/storage/lmgr/lock.c:1119
LockRelationOid
src/backend/storage/lmgr/lmgr.c:135
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ScanPgRelation
src/backend/utils/cache/relcache.c:322
RelationBuildDesc
src/backend/utils/cache/relcache.c:1039
RelationIdGetRelation
src/backend/utils/cache/relcache.c:2045
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ExecInitPartitionInfo
src/backend/executor/execPartition.c:510
ExecPrepareTupleRouting
src/backend/executor/nodeModifyTable.c:2311
ExecModifyTable
src/backend/executor/nodeModifyTable.c:2559
ExecutePlan
src/backend/executor/execMain.c:1557
The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 6/14/21 12:33 PM, Andrew Dunstan wrote:
The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.
Of course I meant the line (1831) in lock.c.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away.
Oooh, that certainly seems like a smoking gun.
I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.
So what we've got there is
old_status = get_ps_display(&len);
new_status = (char *) palloc(len + 8 + 1);
memcpy(new_status, old_status, len);
strcpy(new_status + len, " waiting");
set_ps_display(new_status);
new_status[len] = '\0'; /* truncate off " waiting" */
Line 1831 is the strcpy, but it seems entirely impossible that that
could fail, unless palloc has shirked its job. I'm thinking that
the crash is really in the memcpy --- looking at the other lines
in your trace, fingering the line after the call seems common.
What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.
A platform-specific problem in get_ps_display() seems plausible
enough. The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.
regards, tom lane
I wrote:
What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.
A platform-specific problem in get_ps_display() seems plausible
enough. The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.
If I understand correctly that you're only seeing this in v13 and
HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API)
deserves a hard look.
regards, tom lane