recent failures on lorikeet

Started by Andrew Dunstanover 4 years ago6 messages

andrew@dunslane.net

over 4 years ago

I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.

2021-06-14 05:04:00.220 EDT [60c71b7f.e8bf:60] pg_regress/vacuum LOG: statement: VACUUM FULL pg_class;
2021-06-14 05:04:00.222 EDT [60c71b80.e8c0:7] pg_regress/typed_table LOG: statement: CREATE TABLE persons OF person_type;
2021-06-14 05:04:00.232 EDT [60c71b80.e8c1:3] pg_regress/inherit LOG: statement: CREATE TABLE a (aa TEXT);
*** starting debugger for pid 59584, tid 9640
2021-06-14 05:04:14.678 EDT [60c71b53.e780:4] LOG: server process (PID 59584) exited with exit code 127
2021-06-14 05:04:14.678 EDT [60c71b53.e780:5] DETAIL: Failed process was running: CREATE TABLE persons OF person_type;
2021-06-14 05:04:14.678 EDT [60c71b53.e780:6] LOG: terminating any other active server processes

Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andrew Dunstan (#1)

Re: recent failures on lorikeet

Andrew Dunstan <andrew@dunslane.net> writes:

I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.

Do you have any idea what "exit code 127" signifies on that platform?
(BTW, not all of them look like that; many are reported as plain
segfaults.) I hadn't spotted the association with a concurrent "VACUUM
FULL pg_class" before, that does seem interesting.

Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)

If it does go away, that'd be interesting, but I don't see how it gets
us any closer to a fix. Seems like a stack trace is a necessity to
narrow it down.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Tom Lane (#2)

Re: recent failures on lorikeet

On 6/14/21 9:39 AM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.

Do you have any idea what "exit code 127" signifies on that platform?
(BTW, not all of them look like that; many are reported as plain
segfaults.) I hadn't spotted the association with a concurrent "VACUUM
FULL pg_class" before, that does seem interesting.

Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)

If it does go away, that'd be interesting, but I don't see how it gets
us any closer to a fix. Seems like a stack trace is a necessity to
narrow it down.

Some have given stack traces and some not, not sure why. The one from
June 13 has this:

---- backtrace ----
??
??:0
WaitOnLock
src/backend/storage/lmgr/lock.c:1831
LockAcquireExtended
src/backend/storage/lmgr/lock.c:1119
LockRelationOid
src/backend/storage/lmgr/lmgr.c:135
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ScanPgRelation
src/backend/utils/cache/relcache.c:322
RelationBuildDesc
src/backend/utils/cache/relcache.c:1039
RelationIdGetRelation
src/backend/utils/cache/relcache.c:2045
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ExecInitPartitionInfo
src/backend/executor/execPartition.c:510
ExecPrepareTupleRouting
src/backend/executor/nodeModifyTable.c:2311
ExecModifyTable
src/backend/executor/nodeModifyTable.c:2559
ExecutePlan
src/backend/executor/execMain.c:1557

The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Andrew Dunstan (#3)

Re: recent failures on lorikeet

On 6/14/21 12:33 PM, Andrew Dunstan wrote:

The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.

Of course I meant the line (1831) in lock.c.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andrew Dunstan (#3)

Re: recent failures on lorikeet

Andrew Dunstan <andrew@dunslane.net> writes:

The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away.

Oooh, that certainly seems like a smoking gun.

I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.

So what we've got there is

old_status = get_ps_display(&len);
new_status = (char *) palloc(len + 8 + 1);
memcpy(new_status, old_status, len);
strcpy(new_status + len, " waiting");
set_ps_display(new_status);
new_status[len] = '\0'; /* truncate off " waiting" */

Line 1831 is the strcpy, but it seems entirely impossible that that
could fail, unless palloc has shirked its job. I'm thinking that
the crash is really in the memcpy --- looking at the other lines
in your trace, fingering the line after the call seems common.

What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.

A platform-specific problem in get_ps_display() seems plausible
enough. The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#5)

Re: recent failures on lorikeet

I wrote:

What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.
A platform-specific problem in get_ps_display() seems plausible
enough. The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.

If I understand correctly that you're only seeing this in v13 and
HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API)
deserves a hard look.

regards, tom lane