Runaway backend at 100% CPU PostgreSQL v8.3.5

Started by Stephen R. van den Bergabout 17 years ago5 messages

I'm running Debian PostgreSQL v8.3.5-1 on x86 in 32-bit mode.
Every once in a while, some backends start taking 100% CPU, as can be seen
below in the excerpt from the process table:

27256 ? Ss 0:04 /usr/lib/postgresql/8.3/bin/postgres -D /var/lib/post
27299 ? Ss 0:00 \_ postgres: writer process
27300 ? Ss 0:00 \_ postgres: wal writer process
27301 ? Ss 0:00 \_ postgres: autovacuum launcher process
27302 ? Ss 0:26 \_ postgres: stats collector process
12076 ? Rs 5681:50 \_ postgres: cms cms 10.0.0.5(59125) PARSE
12853 ? Rs 5549:25 \_ postgres: cms cms 10.0.0.5(48437) PARSE
25985 ? Ss 0:00 \_ postgres: cms cms 10.0.0.18(45807) idle
25986 ? Ss 0:00 \_ postgres: cms cms 10.0.0.18(45808) idle

It seems that the backend is stuck in some kind of endless loop. Since
it's a production Debian server, the backend is not compiled with debugging
turned on. The best I can do is ltrace it, in hopes of someone recognising
the infinite sequence.
Ltracing the 12853 process reveals the following libc calls:

Label a:
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState") = 13
[pid 12853] strcpy(0x84fb0e8, "ExecutorState") = 0x84fb0e8
[pid 12853] malloc(8192) = 0x860da48
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fb060, "ExprContext") = 0x84fb060
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fbda0, "ExprContext") = 0x84fbda0
[pid 12853] strncpy(0x860e498, "parent", 64) = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64) = 0x860e500
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x85cd810, "ExprContext") = 0x85cd810
[pid 12853] malloc(8256) = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64) = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64) = 0x860e858
[pid 12853] strlen("SPI TupTable") = 12
[pid 12853] strcpy(0x85cd1b0, "SPI TupTable") = 0x85cd1b0
[pid 12853] malloc(8192) = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x84fb088) = 0x860e738
[pid 12853] free(0x8600a28) = <void>
[pid 12853] free(0x860da48) = <void>
[pid 12853] free(0x860fa50) = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState") = 13
[pid 12853] strcpy(0x85cd1b0, "ExecutorState") = 0x85cd1b0
[pid 12853] malloc(8192) = 0x860da48
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fb0e8, "ExprContext") = 0x84fb0e8
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fb060, "ExprContext") = 0x84fb060
[pid 12853] strncpy(0x860e498, "parent", 64) = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64) = 0x860e500
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fbda0, "ExprContext") = 0x84fbda0
[pid 12853] malloc(8256) = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64) = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64) = 0x860e858
[pid 12853] strlen("SPI TupTable") = 12
[pid 12853] strcpy(0x85cd810, "SPI TupTable") = 0x85cd810
[pid 12853] malloc(8192) = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x85cd150) = 0x860e738
[pid 12853] free(0x8600a28) = <void>
[pid 12853] free(0x860da48) = <void>
[pid 12853] free(0x860fa50) = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState") = 13
[pid 12853] strcpy(0x85cd810, "ExecutorState") = 0x85cd810
[pid 12853] malloc(8192) = 0x860da48
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x85cd1b0, "ExprContext") = 0x85cd1b0
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fb0e8, "ExprContext") = 0x84fb0e8
[pid 12853] strncpy(0x860e498, "parent", 64) = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64) = 0x860e500
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fb060, "ExprContext") = 0x84fb060
[pid 12853] malloc(8256) = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64) = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64) = 0x860e858
[pid 12853] strlen("SPI TupTable") = 12
[pid 12853] strcpy(0x84fbda0, "SPI TupTable") = 0x84fbda0
[pid 12853] malloc(8192) = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x85cd7b0) = 0x860e738
[pid 12853] free(0x8600a28) = <void>
[pid 12853] free(0x860da48) = <void>
[pid 12853] free(0x860fa50) = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState") = 13
[pid 12853] strcpy(0x84fbda0, "ExecutorState") = 0x84fbda0
[pid 12853] malloc(8192) = 0x860da48
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x85cd810, "ExprContext") = 0x85cd810
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x85cd1b0, "ExprContext") = 0x85cd1b0
[pid 12853] strncpy(0x860e498, "parent", 64) = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64) = 0x860e500
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fb0e8, "ExprContext") = 0x84fb0e8
[pid 12853] malloc(8256) = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64) = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64) = 0x860e858
[pid 12853] strlen("SPI TupTable") = 12
[pid 12853] strcpy(0x84fb060, "SPI TupTable") = 0x84fb060
[pid 12853] malloc(8192) = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x84fbd40) = 0x860e738
[pid 12853] free(0x8600a28) = <void>
[pid 12853] free(0x860da48) = <void>
[pid 12853] free(0x860fa50) = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState") = 13
[pid 12853] strcpy(0x84fb060, "ExecutorState") = 0x84fb060
[pid 12853] malloc(8192) = 0x860da48
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x84fbda0, "ExprContext") = 0x84fbda0
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x85cd810, "ExprContext") = 0x85cd810
[pid 12853] strncpy(0x860e498, "parent", 64) = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64) = 0x860e500
[pid 12853] strlen("ExprContext") = 11
[pid 12853] strcpy(0x85cd1b0, "ExprContext") = 0x85cd1b0
[pid 12853] malloc(8256) = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64) = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64) = 0x860e858
[pid 12853] strlen("SPI TupTable") = 12
[pid 12853] strcpy(0x84fb0e8, "SPI TupTable") = 0x84fb0e8
[pid 12853] malloc(8192) = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x84fb000) = 0x860e738
[pid 12853] free(0x8600a28) = <void>
[pid 12853] free(0x860da48) = <void>
[pid 12853] free(0x860fa50) = <void>
Jump back to label a:, repeat ad infinitum, chewing up 100% CPU

Below a complete process list of all running backends:

27256 ? Ss 0:04 /usr/lib/postgresql/8.3/bin/postgres -D /var/lib/post
27299 ? Ss 0:00 \_ postgres: writer process
27300 ? Ss 0:00 \_ postgres: wal writer process
27301 ? Ss 0:00 \_ postgres: autovacuum launcher process
27302 ? Ss 0:26 \_ postgres: stats collector process
12076 ? Rs 5681:50 \_ postgres: cms cms 10.0.0.5(59125) PARSE
12853 ? Ts 5549:25 \_ postgres: cms cms 10.0.0.5(48437) PARSE
25985 ? Ss 0:00 \_ postgres: cms cms 10.0.0.18(45807) idle
25986 ? Ss 0:00 \_ postgres: cms cms 10.0.0.18(45808) idle
21100 ? Ss 0:00 \_ postgres: cms cms 10.0.0.12(45544) idle
21101 ? Ss 0:00 \_ postgres: cms cms 10.0.0.12(45545) idle
26777 ? Ss 0:00 \_ postgres: cms cms 10.0.0.12(43123) idle
24711 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(34684) idle
24715 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(34687) idle
24716 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(34688) idle
24721 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(34693) idle
8355 ? Ss 0:01 \_ postgres: cms cms 10.0.0.5(59188) idle
13504 ? Ss 0:05 \_ postgres: cms cms 10.0.0.5(35244) idle
17568 ? Ss 0:01 \_ postgres: cms cms 10.0.0.5(50414) idle
17615 ? Ss 0:02 \_ postgres: cms cms 10.0.0.5(42313) idle
17617 ? Ss 0:01 \_ postgres: cms cms 10.0.0.5(42315) idle
20412 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43902) idle
20425 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43905) idle
20427 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43907) idle
20428 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43908) idle
20429 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43909) idle
20430 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43910) idle
20431 ? Ss 0:00 \_ postgres: cms cms 10.0.0.5(43911) idle

--
Sincerely,
Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen R. van den Berg (#1)
Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

"Stephen R. van den Berg" <srb@cuci.nl> writes:

It seems that the backend is stuck in some kind of endless loop. Since
it's a production Debian server, the backend is not compiled with debugging
turned on. The best I can do is ltrace it, in hopes of someone recognising
the infinite sequence.

Well, it seems to be repeatedly executing a SPI call, which suggests but
doesn't prove that it's in a PL function ... but how would it get into
one while in PARSE state?

Anyway there's not much data here. You might try attaching to the
backend process with gdb and trying "bt" and "p debug_query_string";
I think you should get something out of that even without debug symbols.
If not, maybe turning on query logging would be worthwhile.

regards, tom lane

In reply to: Tom Lane (#2)
Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

Tom Lane wrote:

"Stephen R. van den Berg" <srb@cuci.nl> writes:

It seems that the backend is stuck in some kind of endless loop. Since
it's a production Debian server, the backend is not compiled with debugging
turned on. The best I can do is ltrace it, in hopes of someone recognising
the infinite sequence.

Well, it seems to be repeatedly executing a SPI call, which suggests but
doesn't prove that it's in a PL function ... but how would it get into
one while in PARSE state?

Anyway there's not much data here. You might try attaching to the
backend process with gdb and trying "bt" and "p debug_query_string";

Interesting, this is what I get:
(gdb) bt
#0 0xb5b3fd54 in ?? () from /usr/lib/postgresql/8.3/lib/plpgsql.so
#1 0x085e9228 in ?? ()
#2 0x00000048 in ?? ()
#3 0xbfec723b in ?? ()
#4 0x083cdb24 in ?? ()
#5 0xbfec74b4 in ?? ()
#6 0x0840d858 in ?? ()
#7 0xbfec6ff8 in ?? ()
#8 0x082e5102 in RevalidateCachedPlan ()
#9 0xb5b4077c in ?? () from /usr/lib/postgresql/8.3/lib/plpgsql.so
#10 0x00000017 in ?? ()
#11 0x085e9248 in ?? ()
#12 0x085e9240 in ?? ()
#13 0x085e9244 in ?? ()
#14 0xccdf2081 in ?? ()
#15 0x2066f5ee in ?? ()
#16 0x00000000 in ?? ()
(gdb) p debug_query_string
$1 = 138920681
(gdb) p (char*)debug_query_string
$2 = 0x847c2e9 "SELECT csi.sid,csi.name\n FROM cmssite AS csi\n WHERE csi.sid=cmsfindsid('506435')\n LIMIT 1;"

Whereas:

CREATE OR REPLACE FUNCTION
cmsfindsid(INT) RETURNS INT AS '
DECLARE
vfrom ALIAS FOR $1;
vcsid INT;
vold INT;
vsid INT;
BEGIN
vcsid:=vfrom;
LOOP
vold:=vcsid;
SELECT INTO vcsid,vsid parent,sid
FROM cmsstruct
WHERE csid=vcsid
LIMIT 1;
EXIT WHEN vsid IS NOT NULL OR vcsid=vold;
END LOOP;
RETURN vsid;
END;
' LANGUAGE PLPGSQL STABLE STRICT;

Where cmsstruct is a table where csid is a PRIMARY KEY INT column.

And, no, there have not been any function or table redefinitions.
--
Sincerely,
Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen R. van den Berg (#3)
Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

"Stephen R. van den Berg" <srb@cuci.nl> writes:

BEGIN
vcsid:=vfrom;
LOOP
vold:=vcsid;
SELECT INTO vcsid,vsid parent,sid
FROM cmsstruct
WHERE csid=vcsid
LIMIT 1;
EXIT WHEN vsid IS NOT NULL OR vcsid=vold;
END LOOP;
RETURN vsid;
END;

And are you certain that's never an infinite loop? The query string
gives a starting point that one suspects doesn't terminate.

regards, tom lane

In reply to: Tom Lane (#4)
Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

Tom Lane wrote:

"Stephen R. van den Berg" <srb@cuci.nl> writes:

BEGIN
vcsid:=vfrom;
LOOP

EXIT WHEN vsid IS NOT NULL OR vcsid=vold;
END LOOP;
RETURN vsid;
END;

And are you certain that's never an infinite loop? The query string
gives a starting point that one suspects doesn't terminate.

Indeed, you're right. It starts at a deleted item due to a race condition.
Thanks, sorry for the noise.
--
Sincerely,
Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"