PosgreSQL backend process crashed with signal 9

Started by Aleksey Romanovabout 10 years ago8 messagesbugs

drednout.by@gmail.com

about 10 years ago

In case of simultaneous calling plpythonu stored procedure and locking test
table I have reproducible segfault of PosgreSQL backend process.

Steps to reproduce:
0) initialize database schema and data:
$ psql -h localhost -p 5434 -U test < init_schema.sql
$ cat init_schema.sql
DROP TABLE IF EXISTS test;
CREATE TABLE IF NOT EXISTS test
(
id BIGSERIAL PRIMARY KEY,
amount BIGINT,
counter BIGINT DEFAULT 0
);
CREATE INDEX ON test (counter) where counter > 0;
INSERT INTO test (amount) VALUES (1);
INSERT INTO test (amount) VALUES (2);
INSERT INTO test (amount) VALUES (3);
INSERT INTO test (amount) VALUES (4);
INSERT INTO test (amount) VALUES (5);

CREATE OR REPLACE FUNCTION test_plpy_postgres_crash(record_id INTEGER)
RETURNS boolean
AS $$
select_query = """
SELECT * FROM test WHERE id = $1
"""
if "select_plan" in GD:
select_plan = GD["select_plan"]
else:
select_plan = plpy.prepare(select_query, ["integer"])
GD["select_plan"] = select_plan

select_cursor = plpy.cursor(select_plan, [record_id])
return True
$$ LANGUAGE plpythonu;

1) open 2 psql client sessions to PosgreSQL 9.5.2:
$ psql -h localhost -p 5434 -U test
$ psql -h localhost -p 5434 -U test

2) enter the following statement in the one of client sessions:
test=# select pg_sleep(1) ; begin; lock test ; select sum(counter) from
test ; select pg_sleep(60);
pg_sleep
----------

(1 row)

BEGIN
LOCK TABLE
sum
-----
0
(1 row)

3) after successful taking table lock enter the following statement in the
second client session:
test=# SELECT test_plpy_postgres_crash(trunc(random() * 5 + 1)::integer);

4) wait several seconds and press CTRL+C in the second client session
5) PosgreSQL backend process crass is occurred:
test=# select pg_sleep(1) ; begin; lock test ; select sum(counter) from
test ; select pg_sleep(60);
pg_sleep
----------

(1 row)

BEGIN
LOCK TABLE
sum
-----
0
(1 row)

WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Failed.

test=# SELECT test_plpy_postgres_crash(trunc(random() * 5 + 1)::integer);
^CCancel request sent
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Failed.

Dmesg log:
$ dmesg| tail
[1904728.889471] postgres[21138]: segfault at fffffffffffffff0 ip
00007f25462c7724 sp 00007fff96c155a0 error 5 in
postgres[7f2545eb9000+5f1000]

PosgreSQL version:
test=# select version();
version

-------------------------------------------------------------------------------------------------
PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
4.8.2-19ubuntu1) 4.8.2, 64-bit
(1 row)

System info:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux aromanov 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

--
Best regards,
Aleksey Romanov
Web Developer, Wargaming.net

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Aleksey Romanov (#1)

Re: PosgreSQL backend process crashed with signal 9

Aleksey Romanov <drednout.by@gmail.com> writes:

In case of simultaneous calling plpythonu stored procedure and locking test
table I have reproducible segfault of PosgreSQL backend process.

Hm, I can't reproduce that in either HEAD or 9.5. I get

^CCancel request sent
ERROR: spiexceptions.QueryCanceled: canceling statement due to user request
CONTEXT: Traceback (most recent call last):
PL/Python function "test_plpy_postgres_crash", line 8, in <module>
select_plan = plpy.prepare(select_query, ["integer"])
PL/Python function "test_plpy_postgres_crash"

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Pavel Suderevsky

psuderevsky@gmail.com

about 10 years ago

In reply to: Tom Lane (#2)

Re: PosgreSQL backend process crashed with signal 9

I've faced similar behaviour in much more simple situation. It crashed on
simple nested loop. And all other child postmaster processes were
terminated likewise.

Repoduced on:
1. 9.5.1 + Ubuntu 14.04.4 LTS
2. 9.5.1 + CentOS 7.2.1511
2. 9.5.2 + CentOS 7.2.1511

DDL:

platform_db=# \d node_statuses
Table "platform_db.node_statuses"
Column | Type | Modifiers

-----------------------+-------------------------+---------------------------------
node_id | integer | not null
status_id | integer | not null
msg | character varying(1024) | default NULL::character
varying
node_status_timestamp | bigint | not null
Indexes:
"NODE_STATUSES_NODE_ID_FK" btree (node_id)
"NODE_STATUSES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
node_status_timestamp)
"NODE_STATUSES_TIMESTAMP_IDX" btree (node_status_timestamp)

platform_db=# \d node_status_values
Table "platform_db.node_status_values"
Column | Type | Modifiers
----------------------+---------+-----------
node_id | integer | not null
status_id | integer | not null
node_value_timestamp | bigint | not null
Indexes:
"NODE_STATUS_VALUES_NODE_ID_FK" btree (node_id)
"NODE_STATUS_VALUES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
node_value_timestamp)

SQL:

SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
ns.node_id = nsv.node_id where ns.node_id = 1147;

Postgres log:

2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by
signal 9: Killed
2016-04-05 18:37:31 MSK DETAIL: Failed process was running: COMMIT
2016-04-05 18:37:32 MSK LOG: terminating any other active server
processes
2016-04-05 18:37:32 MSK WARNING: terminating connection because of crash
of another server process
2016-04-05 18:37:32 MSK DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possib
ly corrupted shared memory.
2016-04-05 18:37:32 MSK HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] %
WARNING: terminating connection because of crash of another server process
2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] %
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] %
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % WARNING:
terminating connection because of crash of another server process
2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % HINT: In a moment
you should be able to reconnect to the database and repeat your command.
2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % WARNING:
terminating connection because of crash of another server process
2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
2016-04-05 18:37:37 MSK LOG: statistics collector process (PID 1012) was
terminated by signal 9: Killed
2016-04-05 18:37:37 MSK LOG: all server processes terminated;
reinitializing
2016-04-05 18:37:39 MSK LOG: database system was interrupted; last known
up at 2016-04-05 18:31:44 MSK
2016-04-05 18:37:39 MSK ::1 zabbix@zabbix [unknown] % FATAL: the
database system is in recovery mode
2016-04-05 18:37:41 MSK LOG: database system was not properly shut down;
automatic recovery in progress
2016-04-05 18:37:41 MSK LOG: redo starts at 298/AB146828
2016-04-05 18:37:41 MSK ::1 zabbix@zabbix [unknown] % FATAL: the
database system is in recovery mode
2016-04-05 18:37:44 MSK [local] postgres@platform_db [unknown] % FATAL:
the database system is in recovery mode
2016-04-05 18:37:46 MSK LOG: invalid record length at 298/C07ACA78
2016-04-05 18:37:46 MSK LOG: redo done at 298/C07ACA50
2016-04-05 18:37:46 MSK LOG: last completed transaction was at log time
2016-04-05 18:37:28.455638+03
2016-04-05 18:37:48 MSK ::1 zabbix@zabbix [unknown] % FATAL: the
database system is in recovery mode
2016-04-05 18:38:08 MSK LOG: MultiXact member wraparound protections are
now enabled
2016-04-05 18:38:08 MSK LOG: autovacuum launcher started
2016-04-05 18:38:08 MSK LOG: database system is ready to accept
connections

In CentOS just:

2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % LOG: could
not send data to client: Broken pipe
2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % STATEMENT:
SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
ns.node_id = nsv.node_id where ns.node_id = 1147;
2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % FATAL:
connection to client lost

System log [ubuntu]:

Apr 5 18:37:30 db_server kernel: [3905852.822122] postgres invoked
oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Apr 5 18:37:30 db_server kernel: [3905852.822131] postgres cpuset=/
mems_allowed=0-1
Apr 5 18:37:30 db_server kernel: [3905852.822136] CPU: 3 PID: 9191 Comm:
postgres Not tainted 3.13.0-77-generic #121-Ubuntu
Apr 5 18:37:30 db_server kernel: [3905852.822139] Hardware name:
Supermicro H8DMR-82/H8DMR-82, BIOS 080014 12/15/2006
...
Apr 5 18:37:31 db_server kernel: [3905854.860083] Out of memory: Kill
process 2847 (postgres) score 391 or sacrifice child
Apr 5 18:37:31 db_server kernel: [3905854.860201] Killed process 2847
(postgres) total-vm:6613344kB, anon-rss:34224kB, file-rss:6387036kB

Counts:

platform_db=# SELECT count(*) FROM node_statuses ns where node_id = 1147;
count
-------
6819
(1 row)
platform_db=# SELECT count(*) FROM node_status_values nsv where node_id =
1147;
count
-------
6814
(1 row)
platform_db=# SELECT count(*) FROM node_statuses ns INNER JOIN
node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;

count
----------
46423776
(1 row)

Query plan:

Nested Loop (cost=477.14..570040.76 rows=45261752 width=53) (actual
time=3.223..252614.947 rows=46382904 loops=1)
-> Bitmap Heap Scan on node_statuses ns (cost=241.60..2336.38
rows=6862 width=37) (actual time=1.724..21.248 rows=6813 loops=1)
Recheck Cond: (node_id = 1147)
Heap Blocks: exact=1397
-> Bitmap Index Scan on "NODE_STATUSES_NODE_ID_FK"
(cost=0.00..239.89 rows=6862 width=0) (actual time=1.472..1.472 rows=6813
loops=1)
Index Cond: (node_id = 1147)
-> Materialize (cost=235.54..1948.97 rows=6596 width=16) (actual
time=0.002..12.092 rows=6808 loops=6813)
-> Bitmap Heap Scan on node_status_values nsv
(cost=235.54..1915.99 rows=6596 width=16) (actual time=1.471..17.282
rows=6808 loops=1)
Recheck Cond: (node_id = 1147)
Heap Blocks: exact=1357
-> Bitmap Index Scan on "NODE_STATUS_VALUES_NODE_ID_FK"
(cost=0.00..233.89 rows=6596 width=0) (actual time=1.238..1.238 rows=6808
loops=1)
Index Cond: (node_id = 1147)
Planning time: 0.210 ms
Execution time: 334897.948 ms

Memory on the host:

Show quoted text

total used free shared buffers cached
Mem: 15G 15G 453M 2.7G 195M 12G
-/+ buffers/cache: 2.2G 13G
Swap: 0B 0B 0B

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Pavel Suderevsky (#3)

Re: PosgreSQL backend process crashed with signal 9

Pavel Suderevsky <psuderevsky@gmail.com> writes:

I've faced similar behaviour in much more simple situation. It crashed on
simple nested loop. And all other child postmaster processes were
terminated likewise.

2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by
signal 9: Killed

Well, signal 9 is not an internal-to-Postgres error; that's something
outside the database killing the process. If you're on a Linux machine
it's most likely the dreaded OOM killer doing it. The usual
recommendation is to configure your system to reduce or eliminate
memory overcommit:

http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

The OP's log did not indicate a signal 9, though --- that was SIGSEGV
(sig 11, typically).

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Pavel Suderevsky

psuderevsky@gmail.com

about 10 years ago

In reply to: Tom Lane (#4)

Re: PosgreSQL backend process crashed with signal 9

Yes, OOM killer did the job, but is it normal that so lightweight query is
consuming so much memory that OOM-killer to be invoked?

[dbtest3] mars_gedik=# select
pg_size_pretty(pg_relation_size('node_statuses'));
pg_size_pretty
----------------
11 MB
(1 row)
[dbtest3] mars_gedik=# select
pg_size_pretty(pg_relation_size('node_status_values'));
pg_size_pretty
----------------
11 MB
(1 row)

2016-04-06 17:46 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

Show quoted text

Pavel Suderevsky <psuderevsky@gmail.com> writes:

I've faced similar behaviour in much more simple situation. It crashed on
simple nested loop. And all other child postmaster processes were
terminated likewise.

2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated

by

signal 9: Killed

Well, signal 9 is not an internal-to-Postgres error; that's something
outside the database killing the process. If you're on a Linux machine
it's most likely the dreaded OOM killer doing it. The usual
recommendation is to configure your system to reduce or eliminate
memory overcommit:

http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

The OP's log did not indicate a signal 9, though --- that was SIGSEGV
(sig 11, typically).

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Pavel Suderevsky (#5)

Re: PosgreSQL backend process crashed with signal 9

Pavel Suderevsky <psuderevsky@gmail.com> writes:

Yes, OOM killer did the job, but is it normal that so lightweight query is
consuming so much memory that OOM-killer to be invoked?

[ shrug... ] The OOM killer is widely considered broken. Its heuristics
don't interact terribly well with processes using large amounts of shared
memory.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Pavel Suderevsky

psuderevsky@gmail.com

about 10 years ago

In reply to: Tom Lane (#6)

Re: PosgreSQL backend process crashed with signal 9

[ shrug... ] The OOM killer is widely considered broken. Its heuristics
don't interact terribly well with processes using large amounts of shared
memory.

Actually the issue is not the OOM killer invokation but in amount of memory
that postgresql consumes for a not very heavy operation.

Observing Memory:

[root@dbtest3 ~]# for i in `seq 25`; do free -m && sleep 1 ; done
total used free shared buff/cache
available
Mem: 993 64 831 3 97
811
Swap: 1231 181 1050

Starting SQL:

[dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN
node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;

Total memory consumed by a process including shared memory (according pmap
utility):

[root@dbtest3 ~]# for i in `seq 25`; do pmap 16227 | tail -1 && sleep 1;
done
total 189172K
total 220264K
total 438276K
total 657148K
total 861352K
total 1137396K
total 1320612K
total 1564020K
total 1809472K
total 2038492K

Obersving memory once again:

total used free shared buff/cache
available
Mem: 993 881 62 2 49
18
Swap: 1231 1113 118

And OOM killer works.

If I set vm.overcommit_memory=2 than I get:

[dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN
node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
out of memory for query result

But process is still being alive and OOM killer hasn't been invoked.

[root@dbtest3 ~]# for i in `seq 25`; do pmap 16479 | tail -1 && sleep 1;
done
total 189172K
total 205776K
total 419796K
total 644212K
total 860560K
total 1110732K
total 1320480K
total 484948K
total 400404K
total 313236K
total 189172K
total 189172K
total 189172K
total 189172K
total 189172K

2016-04-06 18:00 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

Show quoted text

Pavel Suderevsky <psuderevsky@gmail.com> writes:

Yes, OOM killer did the job, but is it normal that so lightweight query

is

consuming so much memory that OOM-killer to be invoked?

[ shrug... ] The OOM killer is widely considered broken. Its heuristics
don't interact terribly well with processes using large amounts of shared
memory.

regards, tom lane

daveg

daveg@sonic.net

about 10 years ago

In reply to: Pavel Suderevsky (#7)

Re: PosgreSQL backend process crashed with signal 9

On Wed, 6 Apr 2016 21:22:23 +0300
Pavel Suderevsky <psuderevsky@gmail.com> wrote:

[ shrug... ] The OOM killer is widely considered broken. Its heuristics
don't interact terribly well with processes using large amounts of shared
memory.

Actually the issue is not the OOM killer invokation but in amount of memory
that postgresql consumes for a not very heavy operation.

I suggest you open a new bug for this. The original report was for a SEGV,
which should never happen. This report is for excessive memory growth leading
to OOM. Both cause a crash, but they are different issues and tying them
together may delay solving either of them.

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs