BUG #14781: server process was terminated by signal 11: Segmentation fault

Started by Maksim Karabaover 8 years ago9 messagesbugs

Maksim_Karaba@epam.com

over 8 years ago

The following bug has been logged on the website:

Bug reference: 14781
Logged by: Maksim Karaba
Email address: maksim_karaba@epam.com
PostgreSQL version: 9.6.4
Operating system: CentOS Linux release 7.3.1611 (Core)
Description:

Hi!
We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw. Could you provide any workaround? Version Postgresql 9.6.4
Error from /var/log/messages :
bigdatadb kernel: postgres[17467]: segfault at 58 ip 00000000005c2f94 sp
00007ffdc06b2260 error 4 in postgres[400000+616000]
bigdatadb abrt-hook-ccpp: Process 17467 (postgres) of user 26 killed by
SIGSEGV - dumping core

And we have core dump bt list:

#0 ExecProcNode (node=0x0) at execProcnode.c:380
result = <optimized out>
__func__ = "ExecProcNode"
#1 0x00007f534be36554 in postgresRecheckForeignScan (node=<optimized out>,
slot=0x6afbf68) at postgres_fdw.c:2059
scanrelid = <optimized out>
outerPlan = <optimized out>
result = <optimized out>
#2 0x00000000005e3b7c in ForeignRecheck (node=node@entry=0x6afba48,
slot=slot@entry=0x6afbf68) at nodeForeignscan.c:101
fdwroutine = 0x6a48b88
econtext = 0x6afbb58
#3 0x00000000005ca5a6 in ExecScanFetch (recheckMtd=0x5e3b40
<ForeignRecheck>, accessMtd=0x5e3bb0 <ForeignNext>, node=0x6afba48) at
execScan.c:85
slot = <optimized out>
scanrelid = <optimized out>
estate = <optimized out>
#4 ExecScan (node=node@entry=0x6afba48, accessMtd=accessMtd@entry=0x5e3bb0
<ForeignNext>, recheckMtd=recheckMtd@entry=0x5e3b40 <ForeignRecheck>) at
execScan.c:180
econtext = 0x6afbb58
qual = 0x0
projInfo = 0x6afc0d0
isDone = ExprSingleResult
resultSlot = <optimized out>
#5 0x00000000005e3c5f in ExecForeignScan (node=node@entry=0x6afba48) at
nodeForeignscan.c:119
No locals.
#6 0x00000000005c36a8 in ExecProcNode (node=node@entry=0x6afba48) at
execProcnode.c:465
result = <optimized out>
__func__ = "ExecProcNode"
#7 0x00000000005e0889 in ExecSort (node=node@entry=0x6afb7d8) at
nodeSort.c:103
plannode = <optimized out>
outerNode = 0x6afba48
tupDesc = <optimized out>
estate = 0x6a42dc8
dir = ForwardScanDirection
tuplesortstate = 0x6b70868
slot = <optimized out>
#8 0x00000000005c3648 in ExecProcNode (node=node@entry=0x6afb7d8) at
execProcnode.c:495
result = <optimized out>
__func__ = "ExecProcNode"
#9 0x00000000005e448e in begin_partition
(winstate=winstate@entry=0x6a50b28) at nodeWindowAgg.c:1082
outerslot = <optimized out>
outerPlan = 0x6afb7d8
numfuncs = 1
i = <optimized out>
#10 0x00000000005e6a4b in ExecWindowAgg (winstate=winstate@entry=0x6a50b28)
at nodeWindowAgg.c:1691
result = <optimized out>
isDone = ExprSingleResult
econtext = <optimized out>
i = <optimized out>
numfuncs = <optimized out>
__func__ = "ExecWindowAgg"
#11 0x00000000005c3618 in ExecProcNode (node=0x6a50b28) at
execProcnode.c:507

Thank you!

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Maksim Karaba (#1)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

maksim_karaba@epam.com writes:

We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw.

Please show the failing query or queries. (If you don't have postmaster
logs showing them, "p debug_query_string" in the core files should get
the info.) Also, please show EXPLAIN VERBOSE plans for the query(s),
as well as schema information (psql \d output would do) for the
referenced tables.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Maksim Karaba

Maksim_Karaba@epam.com

over 8 years ago

In reply to: Tom Lane (#2)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Thank you for your answer so quickly.

(gdb) p debug_query_string
$1 = 0x2c2e748 "DO $$DECLARE\nBEGIN\n PERFORM dwh.load_trade_record_inc();\nEND$$;"

We added logging to the function and found out that the error always occurs on the same update
(in attachments)
Plan explain verbose in attachments

MAKSIM KARABA
Senior Systems Engineer, EPAM
Office: +375 17 389 0100 x 53194 Cell: +375296772871 Email: maksim_karaba@epam.com
Minsk, Belarus (GMT+3) epam.com

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 4:47 PM
To: Maksim Karaba <Maksim_Karaba@epam.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault

maksim_karaba@epam.com writes:

We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw.

Please show the failing query or queries. (If you don't have postmaster logs showing them, "p debug_query_string" in the core files should get the info.) Also, please show EXPLAIN VERBOSE plans for the query(s), as well as schema information (psql \d output would do) for the referenced tables.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Maksim Karaba (#3)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Maksim Karaba <Maksim_Karaba@epam.com> writes:

We added logging to the function and found out that the error always occurs on the same update
(in attachments)
Plan explain verbose in attachments

The point of my questions was that it's going to be difficult for anyone
to fix this unless we can reproduce the problem. What you've provided
doesn't even begin to make that possible. Please see the advice about
providing self-contained test cases at
https://www.postgresql.org/docs/current/static/bug-reporting.html

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Maksim Karaba

Maksim_Karaba@epam.com

over 8 years ago

In reply to: Tom Lane (#4)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 6:10 PM
To: Maksim Karaba <Maksim_Karaba@epam.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault

The point of my questions was that it's going to be difficult for anyone to fix this unless we can reproduce the problem. What you've provided doesn't even begin to make that possible. Please see the advice about providing self-contained test cases at https://www.postgresql.org/docs/current/static/bug-reporting.html

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Maksim Karaba (#5)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Maksim Karaba <Maksim_Karaba@epam.com> writes:

Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.

[ shrug... ] We may just have to wait for somebody to be more
forthcoming.

FWIW, the stack trace seems to indicate that an incorrect plan has been
generated, ie one that has a remote join node without an EPQ recheck
subplan. That mistake in itself is probably pretty deterministic. The
reason you can't reproduce the crash easily is that the lack of a subplan
only manifests as a crash if we enter the EPQ recheck code, and that only
happens if the query tries to update a row that's just been updated by
some concurrent query. So it's not going to crash except under concurrent
load, which probably also explains why the bug wasn't found long ago.

If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something like

if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");

near the beginning of postgresBeginForeignScan and then running your app
on a test server. I'm not sure offhand that the estate filters are
exactly right, but any statement that produces this warning would be
pretty suspect. At that point you could work on sanitizing the query +
tables + test data to get to a publishable test case; you could probably
boil your real query down quite a bit and still get the failure.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Tom Lane (#6)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Tom Lane wrote:

Maksim Karaba <Maksim_Karaba@epam.com> writes:

Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.

[ shrug... ] We may just have to wait for somebody to be more
forthcoming.

FWIW, the stack trace seems to indicate that an incorrect plan has been
generated, ie one that has a remote join node without an EPQ recheck
subplan. That mistake in itself is probably pretty deterministic. The
reason you can't reproduce the crash easily is that the lack of a subplan
only manifests as a crash if we enter the EPQ recheck code, and that only
happens if the query tries to update a row that's just been updated by
some concurrent query. So it's not going to crash except under concurrent
load, which probably also explains why the bug wasn't found long ago.

One way to figure out the exact bug is to explore the sequence of WAL
records that leads to the tuple causing the crash; it should be possible
to create a reproducer by writing an isolationtester script that
produces the same WAL sequence. That's how we found the bug fixed in
https://git.postgresql.org/pg/commitdiff/459c64d3227f8 for example.

If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something like

if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");

near the beginning of postgresBeginForeignScan and then running your app
on a test server.

Hmm, is there a reason this cannot be included as a sanity check always?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Alvaro Herrera (#7)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something like

if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");

near the beginning of postgresBeginForeignScan and then running your app
on a test server.

Hmm, is there a reason this cannot be included as a sanity check always?

That's off-the-cuff rather than something I'm sure is correct. But
yeah, I was wondering about pushing something like that into the
standard code.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Maksim Karaba

Maksim_Karaba@epam.com

over 8 years ago

In reply to: Tom Lane (#8)

Re: BUG #14781: server process was terminated by signal 11: Segmentation fault

Thanks for pointing to the possible root cause.
Our Dev team finally have figured out what was the reason and fixed it.
The reason was in using postgres_fdw based cursors like

for scr in (select f1, f2 ... from foreign table (postgre_fdw) ) loop

from scr1 in (select .. from foreign table (postgre_fdw) where ... = scr.f1) loop

Compicated update of local table using foreign tables as source

update of foreign table one record /**/
end loop;
end loop;

Dev team has refactored it to use loop based on arrays instead of fdw to reduce time of foreign session

MAKSIM KARABA

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 7:57 PM
To: Alvaro Herrera <alvherre@2ndquadrant.com>
Cc: Maksim Karaba <Maksim_Karaba@epam.com>; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something like

if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");

near the beginning of postgresBeginForeignScan and then running your
app on a test server.

Hmm, is there a reason this cannot be included as a sanity check always?

That's off-the-cuff rather than something I'm sure is correct. But yeah, I was wondering about pushing something like that into the standard code.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

BUG #14781: server process was terminated by signal 11: Segmentation fault

Attachments: