BUG #14781: server process was terminated by signal 11: Segmentation fault
The following bug has been logged on the website:
Bug reference: 14781
Logged by: Maksim Karaba
Email address: maksim_karaba@epam.com
PostgreSQL version: 9.6.4
Operating system: CentOS Linux release 7.3.1611 (Core)
Description:
Hi!
We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw. Could you provide any workaround? Version Postgresql 9.6.4
Error from /var/log/messages :
bigdatadb kernel: postgres[17467]: segfault at 58 ip 00000000005c2f94 sp
00007ffdc06b2260 error 4 in postgres[400000+616000]
bigdatadb abrt-hook-ccpp: Process 17467 (postgres) of user 26 killed by
SIGSEGV - dumping core
And we have core dump bt list:
#0 ExecProcNode (node=0x0) at execProcnode.c:380
result = <optimized out>
__func__ = "ExecProcNode"
#1 0x00007f534be36554 in postgresRecheckForeignScan (node=<optimized out>,
slot=0x6afbf68) at postgres_fdw.c:2059
scanrelid = <optimized out>
outerPlan = <optimized out>
result = <optimized out>
#2 0x00000000005e3b7c in ForeignRecheck (node=node@entry=0x6afba48,
slot=slot@entry=0x6afbf68) at nodeForeignscan.c:101
fdwroutine = 0x6a48b88
econtext = 0x6afbb58
#3 0x00000000005ca5a6 in ExecScanFetch (recheckMtd=0x5e3b40
<ForeignRecheck>, accessMtd=0x5e3bb0 <ForeignNext>, node=0x6afba48) at
execScan.c:85
slot = <optimized out>
scanrelid = <optimized out>
estate = <optimized out>
#4 ExecScan (node=node@entry=0x6afba48, accessMtd=accessMtd@entry=0x5e3bb0
<ForeignNext>, recheckMtd=recheckMtd@entry=0x5e3b40 <ForeignRecheck>) at
execScan.c:180
econtext = 0x6afbb58
qual = 0x0
projInfo = 0x6afc0d0
isDone = ExprSingleResult
resultSlot = <optimized out>
#5 0x00000000005e3c5f in ExecForeignScan (node=node@entry=0x6afba48) at
nodeForeignscan.c:119
No locals.
#6 0x00000000005c36a8 in ExecProcNode (node=node@entry=0x6afba48) at
execProcnode.c:465
result = <optimized out>
__func__ = "ExecProcNode"
#7 0x00000000005e0889 in ExecSort (node=node@entry=0x6afb7d8) at
nodeSort.c:103
plannode = <optimized out>
outerNode = 0x6afba48
tupDesc = <optimized out>
estate = 0x6a42dc8
dir = ForwardScanDirection
tuplesortstate = 0x6b70868
slot = <optimized out>
#8 0x00000000005c3648 in ExecProcNode (node=node@entry=0x6afb7d8) at
execProcnode.c:495
result = <optimized out>
__func__ = "ExecProcNode"
#9 0x00000000005e448e in begin_partition
(winstate=winstate@entry=0x6a50b28) at nodeWindowAgg.c:1082
outerslot = <optimized out>
outerPlan = 0x6afb7d8
numfuncs = 1
i = <optimized out>
#10 0x00000000005e6a4b in ExecWindowAgg (winstate=winstate@entry=0x6a50b28)
at nodeWindowAgg.c:1691
result = <optimized out>
isDone = ExprSingleResult
econtext = <optimized out>
i = <optimized out>
numfuncs = <optimized out>
__func__ = "ExecWindowAgg"
#11 0x00000000005c3618 in ExecProcNode (node=0x6a50b28) at
execProcnode.c:507
Thank you!
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
maksim_karaba@epam.com writes:
We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw.
Please show the failing query or queries. (If you don't have postmaster
logs showing them, "p debug_query_string" in the core files should get
the info.) Also, please show EXPLAIN VERBOSE plans for the query(s),
as well as schema information (psql \d output would do) for the
referenced tables.
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Thank you for your answer so quickly.
(gdb) p debug_query_string
$1 = 0x2c2e748 "DO $$DECLARE\nBEGIN\n PERFORM dwh.load_trade_record_inc();\nEND$$;"
We added logging to the function and found out that the error always occurs on the same update
(in attachments)
Plan explain verbose in attachments
MAKSIM KARABA
Senior Systems Engineer, EPAM
Office: +375 17 389 0100 x 53194 Cell: +375296772871 Email: maksim_karaba@epam.com
Minsk, Belarus (GMT+3) epam.com
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 4:47 PM
To: Maksim Karaba <Maksim_Karaba@epam.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
maksim_karaba@epam.com writes:
We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw.
Please show the failing query or queries. (If you don't have postmaster logs showing them, "p debug_query_string" in the core files should get the info.) Also, please show EXPLAIN VERBOSE plans for the query(s), as well as schema information (psql \d output would do) for the referenced tables.
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
regards, tom lane
Maksim Karaba <Maksim_Karaba@epam.com> writes:
We added logging to the function and found out that the error always occurs on the same update
(in attachments)
Plan explain verbose in attachments
The point of my questions was that it's going to be difficult for anyone
to fix this unless we can reproduce the problem. What you've provided
doesn't even begin to make that possible. Please see the advice about
providing self-contained test cases at
https://www.postgresql.org/docs/current/static/bug-reporting.html
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 6:10 PM
To: Maksim Karaba <Maksim_Karaba@epam.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
The point of my questions was that it's going to be difficult for anyone to fix this unless we can reproduce the problem. What you've provided doesn't even begin to make that possible. Please see the advice about providing self-contained test cases at https://www.postgresql.org/docs/current/static/bug-reporting.html
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Maksim Karaba <Maksim_Karaba@epam.com> writes:
Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.
[ shrug... ] We may just have to wait for somebody to be more
forthcoming.
FWIW, the stack trace seems to indicate that an incorrect plan has been
generated, ie one that has a remote join node without an EPQ recheck
subplan. That mistake in itself is probably pretty deterministic. The
reason you can't reproduce the crash easily is that the lack of a subplan
only manifests as a crash if we enter the EPQ recheck code, and that only
happens if the query tries to update a row that's just been updated by
some concurrent query. So it's not going to crash except under concurrent
load, which probably also explains why the bug wasn't found long ago.
If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something like
if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");
near the beginning of postgresBeginForeignScan and then running your app
on a test server. I'm not sure offhand that the estate filters are
exactly right, but any statement that produces this warning would be
pretty suspect. At that point you could work on sanitizing the query +
tables + test data to get to a publishable test case; you could probably
boil your real query down quite a bit and still get the failure.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom Lane wrote:
Maksim Karaba <Maksim_Karaba@epam.com> writes:
Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.[ shrug... ] We may just have to wait for somebody to be more
forthcoming.FWIW, the stack trace seems to indicate that an incorrect plan has been
generated, ie one that has a remote join node without an EPQ recheck
subplan. That mistake in itself is probably pretty deterministic. The
reason you can't reproduce the crash easily is that the lack of a subplan
only manifests as a crash if we enter the EPQ recheck code, and that only
happens if the query tries to update a row that's just been updated by
some concurrent query. So it's not going to crash except under concurrent
load, which probably also explains why the bug wasn't found long ago.
One way to figure out the exact bug is to explore the sequence of WAL
records that leads to the tuple causing the crash; it should be possible
to create a reproducer by writing an isolationtester script that
produces the same WAL sequence. That's how we found the bug fixed in
https://git.postgresql.org/pg/commitdiff/459c64d3227f8 for example.
If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something likeif (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");near the beginning of postgresBeginForeignScan and then running your app
on a test server.
Hmm, is there a reason this cannot be included as a sanity check always?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something likeif (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");near the beginning of postgresBeginForeignScan and then running your app
on a test server.
Hmm, is there a reason this cannot be included as a sanity check always?
That's off-the-cuff rather than something I'm sure is correct. But
yeah, I was wondering about pushing something like that into the
standard code.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Thanks for pointing to the possible root cause.
Our Dev team finally have figured out what was the reason and fixed it.
The reason was in using postgres_fdw based cursors like
for scr in (select f1, f2 ... from foreign table (postgre_fdw) ) loop
from scr1 in (select .. from foreign table (postgre_fdw) where ... = scr.f1) loop
Compicated update of local table using foreign tables as source
update of foreign table one record /**/
end loop;
end loop;
Dev team has refactored it to use loop based on arrays instead of fdw to reduce time of foreign session
MAKSIM KARABA
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 7:57 PM
To: Alvaro Herrera <alvherre@2ndquadrant.com>
Cc: Maksim Karaba <Maksim_Karaba@epam.com>; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something likeif (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
(estate->es_plannedstmt->commandType != CMD_SELECT ||
estate->es_rowMarks))
elog(WARNING, "foreign join plan lacks EPQ support");near the beginning of postgresBeginForeignScan and then running your
app on a test server.
Hmm, is there a reason this cannot be included as a sanity check always?
That's off-the-cuff rather than something I'm sure is correct. But yeah, I was wondering about pushing something like that into the standard code.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs