BUG #14420: Parallel worker segfault

Started by Rick Ottenover 9 years ago4 messagesbugs

rotten@windfish.net

over 9 years ago

The following bug has been logged on the website:

Bug reference: 14420
Logged by: Rick Otten
Email address: rotten@windfish.net
PostgreSQL version: 9.6.1
Operating system: Ubuntu 16.04
Description:

This just happened on my production 9.6.1 system:

2016-11-11 21:31:26.215 UTC LOG: worker process: parallel worker for PID
1737 (PID 6792) was terminated by signal 11: Segmentation fault
2016-11-11 21:31:26.215 UTC LOG: terminating any other active server
processes

I had a lot of stuff running concurrently, so I'm not sure which query
caused the failure. There is nothing else obvious or illuminating in the
logs that I can see. After it recovered, everything went back to normal.

I figured I should report this even if I don't have much to go on yet.

Is there a core file somewhere that would be of help? (I'm not seeing
one.)

Should I turn up debug or enable something to catch more information if it
happens again?

select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.3.1-14ubuntu2) 5.3.1 20160413, 64-bit

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Rick Otten (#1)

Re: BUG #14420: Parallel worker segfault

On Sat, Nov 12, 2016 at 3:36 AM, <rotten@windfish.net> wrote:

The following bug has been logged on the website:

Bug reference: 14420
Logged by: Rick Otten
Email address: rotten@windfish.net
PostgreSQL version: 9.6.1
Operating system: Ubuntu 16.04
Description:

This just happened on my production 9.6.1 system:

2016-11-11 21:31:26.215 UTC LOG: worker process: parallel worker for PID
1737 (PID 6792) was terminated by signal 11: Segmentation fault
2016-11-11 21:31:26.215 UTC LOG: terminating any other active server
processes

I had a lot of stuff running concurrently, so I'm not sure which query
caused the failure. There is nothing else obvious or illuminating in the
logs that I can see. After it recovered, everything went back to normal.

It is difficult to find out the problem by above information. Can you
share server log? I am not sure if we can find anything from it, but
at least we can try.

I figured I should report this even if I don't have much to go on yet.

Thanks for the report.

Is there a core file somewhere that would be of help? (I'm not seeing
one.)

If it is generated it should be in $PGDATA directory. I have found
few articles [1]https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD[2]http://bajis-postgres.blogspot.in/2014/03/want-to-ensure-that-postgres-generates.html on web related to this which could be useful for
you.

Should I turn up debug or enable something to catch more information if it
happens again?

I think core file can help here.

[1]: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
[2]: http://bajis-postgres.blogspot.in/2014/03/want-to-ensure-that-postgres-generates.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Rick Otten (#1)

Re: BUG #14420: Parallel worker segfault

On Sat, Nov 12, 2016 at 9:32 AM, Rick Otten <rotten@windfish.net> wrote:

Please keep pgsql-bugs in the loop. It is important to keep everyone
in the loop not only because it is a way to work in this community,
but also because others can see something which I or you can't see.

PostgreSQL was not started with the "-c" option. I'll look into enabling
that before this happens again.

makes sense.

I'll read more from the other debugging article and see if there is anything
I can do there as well. Thanks.

There were no files generated and dropped in PGDATA this time,
unfortunately.

Sorry, I know this isn't much to go on, but it is all I know at this time.

There wasn't much else that wasn't routine in the logs before or after the
two lines I pasted below other than a bunch of warnings for the the 30 or 40
transactions that were in progress followed by this:

Okay, I think we can't get anything from these logs. I think once
core is available, we can try to find the reason, but it would be much
better if we can generate an independent test to reproduce this
problem. One possible way could be to find the culprit query. You
might want to log long-running queries, as parallelism will generally
be used for such queries.

2016-11-11 21:31:26.292 UTC WARNING: terminating connection because of crash
of another server process
2016-11-11 21:31:26.292 UTC DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2016-11-11 21:31:26.292 UTC HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2016-11-11 21:31:26.301 UTC WARNING: terminating connection because of crash
of another server process
2016-11-11 21:31:26.301 UTC DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2016-11-11 21:31:26.301 UTC HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2016-11-11 21:31:30.762 UTC [unknown] x.x.x.x [unknown] LOG: connection
received: host=x.x.x.x port=47692
2016-11-11 21:31:30.762 UTC clarivoy x.x.x.x some_user FATAL: the database
system is in recovery mode
2016-11-11 21:31:31.766 UTC LOG: all server processes terminated;
reinitializing
2016-11-11 21:31:33.526 UTC LOG: database system was interrupted; last known
up at 2016-11-11 21:29:28 UTC
2016-11-11 21:31:33.660 UTC LOG: database system was not properly shut down;
automatic recovery in progress
2016-11-11 21:31:33.674 UTC LOG: redo starts at 1DD/4F5A0320
2016-11-11 21:31:33.957 UTC LOG: unexpected pageaddr 1DC/16AEC000 in log
segment 00000001000001DD00000056, offset 11452416
2016-11-11 21:31:33.958 UTC LOG: redo done at 1DD/56AEB7F8
2016-11-11 21:31:33.958 UTC LOG: last completed transaction was at log time
2016-11-11 21:31:26.07448+00
2016-11-11 21:31:34.705 UTC LOG: MultiXact member wraparound protections are
now enabled
2016-11-11 21:31:34.724 UTC LOG: autovacuum launcher started
2016-11-11 21:31:34.725 UTC LOG: database system is ready to accept
connections

After that the database was pretty much back to normal. Because everything
connects from various pgbouncer instances running elsewhere, they quickly
reconnected and started working again without having to restart any
applications or services.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Import Notes

Reply to msg id not found: 23d5d401864407121a11571797701e2e@www.windfish.net

Rick Otten

rotten@windfish.net

over 9 years ago

In reply to: Amit Kapila (#3)

Re: BUG #14420: Parallel worker segfault

Sorry about forgetting to CC the bugs list when I replied.

I've enabled "-c", and made sure my PGDATA directory has enough space to
collect a full core image. If we get one, I'll let you know.

There were a lot of queries happening at the time of the seg fault. The
only new or unusual one that I am aware of was doing a UNION ALL,
between two nearly identical queries, where one side was doing a
parallel query scan, and the other side wasn't. I had just refactored it
from using an "OR" operand in the WHERE clause because it was much
faster that way.

Since Friday I ran another 1M or so of those queries, but it hasn't seg
faulted again.

On 2016-11-12 02:18, Amit Kapila wrote:

On Sat, Nov 12, 2016 at 9:32 AM, Rick Otten <rotten@windfish.net> wrote:

Please keep pgsql-bugs in the loop. It is important to keep everyone
in the loop not only because it is a way to work in this community,
but also because others can see something which I or you can't see.

PostgreSQL was not started with the "-c" option. I'll look into enabling that before this happens again.

makes sense.

I'll read more from the other debugging article and see if there is anything I can do there as well. Thanks. There were no files generated and dropped in PGDATA this time, unfortunately. Sorry, I know this isn't much to go on, but it is all I know at this time. There wasn't much else that wasn't routine in the logs before or after the two lines I pasted below other than a bunch of warnings for the the 30 or 40 transactions that were in progress followed by this:

Okay, I think we can't get anything from these logs. I think once
core is available, we can try to find the reason, but it would be much
better if we can generate an independent test to reproduce this
problem. One possible way could be to find the culprit query. You
might want to log long-running queries, as parallelism will generally
be used for such queries.

2016-11-11 21:31:26.292 UTC WARNING: terminating connection because of crash of another server process 2016-11-11 21:31:26.292 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.292 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. 2016-11-11 21:31:26.301 UTC WARNING: terminating connection because of crash of another server process 2016-11-11 21:31:26.301 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.301 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. 2016-11-11 21:31:30.762 UTC [unknown] x.x.x.x [unknown] LOG: connection received: host=x.x.x.x port=47692

2016-11-11 21:31:30.762 UTC clarivoy x.x.x.x some_user FATAL: the database system is in recovery mode 2016-11-11 21:31:31.766 UTC LOG: all server processes terminated; reinitializing 2016-11-11 21:31:33.526 UTC LOG: database system was interrupted; last known up at 2016-11-11 21:29:28 UTC 2016-11-11 21:31:33.660 UTC LOG: database system was not properly shut down; automatic recovery in progress 2016-11-11 21:31:33.674 UTC LOG: redo starts at 1DD/4F5A0320 2016-11-11 21:31:33.957 UTC LOG: unexpected pageaddr 1DC/16AEC000 in log segment 00000001000001DD00000056, offset 11452416 2016-11-11 21:31:33.958 UTC LOG: redo done at 1DD/56AEB7F8 2016-11-11 21:31:33.958 UTC LOG: last completed transaction was at log time 2016-11-11 21:31:26.07448+00 2016-11-11 21:31:34.705 UTC LOG: MultiXact member wraparound protections are now enabled 2016-11-11 21:31:34.724 UTC LOG: autovacuum launcher started 2016-11-11 21:31:34.725 UTC LOG: database system is ready to accept connections After that the
database was pretty much back to normal. Because everything connects from various pgbouncer instances running elsewhere, they quickly reconnected and started working again without having to restart any applications or services.