signal 11 segfaults with parallel workers

Started by Rick Ottenover 8 years ago24 messagesbugs

rottenwindfish@gmail.com

over 8 years ago

Starting a couple of weeks ago, our PostgreSQL database has been crashing,
almost daily, with a signal 11 seg fault on a query as the triggering event:

2017-07-11 23:00:29.984 UTC LOG: worker process: parallel worker for
PID 1055 (PID 12405) was terminated by signal 11: Segmentation fault
2017-07-12 23:01:56.432 UTC LOG: worker process: parallel worker for
PID 5752 (PID 32552) was terminated by signal 11: Segmentation fault
2017-07-14 23:00:46.856 UTC LOG: worker process: parallel worker for
PID 24280 (PID 9639) was terminated by signal 11: Segmentation fault
2017-07-15 23:01:24.317 UTC LOG: worker process: parallel worker for
PID 1561 (PID 15153) was terminated by signal 11: Segmentation fault
2017-07-16 23:00:26.722 UTC LOG: worker process: parallel worker for
PID 5776 (PID 7912) was terminated by signal 11: Segmentation fault
2017-07-17 18:58:14.155 UTC LOG: worker process: parallel worker for
PID 11427 (PID 9998) was terminated by signal 11: Segmentation fault
2017-07-17 19:08:04.103 UTC LOG: worker process: parallel worker for
PID 10190 (PID 11907) was terminated by signal 11: Segmentation fault
2017-07-18 23:01:09.775 UTC LOG: worker process: parallel worker for
PID 29445 (PID 360) was terminated by signal 11: Segmentation fault
2017-07-19 18:46:58.676 UTC LOG: worker process: parallel worker for
PID 7080 (PID 27710) was terminated by signal 11: Segmentation fault
2017-07-20 23:00:35.270 UTC LOG: worker process: parallel worker for
PID 19153 (PID 21218) was terminated by signal 11: Segmentation fault
2017-07-21 23:00:41.085 UTC LOG: worker process: parallel worker for
PID 19161 (PID 30720) was terminated by signal 11: Segmentation fault
2017-07-22 23:00:22.169 UTC LOG: worker process: parallel worker for
PID 4903 (PID 6931) was terminated by signal 11: Segmentation fault
2017-07-25 23:02:03.688 UTC LOG: worker process: parallel worker for
PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

As near as I can tell there were no specific changes preceding this
pattern which might be a root cause. Since then I've tried patching the
Linux instance and bounced the database server, and bumped up the number of
connections (because we were running low sometimes). None of those changes
impacted the regular crashing pattern.

On Sunday (2017-07-23) I set DEBUG5 on all log events, and set it to also
log all queries, so I could try to learn more about what was happening. I
found the culprit query last night, from one of our daily jobs. It was
doing a 5 thread parallel sequence scan on a moderately sized table (maybe
70 columns, by 2.5M rows).

I was not able to force the database to crash by running this query by
hand. I tried a number of times. Although it did happen to someone else
on the 17th.

For now, I've put an index on the relevant columns to avoid the parallel
sequence scan for that query. I also repacked the table. Hopefully we
won't crash tonight too.

There wasn't much extra in the logs to share about the crash. This is from
when it crashed:

2017-07-25 23:02:03.688 UTC DEBUG: reaping dead processes
2017-07-25 23:02:03.688 UTC LOG: worker process: parallel worker for
PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

And this is when it spun out those parallel workers, just prior to the
segfault, that let me identify the query in question:

2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"

Here is what /var/log/kern.log had to say about the one from last night:

Jul 25 23:02:01 core-gce kernel: [738031.417934] postgres[11279]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417953] postgres[11278]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417967] postgres[11280]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417989] postgres[11276]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]

I'm running on Ubuntu 16.04.02 in Google Compute Environment, on a 16 core
VM with 104G RAM, using the Ubuntu Postgresql 9.6.3 package.

$ pg_config --configure
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl'
'--with-libxml' '--with-libxslt'
'--with-tclconfig=/usr/lib/x86_64-linux-gnu/tcl8.6'
'--with-includes=/usr/include/tcl8.6' 'PYTHON=/usr/bin/python'
'--mandir=/usr/share/postgresql/9.6/man'
'--docdir=/usr/share/doc/postgresql-doc-9.6'
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/'
'--datadir=/usr/share/postgresql/9.6'
'--bindir=/usr/lib/postgresql/9.6/bin'
'--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/'
'--includedir=/usr/include/postgresql/' '--enable-nls'
'--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests'
'--enable-debug' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld'
'--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo'
'--with-systemd' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat
-Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie
-fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro
-Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5
-L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi'
'--with-ldap' '--with-selinux' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'

$ pg_config --ldflags
-L../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now
-Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5
-Wl,--as-needed

$ pg_config --cflags
-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Wendif-labels -Wmissing-format-attribute -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2
-fstack-protector-strong -Wformat -Werror=format-security
-I/usr/include/mit-krb5 -fPIC -pie -fno-omit-frame-pointer

I have these two most relevant settings enabled in my configuration:
max_worker_processes = 16
max_parallel_workers_per_gather = 16

If you need anything else, please let me know. I wish I could reproduce
the error every time I ran the query, but it doesn't seem to work that way,
and of course now the query plan is completely different, but I'm sure I
can run other queries that would induce parallel sequence scans on my
tables.

Michael Paquier

michael@paquier.xyz

over 8 years ago

In reply to: Rick Otten (#1)

Re: signal 11 segfaults with parallel workers

On Wed, Jul 26, 2017 at 4:47 PM, Rick Otten <rottenwindfish@gmail.com> wrote:

If you need anything else, please let me know. I wish I could reproduce the
error every time I ran the query, but it doesn't seem to work that way, and
of course now the query plan is completely different, but I'm sure I can run
other queries that would induce parallel sequence scans on my tables.

Backtrace of the core files generated with debug symbols on, and a
minimum test case to reproduce the failure usually help.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Michael Paquier (#2)

Re: signal 11 segfaults with parallel workers

I don't have any core files. I suppose that is something I have to enable
specifically? I'm game to turn it on in case we core dump again.

If I could get it to fail every time I ran the query, I'm sure I could
build a test case for you. Sorry. :-(

On Wed, Jul 26, 2017 at 10:58 AM, Michael Paquier <michael.paquier@gmail.com

Show quoted text

wrote:

On Wed, Jul 26, 2017 at 4:47 PM, Rick Otten <rottenwindfish@gmail.com>
wrote:

If you need anything else, please let me know. I wish I could reproduce

the

error every time I ran the query, but it doesn't seem to work that way,

and

of course now the query plan is completely different, but I'm sure I can

run

other queries that would induce parallel sequence scans on my tables.

Backtrace of the core files generated with debug symbols on, and a
minimum test case to reproduce the failure usually help.
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Rick Otten (#3)

Re: signal 11 segfaults with parallel workers

Rick Otten <rottenwindfish@gmail.com> writes:

I don't have any core files. I suppose that is something I have to enable
specifically? I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited". It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Tom Lane (#4)

Re: signal 11 segfaults with parallel workers

I'll restart the database tonight to pick up the ulimit change and let you
know if I capture a core file in the near future.

On Wed, Jul 26, 2017 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Rick Otten <rottenwindfish@gmail.com> writes:

I don't have any core files. I suppose that is something I have to

enable

specifically? I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited". It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.

regards, tom lane

daveg

daveg@sonic.net

over 8 years ago

In reply to: Tom Lane (#4)

Re: signal 11 segfaults with parallel workers

On Wed, 26 Jul 2017 11:43:22 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I don't have any core files. I suppose that is something I have to enable
specifically? I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited". It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.

If you are using pg_ctl to start postgresql you can add the "-c" flag to your
pg_ctl command to enable core files.

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Rick Otten (#5)

Re: signal 11 segfaults with parallel workers

FWIW, the database crashed again tonight. It hasn't been quiet enough yet
to be able to restart it in a controlled fashion to enable cores.
Hopefully I'll get a chance this weekend!

2017-07-27 23:01:20.411 UTC LOG: worker process: parallel worker for
PID 31472 (PID 2186) was terminated by signal 11: Segmentation fault

Since I didn't have statement logging and debug turned on this time, I can
only guess which query seg faulted.

Is enabling DEBUG in the postgresql.conf sufficient to enable debug symbols
in the core, or do I have to rebuild the postgresql binaries to get that?
Is the core of any use without debug symbols enabled?

On Wed, Jul 26, 2017 at 12:26 PM, Rick Otten <rottenwindfish@gmail.com>
wrote:

Show quoted text

I'll restart the database tonight to pick up the ulimit change and let you
know if I capture a core file in the near future.

On Wed, Jul 26, 2017 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I don't have any core files. I suppose that is something I have to

enable

specifically? I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited". It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Rick Otten (#7)

Re: signal 11 segfaults with parallel workers

Rick Otten <rottenwindfish@gmail.com> writes:

Is enabling DEBUG in the postgresql.conf sufficient to enable debug symbols
in the core, or do I have to rebuild the postgresql binaries to get that?

You would need to recompile (with --enable-debug added to configure
switches) if they're not there already. But if you used somebody's
packaging rather than a homebrew build, you can probably get the
symbols installed without doing your own build.

Is the core of any use without debug symbols enabled?

You should still be able to get a stack trace out of it, but the trace
would be much more informative with debug symbols. See
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Tom Lane (#8)

Re: signal 11 segfaults with parallel workers

I'm using the Ubuntu PostgresSQL 9.6.3 from this repo:
deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main

It looks like there is a "-dbg" package available:
postgresql-9.6-dbg - debug symbols for postgresql-9.6

I'll give that a try when I get the restart opportunity.

On Thu, Jul 27, 2017 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Rick Otten <rottenwindfish@gmail.com> writes:

Is enabling DEBUG in the postgresql.conf sufficient to enable debug

symbols

in the core, or do I have to rebuild the postgresql binaries to get that?

You would need to recompile (with --enable-debug added to configure
switches) if they're not there already. But if you used somebody's
packaging rather than a homebrew build, you can probably get the
symbols installed without doing your own build.

Is the core of any use without debug symbols enabled?

You should still be able to get a stack trace out of it, but the trace
would be much more informative with debug symbols. See
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_
a_running_PostgreSQL_backend_on_Linux/BSD

regards, tom lane

#10

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Rick Otten (#9)

Re: signal 11 segfaults with parallel workers

Rick Otten wrote:

I'm using the Ubuntu PostgresSQL 9.6.3 from this repo:
deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main

It looks like there is a "-dbg" package available:
postgresql-9.6-dbg - debug symbols for postgresql-9.6

I'll give that a try when I get the restart opportunity.

You can install the -dbg package without waiting for a restart; it won't
disrupt anything. Also, if you already got a core from the last crash,
installing that package now would be enough to be able to extract info
from the core file, assuming the -dbg package is the same version as the
package version that was running when it crashed.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Alvaro Herrera (#10)

Re: signal 11 segfaults with parallel workers

Thanks! I've got the -dbg package installed and I've restarted the server
and the database this morning. We've continued to crash almost every
night, so if the restart doesn't do something strange, I should have a core
file within a day or two.

One thing that is bugging me is I think when the database crashes, it
doesn't clean up the temp_tablespace(s). I've noticed as I'm working
through this issue that the temp tablespace keeps creeping up in size and
there doesn't seem to be any obvious way to recover that space. I've been
keeping up with it for now by making the disk bigger, but obviously I can't
do that indefinitely.

I was debating making a new temp tablespace and then dropping the old one,
but there must be an easier, safe way to clear dangling temp tablespace
stuff? Google wasn't terribly helpful to uncover strategies for dealing
with temp tablespace bloat.

On Fri, Jul 28, 2017 at 3:02 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Show quoted text

Rick Otten wrote:

I'm using the Ubuntu PostgresSQL 9.6.3 from this repo:
deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main

It looks like there is a "-dbg" package available:
postgresql-9.6-dbg - debug symbols for postgresql-9.6

I'll give that a try when I get the restart opportunity.

You can install the -dbg package without waiting for a restart; it won't
disrupt anything. Also, if you already got a core from the last crash,
installing that package now would be enough to be able to extract info
from the core file, assuming the -dbg package is the same version as the
package version that was running when it crashed.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Rick Otten (#11)

Re: signal 11 segfaults with parallel workers

Rick Otten <rottenwindfish@gmail.com> writes:

One thing that is bugging me is I think when the database crashes, it
doesn't clean up the temp_tablespace(s).

Hm, interesting, what do you see in there?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Tom Lane (#12)

Re: signal 11 segfaults with parallel workers

Well, I'm not sure how to inspect the temp tablespace other than from the
filesystem itself. I have it configured on its own disk. Usually the disk
space ebbs and flows with query activity. Since we've been crashing
however, it never reclaims the disk that was in use just before the crash.
So our temp space 'floor" keeps getting higher and higher.

At least that is what it has been doing for the past week or two, and what
it looked like this morning. Now that the database has been back up for 8
or 9 hours following this controlled restart, I just went to look at it,
and all of the temp space has been reclaimed - for the first time since the
crashing started. ... Interesting...

On Sun, Jul 30, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Rick Otten <rottenwindfish@gmail.com> writes:

One thing that is bugging me is I think when the database crashes, it
doesn't clean up the temp_tablespace(s).

Hm, interesting, what do you see in there?

regards, tom lane

#14

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Rick Otten (#13)

Re: signal 11 segfaults with parallel workers

Ok, I got a core this time at 23:00 when the database went down.
Here is the basic backtrace:

$ gdb /usr/lib/postgresql/9.6/bin/postgres core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html

This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/9.6/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/32/108810b4ff9528a94d48315dd9333c501fc52d.debug...done.
done.
[New LWP 4294]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: parallel worker f'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 MemoryContextAlloc (context=0x0, size=size@entry=1024) at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
761 /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:
No such file or directory.
(gdb) bt
#0 MemoryContextAlloc (context=0x0, size=size@entry=1024) at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
#1 0x0000560b7a518ec4 in SPI_connect () at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/executor/spi.c:102
#2 0x00007fec467b9261 in _PG_init () from
/usr/lib/postgresql/9.6/lib/multicorn.so
#3 0x0000560b7a717cf2 in internal_load_library
(libname=libname@entry=0x7ff48208dbf8
<error: Cannot access memory at address 0x7ff48208dbf8>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:276
#4 0x0000560b7a7188c0 in RestoreLibraryState (start_address=0x7ff48208dbf8
<error: Cannot access memory at address 0x7ff48208dbf8>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:741
#5 0x0000560b7a3ee4f7 in ParallelWorkerMain (main_arg=<optimized out>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/access/transam/parallel.c:1065
#6 0x0000560b7a59ae29 in StartBackgroundWorker () at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/bgworker.c:742
#7 0x0000560b7a5a701b in do_start_bgworker (rw=<optimized out>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:5579
#8 maybe_start_bgworkers () at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:5776
#9 0x0000560b7a5a7cd5 in sigusr1_handler (postgres_signal_arg=<optimized
out>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:4973
#10 <signal handler called>
#11 0x00007ff480425573 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:84
#12 0x0000560b7a3858ef in ServerLoop () at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:1679
#13 0x0000560b7a5a9053 in PostmasterMain (argc=1, argv=<optimized out>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:1323
#14 0x0000560b7a387511 in main (argc=1, argv=0x560b7ba23630) at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/main/main.c:228
(gdb)

The query that took it down this time (based on the pid reported in the
stacktrace) does indeed spin out a parallel plan, but it is a simple
query. I was surprised to see the multicorn library mentioned in this
trace, it has nothing to do with the multicorn FDW installed on the system.

I've run the query several times in the last few minutes and can't get it
to generate a core again.

On Sun, Jul 30, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com>
wrote:

Show quoted text

Well, I'm not sure how to inspect the temp tablespace other than from the
filesystem itself. I have it configured on its own disk. Usually the disk
space ebbs and flows with query activity. Since we've been crashing
however, it never reclaims the disk that was in use just before the crash.
So our temp space 'floor" keeps getting higher and higher.

At least that is what it has been doing for the past week or two, and what
it looked like this morning. Now that the database has been back up for 8
or 9 hours following this controlled restart, I just went to look at it,
and all of the temp space has been reclaimed - for the first time since the
crashing started. ... Interesting...

On Sun, Jul 30, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

One thing that is bugging me is I think when the database crashes, it
doesn't clean up the temp_tablespace(s).

Hm, interesting, what do you see in there?

regards, tom lane

#15

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Rick Otten (#14)

Re: signal 11 segfaults with parallel workers

On Mon, Jul 31, 2017 at 6:35 AM, Rick Otten <rottenwindfish@gmail.com> wrote:

Ok, I got a core this time at 23:00 when the database went down.
Here is the basic backtrace:

The query that took it down this time (based on the pid reported in the
stacktrace) does indeed spin out a parallel plan, but it is a simple query.
I was surprised to see the multicorn library mentioned in this trace, it has
nothing to do with the multicorn FDW installed on the system.

We load all the libraries in parallel workers which are loaded by the
main backend. This is to ensure that master and worker backends have
exactly the same guc's defined in the worker.

I've run the query several times in the last few minutes and can't get it to
generate a core again.

Did the query take the parallel plan during execution? The above
symptom shows that it should crash if you run the same query after
restarting the server.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Rick Otten (#14)

Re: signal 11 segfaults with parallel workers

Hi,

On 2017-07-30 21:05:50 -0400, Rick Otten wrote:

Ok, I got a core this time at 23:00 when the database went down.
Here is the basic backtrace:

(gdb) bt
#0 MemoryContextAlloc (context=0x0, size=size@entry=1024) at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
#1 0x0000560b7a518ec4 in SPI_connect () at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/executor/spi.c:102
#2 0x00007fec467b9261 in _PG_init () from
/usr/lib/postgresql/9.6/lib/multicorn.so
#3 0x0000560b7a717cf2 in internal_load_library
(libname=libname@entry=0x7ff48208dbf8
<error: Cannot access memory at address 0x7ff48208dbf8>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:276
#4 0x0000560b7a7188c0 in RestoreLibraryState (start_address=0x7ff48208dbf8
<error: Cannot access memory at address 0x7ff48208dbf8>)
at

Rick: Looks like a buglet in multicorn, which seems to expect to be
called in a valid memory context. Can you reproduce the bug if you use
multicorn, and then in the same session execute the problematic query?

Robert, was it intentional that we don't have a memory context defined
at this point?

Regards,

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Andres Freund (#16)

Re: signal 11 segfaults with parallel workers

On Mon, Jul 31, 2017 at 8:26 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-07-30 21:05:50 -0400, Rick Otten wrote:

Ok, I got a core this time at 23:00 when the database went down.
Here is the basic backtrace:

(gdb) bt
#0 MemoryContextAlloc (context=0x0, size=size@entry=1024) at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
#1 0x0000560b7a518ec4 in SPI_connect () at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/executor/spi.c:102
#2 0x00007fec467b9261 in _PG_init () from
/usr/lib/postgresql/9.6/lib/multicorn.so
#3 0x0000560b7a717cf2 in internal_load_library
(libname=libname@entry=0x7ff48208dbf8
<error: Cannot access memory at address 0x7ff48208dbf8>)
at
/build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:276
#4 0x0000560b7a7188c0 in RestoreLibraryState (start_address=0x7ff48208dbf8
<error: Cannot access memory at address 0x7ff48208dbf8>)
at

Rick: Looks like a buglet in multicorn, which seems to expect to be
called in a valid memory context. Can you reproduce the bug if you use
multicorn, and then in the same session execute the problematic query?

Robert, was it intentional that we don't have a memory context defined
at this point?

There is already a "Parallel Worker" memory context defined by that
time. I think the issue is that multicorn library expects that
Transaction context to be defined by that time.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Amit Kapila (#17)

Re: signal 11 segfaults with parallel workers

Amit Kapila <amit.kapila16@gmail.com> writes:

There is already a "Parallel Worker" memory context defined by that
time. I think the issue is that multicorn library expects that
Transaction context to be defined by that time.

It looks like multicorn supposes that a library's _PG_init function can
only be called inside a transaction. That is broken with a capital B.
We need not consider parallel query to find counterexamples: that
means you can't preload multicorn using shared_preload_libraries,
as that loads libraries into the postmaster, which never has and never
will run transactions.

Whatever it's trying to initialize in _PG_init needs to be done later.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#19

Michael Paquier

michael@paquier.xyz

over 8 years ago

In reply to: Tom Lane (#18)

Re: signal 11 segfaults with parallel workers

On Mon, Jul 31, 2017 at 5:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

There is already a "Parallel Worker" memory context defined by that
time. I think the issue is that multicorn library expects that
Transaction context to be defined by that time.

It looks like multicorn supposes that a library's _PG_init function can
only be called inside a transaction. That is broken with a capital B.
We need not consider parallel query to find counterexamples: that
means you can't preload multicorn using shared_preload_libraries,
as that loads libraries into the postmaster, which never has and never
will run transactions.

Whatever it's trying to initialize in _PG_init needs to be done later.

Indeed, that's bad. I am adding in CC Ronan who has been working on
multicorn. At this stage, I think that you would be better out by
disabling parallelism.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#20

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Michael Paquier (#19)

Re: signal 11 segfaults with parallel workers

Just to follow up. The database has not crashed since I disabled
parallelism. As a result of that change, some of my queries are running
dramatically slower, I'm still working on doing what I can to get them back
up to reasonable performance. I look forward to a solution that allows
both FDW extensions and parallel queries to coexist in the same database.

On Mon, Jul 31, 2017 at 3:52 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Show quoted text

On Mon, Jul 31, 2017 at 5:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

There is already a "Parallel Worker" memory context defined by that
time. I think the issue is that multicorn library expects that
Transaction context to be defined by that time.

It looks like multicorn supposes that a library's _PG_init function can
only be called inside a transaction. That is broken with a capital B.
We need not consider parallel query to find counterexamples: that
means you can't preload multicorn using shared_preload_libraries,
as that loads libraries into the postmaster, which never has and never
will run transactions.

Whatever it's trying to initialize in _PG_init needs to be done later.

Indeed, that's bad. I am adding in CC Ronan who has been working on
multicorn. At this stage, I think that you would be better out by
disabling parallelism.
--
Michael