BUG #13490: Segmentation fault on pg_stat_activity

Started by Michael Bommaritoalmost 11 years ago20 messagesbugs
Jump to latest
#1Michael Bommarito
michael@bommaritollc.com

The following bug has been logged on the website:

Bug reference: 13490
Logged by: Michael Bommarito
Email address: michael@bommaritollc.com
PostgreSQL version: 9.5alpha1
Operating system: Ubuntu 14.04.2 LTS
Description:

Environment: Up-to-date instance running Trusty and using apt.postgresql.org
postgresql-9.5-alpha1. Database was upgraded using pg_dump/restore. All
other application behavior as expected.

Not sure if this is related to #13286 or #12918, but if so, still affecting
the binaries in apt. Pretty critical to fly blind without access to this.

Crash example 1:
2015-07-07 16:42:45 UTC [26785-2] LOG: server process (PID 11836) was
terminated by signal 11: Segmentation fault
2015-07-07 16:42:45 UTC [26785-3] DETAIL: Failed process was running:
select * from pg_stat_activity ;

Crash example 2:
https://github.com/ankane/pghero/issues/30

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Michael Paquier
michael@paquier.xyz
In reply to: Michael Bommarito (#1)
Re: BUG #13490: Segmentation fault on pg_stat_activity

On Wed, Jul 8, 2015 at 1:52 AM, <michael@bommaritollc.com> wrote:

Environment: Up-to-date instance running Trusty and using apt.postgresql.org
postgresql-9.5-alpha1. Database was upgraded using pg_dump/restore. All
other application behavior as expected.

Not sure if this is related to #13286 or #12918, but if so, still affecting
the binaries in apt. Pretty critical to fly blind without access to this.

Crash example 1:
2015-07-07 16:42:45 UTC [26785-2] LOG: server process (PID 11836) was
terminated by signal 11: Segmentation fault
2015-07-07 16:42:45 UTC [26785-3] DETAIL: Failed process was running:
select * from pg_stat_activity ;

Crash example 2:
https://github.com/ankane/pghero/issues/30

If this link gets away we would lost the test case, it is better to
directly add it in the email:
SELECT pid, state, application_name AS source, age(now(), xact_start)
AS duration, waiting, query, xact_start AS started_at FROM
pg_stat_activity WHERE query <> '<insufficient privilege>' AND s
tate <> 'idle' AND pid <> pg_backend_pid() AND now() - query_start >
interval '60 seconds' ORDER BY query_start DESC;

This seems unrelated to #13286 which refers to pg_terminate_backend()
crashing, and in the case 1 presented above the query used for the
crash is more simple. Now this may be related to #12918, so could it
be possible to see at least a stack trace to see what is the code path
failing?
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Michael Bommarito
michael@bommaritollc.com
In reply to: Michael Paquier (#2)
Re: BUG #13490: Segmentation fault on pg_stat_activity

I'll try to generate an isolated core dump and stack trace this weekend
when the system is not in use.

I also confirmed that the earlier patch
2897e069c1f4309a237a1fa4bf5d2fcd9c4b9868 for 12918 is applied in our
version, so either the patch didn't handle all cases or it's unrelated.

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 7, 2015 at 9:17 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Show quoted text

On Wed, Jul 8, 2015 at 1:52 AM, <michael@bommaritollc.com> wrote:

Environment: Up-to-date instance running Trusty and using

apt.postgresql.org

postgresql-9.5-alpha1. Database was upgraded using pg_dump/restore. All
other application behavior as expected.

Not sure if this is related to #13286 or #12918, but if so, still

affecting

the binaries in apt. Pretty critical to fly blind without access to

this.

Crash example 1:
2015-07-07 16:42:45 UTC [26785-2] LOG: server process (PID 11836) was
terminated by signal 11: Segmentation fault
2015-07-07 16:42:45 UTC [26785-3] DETAIL: Failed process was running:
select * from pg_stat_activity ;

Crash example 2:
https://github.com/ankane/pghero/issues/30

If this link gets away we would lost the test case, it is better to
directly add it in the email:
SELECT pid, state, application_name AS source, age(now(), xact_start)
AS duration, waiting, query, xact_start AS started_at FROM
pg_stat_activity WHERE query <> '<insufficient privilege>' AND s
tate <> 'idle' AND pid <> pg_backend_pid() AND now() - query_start >
interval '60 seconds' ORDER BY query_start DESC;

This seems unrelated to #13286 which refers to pg_terminate_backend()
crashing, and in the case 1 presented above the query used for the
crash is more simple. Now this may be related to #12918, so could it
be possible to see at least a stack trace to see what is the code path
failing?
--
Michael

#4Michael Paquier
michael@paquier.xyz
In reply to: Michael Bommarito (#3)
Re: BUG #13490: Segmentation fault on pg_stat_activity

On Wed, Jul 8, 2015 at 10:22 AM, Michael Bommarito
<michael@bommaritollc.com> wrote:

I'll try to generate an isolated core dump and stack trace this weekend
when the system is not in use.

That would be nice. I have let pgbench -C run for one hour with select
* from pg_stat_activity running every second (\watch 1) in parallel
but I could not reproduce the issue on HEAD.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Michael Bommarito
michael@bommaritollc.com
In reply to: Michael Paquier (#4)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Hello Michael,
Here is the offending query and gdb session/stacktrace output. Please
let me know if we can provide anything else from gdb or logs that can be
anonymized.

*Query:*
2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was
terminated by signal 11: Segmentation fault
2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running:
SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()

*GDB:*
root@hostname:/var/lib/postgresql/9.5/main# gdb -q -c
/var/lib/postgresql/9.5/main/core /usr/lib/postgresql/9.5/bin/postgres
Reading symbols from /usr/lib/postgresql/9.5/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/7b/273fca376bfad4ea4bb0c4d3099d62ecc9c124.debug...done.
done.
[New LWP 16028]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: postgres databasename 127.0.0.1(42063)
BIND '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
2832
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
No such file or directory.
(gdb) bt
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
#1 0x00007fd0d47cb9dd in pullup_replace_vars_callback (var=0x7fd0d5d9e958,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
#2 0x00007fd0d481c3ea in replace_rte_variables_mutator (node=<optimized
out>, context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1149
#3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769
#4 0x00007fd0d47812b3 in expression_tree_mutator (node=<optimized out>,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2675
#5 0x00007fd0d481cc64 in replace_rte_variables (node=<optimized out>,
target_varno=<optimized out>, sublevels_up=sublevels_up@entry=0,
callback=callback@entry=0x7fd0d47cb880 <pullup_replace_vars_callback>,
callback_arg=callback_arg@entry=0x7fff521706c0,
outer_hasSubLinks=0x7fd0d5d30d6e "")
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1115
#6 0x00007fd0d47cd1c7 in pullup_replace_vars (context=0x7fff521706c0,
expr=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1982
#7 pull_up_simple_subquery (deletion_ok=<optimized out>,
containing_appendrel=0x0, lowest_nulling_outer_join=0x0,
lowest_outer_join=0x0, rte=0x7fd0d5d30ea8, jtnode=<optimized out>,
root=0x7fd0d5d9ee48) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1030
#8 pull_up_subqueries_recurse (root=root@entry=0x7fd0d5d9ee48,
jtnode=<optimized out>, lowest_outer_join=lowest_outer_join@entry=0x0,
lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0,
containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized
out>)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:696
#9 0x00007fd0d47cc989 in pull_up_subqueries_recurse
(root=root@entry=0x7fd0d5d9ee48,
jtnode=0x7fd0d5d9e6c0, lowest_outer_join=lowest_outer_join@entry=0x0,
lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0,
containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized
out>,
deletion_ok@entry=0 '\000') at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:762
#10 0x00007fd0d47cd639 in pull_up_subqueries (root=root@entry=0x7fd0d5d9ee48)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:614
#11 0x00007fd0d47c5014 in subquery_planner (glob=glob@entry=0x7fd0d5d9edb8,
parse=parse@entry=0x7fd0d5d30d48, parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=0,
subroot=subroot@entry=0x7fff52170908)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:374
#12 0x00007fd0d47c5975 in standard_planner (parse=0x7fd0d5d30d48,
cursorOptions=0, boundParams=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:229
#13 0x00007fd0d4848034 in pg_plan_query (querytree=<optimized out>,
cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:809
#14 0x00007fd0d4848124 in pg_plan_queries
(querytrees=querytrees@entry=0x7fd0d5d30cf8,
cursorOptions=0, boundParams=boundParams@entry=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:868
#15 0x00007fd0d4929760 in BuildCachedPlan
(plansource=plansource@entry=0x7fd0d5d7d940,
qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951
#16 0x00007fd0d4929a98 in GetCachedPlan
(plansource=plansource@entry=0x7fd0d5d7d940,
boundParams=boundParams@entry=0x0, useResOwner=useResOwner@entry=0 '\000')
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:1165
#17 0x00007fd0d48497ab in exec_bind_message (input_message=0x7fff52170be0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:1774
#18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x7fd0d5c8d950,
dbname=0x7fd0d5c8d840 "databasename", username=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:4071
#19 0x00007fd0d45f239c in BackendRun (port=0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:4159
#20 BackendStartup (port=0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:3835
#21 ServerLoop () at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1609
#22 0x00007fd0d47f18e1 in PostmasterMain (argc=5, argv=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1254
#23 0x00007fd0d45f30cd in main (argc=5, argv=0x7fd0d5c8c970) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/main/main.c:221

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 7, 2015 at 11:49 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Show quoted text

On Wed, Jul 8, 2015 at 10:22 AM, Michael Bommarito
<michael@bommaritollc.com> wrote:

I'll try to generate an isolated core dump and stack trace this weekend
when the system is not in use.

That would be nice. I have let pgbench -C run for one hour with select
* from pg_stat_activity running every second (\watch 1) in parallel
but I could not reproduce the issue on HEAD.
--
Michael

#6Michael Paquier
michael@paquier.xyz
In reply to: Michael Bommarito (#5)
Re: BUG #13490: Segmentation fault on pg_stat_activity

On Sat, Jul 11, 2015 at 10:07 PM, Michael Bommarito <
michael@bommaritollc.com> wrote:

Here is the offending query and gdb session/stacktrace output. Please
let me know if we can provide anything else from gdb or logs that can be
anonymized.

Thanks.

2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was
terminated by signal 11: Segmentation fault
2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running:
SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()

12803 is the number of the process PID, right?

[...]

Core was generated by `postgres: postgres databasename 127.0.0.1(42063)
BIND '.

Interesting. So you are using the extended query protocol.

Program terminated with signal SIGSEGV, Segmentation fault.
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
2832
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
No such file or directory.
(gdb) bt
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
#1 0x00007fd0d47cb9dd in pullup_replace_vars_callback
(var=0x7fd0d5d9e958, context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
[...]
#15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940,
qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951

And the crash happens in the planner when building a cached plan. This may
be enough information for coming up with a reproducible test case. I'll get
a closer look on Monday.
--
Michael

#7Michael Bommarito
michael@bommaritollc.com
In reply to: Michael Paquier (#6)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Yes, 12803 was the PID of the parent postgres process.

While the extend query protocol was being used by this client and query,
we've also been able to trigger it via unprepared statements in `psql` or
pgAdmin.

Great, thank you! I'll keep the core around for now so we can provide
additional details; if we need to test another execution path for segfault,
it will need to wait until the weekend again.

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sat, Jul 11, 2015 at 9:18 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Show quoted text

On Sat, Jul 11, 2015 at 10:07 PM, Michael Bommarito <
michael@bommaritollc.com> wrote:

Here is the offending query and gdb session/stacktrace output. Please
let me know if we can provide anything else from gdb or logs that can be
anonymized.

Thanks.

2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was
terminated by signal 11: Segmentation fault
2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running:
SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()

12803 is the number of the process PID, right?

[...]

Core was generated by `postgres: postgres databasename 127.0.0.1(42063)
BIND '.

Interesting. So you are using the extended query protocol.

Program terminated with signal SIGSEGV, Segmentation fault.
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
2832
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
No such file or directory.
(gdb) bt
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
#1 0x00007fd0d47cb9dd in pullup_replace_vars_callback
(var=0x7fd0d5d9e958, context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
[...]
#15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940,
qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951

And the crash happens in the planner when building a cached plan. This may
be enough information for coming up with a reproducible test case. I'll get
a closer look on Monday.
--
Michael

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Bommarito (#5)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Michael Bommarito <michael@bommaritollc.com> writes:

Hello Michael,
Here is the offending query and gdb session/stacktrace output. Please
let me know if we can provide anything else from gdb or logs that can be
anonymized.

*Query:*
2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was
terminated by signal 11: Segmentation fault
2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running:
SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()

I tried tracing through the logic using the line numbers shown in this
stack trace, and soon decided that they didn't seem to match the query
you show above --- but on closer look, that's to be expected, because
this core is evidently from PID 16028:

[New LWP 16028]

If you can identify the exact query that 16028 was running from your logs,
that would be helpful. Or, if you still have this same core file laying
about, "p debug_query_string" in gdb would probably be a more trustworthy
guide than trying to match up log entries.

That would be nice. I have let pgbench -C run for one hour with select
* from pg_stat_activity running every second (\watch 1) in parallel
but I could not reproduce the issue on HEAD.

Yeah, I tried similar experiments without result. Presumably there is
some other triggering condition here, but it's hard to guess what.
I tried things like doing VACUUM FULL on pg_database and pg_authid
to force replan cycles on the view, but no crash.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Michael Bommarito
michael@bommaritollc.com
In reply to: Tom Lane (#8)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Here's the session with debug_query_string:

ubuntu@hostname:~⟫ sudo gdb -q -c /var/lib/postgresql/9.5/main/core
/usr/lib/postgresql/9.5/bin/postgres
Reading symbols from /usr/lib/postgresql/9.5/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/7b/273fca376bfad4ea4bb0c4d3099d62ecc9c124.debug...done.
done.
[New LWP 16028]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: postgres database 127.0.0.1(42063) BIND
'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
2832
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
No such file or directory.
(gdb) printf "%s\n", debug_query_string
SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC
(gdb)

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sat, Jul 11, 2015 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Michael Bommarito <michael@bommaritollc.com> writes:

Hello Michael,
Here is the offending query and gdb session/stacktrace output. Please
let me know if we can provide anything else from gdb or logs that can be
anonymized.

*Query:*
2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was
terminated by signal 11: Segmentation fault
2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running:
SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()

I tried tracing through the logic using the line numbers shown in this
stack trace, and soon decided that they didn't seem to match the query
you show above --- but on closer look, that's to be expected, because
this core is evidently from PID 16028:

[New LWP 16028]

If you can identify the exact query that 16028 was running from your logs,
that would be helpful. Or, if you still have this same core file laying
about, "p debug_query_string" in gdb would probably be a more trustworthy
guide than trying to match up log entries.

That would be nice. I have let pgbench -C run for one hour with select
* from pg_stat_activity running every second (\watch 1) in parallel
but I could not reproduce the issue on HEAD.

Yeah, I tried similar experiments without result. Presumably there is
some other triggering condition here, but it's hard to guess what.
I tried things like doing VACUUM FULL on pg_database and pg_authid
to force replan cycles on the view, but no crash.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Bommarito (#9)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Michael Bommarito <michael@bommaritollc.com> writes:

Here's the session with debug_query_string:
(gdb) printf "%s\n", debug_query_string
SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC

Thanks. This still doesn't match the stack trace: in particular, this
stack frame

#3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769

indicates that we found a PlaceHolderInfo node in the expression tree that
pullup_replace_vars() was applied to, but so far as I can see no such node
should exist in the query tree generated by this query. The most likely
theory seems to be that something clobbered the query tree while it was
sitting in the plancache, causing this recursive function to follow a
bogus pointer. But that doesn't leave us with a lot to go on.

What can you tell us about the environment this is happening in?
How is the client-side code executing the failing queries? (We know
it's using extended query protocol, but is it preparing a statement
and then executing it repeatedly, or just using a one-shot unnamed
prepared statement?) What nondefault settings are in use on the
server side? Do you have any extensions loaded, such as
pg_stat_statements or auto_explain?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11Michael Bommarito
michael@bommaritollc.com
In reply to: Tom Lane (#10)
Re: BUG #13490: Segmentation fault on pg_stat_activity

This particular instance is from pghero, which is a monitoring tool. It
can be reproduced simply by querying stat_activity in psql as well. Pghero
is using prepared statements via ruby from a quick skim on their github
repo.

We have pg_stat_statements enabled, and can reproduce without pghero setup
as well. No other extensions loaded.
On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Michael Bommarito <michael@bommaritollc.com> writes:

Here's the session with debug_query_string:
(gdb) printf "%s\n", debug_query_string
SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid()

GROUP

BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC

Thanks. This still doesn't match the stack trace: in particular, this
stack frame

#3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at

/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769

indicates that we found a PlaceHolderInfo node in the expression tree that
pullup_replace_vars() was applied to, but so far as I can see no such node
should exist in the query tree generated by this query. The most likely
theory seems to be that something clobbered the query tree while it was
sitting in the plancache, causing this recursive function to follow a
bogus pointer. But that doesn't leave us with a lot to go on.

What can you tell us about the environment this is happening in?
How is the client-side code executing the failing queries? (We know
it's using extended query protocol, but is it preparing a statement
and then executing it repeatedly, or just using a one-shot unnamed
prepared statement?) What nondefault settings are in use on the
server side? Do you have any extensions loaded, such as
pg_stat_statements or auto_explain?

regards, tom lane

#12Michael Paquier
michael@paquier.xyz
In reply to: Michael Bommarito (#11)
Re: BUG #13490: Segmentation fault on pg_stat_activity

On Mon, Jul 13, 2015 at 4:16 AM, Michael Bommarito
<michael@bommaritollc.com> wrote:

This particular instance is from pghero, which is a monitoring tool. It
can be reproduced simply by querying stat_activity in psql as well. Pghero
is using prepared statements via ruby from a quick skim on their github
repo.

We have pg_stat_statements enabled, and can reproduce without pghero setup
as well. No other extensions loaded.

On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Michael Bommarito <michael@bommaritollc.com> writes:

Here's the session with debug_query_string:
(gdb) printf "%s\n", debug_query_string
SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid()
GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC

Thanks. This still doesn't match the stack trace: in particular, this
stack frame

#3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at

/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769

indicates that we found a PlaceHolderInfo node in the expression tree that
pullup_replace_vars() was applied to, but so far as I can see no such node
should exist in the query tree generated by this query. The most likely
theory seems to be that something clobbered the query tree while it was
sitting in the plancache, causing this recursive function to follow a
bogus pointer. But that doesn't leave us with a lot to go on.

What can you tell us about the environment this is happening in?
How is the client-side code executing the failing queries? (We know
it's using extended query protocol, but is it preparing a statement
and then executing it repeatedly, or just using a one-shot unnamed
prepared statement?) What nondefault settings are in use on the
server side? Do you have any extensions loaded, such as
pg_stat_statements or auto_explain?

FWIW, I have been fooling around with the query reported in the back
trace upthread by playing a bit with the extended query protocol to
send BIND messages with PQdescribePrepared and PQsendDescribePrepared,
as well as with psql and while I am able to reproduce stack traces
close to what you had I am not seeing any crashes. I have as well
played a bit with pghero with pgbench running in parallel and there
were no problems, with and without pg_stat_statements loaded.

In the backtrace you send previously
(/messages/by-id/CAN=rtBipwKdHCtmXH3r4GNfUhF9e4ZfJbqcj7s_Ec9e2Mbf_LA@mail.gmail.com),
what is the value of MyProcPid? Is it 12803 or 20696? If it is the
former, do you have a backtrace for process 20696? What we may be
looking at now is actually a side effect of the real problem, and as
long as we do not have a real test case, I am afraid that finding the
root problem is rather difficult.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13Michael Bommarito
michael@bommaritollc.com
In reply to: Michael Paquier (#12)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Here are locals per frame, all the way back up to BackendStartup, in case
it helps as well. I will be trying to reproduce this upcoming weekend with
a "baseline" version of postgresql.conf and another system with clean
import.

(gdb) bt
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
#1 0x00007fd0d47cb9dd in pullup_replace_vars_callback (var=0x7fd0d5d9e958,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
#2 0x00007fd0d481c3ea in replace_rte_variables_mutator (node=<optimized
out>, context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1149
#3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769
#4 0x00007fd0d47812b3 in expression_tree_mutator (node=<optimized out>,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2675
#5 0x00007fd0d481cc64 in replace_rte_variables (node=<optimized out>,
target_varno=<optimized out>, sublevels_up=sublevels_up@entry=0,
callback=callback@entry=0x7fd0d47cb880 <pullup_replace_vars_callback>,
callback_arg=callback_arg@entry=0x7fff521706c0,
outer_hasSubLinks=0x7fd0d5d30d6e "")
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1115
#6 0x00007fd0d47cd1c7 in pullup_replace_vars (context=0x7fff521706c0,
expr=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1982
#7 pull_up_simple_subquery (deletion_ok=<optimized out>,
containing_appendrel=0x0, lowest_nulling_outer_join=0x0,
lowest_outer_join=0x0, rte=0x7fd0d5d30ea8, jtnode=<optimized out>,
root=0x7fd0d5d9ee48) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1030
#8 pull_up_subqueries_recurse (root=root@entry=0x7fd0d5d9ee48,
jtnode=<optimized out>, lowest_outer_join=lowest_outer_join@entry=0x0,
lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0,
containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized
out>)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:696
#9 0x00007fd0d47cc989 in pull_up_subqueries_recurse
(root=root@entry=0x7fd0d5d9ee48,
jtnode=0x7fd0d5d9e6c0, lowest_outer_join=lowest_outer_join@entry=0x0,
lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0,
containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized
out>,
deletion_ok@entry=0 '\000') at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:762
#10 0x00007fd0d47cd639 in pull_up_subqueries (root=root@entry=0x7fd0d5d9ee48)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:614
#11 0x00007fd0d47c5014 in subquery_planner (glob=glob@entry=0x7fd0d5d9edb8,
parse=parse@entry=0x7fd0d5d30d48, parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=0,
subroot=subroot@entry=0x7fff52170908)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:374
#12 0x00007fd0d47c5975 in standard_planner (parse=0x7fd0d5d30d48,
cursorOptions=0, boundParams=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:229
#13 0x00007fd0d4848034 in pg_plan_query (querytree=<optimized out>,
cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:809
#14 0x00007fd0d4848124 in pg_plan_queries
(querytrees=querytrees@entry=0x7fd0d5d30cf8,
cursorOptions=0, boundParams=boundParams@entry=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:868
#15 0x00007fd0d4929760 in BuildCachedPlan
(plansource=plansource@entry=0x7fd0d5d7d940,
qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951
#16 0x00007fd0d4929a98 in GetCachedPlan
(plansource=plansource@entry=0x7fd0d5d7d940,
boundParams=boundParams@entry=0x0, useResOwner=useResOwner@entry=0 '\000')
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:1165
#17 0x00007fd0d48497ab in exec_bind_message (input_message=0x7fff52170be0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:1774
#18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x7fd0d5c8d950,
dbname=0x7fd0d5c8d840 "databasename", username=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:4071
#19 0x00007fd0d45f239c in BackendRun (port=0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:4159
#20 BackendStartup (port=0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:3835
#21 ServerLoop () at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1609
#22 0x00007fd0d47f18e1 in PostmasterMain (argc=5, argv=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1254
#23 0x00007fd0d45f30cd in main (argc=5, argv=0x7fd0d5c8c970) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/main/main.c:221
(gdb) info locals
tle = 0x0
l = 0x7fd0d5da2940
(gdb) select-frame 1
(gdb) info locals
tle = <optimized out>
rcon = 0x7fff521706c0
varattno = 6
newnode = <optimized out>
__func__ = "pullup_replace_vars_callback"
(gdb) select-frame 2
(gdb) info locals

newnode = <optimized out>
var = <optimized out>
__func__ = "replace_rte_variables_mutator"
(gdb) select-frame 3
(gdb) info locals

phinfo = <optimized out>
newnode = <optimized out>
__func__ = "expression_tree_mutator"
(gdb) select-frame 4
(gdb) info locals

resultlist = 0x0
temp = 0x7fd0d5d9e8e8
__func__ = "expression_tree_mutator"
(gdb) select-frame 5
(gdb) info locals

result = <optimized out>
context = {callback = 0x7fd0d47cb880 <pullup_replace_vars_callback>,
callback_arg = 0x7fff521706c0, target_varno = 1, sublevels_up = 0,
inserted_sublink = 0 '\000'}
__func__ = "replace_rte_variables"
(gdb) select-frame 6
(gdb) info locals

No locals.
(gdb) select-frame 7
(gdb) info locals

parse = 0x7fd0d5d30d48
subquery = <optimized out>
rvcontext = {root = 0x7fd0d5d9ee48, targetlist = 0x7fd0d5da27c0, target_rte
= 0x7fd0d5d30ea8, relids = 0x0, outer_hasSubLinks = 0x7fd0d5d30d6e "",
varno = 1, need_phvs = 0 '\000', wrap_non_vars = 0 '\000', rv_cache =
0x7fd0d5da2960}
varno = 1
subroot = 0x7fd0d5da2560
lc = <optimized out>
(gdb) select-frame 8
(gdb) info locals

varno = <optimized out>
rte = 0x7fd0d5d30ea8
__func__ = "pull_up_subqueries_recurse"
(gdb) select-frame 9
(gdb) info locals

sub_deletion_ok = <optimized out>
f = 0x7fd0d5d9e6c0
have_undeleted_child = 0 '\000'
l = 0x7fd0d5d9e720
__func__ = "pull_up_subqueries_recurse"
(gdb) select-frame 10
(gdb) info locals

No locals.
(gdb) select-frame 11
(gdb) info locals

root = 0x7fd0d5d9ee48
plan = <optimized out>
newWithCheckOptions = <optimized out>
newHaving = <optimized out>
hasOuterJoins = <optimized out>
l = <optimized out>
(gdb) select-frame 12
(gdb) info locals

result = <optimized out>
glob = 0x7fd0d5d9edb8
tuple_fraction = <optimized out>
root = 0x7fd0d5d9ed18
top_plan = <optimized out>
lp = <optimized out>
lr = <optimized out>
(gdb) select-frame 13
(gdb) info locals

plan = <optimized out>
(gdb) select-frame 14
(gdb) info locals

query = <optimized out>
stmt = <optimized out>
stmt_list = 0x0
query_list = 0x7fd0d5d30d28
(gdb) select-frame 15
(gdb) info locals

plan = <optimized out>
plist = <optimized out>
snapshot_set = 0 '\000'
spi_pushed = <optimized out>
plan_context = <optimized out>
oldcxt = 0x7fd0d5c8ccb0
(gdb) select-frame 16
(gdb) info locals

plan = <optimized out>
qlist = 0x0
customplan = 0 '\000'
__func__ = "GetCachedPlan"
(gdb) select-frame 17
(gdb) info locals

pformats = 0x0
psrc = 0x7fd0d5d7d940
portal = 0x7fd0d5ca6da0
query_string = 0x7fd0d5cd17a0 "SELECT application_name AS source,
client_addr AS ip, COUNT(*) AS total_connections FROM pg_stat_activity
WHERE pid <> pg_backend_pid() GROUP BY application_name, ip ORDER BY
COUNT(*) DESC, applicatio"...
portal_name = 0x7fd0d5d308d0 ""
stmt_name = 0x7fd0d5d308d1 ""
numPFormats = 0
saved_stmt_name = 0x0
rformats = 0x7fd0d5d30ce0
params = 0x0
save_log_statement_stats = 0 '\000'
msec_str = "SELECT
1\000\061\000\324\320\177\000\000\220\r\027R\377\177\000\000ᕔ\324\320\177\000"
numParams = 0
numRFormats = 1
cplan = <optimized out>
snapshot_set = <optimized out>
(gdb) select-frame 18
(gdb) info locals

firstchar = -707588896
input_message = {data = 0x7fd0d5d308d0 "", len = 10, maxlen = 1024, cursor
= 10}
local_sigjmp_buf = {{__jmpbuf = {140734570630176, -6783158886466612332,
140534916634656, 1, 0, 140534916918272, -6783158886368046188,
-6808758974801407084}, __mask_was_saved = 1, __saved_mask = {__val = {0,
140534916918272, 140534862700952, 140534916634800, 206158430256,
140734570630512, 140734570630304,
140534916918272, 52, 140734570630432, 140534895011200, 1024,
140734570630464, 140734570630628, 0, 140734570630400}}}}
send_ready_for_query = 0 '\000'
__func__ = "PostgresMain"
(gdb) select-frame 19
(gdb) info locals

ac = 1
secs = 489934357
usecs = 656467
i = 1
av = 0x7fd0d5c8d950
maxac = <optimized out>

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Mon, Jul 13, 2015 at 2:54 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Show quoted text

On Mon, Jul 13, 2015 at 4:16 AM, Michael Bommarito
<michael@bommaritollc.com> wrote:

This particular instance is from pghero, which is a monitoring tool. It
can be reproduced simply by querying stat_activity in psql as well.

Pghero

is using prepared statements via ruby from a quick skim on their github
repo.

We have pg_stat_statements enabled, and can reproduce without pghero

setup

as well. No other extensions loaded.

On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Michael Bommarito <michael@bommaritollc.com> writes:

Here's the session with debug_query_string:
(gdb) printf "%s\n", debug_query_string
SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid()
GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC

Thanks. This still doesn't match the stack trace: in particular, this
stack frame

#3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at

/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769

indicates that we found a PlaceHolderInfo node in the expression tree

that

pullup_replace_vars() was applied to, but so far as I can see no such

node

should exist in the query tree generated by this query. The most likely
theory seems to be that something clobbered the query tree while it was
sitting in the plancache, causing this recursive function to follow a
bogus pointer. But that doesn't leave us with a lot to go on.

What can you tell us about the environment this is happening in?
How is the client-side code executing the failing queries? (We know
it's using extended query protocol, but is it preparing a statement
and then executing it repeatedly, or just using a one-shot unnamed
prepared statement?) What nondefault settings are in use on the
server side? Do you have any extensions loaded, such as
pg_stat_statements or auto_explain?

FWIW, I have been fooling around with the query reported in the back
trace upthread by playing a bit with the extended query protocol to
send BIND messages with PQdescribePrepared and PQsendDescribePrepared,
as well as with psql and while I am able to reproduce stack traces
close to what you had I am not seeing any crashes. I have as well
played a bit with pghero with pgbench running in parallel and there
were no problems, with and without pg_stat_statements loaded.

In the backtrace you send previously
(
/messages/by-id/CAN=rtBipwKdHCtmXH3r4GNfUhF9e4ZfJbqcj7s_Ec9e2Mbf_LA@mail.gmail.com
),
what is the value of MyProcPid? Is it 12803 or 20696? If it is the
former, do you have a backtrace for process 20696? What we may be
looking at now is actually a side effect of the real problem, and as
long as we do not have a real test case, I am afraid that finding the
root problem is rather difficult.
--
Michael

#14Michael Paquier
michael@paquier.xyz
In reply to: Michael Bommarito (#13)
Re: BUG #13490: Segmentation fault on pg_stat_activity

On Mon, Jul 13, 2015 at 10:43 PM, Michael Bommarito
<michael@bommaritollc.com> wrote:

(gdb) bt
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
[...]
(gdb) info locals
tle = 0x0
l = 0x7fd0d5da2940

So basically this is crashing because this tle is NULL, and
get_tle_by_resno does not like that much when this item is extracted
from the list. I am wondering what is breaking that.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#15Michael Bommarito
michael@bommaritollc.com
In reply to: Michael Paquier (#14)
Re: BUG #13490: Segmentation fault on pg_stat_activity

If you can provide a patch that performs input validation in
get_tle_by_resno and logs the condition, I can compile and test with it.

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 14, 2015 at 3:33 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Show quoted text

On Mon, Jul 13, 2015 at 10:43 PM, Michael Bommarito
<michael@bommaritollc.com> wrote:

(gdb) bt
#0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at

/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832

[...]
(gdb) info locals
tle = 0x0
l = 0x7fd0d5da2940

So basically this is crashing because this tle is NULL, and
get_tle_by_resno does not like that much when this item is extracted
from the list. I am wondering what is breaking that.
--
Michael

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Bommarito (#15)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Michael Bommarito <michael@bommaritollc.com> writes:

If you can provide a patch that performs input validation in
get_tle_by_resno and logs the condition, I can compile and test with it.

Wouldn't prove anything one way or another. Somehow, a corrupt query tree
is being fed to the planner; what we need to understand is what conditions
cause that to happen. I doubt that getting more details at the point
where the code trips over the corruption will teach us that.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17Michael Bommarito
michael@bommaritollc.com
In reply to: Tom Lane (#16)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb
-fno-omit-frame-pointer -mno-red-zone. Reset to default postgresql.conf.
Was able to generate segfaults repeatedly loading pghero dashboard. First,
some errors from the pg logs.
========================================================================
========================================================
2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR: attribute
number 2 exceeds number of columns 0

2015-07-18 15:07:58 UTC [27112-2] postgres@database STATEMENT: SELECT
application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC

2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR: invalid
varattno 66

2015-07-18 15:08:23 UTC [27112-4] postgres@database STATEMENT: SELECT
relname AS table, indexrelname AS index,
pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, idx_scan as
index_scans FROM pg_stat_user_indexes ui INNER JOIN pg_index i ON
ui.indexrelid = i.indexrelid WHERE NOT indisunique AND idx_scan < 50 ORDER
BY pg_relation_size(i.indexrelid) DESC, relname ASC

2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR: tupdesc
reference 0x2bdd8a8 is not owned by resource owner Portal
========================================================================
========================================================

Next, attached to the backend PID under sudo gdb and `cont`:
========================================================================
========================================================
2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR: tupdesc
reference 0xf77248 is not owned by resource owner Portal
2015-07-18 15:48:54 UTC [8812-4] LOG: server process (PID 10538) was
terminated by signal 11: Segmentation fault
2015-07-18 15:48:54 UTC [8812-5] LOG: terminating any other active server
processes
2015-07-18 15:48:54 UTC [10523-1] postgres@database WARNING: terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10523-2] postgres@database DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10523-3] postgres@database HINT: In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10239-1] postgres@database WARNING: terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10239-2] postgres@database DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10239-3] postgres@database HINT: In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10522-1] postgres@database WARNING: terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10522-2] postgres@database DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10522-3] postgres@database HINT: In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10409-1] postgres@database WARNING: terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10409-2] postgres@database DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10409-3] postgres@database HINT: In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10408-1] postgres@database WARNING: terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10408-2] postgres@database DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10408-3] postgres@database HINT: In a moment you
should be able to reconnect to the database and repeat your command.

Program received signal SIGQUIT, Quit.
0x00007f84fe78c110 in __poll_nocancel () at
../sysdeps/unix/syscall-template.S:81
81 in ../sysdeps/unix/syscall-template.S
(gdb) bt full
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7ffc65188bd8:
#0 0x00007f84fe78c110 in __poll_nocancel () at
../sysdeps/unix/syscall-template.S:81
No locals.
Cannot access memory at address 0x7ffc65188bd8
========================================================================
========================================================

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 14, 2015 at 8:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Michael Bommarito <michael@bommaritollc.com> writes:

If you can provide a patch that performs input validation in
get_tle_by_resno and logs the condition, I can compile and test with it.

Wouldn't prove anything one way or another. Somehow, a corrupt query tree
is being fed to the planner; what we need to understand is what conditions
cause that to happen. I doubt that getting more details at the point
where the code trips over the corruption will teach us that.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Bommarito (#17)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Michael Bommarito <michael@bommaritollc.com> writes:

Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb
-fno-omit-frame-pointer -mno-red-zone. Reset to default postgresql.conf.
Was able to generate segfaults repeatedly loading pghero dashboard. First,
some errors from the pg logs.
2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR: attribute
number 2 exceeds number of columns 0
2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR: invalid
varattno 66
2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR: tupdesc
reference 0x2bdd8a8 is not owned by resource owner Portal
2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR: tupdesc
reference 0xf77248 is not owned by resource owner Portal

I'm starting to wonder about flaky hardware and/or compiler bugs. 9.5
seems completely unreliable for you, but we're not seeing any similar
reports from anyone else. Is 9.4 reliable on the same machine + same
build environment?

Program received signal SIGQUIT, Quit.

Catching a SIGQUIT isn't very helpful; that just means you're one of
the server processes that got force-terminated after some other one
crashed.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#19Michael Bommarito
michael@bommaritollc.com
In reply to: Tom Lane (#18)
Re: BUG #13490: Segmentation fault on pg_stat_activity

After exporting and re-importing (pg9.5 -> 9.5), this has gone away. We
changed quite a few variables in the import (non-parallel, letting full
index builds proceed, etc.), so not 100% sure what the issue was; as you
suggested, doesn't that pg9.5 itself was the problem (though not sure on
pg_dump/restore).

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sat, Jul 18, 2015 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Michael Bommarito <michael@bommaritollc.com> writes:

Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb
-fno-omit-frame-pointer -mno-red-zone. Reset to default postgresql.conf.
Was able to generate segfaults repeatedly loading pghero dashboard.

First,

some errors from the pg logs.
2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR: attribute
number 2 exceeds number of columns 0
2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR: invalid
varattno 66
2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR: tupdesc
reference 0x2bdd8a8 is not owned by resource owner Portal
2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR: tupdesc
reference 0xf77248 is not owned by resource owner Portal

I'm starting to wonder about flaky hardware and/or compiler bugs. 9.5
seems completely unreliable for you, but we're not seeing any similar
reports from anyone else. Is 9.4 reliable on the same machine + same
build environment?

Program received signal SIGQUIT, Quit.

Catching a SIGQUIT isn't very helpful; that just means you're one of
the server processes that got force-terminated after some other one
crashed.

regards, tom lane

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Bommarito (#19)
Re: BUG #13490: Segmentation fault on pg_stat_activity

Michael Bommarito wrote:

After exporting and re-importing (pg9.5 -> 9.5), this has gone away. We
changed quite a few variables in the import (non-parallel, letting full
index builds proceed, etc.), so not 100% sure what the issue was; as you
suggested, doesn't that pg9.5 itself was the problem (though not sure on
pg_dump/restore).

Maybe you had an inconsistent build.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs