postmaster core ( finally I have it )
Hi all,
since long time ( in the mean time I did Postgres upgrade four time and
now I'm using 7.3.3 ) I'm having, at least once in a week, a signal 11 on
a backend, and how you can immagine with the subseguent drop of all
connections, finally now I have the core.
The process killed made always the same select ( with different
id_package ):
SELECT id_publisher, publisher_name, id_package, package_name
FROM v_psl_package_info
WHERE id_package = 177;
Here it is:
bash-2.05$ file core
core: ELF 32-bit LSB core file of 'postmaster' (signal 11), Intel 80386,
version 1, from 'postmaster'
bash-2.05$ gdb /usr/bin/postmaster core
GNU gdb Red Hat Linux (5.2-2)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...(no debugging symbols
found)...
warning: core file may not match specified executable file.
Core was generated by `postgres: push empdb 192.168'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /lib/libssl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libssl.so.2
Reading symbols from /lib/libcrypto.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcrypto.so.2
Reading symbols from /usr/kerberos/lib/libkrb5.so.3...(no debugging symbols
found)...done.
Loaded symbols for /usr/kerberos/lib/libkrb5.so.3
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libreadline.so.4...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libreadline.so.4
Reading symbols from /lib/libtermcap.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libtermcap.so.2
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /usr/kerberos/lib/libcom_err.so.3...done.
Loaded symbols for /usr/kerberos/lib/libcom_err.so.3
Reading symbols from /usr/kerberos/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
#0 0x08171fdd in RelationBuildRuleLock ()
(gdb) where
#0 0x08171fdd in RelationBuildRuleLock ()
#1 0x081722ab in RelationBuildDesc ()
#2 0x0817311c in RelationIdGetRelation ()
#3 0x08077b95 in relation_open ()
#4 0x08077d24 in heap_openrv ()
#5 0x080b1015 in addRangeTableEntry ()
#6 0x080ab6de in transformTableEntry ()
#7 0x080ab942 in transformFromClauseItem ()
#8 0x080ab321 in transformFromClause ()
#9 0x0809f639 in transformSelectStmt ()
#10 0x0809dcc4 in transformStmt ()
#11 0x0809da06 in parse_analyze ()
#12 0x08128a35 in pg_analyze_and_rewrite ()
#13 0x08128ce2 in pg_exec_query_string ()
#14 0x08129fc3 in PostgresMain ()
#15 0x08110344 in DoBackend ()
#16 0x0810fc2d in BackendStartup ()
#17 0x0810ed21 in ServerLoop ()
#18 0x0810e6f9 in PostmasterMain ()
#19 0x080e9cdd in main ()
#20 0x40249336 in __libc_start_main (main=0x80e9ab0 <main>, argc=3,
ubp_av=0xbffffb04, init=0x806b2ec <_init>,
fini=0x8188830 <_fini>, rtld_fini=0x4000d2fc <_dl_fini>,
stack_end=0xbffffafc) at ../sysdeps/generic/libc-start.c:129
(gdb) quit
and here all the activity performed by the process:
Jul 22 20:00:02 sdr2 postgres[25079]: [2896523] LOG: connection received:
host= xxx.xxx.xxx.xxx port=53809
Jul 22 20:00:02 sdr2 postgres[25079]: [2896524] LOG: connection authorized:
user=push database=empdb
Jul 22 20:00:02 sdr2 postgres[25079]: [2896525] DEBUG: /usr/bin/postmaster
child[25079]: starting with (
Jul 22 20:00:02 sdr2 postgres[25079]: [2896526] DEBUG: ^Ipostgres
Jul 22 20:00:02 sdr2 postgres[25079]: [2896527] DEBUG: ^I-v131072
Jul 22 20:00:02 sdr2 postgres[25079]: [2896528] DEBUG: ^I-p
Jul 22 20:00:02 sdr2 postgres[25079]: [2896529] DEBUG: ^Iempdb
Jul 22 20:00:02 sdr2 postgres[25079]: [2896530] DEBUG: )
Jul 22 20:00:02 sdr2 postgres[25079]: [2896531] DEBUG: InitPostgres
Jul 22 20:00:02 sdr2 postgres[25079]: [2896532] DEBUG:
StartTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896533-1] LOG: query: set datestyle
to 'ISO'; select version(), case when pg_encoding_to_char(1) = 'SQL_ASCII'
then 'UNKNOWN' else
Jul 22 20:00:02 sdr2 postgres[25079]: [2896533-2] getdatabaseencoding()
end;
Jul 22 20:00:02 sdr2 postgres[25079]: [2896534] DEBUG: ProcessUtility
Jul 22 20:00:02 sdr2 postgres[25079]: [2896535] DEBUG:
CommitTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896536] DEBUG:
StartTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896537] DEBUG: ProcessQuery
Jul 22 20:00:02 sdr2 postgres[25079]: [2896538] DEBUG:
CommitTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896539] LOG: duration: 0.035168 sec
Jul 22 20:00:02 sdr2 postgres[25079]: [2896540] DEBUG:
StartTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896541] LOG: query: set
client_encoding = 'UNICODE'; show autocommit
Jul 22 20:00:02 sdr2 postgres[25079]: [2896542] DEBUG: ProcessUtility
Jul 22 20:00:02 sdr2 postgres[25079]: [2896543] DEBUG:
CommitTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896544] DEBUG:
StartTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896545] DEBUG: ProcessUtility
Jul 22 20:00:02 sdr2 postgres[25079]: [2896546] DEBUG:
CommitTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896547] LOG: duration: 0.003609 sec
Jul 22 20:00:02 sdr2 postgres[25079]: [2896548] DEBUG:
StartTransactionCommand
Jul 22 20:00:02 sdr2 postgres[25079]: [2896549] LOG: query: SELECT
id_publisher, publisher_name, id_package, package_name FROM
v_psl_package_info WHERE id_package = 177
Do you need others informations or may I do something in order
to permit you to isolate the problem ?
Regards
Gaetano
"Mendola Gaetano" <mendola@bigfoot.com> writes:
The process killed made always the same select ( with different
id_package ):
SELECT id_publisher, publisher_name, id_package, package_name
FROM v_psl_package_info
WHERE id_package = 177;
(gdb) where
#0 0x08171fdd in RelationBuildRuleLock ()
#1 0x081722ab in RelationBuildDesc ()
#2 0x0817311c in RelationIdGetRelation ()
#3 0x08077b95 in relation_open ()
#4 0x08077d24 in heap_openrv ()
#5 0x080b1015 in addRangeTableEntry ()
#6 0x080ab6de in transformTableEntry ()
#7 0x080ab942 in transformFromClauseItem ()
#8 0x080ab321 in transformFromClause ()
#9 0x0809f639 in transformSelectStmt ()
#10 0x0809dcc4 in transformStmt ()
#11 0x0809da06 in parse_analyze ()
Hmm. RelationBuildRuleLock is concerned with reading the rule
information for a table; specifically, it's evidently trying to
read the rules for v_psl_package_info. I suspect some form of
data corruption in the pg_rewrite row(s) for this table. Do you
see any misbehavior when you do
select * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass
In particular, are any of the columns NULL in these rows?
regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us>
"Mendola Gaetano" <mendola@bigfoot.com> writes:
The process killed made always the same select ( with different
id_package ):SELECT id_publisher, publisher_name, id_package, package_name
FROM v_psl_package_info
WHERE id_package = 177;(gdb) where
#0 0x08171fdd in RelationBuildRuleLock ()
#1 0x081722ab in RelationBuildDesc ()
#2 0x0817311c in RelationIdGetRelation ()
#3 0x08077b95 in relation_open ()
#4 0x08077d24 in heap_openrv ()
#5 0x080b1015 in addRangeTableEntry ()
#6 0x080ab6de in transformTableEntry ()
#7 0x080ab942 in transformFromClauseItem ()
#8 0x080ab321 in transformFromClause ()
#9 0x0809f639 in transformSelectStmt ()
#10 0x0809dcc4 in transformStmt ()
#11 0x0809da06 in parse_analyze ()Hmm. RelationBuildRuleLock is concerned with reading the rule
information for a table; specifically, it's evidently trying to
read the rules for v_psl_package_info. I suspect some form of
data corruption in the pg_rewrite row(s) for this table. Do you
see any misbehavior when you doselect * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass
All seems good.
In particular, are any of the columns NULL in these rows?
I did:
select *
from pg_rewrite
where ev_class = 'v_psl_package_info'::regclass and
( rulename is null or
ev_class is null or
ev_attr is null or
ev_type is null or
is_instead is null or
ev_qual is null or
ev_action is null );
and I had: No rows.
v_psl_package_info is a view, shall we look in another direction ?
thank you
Gaetano
"Mendola Gaetano" <mendola@bigfoot.com> writes:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
I suspect some form of
data corruption in the pg_rewrite row(s) for this table. Do you
see any misbehavior when you doselect * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass
All seems good.
I really don't see any other explanation for a crash in that routine
than problems with the pg_rewrite data.
Do you get this any time you try to do something with the
v_psl_package_info view, or is it a once-in-a-while problem?
If the latter, I wonder whether it could be an issue with a
flaky disk sector ...
regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us>
"Mendola Gaetano" <mendola@bigfoot.com> writes:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
I suspect some form of
data corruption in the pg_rewrite row(s) for this table. Do you
see any misbehavior when you doselect * from pg_rewrite where ev_class =
'v_psl_package_info'::regclass
All seems good.
I really don't see any other explanation for a crash in that routine
than problems with the pg_rewrite data.Do you get this any time you try to do something with the
v_psl_package_info view, or is it a once-in-a-while problem?
If the latter, I wonder whether it could be an issue with a
flaky disk sector ...
Is once-in-a-while but always at 00 minutes. This select is performed each
20 minutes and
the core happen always at 00 never at 20 and never at 40!
I have scheduled also a vacuumdb -z each 15 minutes so only at
00 these are running together.
I had another core at 21:00 today and this is the log of the vacuum
( there was that error just during the pg_rewrite vacuum!):
INFO: --Relation pg_catalog.pg_description--
INFO: Pages 12: Changed 0, Empty 0; Tup 1390: Vac 0, Keep 0, UnUsed 1.
Total CPU 0.00s/0.00u sec elapsed 0.20 sec.
INFO: --Relation pg_toast.pg_toast_16416--
INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: Analyzing pg_catalog.pg_description
INFO: --Relation pg_catalog.pg_group--
INFO: Pages 1: Changed 0, Empty 0; Tup 3: Vac 0, Keep 0, UnUsed 9.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: --Relation pg_toast.pg_toast_1261--
INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: Analyzing pg_catalog.pg_group
INFO: --Relation pg_catalog.pg_proc--
INFO: Pages 111: Changed 0, Empty 0; Tup 1822: Vac 0, Keep 0, UnUsed 272.
Total CPU 0.02s/0.00u sec elapsed 0.29 sec.
INFO: --Relation pg_toast.pg_toast_1255--
INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: Analyzing pg_catalog.pg_proc
INFO: --Relation pg_catalog.pg_rewrite--
INFO: Pages 47: Changed 0, Empty 0; Tup 231: Vac 0, Keep 0, UnUsed 74.
Total CPU 0.00s/0.00u sec elapsed 0.07 sec.
INFO: --Relation pg_toast.pg_toast_16410--
INFO: Pages 22: Changed 0, Empty 0; Tup 77: Vac 0, Keep 0, UnUsed 33.
Total CPU 0.00s/0.00u sec elapsed 0.03 sec.
INFO: Analyzing pg_catalog.pg_rewrite
WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
vacuumdb: vacuum empdb failed
I think we are near the solution now.
Thank you
Gaetano
"Mendola Gaetano" <mendola@bigfoot.com> writes:
Is once-in-a-while but always at 00 minutes. This select is performed each
20 minutes and
the core happen always at 00 never at 20 and never at 40!
Now that is very interesting ... why would that be?
Could we see the definition of this view?
I had another core at 21:00 today and this is the log of the vacuum
( there was that error just during the pg_rewrite vacuum!):
I think that's just coincidence.
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
"Mendola Gaetano" <mendola@bigfoot.com> writes:
Is once-in-a-while but always at 00 minutes. This select is performed
each
20 minutes and
the core happen always at 00 never at 20 and never at 40!Now that is very interesting ... why would that be?
Could we see the definition of this view?
Of course here it is:
CREATE OR REPLACE VIEW v_psl_package_info AS
SELECT pb.id_publisher AS id_publisher,
pb.name AS publisher_name,
pk.id_package AS id_package,
pk.name AS package_name
FROM v_packages pk JOIN v_publishers pb USING (id_publisher);
CREATE OR REPLACE VIEW v_packages AS
SELECT p.id_package AS id_package,
p.id_publisher AS id_publisher,
p.name AS name,
p.information AS information,
p.description AS description,
sp_lookup_key('package_type', p.id_package_type)
AS TYPE,
sp_lookup_key('target', p.id_target)
AS target,
p.port AS port,
p.priority AS priority,
sp_lookup_key('fec', p.id_fec)
AS fec,
p.input_group AS input_group,
p.output_group AS output_group,
CASE WHEN p.updatable
THEN 1
ELSE 0
END AS updatable,
p.checksum AS checksum,
p.version AS version,
p.start_file AS start_file,
p.view_target_group AS view_target_group,
p.target_group AS target_group,
CASE WHEN p.auto_listen
THEN 1
ELSE 0
END AS auto_listen,
CASE WHEN p.public_flag
THEN 1
ELSE 0
END AS public_flag,
p.needed_version AS needed_version,
p.logic_version AS logic_version,
p.package_size AS package_size,
ps.id_drm_process AS id_drm_process,
ps.id_cas_service AS id_cas_service,
ps.id_cas_settings AS id_cas_settings,
ps.id_drm_service AS id_drm_service
FROM packages p LEFT OUTER JOIN package_security ps USING (id_package)
ORDER BY p.id_publisher, p.name;
CREATE OR REPLACE VIEW v_publishers AS
SELECT p.id_publisher AS id_publisher,
p.login AS login,
p.password AS password,
p.ftp_password AS ftp_password,
p.name AS name,
p.address AS address,
p.city AS city,
sp_lookup_descr('country', p.id_country)
AS country,
p.zip AS zip,
p.phone_number AS phone_number,
p.fax_number AS fax_number,
p.email AS email,
p.web_site AS web_site,
p.description AS description,
v.id_pid AS id_pid,
v.id_transponder AS id_transponder,
v.transponder AS transponder,
v.pid AS pid,
v.satellite AS satellite,
v.frequency AS downlink_freq,
v.polarization AS polarization,
v.fec AS fec,
v.symbol_rate AS symbol_rate,
v.description AS txp_description,
a.multicast_ip AS multicast_ip
FROM publishers p JOIN v_pids v USING (id_pid)
JOIN addresses a USING (id_publisher)
WHERE id_publisher<>0
ORDER BY p.name;
CREATE OR REPLACE VIEW v_pids AS
SELECT p.id_pid AS id_pid,
t.id_transponder AS id_transponder,
t.name AS transponder,
p.pid AS pid,
sp_lookup_descr('orbital_ptn', t.orbital_pos)
AS satellite,
t.frequency AS frequency,
t.polarization AS polarization,
t.fec AS fec,
t.symbol_rate AS symbol_rate,
t.description AS description
FROM pids p JOIN transponders t USING (id_transponder);
and the code for the function sp_lookup_key:
CREATE OR REPLACE FUNCTION sp_lookup_key ( TEXT,INTEGER )
RETURNS TEXT AS'
DECLARE
lookup_name ALIAS FOR $1;
my_id_key ALIAS FOR $2;
my_id_lookup INTEGER;
my_key TEXT;
BEGIN
SELECT INTO my_id_lookup id_lookup
FROM v_lookup_tables
WHERE name = lookup_name::varchar;
IF NOT FOUND THEN
RETURN NULL;
END IF;
SELECT INTO my_key key
FROM lookup_tables
WHERE id_table = my_id_lookup
AND id_key = my_id_key;
IF NOT FOUND THEN
RETURN NULL;
END IF;
RETURN my_key;
END;
' LANGUAGE 'plpgsql'
WITH ( iscachable );
I had another core at 21:00 today and this is the log of the vacuum
( there was that error just during the pg_rewrite vacuum!):I think that's just coincidence.
regards, tom lane
May be was coincidence, I'm going to avoid the vacuum and that select
concurreny and see what happen in these next days.
Just to add some others information,
this is what happens 22 Jul at 20:00
( this time was during the pg_attribute vacuum)
INFO: --Relation pg_catalog.pg_description--
INFO: Pages 12: Changed 0, Empty 0; Tup 1390: Vac 0, Keep 0, UnUsed 1.
Total CPU 0.01s/0.00u sec elapsed 0.03 sec.
INFO: --Relation pg_toast.pg_toast_16416--
INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: Analyzing pg_catalog.pg_description
INFO: --Relation pg_catalog.pg_group--
INFO: Pages 1: Changed 0, Empty 0; Tup 3: Vac 0, Keep 0, UnUsed 9.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: --Relation pg_toast.pg_toast_1261--
INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: Analyzing pg_catalog.pg_group
INFO: --Relation pg_catalog.pg_proc--
INFO: Pages 110: Changed 0, Empty 0; Tup 1821: Vac 0, Keep 0, UnUsed 269.
Total CPU 0.01s/0.00u sec elapsed 0.22 sec.
INFO: --Relation pg_toast.pg_toast_1255--
INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: Analyzing pg_catalog.pg_proc
INFO: --Relation pg_catalog.pg_rewrite--
INFO: Pages 47: Changed 0, Empty 0; Tup 231: Vac 0, Keep 0, UnUsed 73.
Total CPU 0.01s/0.00u sec elapsed 0.18 sec.
INFO: --Relation pg_toast.pg_toast_16410--
INFO: Pages 22: Changed 0, Empty 0; Tup 77: Vac 0, Keep 0, UnUsed 33.
Total CPU 0.00s/0.00u sec elapsed 0.02 sec.
INFO: Analyzing pg_catalog.pg_rewrite
INFO: --Relation pg_catalog.pg_type--
INFO: Pages 10: Changed 0, Empty 0; Tup 523: Vac 0, Keep 0, UnUsed 27.
Total CPU 0.00s/0.00u sec elapsed 0.03 sec.
INFO: Analyzing pg_catalog.pg_type
INFO: --Relation pg_catalog.pg_attribute--
WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
vacuumdb: vacuum empdb failed
Thank you
Gaetano