backend died

Started by Brusser, Michaelabout 21 years ago6 messages
#1Brusser, Michael
Michael.Brusser@matrixone.com

Our customer running Postgres v. 7.3.2 reported a problem, occurring
couple times a week on three different servers, all on Solaris 9.
We enabled debugging in postgresql.conf, now it happened again;
here's the excerpt from the database log:

2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10
2004-11-13 10:01:06 [10456] LOG: server process (pid 19285) was terminated
by signal 10
2004-11-13 10:01:06 [10456] LOG: terminating any other active server
processes
2004-11-13 10:01:06 [10456] DEBUG: CleanupProc: sending SIGQUIT to process
10876
2004-11-13 10:01:06 [10456] DEBUG: CleanupProc: sending SIGQUIT to process
10482
2004-11-13 10:01:06 [10456] DEBUG: CleanupProc: sending SIGQUIT to process
10481
2004-11-13 10:01:06 [10456] DEBUG: CleanupProc: sending SIGQUIT to process
10478
2004-11-13 10:01:06 [10456] DEBUG: CleanupProc: sending SIGQUIT to process
10472
2004-11-13 10:01:06 [10876] WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
2004-11-13 10:01:06 [10478] WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
2004-11-13 10:01:06 [10482] WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
... ...

There's no other references to process 19285 in the log file.
If it helps the servers are configured to use UDS.
The socket files are placed in different directories (each db's PGDATA)

Would it be helful to change the debug level from DEBUG1 to a higher value?
What else should I look at?

Thank you,
Mike

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brusser, Michael (#1)
Re: backend died

"Brusser, Michael" <Michael.Brusser@matrixone.com> writes:

Our customer running Postgres v. 7.3.2 reported a problem, occurring
couple times a week on three different servers, all on Solaris 9.

2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10

SIGBUS iirc.

What else should I look at?

Find out what query is causing the crash --- enable query logging if you
have no other way. And get a debugger stack trace from the core file
that the crashed backend left behind.

regards, tom lane

#3Brusser, Michael
Michael.Brusser@matrixone.com
In reply to: Tom Lane (#2)
Re: backend died

"Brusser, Michael" <Michael.Brusser@matrixone.com> writes:

Our customer running Postgres v. 7.3.2 reported a problem, occurring
couple times a week on three different servers, all on Solaris 9.

2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10

SIGBUS iirc.

What else should I look at?

Find out what query is causing the crash --- enable query
logging if you have no other way. And get a debugger stack trace from the

core file

that the crashed backend left behind.
regards, tom lane

==================================================
The log-statements option was already enabled,
here's what I see prior to the crash:

2004-11-13 09:49:46 [10876] DEBUG: StartTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: SELECT
SUM(C0),SUM(C1),SUM(C2),SUM(C3),SUM(C4),SUM(C5),SUM(C6),SUM(C7),SUM(C8),SUM(
C9)
,SUM(C10) FROM cache_refreshes
2004-11-13 09:49:46 [10876] LOG: query: SELECT
SUM(C0),SUM(C1),SUM(C2),SUM(C3),SUM(C4),SUM(C5),SUM(C6),SUM(C7),SUM(C8),SUM(
C9),SUM
(C10) FROM cache_refreshes
2004-11-13 09:49:46 [10876] DEBUG: CommitTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: SELECT
SUM(C0),SUM(C1),SUM(C2),SUM(C3),SUM(C4),SUM(C5),SUM(C6),SUM(C7),SUM(C8),SUM(
C9)
,SUM(C10) FROM cache_refreshes
2004-11-13 09:49:46 [10876] DEBUG: StartTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: commit
2004-11-13 09:49:46 [10876] LOG: query: commit
2004-11-13 09:49:46 [10876] DEBUG: CommitTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: commit
2004-11-13 09:49:46 [10876] DEBUG: StartTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: begin
2004-11-13 09:49:46 [10876] LOG: query: begin
2004-11-13 09:49:46 [10876] DEBUG: CommitTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: begin
2004-11-13 09:49:46 [10876] DEBUG: StartTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: commit
2004-11-13 09:49:46 [10876] LOG: query: commit
2004-11-13 09:49:46 [10876] DEBUG: CommitTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: commit
2004-11-13 09:49:46 [10876] DEBUG: StartTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: begin
2004-11-13 09:49:46 [10876] LOG: query: begin
2004-11-13 09:49:46 [10876] DEBUG: CommitTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: begin
2004-11-13 09:49:46 [10876] DEBUG: StartTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: commit
2004-11-13 09:49:46 [10876] LOG: query: commit
2004-11-13 09:49:46 [10876] DEBUG: CommitTransactionCommand
2004-11-13 09:49:46 [10876] LOG: statement: commit
2004-11-13 09:54:21 [10456] DEBUG: child process (pid 19282) exited with
exit code 0
2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10
2004-11-13 10:01:06 [10456] LOG: server process (pid 19285) was terminated
by signal 10
2004-11-13 10:01:06 [10456] LOG: terminating any other active server
processes
2004-11-13 10:01:06 [10456] DEBUG: CleanupProc: sending SIGQUIT to process
10876
... ... ...

The same app. is running for other customers, seemingly steady...
I will ask for the core file.
This is a brand new machine. Is it likely that a bad memory chip may cause
this?
Thank you,
Mike

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brusser, Michael (#3)
Re: backend died

"Brusser, Michael" <Michael.Brusser@matrixone.com> writes:

2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10

The log-statements option was already enabled,
here's what I see prior to the crash:

That's no help. What were the last few lines from process 19285?

This is a brand new machine. Is it likely that a bad memory chip may cause
this?

Possibly, but it would not do to point fingers at the hardware when
you're running an obsolete version of PG ;-). At least get it updated
to 7.3.8.

regards, tom lane

#5Brusser, Michael
Michael.Brusser@matrixone.com
In reply to: Tom Lane (#4)
Re: backend died

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, November 15, 2004 3:34 PM
To: Brusser, Michael
Cc: Pgsql-Hackers (E-mail)
Subject: Re: [HACKERS] backend died

"Brusser, Michael" <Michael.Brusser@matrixone.com> writes:

2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10

The log-statements option was already enabled,
here's what I see prior to the crash:

That's no help. What were the last few lines from process 19285?

That's the strangest thing: the log file begins with
2004-11-12 11:49:14 [10456] DEBUG: FindExec: found
"/lsi/soft/synchronicity/latest/syncinc/bin.sol2/postgres" using argv[0]

- it continues with all SQL statements until it crashes; but the only
reference to pid 19285 is:

2004-11-13 10:01:06 [10456] DEBUG: child process (pid 19285) was
terminated by signal 10
2004-11-13 10:01:06 [10456] LOG: server process (pid 19285) was terminated
by signal 10

there are no prior occurrences of token 19285 in the file.

... you're running an obsolete version of PG ;-).
At least get it updated to 7.3.8.

I'd love to, but this is not something I can do. Have to live with that,
as well as with the fact that many of our customers are running on NFS
(yes, I know...)

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brusser, Michael (#5)
Re: backend died

"Brusser, Michael" <Michael.Brusser@matrixone.com> writes:

That's no help. What were the last few lines from process 19285?

there are no prior occurrences of token 19285 in the file.

Hmm, so it seems 19285 died during startup. That does make a hardware
problem seem a bit plausible --- the backend start sequence is pretty
well tested ;-).

regards, tom lane