what could cause this PANIC on enterprice 7.3.4 db?

Started by Andriy Tkachukabout 22 years ago5 messages
#1Andriy Tkachuk
ant@imt.com.ua

Usually db works fine & stable months, but couple days ago
it was PANIC, and i have no ideas why this happen.

Nov 5 20:22:30 monstr postgres[16071]: [1] LOG: connection received: host=[local]
Nov 5 20:22:30 monstr postgres[16071]: [2] LOG: connection authorized: user=ant database=tele
Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nov 5 20:22:42 monstr postgres[16071]: [4-1] LOG: statement: select
[...] skip the log of failed query (it was just usual query)
Nov 5 20:22:42 monstr postgres[1109]: [1] LOG: server process (pid 16071) was terminated by signal 6
Nov 5 20:22:42 monstr postgres[1109]: [2] LOG: terminating any other active server processes
Nov 5 20:22:42 monstr postgres[15913]: [3-1] WARNING: Message from PostgreSQL backend:
Nov 5 20:22:42 monstr postgres[1412]: [143-1] WARNING: Message from PostgreSQL backend:
Nov 5 20:22:42 monstr postgres[15913]: [3-2] ^IThe Postmaster has informed me that some other backend
Nov 5 20:22:42 monstr postgres[1412]: [143-2] ^IThe Postmaster has informed me that some other backend
Nov 5 20:22:42 monstr postgres[15913]: [3-3] ^Idied abnormally and possibly corrupted shared memory.
Nov 5 20:22:42 monstr postgres[1412]: [143-3] ^Idied abnormally and possibly corrupted shared memory.
Nov 5 20:22:42 monstr postgres[15913]: [3-4] ^II have rolled back the current transaction and am
Nov 5 20:22:42 monstr postgres[1412]: [143-4] ^II have rolled back the current transaction and am
Nov 5 20:22:42 monstr postgres[15913]: [3-5] ^Igoing to terminate your database system connection and exit.
Nov 5 20:22:42 monstr postgres[1412]: [143-5] ^Igoing to terminate your database system connection and exit.
Nov 5 20:22:42 monstr postgres[15913]: [3-6] ^IPlease reconnect to the database system and repeat your query.
Nov 5 20:22:42 monstr postgres[1412]: [143-6] ^IPlease reconnect to the database system and repeat your query.
Nov 5 20:22:42 monstr postgres[1109]: [3] LOG: all server processes terminated; reinitializing shared memory and semaphores
Nov 5 20:22:42 monstr postgres[16077]: [4] LOG: database system was interrupted at 2003-11-05 20:17:48 EET
Nov 5 20:22:42 monstr postgres[16077]: [5] LOG: checkpoint record is at 14/E956482C
Nov 5 20:22:42 monstr postgres[16077]: [6] LOG: redo record is at 14/E956482C; undo record is at 0/0; shutdown FALSE
Nov 5 20:22:42 monstr postgres[16077]: [7] LOG: next transaction id: 13580788; next oid: 190057632
Nov 5 20:22:42 monstr postgres[16077]: [8] LOG: database system was not properly shut down; automatic recovery in progress
Nov 5 20:22:42 monstr postgres[16077]: [9] LOG: redo starts at 14/E956486C
Nov 5 20:22:42 monstr postgres[16077]: [10] LOG: ReadRecord: record with zero length at 14/E95ACA14
Nov 5 20:22:42 monstr postgres[16077]: [11] LOG: redo done at 14/E95AC9F0
Nov 5 20:22:45 monstr postgres[16077]: [12] LOG: database system is ready

so db was restarted succesfully itself but all applications that had
static connection (it's tcl apps) where fell

the hard disk is fine (might be, because i've used to check it
at night time from cron like this: cat /dev/hda > /dev/null),
so i have no idea why pg didn't find this file (see error from log).
Does someone?

system is: Linux 2.4.18-5 (RH)

Thank you,
Andriy Tkachuk.

--
http://www.imt.com.ua

#2Andriy Tkachuk
ant@imt.com.ua
In reply to: Andriy Tkachuk (#1)
Re: what could cause this PANIC on enterprise 7.3.4 db?

sorry , guys, for my english

i meant "enterprise db" in subj, not enterprice :)

On Fri, 7 Nov 2003, Andriy Tkachuk wrote:

Usually db works fine & stable months, but couple days ago
it was PANIC, and i have no ideas why this happen.

Nov 5 20:22:30 monstr postgres[16071]: [1] LOG: connection received: host=[local]
Nov 5 20:22:30 monstr postgres[16071]: [2] LOG: connection authorized: user=ant database=tele
Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nov 5 20:22:42 monstr postgres[16071]: [4-1] LOG: statement: select
[...] skip the log of failed query (it was just usual query)
Nov 5 20:22:42 monstr postgres[1109]: [1] LOG: server process (pid 16071) was terminated by signal 6
Nov 5 20:22:42 monstr postgres[1109]: [2] LOG: terminating any other active server processes
Nov 5 20:22:42 monstr postgres[15913]: [3-1] WARNING: Message from PostgreSQL backend:
Nov 5 20:22:42 monstr postgres[1412]: [143-1] WARNING: Message from PostgreSQL backend:
Nov 5 20:22:42 monstr postgres[15913]: [3-2] ^IThe Postmaster has informed me that some other backend
Nov 5 20:22:42 monstr postgres[1412]: [143-2] ^IThe Postmaster has informed me that some other backend
Nov 5 20:22:42 monstr postgres[15913]: [3-3] ^Idied abnormally and possibly corrupted shared memory.
Nov 5 20:22:42 monstr postgres[1412]: [143-3] ^Idied abnormally and possibly corrupted shared memory.
Nov 5 20:22:42 monstr postgres[15913]: [3-4] ^II have rolled back the current transaction and am
Nov 5 20:22:42 monstr postgres[1412]: [143-4] ^II have rolled back the current transaction and am
Nov 5 20:22:42 monstr postgres[15913]: [3-5] ^Igoing to terminate your database system connection and exit.
Nov 5 20:22:42 monstr postgres[1412]: [143-5] ^Igoing to terminate your database system connection and exit.
Nov 5 20:22:42 monstr postgres[15913]: [3-6] ^IPlease reconnect to the database system and repeat your query.
Nov 5 20:22:42 monstr postgres[1412]: [143-6] ^IPlease reconnect to the database system and repeat your query.
Nov 5 20:22:42 monstr postgres[1109]: [3] LOG: all server processes terminated; reinitializing shared memory and semaphores
Nov 5 20:22:42 monstr postgres[16077]: [4] LOG: database system was interrupted at 2003-11-05 20:17:48 EET
Nov 5 20:22:42 monstr postgres[16077]: [5] LOG: checkpoint record is at 14/E956482C
Nov 5 20:22:42 monstr postgres[16077]: [6] LOG: redo record is at 14/E956482C; undo record is at 0/0; shutdown FALSE
Nov 5 20:22:42 monstr postgres[16077]: [7] LOG: next transaction id: 13580788; next oid: 190057632
Nov 5 20:22:42 monstr postgres[16077]: [8] LOG: database system was not properly shut down; automatic recovery in progress
Nov 5 20:22:42 monstr postgres[16077]: [9] LOG: redo starts at 14/E956486C
Nov 5 20:22:42 monstr postgres[16077]: [10] LOG: ReadRecord: record with zero length at 14/E95ACA14
Nov 5 20:22:42 monstr postgres[16077]: [11] LOG: redo done at 14/E95AC9F0
Nov 5 20:22:45 monstr postgres[16077]: [12] LOG: database system is ready

so db was restarted succesfully itself but all applications that had
static connection (it's tcl apps) where fell

the hard disk is fine (might be, because i've used to check it
at night time from cron like this: cat /dev/hda > /dev/null),
so i have no idea why pg didn't find this file (see error from log).
Does someone?

system is: Linux 2.4.18-5 (RH)

Thank you,
Andriy Tkachuk.

--
http://www.imt.com.ua

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

О©╫ О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫,
О©╫О©╫О©╫О©╫О©╫О©╫ О©╫О©╫О©╫О©╫О©╫О©╫.

http://www.imt.com.ua

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andriy Tkachuk (#1)
Re: what could cause this PANIC on enterprice 7.3.4 db?

Andriy Tkachuk <ant@imt.com.ua> writes:

Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or directory

Could we see ls -l /usr/local/pgsql/data/pg_clog/

regards, tom lane

#4Andriy Tkachuk
ant@imt.com.ua
In reply to: Tom Lane (#3)
Re: what could cause this PANIC on enterprise 7.3.4 db?

shure, Tom

[10:49]/2:ant@monstr:~>sudo ls -al /usr/local/pgsql/data/pg_clog
total 40
drwx------ 2 pgsql postgres 4096 Nov 7 03:28 .
drwx------ 6 pgsql root 4096 Oct 23 10:45 ..
-rw------- 1 pgsql postgres 32768 Nov 10 10:47 000D

[10:49]/2:ant@monstr:~>date
Mon Nov 10 10:49:50 EET 2003

[10:49]/2:ant@monstr:~>uname -sr
Linux 2.4.18-5custom

On Fri, 7 Nov 2003, Tom Lane wrote:

Andriy Tkachuk <ant@imt.com.ua> writes:

Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or directory

Could we see ls -l /usr/local/pgsql/data/pg_clog/

regards, tom lane

regards,
Andriy Tkachuk
--
http://www.imt.com.ua

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andriy Tkachuk (#4)
Re: what could cause this PANIC on enterprise 7.3.4 db?

Andriy Tkachuk <ant@imt.com.ua> writes:

On Fri, 7 Nov 2003, Tom Lane wrote:

Andriy Tkachuk <ant@imt.com.ua> writes:

Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or directory

Could we see ls -l /usr/local/pgsql/data/pg_clog/

[10:49]/2:ant@monstr:~>sudo ls -al /usr/local/pgsql/data/pg_clog
total 40
drwx------ 2 pgsql postgres 4096 Nov 7 03:28 .
drwx------ 6 pgsql root 4096 Oct 23 10:45 ..
-rw------- 1 pgsql postgres 32768 Nov 10 10:47 000D

Okay, given that the file the code was trying to access is nowhere near
the current or past set of valid transaction numbers, it's pretty clear
that what you have is a corrupted transaction number in some tuple's
header. The odds are that not only the transaction number is affected;
usually when we see something like this, anywhere from dozens to
hundreds of bytes have been replaced by garbage data.

In the cases I've been able to study in the past, the cause seemed to
be faulty hardware or possibly kernel bugs --- for instance someone
recently reported a case where a whole kilobyte of a Postgres file had
been replaced with what seemed to be part of a mail message. I'd
ascribe that to either a disk drive writing a sector at the wrong place,
or the kernel getting confused about which buffer held which file.
So I'd recommend running some hardware diagnostics and checking to see
if there are errata available for your kernel.

As far as cleaning up the immediate damage is concerned, you'll probably
want to use pg_filedump or some such tool to get a better feeling for
the extent of the damage. There are descriptions of this process in the
archives --- try looking for recent references to pg_filedump.

regards, tom lane