Postmaster cannot start

Started by Chun Yit(Chronos)almost 20 years ago8 messagesgeneral
Jump to latest
#1Chun Yit(Chronos)
ivanbeh@chronos.com.my

Hi,

my postgresql version 7.4.5 cannot start until this morning, when i check the log file, it give me this error
this is part of my log file

DEBUG: vacuuming "pg_catalog.pg_class"
DEBUG: "pg_class": found 9823 removable, 1017 nonremovable row versions in 205 pages
DETAIL: 0 dead row versions cannot be removed yet.
Nonremovable row versions range from 148 to 196 bytes long.
There were 20 unused item pointers.
Total free space (including removable row versions) is 1477460 bytes.
183 pages are or will become empty, including 183 at the end of the table.
9 pages containing 19260 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
DEBUG: index "pg_class_oid_index" now contains 1017 row versions in 35 pages
DETAIL: 9823 index row versions were removed.
28 index pages have been deleted, 28 are currently reusable.
CPU 0.01s/0.00u sec elapsed 0.02 sec.
DEBUG: index "pg_class_relname_nsp_index" now contains 726 row versions in 197 pages
DETAIL: 6124 index row versions were removed.
126 index pages have been deleted, 126 are currently reusable.
CPU 0.00s/0.03u sec elapsed 0.18 sec.
WARNING: index "pg_class_relname_nsp_index" contains 726 row versions, but table contains 1017 row versions
HINT: Rebuild the index with REINDEX.
DEBUG: "pg_class": moved 57 row versions, truncated 205 to 20 pages
DETAIL: CPU 0.00s/0.00u sec elapsed 0.34 sec.
DEBUG: index "pg_class_oid_index" now contains 1017 row versions in 35 pages
DETAIL: 57 index row versions were removed.
28 index pages have been deleted, 28 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
ERROR: could not open segment 1 of relation "pg_class_relname_nsp_index" (target block 1183847760): No such file or directory
DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 14135438, it was already committed
DEBUG: child process (PID 4834) was terminated by signal 6
LOG: server process (PID 4834) was terminated by signal 6
LOG: terminating any other active server processes
DEBUG: sending SIGQUIT to process 3337
DEBUG: sending SIGQUIT to process 3336
DEBUG: sending SIGQUIT to process 3335
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

Can someone please explain to me what happen to my database, cause i not so clear,
i have a daily maintenance routine that always trigger at 4am morning.
1) will run end of day job. (every day)
2) will run vacuum full database. (every day)
3) will run analyze database. (every day)
4) will run reindex database. (every day)

saw from the log file, it's possible that server crash during vacuum process...

Question :
1) what happen to my database server? what the error meaning?
2) it is my daily maintenance routine causing that?
3) it is ok to run vacuum full every day ?
4) how can i solve this problem?

Thanks

Regards
Beh

#2Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Chun Yit(Chronos) (#1)
Re: Postmaster cannot start

""Chun Yit(Chronos)"" <ivanbeh@chronos.com.my> wrote

saw from the log file, it's possible that server crash during
vacuum process...

Question :
1) what happen to my database server? what the error meaning?

It looks like index "pg_class_relname_nsp_index" (which is an index on
pg_class) is corrupted. But not sure why it reports the following error
message (which looks like a post-commit cleanup caused error):

DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 14135438, it was already committed

2) it is my daily maintenance routine causing that?

No.

3) it is ok to run vacuum full every day ?

Yes. There is no apparent bad side effect of vacuum full except that it
would block other accesses on the target table. But since you do it per
day(guess you do it on a nightly schedule), so this should not be a
problem.

4) how can i solve this problem?

The base table pg_class should be ok(pg_class_oid_ind indicates both
have the same cardinality). Try to reindex pg_class as the superuser.

Regards,
Qingqing

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#2)
Re: Postmaster cannot start

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

But not sure why it reports the following error
message (which looks like a post-commit cleanup caused error):

DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 14135438, it was already committed

I think this is an artifact of the fact that VACUUM FULL commits its own
transaction before it starts the final index cleanup pass. We ought to
think of a better way to handle that sometime. I don't recall having
seen a PANIC like this reported before, but on reflection it seems like
this would be guaranteed to happen for any ERROR condition occurring
during that last pass. An error there would be pretty improbable, but
surely not impossible.

As for the OP's problem, it seems pretty suspicious that we got through
one cycle of vacuuming pg_class_relname_nsp_index and then the second
one failed with what seems to be a bad block link. If that bad link was
there before, why didn't it fail the first time through? I'm wondering
about flaky hardware ...

regards, tom lane

#4Chun Yit(Chronos)
ivanbeh@chronos.com.my
In reply to: Chun Yit(Chronos) (#1)
Re: Postmaster cannot start

4) how can i solve this problem?

The base table pg_class should be ok(pg_class_oid_ind indicates both have
the same cardinality). Try to reindex pg_class as the superuser.

but not i not be able to reindex the table because i cannot start the
postmaster.
postmaster give me error every time i try to start it
this is the error msg

DEBUG: found "/usr/local/pgsql/bin/postgres" using argv[0]
LOG: could not create IPv6 socket: Address family not supported by protocol
DEBUG: max_safe_fds = 985, usable_fds = 1019, already_open = 5
DEBUG: found "/usr/local/pgsql/bin/postmaster" using argv[0]
LOG: could not resolve "localhost": Temporary failure in name resolution
LOG: database system was interrupted while in recovery at 2006-05-15
04:14:39 MYT
HINT: This probably means that some data is corrupted and you will have to
use the last backup for recovery.
LOG: checkpoint record is at A/46315F10
LOG: redo record is at A/46315F10; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 14135366; next OID: 51809028
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at A/46315F50
PANIC: btree_delete_page_redo: uninitialized right sibling
LOG: startup process (PID 5043) was terminated by signal 6
LOG: aborting startup due to startup process failure

Show quoted text

Regards
Beh

#5Chun Yit(Chronos)
ivanbeh@chronos.com.my
In reply to: Chun Yit(Chronos) (#1)
Re: Postmaster cannot start
Show quoted text

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

But not sure why it reports the following error
message (which looks like a post-commit cleanup caused error):

DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 14135438, it was already committed

I think this is an artifact of the fact that VACUUM FULL commits its own
transaction before it starts the final index cleanup pass. We ought to
think of a better way to handle that sometime. I don't recall having
seen a PANIC like this reported before, but on reflection it seems like
this would be guaranteed to happen for any ERROR condition occurring
during that last pass. An error there would be pretty improbable, but
surely not impossible.

I have check my past postgresql log file, i did not see this error come
out (i did vacuum full every night).

As for the OP's problem, it seems pretty suspicious that we got through
one cycle of vacuuming pg_class_relname_nsp_index and then the second
one failed with what seems to be a bad block link. If that bad link was
there before, why didn't it fail the first time through? I'm wondering
about flaky hardware ...

Any tool to check am i having a flaky hardware?

Regards
Beh

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chun Yit(Chronos) (#4)
Re: Postmaster cannot start

"Chun Yit\(Chronos\)" <ivanbeh@chronos.com.my> writes:

PANIC: btree_delete_page_redo: uninitialized right sibling
LOG: startup process (PID 5043) was terminated by signal 6
LOG: aborting startup due to startup process failure

That's pretty ugly :-(. I think your only hope to get out of it is to
use pg_resetxlog. After that, you will undoubtedly need to reindex
pg_class in the problem database. After that, if you're very lucky,
you'll be able to pg_dump. If you can get a clean dump, then initdb
and reload --- it would not be wise to keep using the database.

I'd suggest first taking a filesystem backup of the $PGDATA directory as
it stands, so you can get back to where you are if this fails to work.

Assuming you can get back up and running, you should look into (a) why
you are running an old, known-buggy PG release, and (b) whether you have
flaky hardware, as per comments already made upthread.

regards, tom lane

#7Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Chun Yit(Chronos) (#1)
Re: Postmaster cannot start

""Chun Yit(Chronos)"" <ivanbeh@chronos.com.my> wrote

postmaster give me error every time i try to start it
LOG: redo starts at A/46315F50
PANIC: btree_delete_page_redo: uninitialized right sibling

So the last resort I can think of is to use pg_resetxlog to pass the startup
failure -- but no gaurantee that your database is still consistent after
that :-(. So before you do that, make a copy of your database for backup.

Maybe a deeper look into the problem is not needed, since 7.4.5 is too old,
and try to use some newer version.

Regards,
Qingqing

#8Chun Yit(Chronos)
ivanbeh@chronos.com.my
In reply to: Chun Yit(Chronos) (#1)
Re: Postmaster cannot start

What i done is i use the last backup file and restore it on new $PGData
Directory, now it
ok. just curious how this thing can happen? maybe i asking a stupid
question since i using
quite old Postgresql Version but hoping this problem won't come back
again.
Thanks for your help to clear some doubt on my mind.

Regards
Beh

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Chun Yit(Chronos)" <ivanbeh@chronos.com.my>
Cc: <pgsql-general@postgresql.org>
Sent: Tuesday, May 16, 2006 1:03 PM
Subject: Re: [GENERAL] Postmaster cannot start

Show quoted text

"Chun Yit\(Chronos\)" <ivanbeh@chronos.com.my> writes:

PANIC: btree_delete_page_redo: uninitialized right sibling
LOG: startup process (PID 5043) was terminated by signal 6
LOG: aborting startup due to startup process failure

That's pretty ugly :-(. I think your only hope to get out of it is to
use pg_resetxlog. After that, you will undoubtedly need to reindex
pg_class in the problem database. After that, if you're very lucky,
you'll be able to pg_dump. If you can get a clean dump, then initdb
and reload --- it would not be wise to keep using the database.

I'd suggest first taking a filesystem backup of the $PGDATA directory as
it stands, so you can get back to where you are if this fails to work.

Assuming you can get back up and running, you should look into (a) why
you are running an old, known-buggy PG release, and (b) whether you have
flaky hardware, as per comments already made upthread.

regards, tom lane