how to recover after harddisk error

Started by Peter Albererabout 23 years ago5 messagesgeneral
Jump to latest
#1Peter Alberer
h9351252@obelix.wu-wien.ac.at

Hi,

Yesterday at about 8pm the harddisk subsystem of our web application
crashed, because of some scsi-error. The system could be restarted today
in the morning, but the database would not come up again. The following
info could be found in the log file.

2003-02-26 09:03:06 [1291] DEBUG: database system was interrupted at
2003-02-25 20:19:22 CET
2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C9 (log file 26, segment
201) failed
: No such file or directory
2003-02-26 09:03:06 [1291] DEBUG: invalid primary checkpoint record
2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
200) failed
: No such file or directory
2003-02-26 09:03:06 [1291] DEBUG: invalid secondary checkpoint record
2003-02-26 09:03:06 [1291] FATAL 2: unable to locate a valid
checkpoint record
2003-02-26 09:03:06 [1277] DEBUG: startup process (pid 1291) exited
with exit code 2
2003-02-26 09:03:06 [1277] DEBUG: aborting startup due to startup
process failure

I did the following steps to get the system running again:

- a new initdb in another data-directory
- create the database again
- restore the data from the last available nightly dump

Is there a better way to get the system running again? Had there been
any way to access the old system again? The steps I did took about 45
min which is quite long (cause the db-dump is rather large) and if there
had been some important data it had been lost...

TIA, peter

#2Bjoern Metzdorf
bm@turtle-entertainment.de
In reply to: Peter Alberer (#1)
Re: how to recover after harddisk error

2003-02-26 09:03:06 [1291] DEBUG: invalid primary checkpoint record
2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
200) failed

I did the following steps to get the system running again:

- a new initdb in another data-directory
- create the database again
- restore the data from the last available nightly dump

Is there a better way to get the system running again? Had there been
any way to access the old system again? The steps I did took about 45
min which is quite long (cause the db-dump is rather large) and if there
had been some important data it had been lost...

pg_resetxlog from contrib

Regards,
Bjoern

#3Peter Alberer
h9351252@obelix.wu-wien.ac.at
In reply to: Bjoern Metzdorf (#2)
Re: how to recover after harddisk error

Thanks a lot Bjoern.

Just wanted to mention that I found pg_resetxlog to be available per
default in pg7.3.2.

-----Ursprüngliche Nachricht-----
Von: Björn Metzdorf [mailto:bm@turtle-entertainment.de]
Gesendet: Mittwoch, 26. Februar 2003 10:25
An: Peter Alberer; pgsql-general@postgresql.org
Betreff: Re: [GENERAL] how to recover after harddisk error

2003-02-26 09:03:06 [1291] DEBUG: invalid primary checkpoint

record

2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
200) failed

I did the following steps to get the system running again:

- a new initdb in another data-directory
- create the database again
- restore the data from the last available nightly dump

Is there a better way to get the system running again? Had there been
any way to access the old system again? The steps I did took about 45
min which is quite long (cause the db-dump is rather large) and if

there

Show quoted text

had been some important data it had been lost...

pg_resetxlog from contrib

Regards,
Bjoern

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Alberer (#1)
Re: how to recover after harddisk error

"Peter Alberer" <h9351252@obelix.wu-wien.ac.at> writes:

2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C9 (log file 26, segment
201) failed
: No such file or directory
2003-02-26 09:03:06 [1291] DEBUG: invalid primary checkpoint record
2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
200) failed
: No such file or directory
2003-02-26 09:03:06 [1291] DEBUG: invalid secondary checkpoint record
2003-02-26 09:03:06 [1291] FATAL 2: unable to locate a valid
checkpoint record

Assuming you haven't wiped the old database directory yet...

What file name(s) are actually present in /usr/local/pgsql/data/pg_xlog/
? What does pg_controldata show --- do the other fields of pg_control
look sane?

pg_resetxlog would have allowed you to restart, but at the price of
losing any consistency guarantees about the results of
recently-committed transactions. So I consider it a very last resort.
What I'd like to understand first is why the system couldn't restart
normally.

regards, tom lane

#5Peter Alberer
h9351252@obelix.wu-wien.ac.at
In reply to: Tom Lane (#4)
Re: how to recover after harddisk error

Too bad, i had intended to keep the old database instance around, but i had to remove the files a
few hours ago after running low on harddisk capacity...

ciao, peter

Show quoted text

"Peter Alberer" <h9351252@obelix.wu-wien.ac.at> writes:

2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C9 (log file 26, segment
201) failed
: No such file or directory
2003-02-26 09:03:06 [1291] DEBUG: invalid primary checkpoint record
2003-02-26 09:03:06 [1291] DEBUG: open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
200) failed
: No such file or directory
2003-02-26 09:03:06 [1291] DEBUG: invalid secondary checkpoint record
2003-02-26 09:03:06 [1291] FATAL 2: unable to locate a valid
checkpoint record

Assuming you haven't wiped the old database directory yet...

What file name(s) are actually present in /usr/local/pgsql/data/pg_xlog/
? What does pg_controldata show --- do the other fields of pg_control
look sane?

pg_resetxlog would have allowed you to restart, but at the price of
losing any consistency guarantees about the results of
recently-committed transactions. So I consider it a very last resort.
What I'd like to understand first is why the system couldn't restart
normally.

regards, tom lane