Recover from corrupted database due to failing disk

Started by Gionatan Dantiover 9 years ago8 messagesgeneral

g.danti@assyoma.it

over 9 years ago

Dear all,
some days ago I was tasked to recover a production database from a
failing single-disk (!) system. I initially planned to restore from
backups but, due to the bad disk, backups (done via pg_dumpall) were
failing and nobody cared to notice (!!). Bottom line, the system was
failing and no backup were in place (!!!). I perfectly understand this
is (very) bad management, but I am now tasked to somehow recover from
this situation.

This is a very old installation. Some details:
- o.s. CentOS 5.6 x86_64
- postgresql version postgresql-8.1.22-1.el5_5.1

By using ddrescue, I successfully migrated the entire old failing disk
to a healty one and solve the first problem (failing disk).

However, backup continue to fail with "invalid page header in block"
message. Morever, I am very near the xid wraparound limit and, as vacuum
fails due to the invalid blocks, I expect a database shutdown (triggered
by the 1M transaction protection) within some days.

From my understanding, both problem *should* be solved enabling
"zero_damaged_pages" and executing a "vacuumdb -a". Is this expectation
correct? Will a "reindexdb -a" necessary?

This is my current action plan:
- stop postgresql;
- make a backup copy the entire /var/lib/pgsql directory
- set zero_damaged_pages=on and increase maintenance_work_mem (to let
vacuum operate faster);
- start postgresql;
- execute a "vacuumdb -a"
- if necessary, execute a "reindexdb -a"

It sounds reasonable? Do you have any suggestions/warnings?
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Gionatan Danti (#1)

Re: Recover from corrupted database due to failing disk

On 11/2/16 2:02 PM, Gionatan Danti wrote:

However, backup continue to fail with "invalid page header in block"
message. Morever, I am very near the xid wraparound limit and, as vacuum
fails due to the invalid blocks, I expect a database shutdown (triggered
by the 1M transaction protection) within some days.

That means at least some of the Postgres files have been damaged
(possibly due to the failing disk). Postgres will complain when it sees
internal data structures that don't make sense, but it has no way to
know if any of the user data has been screwed up.

From my understanding, both problem *should* be solved enabling
"zero_damaged_pages" and executing a "vacuumdb -a". Is this expectation
correct? Will a "reindexdb -a" necessary?

I wouldn't trust the existing cluster that far. Since it sounds like you
have no better options, you could use zero_damaged_pages to allow a
pg_dumpall to complete, but you're going to end up with missing data. So
what I'd suggest would be:

stop Postgres
make a copy of the cluster
start with zero_damaged_pages
pg_dumpall
stop and remove the cluster (make sure you've got that backup)
create a new cluster and load the dump
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Jim Nasby (#2)

Re: Recover from corrupted database due to failing disk

On 11/2/16 6:21 PM, Jim Nasby wrote:

I wouldn't trust the existing cluster that far. Since it sounds like you
have no better options, you could use zero_damaged_pages to allow a
pg_dumpall to complete, but you're going to end up with missing data. So
what I'd suggest would be:

stop Postgres
make a copy of the cluster
start with zero_damaged_pages
pg_dumpall
stop and remove the cluster (make sure you've got that backup)
create a new cluster and load the dump

Oh, and while you're at it, upgrade to a version that's supported. 8.1
has been out of support for 5+ years.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Gionatan Danti

g.danti@assyoma.it

over 9 years ago

In reply to: Jim Nasby (#2)

Re: Recover from corrupted database due to failing disk

Il 03-11-2016 00:21 Jim Nasby ha scritto:

On 11/2/16 2:02 PM, Gionatan Danti wrote:

That means at least some of the Postgres files have been damaged
(possibly due to the failing disk). Postgres will complain when it
sees internal data structures that don't make sense, but it has no way
to know if any of the user data has been screwed up.

I understand that (unfortunately) user data *will* be corrupted/lost.
However, having no backup, I think the customer *must* accept that...

I wouldn't trust the existing cluster that far. Since it sounds like
you have no better options, you could use zero_damaged_pages to allow
a pg_dumpall to complete, but you're going to end up with missing
data. So what I'd suggest would be:

stop Postgres
make a copy of the cluster
start with zero_damaged_pages
pg_dumpall
stop and remove the cluster (make sure you've got that backup)
create a new cluster and load the dump

The whole dump/restore approach surely is the most sensible one.
However, I am concerned that if the dump have some undetected problems
leading to a failed restore, I had to recover from the raw files (which
I would like to avoid). Moreover, the expected remaining lifetime of
such a database is 2/3 months only, as a new production system should be
installed shortly. This is why I would prefer to use vacuum/reindex and
avoid a full dump/restore.

Thank you very much Jim.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Adrian Klaver

adrian.klaver@aklaver.com

over 9 years ago

In reply to: Gionatan Danti (#4)

Re: Recover from corrupted database due to failing disk

On 11/02/2016 11:18 PM, Gionatan Danti wrote:

Il 03-11-2016 00:21 Jim Nasby ha scritto:

On 11/2/16 2:02 PM, Gionatan Danti wrote:

That means at least some of the Postgres files have been damaged
(possibly due to the failing disk). Postgres will complain when it
sees internal data structures that don't make sense, but it has no way
to know if any of the user data has been screwed up.

I understand that (unfortunately) user data *will* be corrupted/lost.
However, having no backup, I think the customer *must* accept that...

I wouldn't trust the existing cluster that far. Since it sounds like
you have no better options, you could use zero_damaged_pages to allow
a pg_dumpall to complete, but you're going to end up with missing
data. So what I'd suggest would be:

stop Postgres
make a copy of the cluster
start with zero_damaged_pages
pg_dumpall
stop and remove the cluster (make sure you've got that backup)
create a new cluster and load the dump

The whole dump/restore approach surely is the most sensible one.
However, I am concerned that if the dump have some undetected problems
leading to a failed restore, I had to recover from the raw files (which
I would like to avoid). Moreover, the expected remaining lifetime of
such a database is 2/3 months only, as a new production system should be
installed shortly. This is why I would prefer to use vacuum/reindex and
avoid a full dump/restore.

The above does not make sense. You are having to recover because there
was no backup and now you want to go forward without doing a backup?

Thank you very much Jim.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Gionatan Danti

g.danti@assyoma.it

over 9 years ago

In reply to: Adrian Klaver (#5)

Re: Recover from corrupted database due to failing disk

On 03/11/2016 14:20, Adrian Klaver wrote:

The above does not make sense. You are having to recover because there
was no backup and now you want to go forward without doing a backup?

Hi Adrian, no, I don't want go forward without backups ;)
Actually, the *first* thing I did after the vacuum completed was a full
cluster backup (via pg_dumpall), and I scheduled nightly backups as well.

Problem is this customer does not have another server were backups can
be restored and the entire production database migrated. In short, the
two possibilities I have are:

1) execute the vacuum (done), schedule regular dumps (done) and, if
something goes wrong, recover from backups;

2) execute the vacuum (done), do a manual backup (done), reinit
(remove/recreate) the entire cluster (not done) and restore from backups
(not done).

I strongly prefer to execute n.2 on another machine, so that production
is not impacted while the recovered backup can be througly tested.
If/when the backups are validated, I want to migrate all clients to the
new server (with RAID1 in place), and dismiss the old one.

Unfortuntaly I am working with incredible constrains from customer side;
even buying two SAS disks seems a problem. Moreover, as an external
consultant, I have basically no decision/buying power :|
What I can do (and I did) is to raise a very big red flag and let others
decide what to do.

The good thing is that zero_damaged_pages and vacuum did their works, as
now the database can be dumped and vacuumed with no (apparent) problems.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Adrian Klaver

adrian.klaver@aklaver.com

over 9 years ago

In reply to: Gionatan Danti (#6)

Re: Recover from corrupted database due to failing disk

On 11/04/2016 03:20 AM, Gionatan Danti wrote:

On 03/11/2016 14:20, Adrian Klaver wrote:

The above does not make sense. You are having to recover because there
was no backup and now you want to go forward without doing a backup?

Hi Adrian, no, I don't want go forward without backups ;)
Actually, the *first* thing I did after the vacuum completed was a full
cluster backup (via pg_dumpall), and I scheduled nightly backups as well.

Problem is this customer does not have another server were backups can
be restored and the entire production database migrated. In short, the
two possibilities I have are:

1) execute the vacuum (done), schedule regular dumps (done) and, if
something goes wrong, recover from backups;

2) execute the vacuum (done), do a manual backup (done), reinit
(remove/recreate) the entire cluster (not done) and restore from backups
(not done).

I strongly prefer to execute n.2 on another machine, so that production
is not impacted while the recovered backup can be througly tested.
If/when the backups are validated, I want to migrate all clients to the
new server (with RAID1 in place), and dismiss the old one.

Unfortuntaly I am working with incredible constrains from customer side;
even buying two SAS disks seems a problem. Moreover, as an external
consultant, I have basically no decision/buying power :|
What I can do (and I did) is to raise a very big red flag and let others
decide what to do.

Ouch, understood. Good luck!

The good thing is that zero_damaged_pages and vacuum did their works, as
now the database can be dumped and vacuumed with no (apparent) problems.

Thanks.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys

haramrae@gmail.com

over 9 years ago

In reply to: Gionatan Danti (#6)

Re: Recover from corrupted database due to failing disk

On 4 November 2016 at 11:20, Gionatan Danti <g.danti@assyoma.it> wrote:

Unfortuntaly I am working with incredible constrains from customer side;
even buying two SAS disks seems a problem. Moreover, as an external
consultant, I have basically no decision/buying power :|
What I can do (and I did) is to raise a very big red flag and let others
decide what to do.

It seems to me that your customer doesn't realise how expensive it
would be if their server would be unavailable for any length of time
or if they would actually lose the data it contains. That, or the data
of your customer isn't so valuable that it's worth your time.

We've been fighting a somewhat similar fight internally here, where
management wasn't prepared to spend € 30,000 once on a server plus
software licenses, while they pay that to one of our new managers
monthly.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general