pgsql 9.0.1 table corruption

Started by Dan Biaginiabout 15 years ago5 messagesgeneral

dbiagini@gmail.com

about 15 years ago

I have a 9.0.1 database with two corrupted tables (one table has 20
rows, the other 140). The tables *seem* fine for read/select
operations, but updating certain rows in the table produce error
messages:

update media set updated_at = now() at time zone 'UTC';
ERROR: could not read block 2 in file "base/16384/16485": read only 0
of 8192 bytes

update media_status set updated_at = now() at time zone 'UTC';
2011-04-14 00:15:15 UTC ERROR: could not read block 3 in file
"base/16384/16543": read only 0 of 8192 bytes
2011-04-14 00:15:15 UTC STATEMENT: update media_status set updated_at
= now() at time zone 'UTC';

Examining the corrupted files in the filesystem (linux), they aren't
zero bytes:

ll base/16384/16485
-rwx------ 1 postgres postgres 16384 2011-04-07 09:43 base/16384/16485

I ran a "vacuum(FULL, VERBOSE) <table>" command and the corruption (or
at least the errors on update) has disappeared. Is it expected that a
"vacuum(FULL)" command would/could fix table corruption? Does that
providing any clues as to what may have happened?

Is there any way to determine how/when this corruption may have occurred?

I suspect that it may have occurred during a filesystem level backup
(ie pg_start_backup(), tar -czf..., pg_stop_backup()), as I performed
a backup and moved the database to a different system. After
restoring the files and starting postgres I began getting these
errors. I have tried restoring multiple times with the same tar
archive with the same results (on different systems).

Thanks,
Dan

Alan Hodgson

ahodgson@simkin.ca

about 15 years ago

In reply to: Dan Biagini (#1)

Re: pgsql 9.0.1 table corruption

On April 14, 2011 08:10:47 am Dan Biagini wrote:

I suspect that it may have occurred during a filesystem level backup
(ie pg_start_backup(), tar -czf..., pg_stop_backup()), as I performed
a backup and moved the database to a different system. After
restoring the files and starting postgres I began getting these
errors. I have tried restoring multiple times with the same tar
archive with the same results (on different systems).

Did you perform a PITR restore using that tar as a base backup? Do any errors
occur? The tar file alone will not be an intact backup, as I'm sure you're
aware.

http://www.postgresql.org/docs/9.0/static/continuous-archiving.html#BACKUP-
PITR-RECOVERY

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Dan Biagini (#1)

Re: pgsql 9.0.1 table corruption

Dan Biagini <dbiagini@gmail.com> writes:

I have a 9.0.1 database with two corrupted tables (one table has 20
rows, the other 140). The tables *seem* fine for read/select
operations, but updating certain rows in the table produce error
messages:

update media set updated_at = now() at time zone 'UTC';
ERROR: could not read block 2 in file "base/16384/16485": read only 0
of 8192 bytes

The UPDATE scan itself wouldn't read off the end of the table.
I speculate that what is happening here is that you have a unique
index, so it's trying to verify that the uniqueness constraint isn't
violated, which means fetching via the index, and there are corrupt
TID pointers in the index (pointing to blocks that don't exist).

I ran a "vacuum(FULL, VERBOSE) <table>" command and the corruption (or
at least the errors on update) has disappeared. Is it expected that a
"vacuum(FULL)" command would/could fix table corruption?

In 9.0, vacuum full would rebuild the index(es), so that would fix it
according to this theory.

I suspect that it may have occurred during a filesystem level backup
(ie pg_start_backup(), tar -czf..., pg_stop_backup()), as I performed
a backup and moved the database to a different system. After
restoring the files and starting postgres I began getting these
errors. I have tried restoring multiple times with the same tar
archive with the same results (on different systems).

If you simply unpacked the tar archive and started a postmaster on that,
you'd be pretty much guaranteed to get a corrupt database. The tar
archive is *not* a valid snapshot by itself --- you have to replay
whatever WAL was generated during the archiving sequence in order to get
to a consistent database state. So the whole thing sounds like a pretty
foreseeable consequence of that. You've probably got some other issues
in that database :-(. VACUUM FULL would be enough to rebuild indexes,
but it cannot fix inconsistencies in the heap data.

regards, tom lane

Benjamin Smith

lists@benjamindsmith.com

about 15 years ago

In reply to: Tom Lane (#3)

Re: pgsql 9.0.1 table corruption

On Friday, April 15, 2011 09:50:57 AM Tom Lane wrote:

If you simply unpacked the tar archive and started a postmaster on that,
you'd be pretty much guaranteed to get a corrupt database. The tar
archive is not a valid snapshot by itself --- you have to replay
whatever WAL was generated during the archiving sequence in order to get
to a consistent database state.

I have, more than once, "moved" a PG instance from one machine to another with
the following sequence, without apparent issue. is there anything I'm missing
and/or need to be concerned with?

1) service postgresql stop;
2) rsync -vaz /var/lib/pgsql root@newserver:/var/lib/pgsql;
3) ssh root@newserver;
4) (edit postgresql.conf, set IP addresses, memory, etc)
5) service postgresql start;

This is a "done at 10 PM on Friday night" kind of process.

-Ben

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Benjamin Smith (#4)

Re: pgsql 9.0.1 table corruption

Benjamin Smith <lists@benjamindsmith.com> writes:

On Friday, April 15, 2011 09:50:57 AM Tom Lane wrote:

If you simply unpacked the tar archive and started a postmaster on that,
you'd be pretty much guaranteed to get a corrupt database. The tar
archive is not a valid snapshot by itself --- you have to replay
whatever WAL was generated during the archiving sequence in order to get
to a consistent database state.

I have, more than once, "moved" a PG instance from one machine to another with
the following sequence, without apparent issue. is there anything I'm missing
and/or need to be concerned with?

1) service postgresql stop;
2) rsync -vaz /var/lib/pgsql root@newserver:/var/lib/pgsql;
3) ssh root@newserver;
4) (edit postgresql.conf, set IP addresses, memory, etc)
5) service postgresql start;

That way is fine --- what you can't do is take a tarball while the
server is running, unless you do the full nine yards with WAL replay.

regards, tom lane