multiple tables got corrupted

Started by Vasu Madhineniover 5 years ago9 messagesgeneral
Jump to latest
#1Vasu Madhineni
vasumdba1515@gmail.com

Hi All,

In one of my postgres databases multiple tables got corrupted and followed
the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

moh_fa=# VACUUM FULL;
ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files exist
physically in base location.

How to proceed on this, no backup to restore.

Thanks in advance

Regards,
Vasu Madhineni

#2Magnus Hagander
magnus@hagander.net
In reply to: Vasu Madhineni (#1)
Re: multiple tables got corrupted

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Hi All,

In one of my postgres databases multiple tables got corrupted and followed
the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

That is a very destructive first attempt. I hope you took a full disk-level
backup of the database before you did that, as it can ruin your chances for
forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;

ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files exist
physically in base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with no backups to restore you
will definitely be losing data.

I'd start by figuring out which tables have no corruption and do work, and
back those up (with pg_dump for example) as soon as possible to a different
machine -- since it's not exactly unlikely that further disk errors will
appear.

Once you've done that, identify the tables, and then try to do partial
recovery. For example, if you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which is 76595200 bytes into the
file. If this is at the very end of the file, you can for example try to
get the data out until that point with LIMIT. If it's in the middle of the
file, it gets more ticky, but similar approaches can be done.

Also, unless you are running with data checksums enabled, I wouldn't fully
trust the data in the tables that you *can* read either. Since you clearly
have disk issues, they may have caused corruption elsewhere as well, so
whatever verification you can do against other tables, you should do as
well.

You'll of course also want to check any kernel logs or storage system logs
to see if they can give you a hint as to what happened, but they are
unlikely to actually give you something that will help you fix the problem.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#3Vasu Madhineni
vasumdba1515@gmail.com
In reply to: Magnus Hagander (#2)
Re: multiple tables got corrupted

Is it possible to identify which rows are corrupted in particular tables.

On Tue, Sep 15, 2020 at 5:36 PM Magnus Hagander <magnus@hagander.net> wrote:

Show quoted text

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Hi All,

In one of my postgres databases multiple tables got corrupted and
followed the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

That is a very destructive first attempt. I hope you took a full
disk-level backup of the database before you did that, as it can ruin your
chances for forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;

ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files exist
physically in base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with no backups to restore
you will definitely be losing data.

I'd start by figuring out which tables have no corruption and do work, and
back those up (with pg_dump for example) as soon as possible to a different
machine -- since it's not exactly unlikely that further disk errors will
appear.

Once you've done that, identify the tables, and then try to do partial
recovery. For example, if you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which is 76595200 bytes into the
file. If this is at the very end of the file, you can for example try to
get the data out until that point with LIMIT. If it's in the middle of the
file, it gets more ticky, but similar approaches can be done.

Also, unless you are running with data checksums enabled, I wouldn't fully
trust the data in the tables that you *can* read either. Since you clearly
have disk issues, they may have caused corruption elsewhere as well, so
whatever verification you can do against other tables, you should do as
well.

You'll of course also want to check any kernel logs or storage system logs
to see if they can give you a hint as to what happened, but they are
unlikely to actually give you something that will help you fix the problem.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#4Magnus Hagander
magnus@hagander.net
In reply to: Vasu Madhineni (#3)
Re: multiple tables got corrupted

Try reading them "row by row" until it breaks. That is, SELECT * FROM ...
LIMIT 1, then LIMIT 2 etc. For more efficiency use a binary search starting
at what seems like a reasonable place looking at the size of the table vs
the first failed block to make it faster, but the principle is the same.
Once it fails, you've found a corrupt block...

//Magnus

On Tue, Sep 15, 2020 at 12:46 PM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Show quoted text

Is it possible to identify which rows are corrupted in particular tables.

On Tue, Sep 15, 2020 at 5:36 PM Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Hi All,

In one of my postgres databases multiple tables got corrupted and
followed the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

That is a very destructive first attempt. I hope you took a full
disk-level backup of the database before you did that, as it can ruin your
chances for forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;

ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files exist
physically in base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with no backups to restore
you will definitely be losing data.

I'd start by figuring out which tables have no corruption and do work,
and back those up (with pg_dump for example) as soon as possible to a
different machine -- since it's not exactly unlikely that further disk
errors will appear.

Once you've done that, identify the tables, and then try to do partial
recovery. For example, if you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which is 76595200 bytes into the
file. If this is at the very end of the file, you can for example try to
get the data out until that point with LIMIT. If it's in the middle of the
file, it gets more ticky, but similar approaches can be done.

Also, unless you are running with data checksums enabled, I wouldn't
fully trust the data in the tables that you *can* read either. Since you
clearly have disk issues, they may have caused corruption elsewhere as
well, so whatever verification you can do against other tables, you should
do as well.

You'll of course also want to check any kernel logs or storage system
logs to see if they can give you a hint as to what happened, but they are
unlikely to actually give you something that will help you fix the problem.

#5Vasu Madhineni
vasumdba1515@gmail.com
In reply to: Magnus Hagander (#4)
Re: multiple tables got corrupted

I could see block read I/O errors in /var/log/syslog. if those error fixed
by OS team, will it require recovery.

Also can i use LIMIT and OFFSET to locate corrupted rows?

Thanks in advance

Regards,
Vasu Madhineni

On Wed, Sep 16, 2020, 01:58 Magnus Hagander <magnus@hagander.net> wrote:

Show quoted text

Try reading them "row by row" until it breaks. That is, SELECT * FROM ...
LIMIT 1, then LIMIT 2 etc. For more efficiency use a binary search starting
at what seems like a reasonable place looking at the size of the table vs
the first failed block to make it faster, but the principle is the same.
Once it fails, you've found a corrupt block...

//Magnus

On Tue, Sep 15, 2020 at 12:46 PM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Is it possible to identify which rows are corrupted in particular tables.

On Tue, Sep 15, 2020 at 5:36 PM Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Hi All,

In one of my postgres databases multiple tables got corrupted and
followed the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

That is a very destructive first attempt. I hope you took a full
disk-level backup of the database before you did that, as it can ruin your
chances for forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;

ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files exist
physically in base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with no backups to restore
you will definitely be losing data.

I'd start by figuring out which tables have no corruption and do work,
and back those up (with pg_dump for example) as soon as possible to a
different machine -- since it's not exactly unlikely that further disk
errors will appear.

Once you've done that, identify the tables, and then try to do partial
recovery. For example, if you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which is 76595200 bytes into the
file. If this is at the very end of the file, you can for example try to
get the data out until that point with LIMIT. If it's in the middle of the
file, it gets more ticky, but similar approaches can be done.

Also, unless you are running with data checksums enabled, I wouldn't
fully trust the data in the tables that you *can* read either. Since you
clearly have disk issues, they may have caused corruption elsewhere as
well, so whatever verification you can do against other tables, you should
do as well.

You'll of course also want to check any kernel logs or storage system
logs to see if they can give you a hint as to what happened, but they are
unlikely to actually give you something that will help you fix the problem.

#6Magnus Hagander
magnus@hagander.net
In reply to: Vasu Madhineni (#5)
Re: multiple tables got corrupted

That depends on what the problem is and how they fix it. Most likely yes --
especially since if you haven't enabled data checksums you won't *know* if
things are OK or not. So I'd definitely recommend it even if things *look*
OK.

//Magnus

On Wed, Sep 16, 2020 at 5:06 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Show quoted text

I could see block read I/O errors in /var/log/syslog. if those error fixed
by OS team, will it require recovery.

Also can i use LIMIT and OFFSET to locate corrupted rows?

Thanks in advance

Regards,
Vasu Madhineni

On Wed, Sep 16, 2020, 01:58 Magnus Hagander <magnus@hagander.net> wrote:

Try reading them "row by row" until it breaks. That is, SELECT * FROM ...
LIMIT 1, then LIMIT 2 etc. For more efficiency use a binary search starting
at what seems like a reasonable place looking at the size of the table vs
the first failed block to make it faster, but the principle is the same.
Once it fails, you've found a corrupt block...

//Magnus

On Tue, Sep 15, 2020 at 12:46 PM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Is it possible to identify which rows are corrupted in particular tables.

On Tue, Sep 15, 2020 at 5:36 PM Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Hi All,

In one of my postgres databases multiple tables got corrupted and
followed the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

That is a very destructive first attempt. I hope you took a full
disk-level backup of the database before you did that, as it can ruin your
chances for forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;

ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files
exist physically in base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with no backups to restore
you will definitely be losing data.

I'd start by figuring out which tables have no corruption and do work,
and back those up (with pg_dump for example) as soon as possible to a
different machine -- since it's not exactly unlikely that further disk
errors will appear.

Once you've done that, identify the tables, and then try to do partial
recovery. For example, if you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which is 76595200 bytes into the
file. If this is at the very end of the file, you can for example try to
get the data out until that point with LIMIT. If it's in the middle of the
file, it gets more ticky, but similar approaches can be done.

Also, unless you are running with data checksums enabled, I wouldn't
fully trust the data in the tables that you *can* read either. Since you
clearly have disk issues, they may have caused corruption elsewhere as
well, so whatever verification you can do against other tables, you should
do as well.

You'll of course also want to check any kernel logs or storage system
logs to see if they can give you a hint as to what happened, but they are
unlikely to actually give you something that will help you fix the problem.

#7Vasu Madhineni
vasumdba1515@gmail.com
In reply to: Magnus Hagander (#6)
Re: multiple tables got corrupted

Hi Magnus,

Thanks for your update.
To identify the number of tables corrupted in the database if I run below
command, Will any impact on other tables in the production environment.

"pg_dump -f /dev/null database"

Thanks in advance.

Regards,
Vasu Madhineni

On Fri, Sep 18, 2020 at 3:42 PM Magnus Hagander <magnus@hagander.net> wrote:

Show quoted text

That depends on what the problem is and how they fix it. Most likely yes
-- especially since if you haven't enabled data checksums you won't *know*
if things are OK or not. So I'd definitely recommend it even if things
*look* OK.

//Magnus

On Wed, Sep 16, 2020 at 5:06 AM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

I could see block read I/O errors in /var/log/syslog. if those error
fixed by OS team, will it require recovery.

Also can i use LIMIT and OFFSET to locate corrupted rows?

Thanks in advance

Regards,
Vasu Madhineni

On Wed, Sep 16, 2020, 01:58 Magnus Hagander <magnus@hagander.net> wrote:

Try reading them "row by row" until it breaks. That is, SELECT * FROM
... LIMIT 1, then LIMIT 2 etc. For more efficiency use a binary search
starting at what seems like a reasonable place looking at the size of the
table vs the first failed block to make it faster, but the principle is the
same. Once it fails, you've found a corrupt block...

//Magnus

On Tue, Sep 15, 2020 at 12:46 PM Vasu Madhineni <vasumdba1515@gmail.com>
wrote:

Is it possible to identify which rows are corrupted in particular
tables.

On Tue, Sep 15, 2020 at 5:36 PM Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <
vasumdba1515@gmail.com> wrote:

Hi All,

In one of my postgres databases multiple tables got corrupted and
followed the below steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error.

That is a very destructive first attempt. I hope you took a full
disk-level backup of the database before you did that, as it can ruin your
chances for forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;

ERROR: could not read block 9350 in file "base/1156523/1270812":
Input/output error

Tried to take backup of tables with pg_dump but same error. files
exist physically in base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with no backups to
restore you will definitely be losing data.

I'd start by figuring out which tables have no corruption and do work,
and back those up (with pg_dump for example) as soon as possible to a
different machine -- since it's not exactly unlikely that further disk
errors will appear.

Once you've done that, identify the tables, and then try to do partial
recovery. For example, if you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which is 76595200 bytes into the
file. If this is at the very end of the file, you can for example try to
get the data out until that point with LIMIT. If it's in the middle of the
file, it gets more ticky, but similar approaches can be done.

Also, unless you are running with data checksums enabled, I wouldn't
fully trust the data in the tables that you *can* read either. Since you
clearly have disk issues, they may have caused corruption elsewhere as
well, so whatever verification you can do against other tables, you should
do as well.

You'll of course also want to check any kernel logs or storage system
logs to see if they can give you a hint as to what happened, but they are
unlikely to actually give you something that will help you fix the problem.

#8Jerry Sievers
gsievers19@comcast.net
In reply to: Vasu Madhineni (#7)
Re: multiple tables got corrupted

Vasu Madhineni <vasumdba1515@gmail.com> writes:

Hi Magnus,

Thanks for your update.
To identify the number of tables corrupted in the database if I run
below command, Will any impact on other tables in the production
environment. 

"pg_dump -f /dev/null database"

Consider using pg_dump or any other means to dump *each* table
individually.

pg_dump is going to abort on the first case of corruption in any table
that results in a read error on full scan, thus in a scenario where
multiple corrupt tables is likely, you're not going to get too far
w/monolithic approach.

Thanks in advance.

Regards,
Vasu Madhineni

On Fri, Sep 18, 2020 at 3:42 PM Magnus Hagander <magnus@hagander.net>
wrote:

That depends on what the problem is and how they fix it. Most
likely yes -- especially since if you haven't enabled data
checksums you won't *know* if things are OK or not. So I'd
definitely recommend it even if things *look* OK.

//Magnus

On Wed, Sep 16, 2020 at 5:06 AM Vasu Madhineni <
vasumdba1515@gmail.com> wrote:

I could see block read I/O errors in /var/log/syslog. if
those error fixed by OS team, will it require recovery.

Also can i use LIMIT and OFFSET to locate corrupted rows?

Thanks in advance

Regards,
Vasu Madhineni

On Wed, Sep 16, 2020, 01:58 Magnus Hagander <
magnus@hagander.net> wrote:

Try reading them "row by row" until it breaks. That is,
SELECT * FROM ... LIMIT 1, then LIMIT 2 etc. For more
efficiency use a binary search starting at what seems
like a reasonable place looking at the size of the table
vs the first failed block to make it faster, but the
principle is the same. Once it fails, you've found a
corrupt block...

//Magnus

On Tue, Sep 15, 2020 at 12:46 PM Vasu Madhineni <
vasumdba1515@gmail.com> wrote:

Is it possible to identify which rows are corrupted
in particular tables.

On Tue, Sep 15, 2020 at 5:36 PM Magnus Hagander <
magnus@hagander.net> wrote:

On Tue, Sep 15, 2020 at 11:15 AM Vasu Madhineni <
vasumdba1515@gmail.com> wrote:

Hi All,

In one of my postgres databases multiple
tables got corrupted and followed the below
steps but still the same error.

1.SET zero_damaged_pages = on
2. VACUUM ANALYZE, VACUUM FULL
but still same error. 

That is a very destructive first attempt. I hope
you took a full disk-level backup of the database
before you did that, as it can ruin your chances
for forensics and data recovery for other issues.

moh_fa=# VACUUM FULL;
ERROR:  could not read block 9350 in file
"base/1156523/1270812": Input/output error

Tried to take backup of tables with pg_dump
but same error. files exist physically in
base location.

How to proceed on this, no backup to restore.

This is clearly some sort of disk error, and with
no backups to restore you will definitely be
losing data.

I'd start by figuring out which tables have no
corruption and do work, and back those up (with
pg_dump for example) as soon as possible to a
different machine -- since it's not exactly
unlikely that further disk errors will appear.

Once you've done that, identify the tables, and
then try to do partial recovery. For example, if
you look at the file 1270812, how big it is?
PostgreSQL is failing to read block 9350 which
is 76595200 bytes into the file. If this is at
the very end of the file, you can for example try
to get the data out until that point with LIMIT.
If it's in the middle of the file, it gets more
ticky, but similar approaches can be done. 

Also, unless you are running with data checksums
enabled, I wouldn't fully trust the data in the
tables that you *can* read either. Since you
clearly have disk issues, they may have caused
corruption elsewhere as well, so whatever
verification you can do against other tables, you
should do as well.

You'll of course also want to check any kernel
logs or storage system logs to see if they can
give you a hint as to what happened, but they are
unlikely to actually give you something that will
help you fix the problem.

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net

#9Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#4)
Re: multiple tables got corrupted

On Tue, Sep 15, 2020 at 07:58:39PM +0200, Magnus Hagander wrote:

Try reading them "row by row" until it breaks. That is, SELECT * FROM ... LIMIT
1, then LIMIT 2 etc. For more efficiency use a binary search starting at what
seems like a reasonable place looking at the size of the table vs the first
failed block to make it faster, but the principle is the same. Once it fails,
you've found a corrupt block...

You can also include the invisible 'ctid' column so you can see the
block number of each row, e.g.:

SELECT ctid, relname FROM pg_class LIMIT 2;
ctid | relname
--------+--------------
(0,46) | pg_statistic
(0,47) | pg_type

The format is page number, item number on page.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee