Would like to below scenario is possible for getting page/block corruption

Started by Sreekanth Palluruover 9 years ago9 messagesgeneral

sree4pg@gmail.com

over 9 years ago

Hi ,
I am working on page corruption issue want to know if below scenario is
possible

1) Insert command from client , I understand heap_insert is called
from heampam.c
2) Let us say table is full and relation is extended and added a new block
3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID,
info);
5) Then backend update the PageHeader with WAL LSN details
PageSetLSN(page, recptr);

If my server got crashed after step 4) is there a possibility that after
postgres database restart I get below error when I access the relation or
vacuum is run on this relation or taking backup through pg_dump ?
*ERROR: invalid page header in block 204 of relation base/16413/16900 ?*

or
Postgres can automatically recover the page without throwing any error ?

Appreciate your valuable response on this

--
Regards
Sreekanth

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Sreekanth Palluru (#1)

Re: Would like to below scenario is possible for getting page/block corruption

On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:

Hi ,
I am working on page corruption issue want to know if below scenario is
possible

1) Insert command from client , I understand heap_insert is called from
heampam.c
2) Let us say table is full and relation is extended and added a new block
3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID,
info);
5) Then backend update the PageHeader with WAL LSN details PageSetLSN(page,
recptr);

If my server got crashed after step 4) is there a possibility that after
postgres database restart I get below error when I access the relation or
vacuum is run on this relation or taking backup through pg_dump ?
ERROR: invalid page header in block 204 of relation base/16413/16900 ?

So the block is corrupted. You may want to move to another server.

or
Postgres can automatically recover the page without throwing any error ?

At crash recovery, Postgres would redo things from a point where
everything was consistent on disk. If this corrupted page made it to
disk, there is not much that can be done except restoring from a
backup. You could as well zero_damaged_pages to help here, but you
would lose the data on this page, still you would be able to perform
pg_dump and get back as much data as you can. At the same time,
corruption can spread as well as if that's a hardware problem, so you
are just seeing the beginning of a series of problems.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Sreekanth Palluru

sree4pg@gmail.com

over 9 years ago

In reply to: Michael Paquier (#2)

Re: Would like to below scenario is possible for getting page/block corruption

Michael,
Can I generalize that, if after step 4) page ( new page or old page) got
written disk from buffer and crash happens between step 4) and 5) we
always get
block corruption issues with Postgres which can only be recovered by
setting zero_damaged_pages if we just have pg_dump backups and we are OK
lose data in the affected blocks?

I am also looking at ways of reproducing the issue ? appreciate your advice
on it ?

On Fri, Dec 9, 2016 at 12:01 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru <sree4pg@gmail.com>
wrote:

Hi ,
I am working on page corruption issue want to know if below scenario is
possible

1) Insert command from client , I understand heap_insert is called from
heampam.c
2) Let us say table is full and relation is extended and added a new

block

3) Tuple is inserted into new page for the block

RelationPutHeapTuple/hio.c

4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID,
info);
5) Then backend update the PageHeader with WAL LSN details

PageSetLSN(page,

recptr);

If my server got crashed after step 4) is there a possibility that after
postgres database restart I get below error when I access the relation

or

vacuum is run on this relation or taking backup through pg_dump ?
ERROR: invalid page header in block 204 of relation base/16413/16900 ?

So the block is corrupted. You may want to move to another server.

or
Postgres can automatically recover the page without throwing any error ?

At crash recovery, Postgres would redo things from a point where
everything was consistent on disk. If this corrupted page made it to
disk, there is not much that can be done except restoring from a
backup. You could as well zero_damaged_pages to help here, but you
would lose the data on this page, still you would be able to perform
pg_dump and get back as much data as you can. At the same time,
corruption can spread as well as if that's a hardware problem, so you
are just seeing the beginning of a series of problems.
--
Michael

--
Regards
Sreekanth

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Sreekanth Palluru (#3)

Re: Would like to below scenario is possible for getting page/block corruption

(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:

Can I generalize that, if after step 4) page ( new page or old page) got
written disk from buffer and crash happens between step 4) and 5) we
always get
block corruption issues with Postgres which can only be recovered by setting
zero_damaged_pages if we just have pg_dump backups and we are OK lose data
in the affected blocks?

I am also looking at ways of reproducing the issue ? appreciate your advice
on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Sreekanth Palluru

sree4pg@gmail.com

over 9 years ago

In reply to: Michael Paquier (#4)

Re: Would like to below scenario is possible for getting page/block corruption

Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of
PG database envornment
Version 9.2.4.1
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell
systems with our software image contains PG. Few of the customers are
facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different
variables involved in reproducing issue like Dell HW, Software image
versions, Application versions, write-cache settings RAID/Disk, RAID
controllers with no backup and power failures etc , I am trying to
understand is there possibility that PG can end up in having corrupted
blocks due to system crash.

1)As I understand fsycn will write the block from memory to disk and block
just after step 4) would have written disk assuming disk cache did not lie
2)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3)
3) if crash happens after step4) , since there is no PageHeader data ,
after system restarts PG will complain that it is corrupted block or
invalid header

Please correct me if my understanding about play fsync and full_page_writes
are correct ? if so , I see that there is possibility getting corruptions
whenever PG extends a relation and crash happens just after step 4)

I am not sure will the same applicable to existing page (not a new page)
and how it handles if there is PageHeader available as part of
full_page_writes, will same corruption can be happen or will PG can recover
database as I am not sure
recovery process can update the PageHeader from WAL records it wrote recptr
as part of step 4) during the recovery process .

-Sreekanth

On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree4pg@gmail.com>
wrote:

Can I generalize that, if after step 4) page ( new page or old page)

got

written disk from buffer and crash happens between step 4) and 5) we
always get
block corruption issues with Postgres which can only be recovered by

setting

zero_damaged_pages if we just have pg_dump backups and we are OK lose

data

in the affected blocks?

I am also looking at ways of reproducing the issue ? appreciate your

advice

on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
--
Michael

--
Regards
Sreekanth

Sreekanth Palluru

sree4pg@gmail.com

over 9 years ago

In reply to: Sreekanth Palluru (#5)

Re: Would like to below scenario is possible for getting page/block corruption

Correcting typos
Michael,
Thanks for your prompt reply

My Customer is into building Cancer related systems and we ship Dell
systems with our software image contains PG. Few of the customers are
facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different
variables involved in reproducing issue like Dell HW, Software image
versions, Application versions, write-cache settings RAID/Disk, RAID
controllers with no battery backup and power failures etc , I am trying
to understand is there possibility that PG can end up in having corrupted
blocks due to system crash though we set these parameters

a)As I understand fsycn will write the block from memory to disk and block
just after step 4) would have written disk assuming disk cache did not lie
b)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3)
c) if crash happens after step4) , since there is no PageHeader data ,
after system restarts PG will complain that it is corrupted block or
invalid header

-Sreekanth

On Fri, Dec 9, 2016 at 2:09 PM, Sreekanth Palluru <sree4pg@gmail.com> wrote:

Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief
of PG database envornment
Version 9.2.4.1
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell
systems with our software image contains PG. Few of the customers are
facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different
variables involved in reproducing issue like Dell HW, Software image
versions, Application versions, write-cache settings RAID/Disk, RAID
controllers with no backup and power failures etc , I am trying to
understand is there possibility that PG can end up in having corrupted
blocks due to system crash.

1)As I understand fsycn will write the block from memory to disk and block
just after step 4) would have written disk assuming disk cache did not lie
2)and assume that full_page_writes=on has dumped the whole 8k block into
WAL
before it updates block i.e. after step 2) and before 3)
3) if crash happens after step4) , since there is no PageHeader data ,
after system restarts PG will complain that it is corrupted block or
invalid header

Please correct me if my understanding about play fsync and
full_page_writes are correct ? if so , I see that there is possibility
getting corruptions whenever PG extends a relation and crash happens just
after step 4)

I am not sure will the same applicable to existing page (not a new page)
and how it handles if there is PageHeader available as part of
full_page_writes, will same corruption can be happen or will PG can recover
database as I am not sure
recovery process can update the PageHeader from WAL records it wrote recptr
as part of step 4) during the recovery process .

-Sreekanth

On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier <
michael.paquier@gmail.com> wrote:

(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree4pg@gmail.com>
wrote:

Can I generalize that, if after step 4) page ( new page or old page)

got

written disk from buffer and crash happens between step 4) and 5) we
always get
block corruption issues with Postgres which can only be recovered by

setting

zero_damaged_pages if we just have pg_dump backups and we are OK lose

data

in the affected blocks?

I am also looking at ways of reproducing the issue ? appreciate your

advice

on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
--
Michael

--
Regards
Sreekanth

--
Regards
Sreekanth

Shreeyansh Dba

shreeyansh2014@gmail.com

over 9 years ago

In reply to: Sreekanth Palluru (#1)

Re: [ADMIN] Would like to below scenario is possible for getting page/block corruption

Hi Sreekanth,

I doubt auto-recover of the page might be possible, as the header of the
page is no more valid & corrupted and not sure whether the corruption
occurred in relation of a data or index block.

We have seen some occurrences like this before which got rectified by
performing reindexing and vacuum full operations on index or entire table.

If the corrupted relation is a data block & reindexing didn't help, based
on your current backup strategy, logical (pg)dump/restore) or PITR may
help in recovering from corruption problems provided having in tact valid
backups before you faced this error.

Hope this helps you in getting required solution.

Please feel free to reach us if you have any queries.

On Fri, Dec 9, 2016 at 6:16 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:

Hi ,
I am working on page corruption issue want to know if below scenario is
possible

1) Insert command from client , I understand heap_insert is called
from heampam.c
2) Let us say table is full and relation is extended and added a new block
3) Tuple is inserted into new page for the block RelationPutHeapTuple/
hio.c
4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID,
info);
5) Then backend update the PageHeader with WAL LSN details
PageSetLSN(page, recptr);

If my server got crashed after step 4) is there a possibility that after
postgres database restart I get below error when I access the relation or
vacuum is run on this relation or taking backup through pg_dump ?
*ERROR: invalid page header in block 204 of relation base/16413/16900 ?*

or
Postgres can automatically recover the page without throwing any error ?

Appreciate your valuable response on this

--
Regards
Sreekanth

Sreekanth Palluru

sree4pg@gmail.com

over 9 years ago

In reply to: Shreeyansh Dba (#7)

Re: [ADMIN] Would like to below scenario is possible for getting page/block corruption

shreeyansh,
we have issue with relation and we have fixed this using setting
zero_damaged_pages and then running vacuum fullbon relatuon.

I am looking at possibility of PG introducing corruption if relation
extends and before it updates new page with pageheader in memory and crash
happens?

Is this possible? Does PG updates pageheader when relation get extends?
If so what details it writes? Or will it be null?

On 09/12/2016 8:56 PM, "Shreeyansh Dba" <shreeyansh2014@gmail.com> wrote:

Hi Sreekanth,

I doubt auto-recover of the page might be possible, as the header of the
page is no more valid & corrupted and not sure whether the corruption
occurred in relation of a data or index block.

We have seen some occurrences like this before which got rectified by
performing reindexing and vacuum full operations on index or entire table.

Hope this helps you in getting required solution.

Please feel free to reach us if you have any queries.

On Fri, Dec 9, 2016 at 6:16 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:

Hi ,
I am working on page corruption issue want to know if below scenario is
possible

1) Insert command from client , I understand heap_insert is called
from heampam.c
2) Let us say table is full and relation is extended and added a new block
3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio
.c
4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID,
info);
5) Then backend update the PageHeader with WAL LSN details
PageSetLSN(page, recptr);

If my server got crashed after step 4) is there a possibility that after
postgres database restart I get below error when I access the relation or
vacuum is run on this relation or taking backup through pg_dump ?
*ERROR: invalid page header in block 204 of relation base/16413/16900 ?*

or
Postgres can automatically recover the page without throwing any error ?

Appreciate your valuable response on this

--
Regards
Sreekanth

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Sreekanth Palluru (#8)

Re: [GENERAL] Re: Would like to below scenario is possible for getting page/block corruption

On Sun, Dec 11, 2016 at 12:00 PM, Sreekanth Palluru <sree4pg@gmail.com> wrote:

I am looking at possibility of PG introducing corruption if relation extends and before it updates new page with pageheader in memory and crash happens?

Is this possible?

No.

Does PG updates pageheader when relation get extends?

You need to look at smgrextend() when extension an on-disk relation
file. The page is written in a correct shape.

If so what details it writes? Or will it be null?

--
Michael

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin