ERROR: invalid page in block 1226710 of relation base/16750/27244

Started by bricklenover 10 years ago9 messagesgeneral

bricklen@gmail.com

over 10 years ago

Hi,

We have run into some corruption in one of our production tables. We know
the cause (a compute node was moved), but now we need to fix the data. We
have backups, but at this point they are nearly a day old, so recovering
from them is a last-resort and will incur significant downtime.
We are running 9.3.9

Following the steps at
http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

I get the following output for ctid, id, other_id, tstamp:

(690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
(690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
(690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244

It appears 690652 is what would be dd'd if that's the route we take. Is
that accurate?

Because the message indicates the corruption is in the table's page, not
the page header, according to the docs zero_damaged_pages probably won't
work.

What are my options?

1). Enable zero_damaged_pages and execute VACUUM FREEZE (and hope).
2). dd the block(s) using the output of the ctid query above.

It is multi-gigabyte table that is extremely heavily used (100's to 1000's
of queries per second) so a VACUUM FULL or CLUSTER are options we'd really
like to avoid if possible. The database is about 250GB, not huge, but big
enough that slaves and backups are time consuming to redo, or recover from.

Will attempting zero_damaged_pages cause any harm as the first step (other
than the obvious destruction of any bad pages)?

Is this the correct command if option #2 is chosen? Can it be executed
against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
conv=notrunc

Thanks,

Bricklen

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: bricklen (#1)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

bricklen <bricklen@gmail.com> writes:

We have run into some corruption in one of our production tables. We know
the cause (a compute node was moved), but now we need to fix the data. We
have backups, but at this point they are nearly a day old, so recovering
from them is a last-resort and will incur significant downtime.
We are running 9.3.9

Following the steps at
http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

I get the following output for ctid, id, other_id, tstamp:

(690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
(690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
(690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244

It appears 690652 is what would be dd'd if that's the route we take. Is
that accurate?

I'm confused by the block mentioned in the error message not having
anything to do with the TID sequence. I wonder whether it refers to an
index not the table proper. What query were you using to get this output,
exactly? Have you confirmed which relation has relfilenode 27244?

Because the message indicates the corruption is in the table's page, not
the page header, according to the docs zero_damaged_pages probably won't
work.

I should think that zero_damaged_pages would work fine, if the problem
is indeed in the base table. But if it's in an index, a REINDEX would be
a better answer.

Another thing to keep in mind here, if you've got replication slaves,
is that I'm not sure whether the effects of zero_damaged_pages would
propagate to slaves. Have you investigated the possibility that some
slave has an uncorrupt copy that you could dd into place in the master?

Is this the correct command if option #2 is chosen? Can it be executed
against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
conv=notrunc

Uh, no, you're not accounting for the fact that such an offset wouldn't be
in the first segment file of the relation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

bricklen

bricklen@gmail.com

over 10 years ago

In reply to: Tom Lane (#2)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

Hi Tom,

On Wed, Oct 21, 2015 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

bricklen <bricklen@gmail.com> writes:

I get the following output for ctid, id, other_id, tstamp:

(690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
(690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
(690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244

It appears 690652 is what would be dd'd if that's the route we take. Is
that accurate?

I'm confused by the block mentioned in the error message not having
anything to do with the TID sequence. I wonder whether it refers to an
index not the table proper. What query were you using to get this output,
exactly? Have you confirmed which relation has relfilenode 27244?

Yes, it is definitely a table. There was originally an index on that table
which threw the original error (about sibling mismatch). I dropped the
index and attempted to recreate it, which failed. Further investigation led
to discovery of corruption in the table.

Because the message indicates the corruption is in the table's page, not
the page header, according to the docs zero_damaged_pages probably won't
work.

I should think that zero_damaged_pages would work fine, if the problem
is indeed in the base table.

I will make note of that.

Another thing to keep in mind here, if you've got replication slaves,
is that I'm not sure whether the effects of zero_damaged_pages would
propagate to slaves. Have you investigated the possibility that some
slave has an uncorrupt copy that you could dd into place in the master?

We do have one uncorrupted slave, and one corrupted. I have a 4 hour
delayed WAL-apply script that runs on the primary slaves in the disaster
recovery data centres, and I stopped that process as soon as I saw the
error about the sibling mismatch on the master. It is a viable candidate to
fail over to, if we can swing a 20+ hour window of data loss. Right now
that is an undesirable option.

Is this the correct command if option #2 is chosen? Can it be executed
against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
conv=notrunc

Uh, no, you're not accounting for the fact that such an offset wouldn't be
in the first segment file of the relation.

Hmm, I wasn't sure about that. Thanks for confirming that.

As it stands, my next step is going to be a pg_dump of one of the
up-to-date slaves (with corruption) but I will exclude the bad table. Given
that I know the PK id range, I can COPY out the table's contents before and
after the affected data. This way we can at least recover from backup if
things get entirely borked.

The next part of the plan is to create a temporary version of the table
with all data other than the corrupted range, then do some transaction-fu
to rename the tables.

Thank you for your response, and any other insights are gratefully received.

Cheers,

Bricklen

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: bricklen (#3)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

bricklen <bricklen@gmail.com> writes:

On Wed, Oct 21, 2015 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm confused by the block mentioned in the error message not having
anything to do with the TID sequence. I wonder whether it refers to an
index not the table proper. What query were you using to get this output,
exactly? Have you confirmed which relation has relfilenode 27244?

Yes, it is definitely a table. There was originally an index on that table
which threw the original error (about sibling mismatch). I dropped the
index and attempted to recreate it, which failed. Further investigation led
to discovery of corruption in the table.

Hm. There's still something weird about this though. Maybe there is no
data at all between pages 1226710 and 690651? Might be worth doing some
poking around with contrib/pageinspect/.

As it stands, my next step is going to be a pg_dump of one of the
up-to-date slaves (with corruption) but I will exclude the bad table. Given
that I know the PK id range, I can COPY out the table's contents before and
after the affected data. This way we can at least recover from backup if
things get entirely borked.

Agreed, if you're gonna mess with the table files directly, it's always
smart to have a fallback plan in case you make things worse.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

bricklen

bricklen@gmail.com

over 10 years ago

In reply to: Tom Lane (#4)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

On Wed, Oct 21, 2015 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

bricklen <bricklen@gmail.com> writes:

Yes, it is definitely a table. There was originally an index on that

table

which threw the original error (about sibling mismatch). I dropped the
index and attempted to recreate it, which failed. Further investigation

led

to discovery of corruption in the table.

Hm. There's still something weird about this though. Maybe there is no
data at all between pages 1226710 and 690651? Might be worth doing some
poking around with contrib/pageinspect/.

Ah, good idea.

Thanks again!

bricklen

bricklen@gmail.com

over 10 years ago

In reply to: bricklen (#5)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

A follow-up question if I may,

bricklen <bricklen@gmail.com> writes:

Yes, it is definitely a table. There was originally an index on that

table

which threw the original error (about sibling mismatch). I dropped the
index and attempted to recreate it, which failed. Further investigation

led

to discovery of corruption in the table.

There are several hot standby servers attached to the master, some
streaming, and one in a different data centre that is using WAL shipping
only.
The streaming slave IIRC got the corruption from the master (I can't check
now, it was rebuilt).
What would have happened to the WAL-shipping-only standby if the WALs were
all applied? Would it have it balked at applying a WAL containing bad data
from the master, or would it have applied the WAL and continued on? For the
latter, would physical corruption on the master even transfer via WAL?

I didn't get a chance to answer those questions because we promoted the DR
WAL-shipping standby before it got to the corrupted section.

Thanks,

Bricklen

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: bricklen (#6)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

bricklen <bricklen@gmail.com> writes:

There are several hot standby servers attached to the master, some
streaming, and one in a different data centre that is using WAL shipping
only.
The streaming slave IIRC got the corruption from the master (I can't check
now, it was rebuilt).
What would have happened to the WAL-shipping-only standby if the WALs were
all applied? Would it have it balked at applying a WAL containing bad data
from the master, or would it have applied the WAL and continued on? For the
latter, would physical corruption on the master even transfer via WAL?

Hard to tell. I'd have guessed that corruption that made a page
unreadable would not transfer across WAL (streaming or otherwise), because
the master could not have read it in to apply an update to it. However,
we don't know the exact sequence of events here; there may have more than
one step on the way to disaster.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

bricklen

bricklen@gmail.com

over 10 years ago

In reply to: Tom Lane (#7)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

On Thu, Oct 22, 2015 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

bricklen <bricklen@gmail.com> writes:

What would have happened to the WAL-shipping-only standby if the WALs

were

all applied? Would it have it balked at applying a WAL containing bad

data

from the master, or would it have applied the WAL and continued on? For

the

latter, would physical corruption on the master even transfer via WAL?

Hard to tell. I'd have guessed that corruption that made a page
unreadable would not transfer across WAL (streaming or otherwise), because
the master could not have read it in to apply an update to it. However,
we don't know the exact sequence of events here; there may have more than
one step on the way to disaster.

regards, tom lane

I would have liked to have had the opportunity to answer those questions
myself but alas, in the heat of the moment some of the data useful for
forensics was lost.

Thanks again!

Jim Nasby

Jim.Nasby@BlueTreble.com

over 10 years ago

In reply to: bricklen (#8)

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

On 10/22/15 11:25 AM, bricklen wrote:

I would have liked to have had the opportunity to answer those questions
myself but alas, in the heat of the moment some of the data useful for
forensics was lost.

You could always roll WAL forward from the previous base backup and see
what happens.

FWIW, most times that I've experienced corruption it's percolated
through the WAL stream as well, presumably due to full_page_writes. It's
why I like making londiste part of the DR configuration for really
high-value data. Of course, if you're having corruption problems then
it's also very possible that your user data is getting crapped on too,
and logical replication won't help you terribly much there. I
investigated adding user-space row-level checksums but never actually
rolled that out. Of course now you'd just use page level checksums.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general