locate DB corruption

Started by Dave Peticolasover 7 years ago10 messagesgeneral

dave@krondo.com

over 7 years ago

Hello, I'm running into the following error running a large query on a
database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or directory

Searches of this mailing list seem to indicate this means a database row
has been corrupted. How would I go about locating the point of corruption
in order to fix?

thanks,
dave

Adrian Klaver

adrian.klaver@aklaver.com

over 7 years ago

In reply to: Dave Peticolas (#1)

Re: locate DB corruption

On 08/31/2018 08:02 AM, Dave Peticolas wrote:

Hello, I'm running into the following error running a large query on a
database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or directory

Postgres version?

Where is the replay coming from?

Searches of this mailing list seem to indicate this means a database row
has been corrupted. How would I go about locating the point of
corruption in order to fix?

To me it looks like what it says, the transaction file could not be
found. From Postgres 9.6 --> 10 pg_clog became pg_xact.

Are you sure you are not working across versions?

If not do pg_clog/ and 0C68 actually exist?

thanks,
dave

--
Adrian Klaver
adrian.klaver@aklaver.com

Dave Peticolas

dave@krondo.com

over 7 years ago

In reply to: Adrian Klaver (#2)

Re: locate DB corruption

On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 08/31/2018 08:02 AM, Dave Peticolas wrote:

Hello, I'm running into the following error running a large query on a
database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or directory

Postgres version?

Right! Sorry, that original email didn't have a lot of info. This is 9.6.9
restoring a backup from 9.6.8.

Where is the replay coming from?

From a snapshot and WAL files stored in Amazon S3.

Searches of this mailing list seem to indicate this means a database row
has been corrupted. How would I go about locating the point of
corruption in order to fix?

To me it looks like what it says, the transaction file could not be
found. From Postgres 9.6 --> 10 pg_clog became pg_xact.

Are you sure you are not working across versions?

I am sure, they are all 9.6.

If not do pg_clog/ and 0C68 actually exist?

pg_clog definitely exists, but 0C68 does not. I think I have subsequently
found the precise row in the specific table that seems to be the problem.
Specifically I can select * from TABLE where id = BADID - 1 or id = BADID +
1 and the query returns. I get the error if I select the row with the bad
ID.

Now what I'm not sure of is how to fix.

Adrian Klaver

adrian.klaver@aklaver.com

over 7 years ago

In reply to: Dave Peticolas (#3)

Re: locate DB corruption

On 08/31/2018 08:51 AM, Dave Peticolas wrote:

On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 08/31/2018 08:02 AM, Dave Peticolas wrote:

Hello, I'm running into the following error running a large query

on a

database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or directory

Postgres version?

Right! Sorry, that original email didn't have a lot of info. This is
9.6.9 restoring a backup from 9.6.8.

Where is the replay coming from?

From a snapshot and WAL files stored in Amazon S3.

Seems the process is not creating a consistent backup.

How are they being generated?

Are you sure you are not working across versions?

I am sure, they are all 9.6.

If not do pg_clog/ and 0C68 actually exist?

pg_clog definitely exists, but 0C68 does not. I think I have
subsequently found the precise row in the specific table that seems to
be the problem. Specifically I can select * from TABLE where id = BADID
- 1 or id = BADID + 1 and the query returns. I get the error if I select
the row with the bad ID.

Now what I'm not sure of is how to fix.

One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.

There is also pg_resetxlog:

https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html

I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.

--
Adrian Klaver
adrian.klaver@aklaver.com

Dave Peticolas

dave@krondo.com

over 7 years ago

In reply to: Adrian Klaver (#4)

Re: locate DB corruption

On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 08/31/2018 08:51 AM, Dave Peticolas wrote:

On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 08/31/2018 08:02 AM, Dave Peticolas wrote:

Hello, I'm running into the following error running a large query

on a

database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or

directory

Postgres version?

Right! Sorry, that original email didn't have a lot of info. This is
9.6.9 restoring a backup from 9.6.8.

Where is the replay coming from?

From a snapshot and WAL files stored in Amazon S3.

Seems the process is not creating a consistent backup.

This time, yes. This setup has been working for almost two years with
probably hundreds of restores in that time. But nothing's perfect I guess :)

How are they being generated?

The snapshots are sent to S3 via a tar process after calling the start
backup function. I am following the postgres docs here. The WAL files are
just copied to S3.

Are you sure you are not working across versions?

I am sure, they are all 9.6.

If not do pg_clog/ and 0C68 actually exist?

pg_clog definitely exists, but 0C68 does not. I think I have
subsequently found the precise row in the specific table that seems to
be the problem. Specifically I can select * from TABLE where id = BADID
- 1 or id = BADID + 1 and the query returns. I get the error if I select
the row with the bad ID.

Now what I'm not sure of is how to fix.

One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.

Yes, I think that's a good idea, I'm trying that.

There is also pg_resetxlog:

https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html

I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.

Thanks, I didn't know about that. The primary DB seems OK so hopefully it
won't be needed.

Dave Peticolas

dave@krondo.com

over 7 years ago

In reply to: Dave Peticolas (#5)

Re: locate DB corruption

On Fri, Aug 31, 2018 at 8:48 PM Dave Peticolas <dave@krondo.com> wrote:

On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 08/31/2018 08:51 AM, Dave Peticolas wrote:

On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <

adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>> wrote:

On 08/31/2018 08:02 AM, Dave Peticolas wrote:

Hello, I'm running into the following error running a large query

on a

database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or

directory

Postgres version?

Right! Sorry, that original email didn't have a lot of info. This is
9.6.9 restoring a backup from 9.6.8.

Where is the replay coming from?

From a snapshot and WAL files stored in Amazon S3.

Seems the process is not creating a consistent backup.

This time, yes. This setup has been working for almost two years with
probably hundreds of restores in that time. But nothing's perfect I guess :)

How are they being generated?

The snapshots are sent to S3 via a tar process after calling the start
backup function. I am following the postgres docs here. The WAL files are
just copied to S3.

Are you sure you are not working across versions?

I am sure, they are all 9.6.

If not do pg_clog/ and 0C68 actually exist?

pg_clog definitely exists, but 0C68 does not. I think I have
subsequently found the precise row in the specific table that seems to
be the problem. Specifically I can select * from TABLE where id = BADID
- 1 or id = BADID + 1 and the query returns. I get the error if I

select

the row with the bad ID.

Now what I'm not sure of is how to fix.

One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.

Yes, I think that's a good idea, I'm trying that.

There is also pg_resetxlog:

https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html

I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.

Thanks, I didn't know about that. The primary DB seems OK so hopefully it
won't be needed.

Well restoring from a backup of the primary does seem to have fixed the
issue with the corrupt table.

Adrian Klaver

adrian.klaver@aklaver.com

over 7 years ago

In reply to: Dave Peticolas (#6)

Re: locate DB corruption

On 09/01/2018 04:45 PM, Dave Peticolas wrote:

Well restoring from a backup of the primary does seem to have fixed the
issue with the corrupt table.

Pretty sure it was not that the table was corrupt but that transaction
information was missing from pg_clog.

In a previous post you mentioned you ran tar to do the snapshot of
$PG_DATA.

Was there any error when tar ran the backup that caused you problems?

--
Adrian Klaver
adrian.klaver@aklaver.com

Dave Peticolas

dave@krondo.com

over 7 years ago

In reply to: Adrian Klaver (#7)

Re: locate DB corruption

On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 09/01/2018 04:45 PM, Dave Peticolas wrote:

Well restoring from a backup of the primary does seem to have fixed the
issue with the corrupt table.

Pretty sure it was not that the table was corrupt but that transaction
information was missing from pg_clog.

In a previous post you mentioned you ran tar to do the snapshot of
$PG_DATA.

Was there any error when tar ran the backup that caused you problems?

Well the interesting thing about that is that although the bad table was
originally discovered in a DB restored from a snapshot, I subsequently
discovered it in the real-time clone of the primary from which the backups
are made. So somehow the clone's table became corrupted. The same table was
not corrupt on the primary, but I have discovered an error on the primary
-- it's in the thread I posted today. These events seem correlated in time,
I'll have to mine the logs some more.

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Dave Peticolas (#8)

Re: locate DB corruption

Greetings,

* Dave Peticolas (dave@krondo.com) wrote:

On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 09/01/2018 04:45 PM, Dave Peticolas wrote:

Well restoring from a backup of the primary does seem to have fixed the
issue with the corrupt table.

Pretty sure it was not that the table was corrupt but that transaction
information was missing from pg_clog.

In a previous post you mentioned you ran tar to do the snapshot of
$PG_DATA.

Was there any error when tar ran the backup that caused you problems?

Well the interesting thing about that is that although the bad table was
originally discovered in a DB restored from a snapshot, I subsequently
discovered it in the real-time clone of the primary from which the backups
are made. So somehow the clone's table became corrupted. The same table was
not corrupt on the primary, but I have discovered an error on the primary
-- it's in the thread I posted today. These events seem correlated in time,
I'll have to mine the logs some more.

Has this primary been the primary since inception, or was it promoted to
be one at some point after first being built as a replica..?

Thanks!

Stephen

#10

Dave Peticolas

dave@krondo.com

over 7 years ago

In reply to: Stephen Frost (#9)

Re: locate DB corruption

On Sun, Sep 2, 2018 at 4:51 AM Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Dave Peticolas (dave@krondo.com) wrote:

On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 09/01/2018 04:45 PM, Dave Peticolas wrote:

Well restoring from a backup of the primary does seem to have fixed

the

issue with the corrupt table.

Pretty sure it was not that the table was corrupt but that transaction
information was missing from pg_clog.

In a previous post you mentioned you ran tar to do the snapshot of
$PG_DATA.

Was there any error when tar ran the backup that caused you problems?

Well the interesting thing about that is that although the bad table was
originally discovered in a DB restored from a snapshot, I subsequently
discovered it in the real-time clone of the primary from which the

backups

are made. So somehow the clone's table became corrupted. The same table

was

not corrupt on the primary, but I have discovered an error on the primary
-- it's in the thread I posted today. These events seem correlated in

time,

I'll have to mine the logs some more.

Has this primary been the primary since inception, or was it promoted to
be one at some point after first being built as a replica..?

It was the primary since inception. All the problems now appear to have
stemmed from the primary due to a bug in 9.6.8 (see other thread). I've
since upgraded to 9.6.10.