corrupted files

Started by Klaus Itaover 12 years ago10 messagesbugsgeneral
Jump to latest
#1Klaus Ita
klaus@worstofall.com
bugsgeneral

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR: could not access
status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL: Could not open
file "pg_clog/0008": No such file or directory.

combined with the error output of queries that do not work.

I looked in pg_clog and correct, 0008 is missing.

On this linux machine on (3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64
GNU/Linux) I am using xfs on raid1 on a megacli raid controller with 16
disks, no battery, this is why write through is enabled, no cacheing.

My feeling is, that 'something' got confused with hot-standby and
wal_archiving leading to this situation, that seems to be partially xfs
caused?????? xfs_repair mentionned some missing files, that pg did not
expect to have (maybe truncated tables??).

I quite extensively created indices in transactions and removed those
within these transactions to do fast deletes (foreign key constraints)
before i got the error???

* tried to get one of the warm standby's up but one complains about not
being the same pg cluster as the 'wal files'. the other hot standby won't
start for some locale reason.
(it's not that I did not have backups ;) ).

the cluster is 'working', i get the error around 1/sec but the other
clients seem fine, so it's really only a few tables that are corrupted. I
cannot really take down the machine as it's quite a busy few million
queries a day cluster.

before the current error, i got some error that XXXXX.1 was missing which
was (luckily) an index file that i could recreate via 'reindex', but i fear
we're now at a table / transaction corruption which i cannot just 'rewrite'.

I would not at all mind just discarding all those transactions that have
accumulated in pg_clog

postgres@pgmaster:~/9.1/main/pg_clog$ ls -alrt | wc -l
180

quite desperate...

postgres@[local]:5432 [postgres] # select version();
version

----------------------------------------------------------------------------------------------
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-5) 4.7.2, 64-bit
(1 row)

Customized options:

#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

#custom_variable_classes = '' # list of custom variable class
names

listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 320 # (change requires restart)
timezone = 'Etc/UTC'

shared_buffers = 2GB # min 128kB
maintenance_work_mem = 250MB
checkpoint_completion_target = 0.9
effective_cache_size = 20GB
effective_io_concurrency = 6 # 1-1000. 0 disables prefetching

archive_mode = on
wal_level = 'hot_standby' #
http://www.postgresql.org/docs/9.1/static/runtime-config-wal.html#GUC-WAL-LEVEL

archive_command = '/opt/postgres_archive_command.pl --file_path=%p
--file_name=%f --work_dir=/var/tmp/ --destination_hosts=
va-pg-backups@dx.ipv6.ex.net
--destination_sftp_hosts=u671@ipv6.u71.y--destination_hosts=
va-pg-backups@y7.ipv6.ex.net'

max_wal_senders = 3 # max number of walsender processes
wal_keep_segments = 50 # in logfile segments, 16MB each; 0 disables

thx in advance,

klaus

#2Bruce Momjian
bruce@momjian.us
In reply to: Klaus Ita (#1)
bugsgeneral
Re: corrupted files

On Mon, Jul 29, 2013 at 10:19 PM, Klaus Ita <klaus@worstofall.com> wrote:

My feeling is, that 'something' got confused with hot-standby and
wal_archiving leading to this situation, that seems to be partially xfs
caused?????? xfs_repair mentionned some missing files, that pg did not
expect to have (maybe truncated tables??).

This doesn't sound like a problem with wal archiving or hot standby.
It doesn't sound like a postgres bug at all. It sounds like your
filesystem deleted some files that Postgres needs. The pg clog
contains critical data that you're not going to be able to get by
without.

I suggest posting to pgsql-general and include more information to
help people help you. You say xfs_repair mentioned some missing files
but don't include the actual error messages for example.

--
greg

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Klaus Ita
klaus@worstofall.com
In reply to: Bruce Momjian (#2)
bugsgeneral
Re: corrupted files

Hi Greg!

Thank you for your immediate response. I cannot tell you the xfs_repair
files that were mangled but I did the repair on an lvm snapshot. the files
were not listed in 'select * from pg_class'. Might not be THE source for
inodes needed by the cluster?

I agree, it does not really sonud like a pg bug, rather a pg corruption due
to whatever else. I cannot imagine xfs / raid / lsi combination being the
problem. I rather guess that some ram might have been corrupted.

I am moving the cluster to another hardware and hope to have better insight
there.

I am sorry for being partially vague. I cannot really grasp the problem
myself, this is why my description might be lacking detail.

I will cross-post to 'general'

thx,k

On Mon, Jul 29, 2013 at 11:38 PM, Greg Stark <stark@mit.edu> wrote:

Show quoted text

On Mon, Jul 29, 2013 at 10:19 PM, Klaus Ita <klaus@worstofall.com> wrote:

My feeling is, that 'something' got confused with hot-standby and
wal_archiving leading to this situation, that seems to be partially xfs
caused?????? xfs_repair mentionned some missing files, that pg did not
expect to have (maybe truncated tables??).

This doesn't sound like a problem with wal archiving or hot standby.
It doesn't sound like a postgres bug at all. It sounds like your
filesystem deleted some files that Postgres needs. The pg clog
contains critical data that you're not going to be able to get by
without.

I suggest posting to pgsql-general and include more information to
help people help you. You say xfs_repair mentioned some missing files
but don't include the actual error messages for example.

--
greg

#4Klaus Ita
klaus@worstofall.com
In reply to: Klaus Ita (#1)
bugsgeneral
Fwd: corrupted files

Sorry for cross-posting, i read that pg-bug was not the right place for
this email

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR: could not access
status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL: Could not open
file "pg_clog/0008": No such file or directory.

combined with the error output of queries that do not work.

I looked in pg_clog and correct, 0008 is missing.

On this linux machine on (3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64
GNU/Linux) I am using xfs on raid1 on a megacli raid controller with 16
disks, no battery, this is why write through is enabled, no cacheing.

I quite extensively created indices in transactions and removed those
within these transactions to do fast deletes (foreign key constraints)
before i got the error???

Now it might be that the memory on the server is corrupt? dunno, but i
think it's the only 'cheap' part in the whole game.

* tried to get one of the warm standby's up but one complains about not
being the same pg cluster as the 'wal files'. the other hot standby won't
start for some locale reason.
(it's not that I did not have backups ;) ).

the cluster is 'working', i get the error around 1/sec but the other
clients seem fine, so it's really only a few tables that are corrupted. I
cannot really take down the machine as it's quite a busy few million
queries a day cluster.

before the current error, i got some error that XXXXX.1 was missing which
was (luckily) an index file that i could recreate via 'reindex', but i fear
we're now at a table / transaction corruption which i cannot just 'rewrite'.

I would not at all mind just discarding all those transactions that have
accumulated in pg_clog

postgres@pgmaster:~/9.1/main/pg_clog$ ls -alrt | wc -l
180

Is there any way, even with data loss to get rid of those transactions and
just let the cluster behave again? It's serving some web-apps for users so
some minor data loss will not be the issue.

quite desperate...

postgres@[local]:5432 [postgres] # select version();
version

----------------------------------------------------------------------------------------------
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-5) 4.7.2, 64-bit
(1 row)

Customized options:

#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

#custom_variable_classes = '' # list of custom variable class
names

listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 320 # (change requires restart)
timezone = 'Etc/UTC'

shared_buffers = 2GB # min 128kB
maintenance_work_mem = 250MB
checkpoint_completion_target = 0.9
effective_cache_size = 20GB
effective_io_concurrency = 6 # 1-1000. 0 disables prefetching

archive_mode = on
wal_level = 'hot_standby' #
http://www.postgresql.org/docs/9.1/static/runtime-config-wal.html#GUC-WAL-LEVEL

archive_command = '/opt/postgres_archive_command.pl --file_path=%p
--file_name=%f --work_dir=/var/tmp/ --destination_hosts=
va-pg-backups@dx.ipv6.ex.net
--destination_sftp_hosts=u671@ipv6.u71.y--destination_hosts=
va-pg-backups@y7.ipv6.ex.net'

max_wal_senders = 3 # max number of walsender processes
wal_keep_segments = 50 # in logfile segments, 16MB each; 0 disables

thx in advance,

klaus

#5raghu ram
raghuchennuru@gmail.com
In reply to: Klaus Ita (#4)
bugsgeneral
Re: Fwd: corrupted files

On Tue, Jul 30, 2013 at 4:07 AM, Klaus Ita <klaus@worstofall.com> wrote:

Sorry for cross-posting, i read that pg-bug was not the right place for
this email

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR: could not access
status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL: Could not open
file "pg_clog/0008": No such file or directory.

combined with the error output of queries that do not work.

I looked in pg_clog and correct, 0008 is missing.

You can recreate a missed "pg_clog" file with below command:

dd if=/dev/zero of=~/9.1/main/pg_clog/0008 bs=256k count=1 (To make the
uncommitted record as they haven't been committed.)

and then try to start the cluster.

Thanks & Regards
Raghu Ram

#6Klaus Ita
klaus@worstofall.com
In reply to: raghu ram (#5)
bugsgeneral
Re: Fwd: corrupted files

Hi!

Thank you, I actually tried that and it seems that only lead to even more
corrupted data. I am currently trying to recover the 'hot-standby' host
that is also unhappy about one of the wal_files. I am looking at the wal
with less and see only data i do not care about in it (mostly
session-logging/statistics data).

I am trying to remember, there was a tool that plotted the contents of the
wal_files in a more readable format ...

lg,k

On Tue, Jul 30, 2013 at 8:23 AM, raghu ram <raghuchennuru@gmail.com> wrote:

Show quoted text

On Tue, Jul 30, 2013 at 4:07 AM, Klaus Ita <klaus@worstofall.com> wrote:

Sorry for cross-posting, i read that pg-bug was not the right place for
this email

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR: could not access
status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL: Could not open
file "pg_clog/0008": No such file or directory.

combined with the error output of queries that do not work.

I looked in pg_clog and correct, 0008 is missing.

You can recreate a missed "pg_clog" file with below command:

dd if=/dev/zero of=~/9.1/main/pg_clog/0008 bs=256k count=1 (To make the
uncommitted record as they haven't been committed.)

and then try to start the cluster.

Thanks & Regards
Raghu Ram

#7bricklen
bricklen@gmail.com
In reply to: Klaus Ita (#6)
bugsgeneral
Re: Fwd: corrupted files

On Mon, Jul 29, 2013 at 11:50 PM, Klaus Ita <klaus@worstofall.com> wrote:

I am trying to remember, there was a tool that plotted the contents of the
wal_files in a more readable format ...

xlogdump?

https://github.com/snaga/xlogdump

#8Klaus Ita
klaus@worstofall.com
In reply to: bricklen (#7)
bugsgeneral
Re: Fwd: corrupted files

Yes, that's it!

thank you! It turned out that really there was a corruption in the main pg
server which was 'virally' propagated to

1. streaming replica
1. replaying wal receiver
1. old backup that tried to replay the wal's

I really thought with a master and 3 backups i'd be safe.

lg,k

On Tue, Jul 30, 2013 at 5:13 PM, bricklen <bricklen@gmail.com> wrote:

Show quoted text

On Mon, Jul 29, 2013 at 11:50 PM, Klaus Ita <klaus@worstofall.com> wrote:

I am trying to remember, there was a tool that plotted the contents of
the wal_files in a more readable format ...

xlogdump?

https://github.com/snaga/xlogdump

#9bricklen
bricklen@gmail.com
In reply to: Klaus Ita (#8)
bugsgeneral
Re: Fwd: corrupted files

On Tue, Jul 30, 2013 at 8:18 AM, Klaus Ita <klaus@worstofall.com> wrote:

thank you! It turned out that really there was a corruption in the main pg
server which was 'virally' propagated to

1. streaming replica
1. replaying wal receiver
1. old backup that tried to replay the wal's

I really thought with a master and 3 backups i'd be safe.

Physical corruption in the master, or logical?

#10Klaus Ita
klaus@worstofall.com
In reply to: bricklen (#9)
bugsgeneral
Re: Fwd: corrupted files

i guess logical, caused by whatever. i really cannot say, the wal files all
*look* ok, still, they lead to a situation that's a definite dead end.
we did have a hard-drive failure (one in 13) at the time, but due to raid5
+ hot spare no data should have been corrupted. i mean it's an lsi
controller, ... not fond of it, but it's not bad stuff.

lg,k

On Tue, Jul 30, 2013 at 5:29 PM, bricklen <bricklen@gmail.com> wrote:

Show quoted text

On Tue, Jul 30, 2013 at 8:18 AM, Klaus Ita <klaus@worstofall.com> wrote:

thank you! It turned out that really there was a corruption in the main
pg server which was 'virally' propagated to

1. streaming replica
1. replaying wal receiver
1. old backup that tried to replay the wal's

I really thought with a master and 3 backups i'd be safe.

Physical corruption in the master, or logical?