BUG #3484: Missing pg_clog file / corrupt index
The following bug has been logged online:
Bug reference: 3484
Logged by: Marc Schablewski
Email address: ms@clickware.de
PostgreSQL version: 8.1.8
Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
Description: Missing pg_clog file / corrupt index
Details:
Our application receives and processes payment information that comes in
plain text files and stores the processed data into different tables in our
database. There are about 5 tables involved, three of them with 35 million
records so far. We get approximately 150000 payments a day. Each payment is
handled in a single transaction, because in case of an error, we want to
store as many payments as possible. We have about 500000 INSERT and a few
UPDATE statements each day. The whole application runs on two servers (see
specs below) which are nearly identical. One is the production system, the
other is for testing.
A few months ago we had some trouble with the test system. Postgres
complained about a missing pg_clog file during nightly routine
VACUUM/ANALYZE. Some days later, the same error occurred on the production
system, even on the same table! The corrupted table is one of those bigger
ones involved into the file processing. After searching the web we found a
hint that this problem could be related to a bug in 8.1.3, so we upgraded to
8.1.8 and restored the databases on both servers. This was around May, but
now we discovered the same problem on our production system again. Actually,
the error occurred four weeks ago, but it was not discovered until now –
if you do error logging, you should look at it from time to time ;)
When trying to backup or vacuum the database, we receive one of the
following error messages:
**BACKUP**
pg_dump: ERROR: could not access status of transaction 2134240
DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis nicht
gefunden
pg_dump: SQL command to dump the contents of table "dateiblock" failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR: could not access status of
transaction 2134240
DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis nicht
gefunden
pg_dump: The command was: COPY public.dateiblock (id, eda_id, dbt_id, lfdnr,
binaer_offset, laenge) TO stdout;
**VACUUM**
INFO: vacuuming "public.dateiblock"
ERROR: could not access status of transaction 2134240
DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis nicht
gefunden
(For those not familiar to the German language: ‘Datei oder Verzeichnis
nicht gefunden’ means ‘file or directory not found’).
Current pg_clogs range from 005A to 008F. The oldest one is dated to April
30th.
We narrowed it down to a few records in that table. Some records contain
unreasonable values, others produce the same message about the missing
pg_clog file when selected and some are simply missing. But they must have
existed, because there are still records in a second table referencing them.
One strange thing about this is, that the referencing records are about two
and a half months old and shouldn’t been touched since then.
We don’t think this is a hardware issue, because we had it on two
different servers and within a short period of time.
Luckily, the loss of data is minimal. There are only about 30 records
affected. Otherwise this would have been fatal, because as said before, our
backup was not working either.
In addition, we had another problem today. One of the indexes on a second
table became corrupted, causing the database backend to rollback all
transactions and restart the processes. A REINDEX fixed it, but it leaves me
with a bad feeling about what will break next.
2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't match
2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id,
dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
kartennr
, wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES ($1, $2, $3,
$4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was terminated by
signal 6
2007-07-25 08:07:00 CEST LOG: terminating any other active server
processes
2007-07-25 08:07:00 CEST WARNING: terminating connection because of crash
of another server process
2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
Kind regards,
Marc Schablewski
System:
OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64
System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP proliant
server)
3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)
Postgres 8.1.8
Postgres was compiled manually but with default parameters. The
configuration has been tuned to improve performance.
shared_buffers = 30000
maintenance_work_mem = 131072
max_fsm_pages = 500000
default_statistics_target = 200
redirect_stderr = on
log_directory = '/var/log/pg_log'
log_min_duration_statement = 10
log_duration = on
log_line_prefix = '%t '
stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = off
VACUUM ANALYZE runs once a day, auto_vacuum is off.
On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote:
The following bug has been logged online:
Bug reference: 3484
Logged by: Marc Schablewski
Email address: ms@clickware.de
PostgreSQL version: 8.1.8
Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
Description: Missing pg_clog file / corrupt index
Details:Our application receives and processes payment information that
comes in
plain text files and stores the processed data into different
tables in our
database. There are about 5 tables involved, three of them with 35
million
records so far. We get approximately 150000 payments a day. Each
payment is
handled in a single transaction, because in case of an error, we
want to
store as many payments as possible. We have about 500000 INSERT and
a few
UPDATE statements each day. The whole application runs on two
servers (see
specs below) which are nearly identical. One is the production
system, the
other is for testing.A few months ago we had some trouble with the test system. Postgres
complained about a missing pg_clog file during nightly routine
VACUUM/ANALYZE. Some days later, the same error occurred on the
production
system, even on the same table! The corrupted table is one of those
bigger
ones involved into the file processing. After searching the web we
found a
hint that this problem could be related to a bug in 8.1.3, so we
upgraded to
8.1.8 and restored the databases on both servers. This was around
May, but
now we discovered the same problem on our production system again.
Actually,
the error occurred four weeks ago, but it was not discovered until
now –
if you do error logging, you should look at it from time to time ;)When trying to backup or vacuum the database, we receive one of the
following error messages:**BACKUP**
pg_dump: ERROR: could not access status of transaction 2134240
DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis
nicht
gefunden
pg_dump: SQL command to dump the contents of table "dateiblock"
failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR: could not access status of
transaction 2134240
DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis
nicht
gefunden
pg_dump: The command was: COPY public.dateiblock (id, eda_id,
dbt_id, lfdnr,
binaer_offset, laenge) TO stdout;**VACUUM**
INFO: vacuuming "public.dateiblock"
ERROR: could not access status of transaction 2134240
DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis
nicht
gefunden(For those not familiar to the German language: ‘Datei oder
Verzeichnis
nicht gefunden’ means ‘file or directory not found’).Current pg_clogs range from 005A to 008F. The oldest one is dated
to April
30th.We narrowed it down to a few records in that table. Some records
contain
unreasonable values, others produce the same message about the missing
pg_clog file when selected and some are simply missing. But they
must have
existed, because there are still records in a second table
referencing them.
One strange thing about this is, that the referencing records are
about two
and a half months old and shouldn’t been touched since then.We don’t think this is a hardware issue, because we had it on two
different servers and within a short period of time.Luckily, the loss of data is minimal. There are only about 30 records
affected. Otherwise this would have been fatal, because as said
before, our
backup was not working either.In addition, we had another problem today. One of the indexes on a
second
table became corrupted, causing the database backend to rollback all
transactions and restart the processes. A REINDEX fixed it, but it
leaves me
with a bad feeling about what will break next.2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't
match
2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id,
dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
kartennr
, wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES
($1, $2, $3,
$4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was
terminated by
signal 6
2007-07-25 08:07:00 CEST LOG: terminating any other active server
processes
2007-07-25 08:07:00 CEST WARNING: terminating connection because
of crash
of another server process
2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this
server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.Kind regards,
Marc Schablewski
System:
OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64
System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP
proliant
server)
3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)Postgres 8.1.8
Postgres was compiled manually but with default parameters. The
configuration has been tuned to improve performance.shared_buffers = 30000
maintenance_work_mem = 131072
max_fsm_pages = 500000
default_statistics_target = 200
redirect_stderr = on
log_directory = '/var/log/pg_log'log_min_duration_statement = 10
log_duration = on
log_line_prefix = '%t '
stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = offVACUUM ANALYZE runs once a day, auto_vacuum is off.
Actually, this does sound like a hardware problem to me. You only
have 5 tables that get hit heavily, so you've likely got somewhere
around a 20% chance that corruption would hit the same table on two
different machines.
So far you haven't said anything that sounds unusual about how you're
using the database, and the hardware certainly seems pretty common-
place, so I'm rather doubtful that it's software. But if you still
have copies of the bad database, someone might be able to help you.
--
Decibel!, aka Jim Nasby decibel@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
I kept a copy of the data files in case it is needed, but I have to
check first, if I am allowed to give away that information. Some of the
data is confidential. If you just need the files containing the dammaged
table, this won't be a big problem, because it does not contain any
confidential information (as long as one data file only contains the
data of one table). The other problem is the size of the files. The
whole database is about 60GB and the files belonging to that table are
about 2.5GB. Mayby there is a way to pre-select the data you need?
Decibel! wrote:
Actually, this does sound like a hardware problem to me. You only have 5
tables that get hit heavily, so you've likely got somewhere around a 20%
chance that corruption would hit the same table on two different machines.So far you haven't said anything that sounds unusual about how you're
using the database, and the hardware certainly seems pretty
common-place, so I'm rather doubtful that it's software. But if you
still have copies of the bad database, someone might be able to help you.
--
Marc Schablewski
click:ware Informationstechnik GmbH
"Marc Schablewski" <ms@clickware.de> writes:
I kept a copy of the data files in case it is needed, but I have to
check first, if I am allowed to give away that information. Some of the
data is confidential. If you just need the files containing the dammaged
table, this won't be a big problem, because it does not contain any
confidential information (as long as one data file only contains the
data of one table). The other problem is the size of the files. The
whole database is about 60GB and the files belonging to that table are
about 2.5GB. Mayby there is a way to pre-select the data you need?
Perhaps. You could find the records with unreasonable values. But I don't
think there's any convenient way to find the records which produce the clog
error or which are missing unless they turn out to be on the same page.
Out of curiosity, what do the unreasonable values look like?
Earlier you said:
We narrowed it down to a few records in that table. Some records contain
unreasonable values, others produce the same message about the missing
pg_clog file when selected and some are simply missing. But they must have
existed, because there are still records in a second table referencing them.
If you still have a live database with this data then if you can do
SELECT ctid FROM tab WHERE ...
for the records with unreasonable values that might tell you what blocks are
corrupted. The value before the comma is the block number, which when
multiplied by 8192 (assuming you're using 8k blocks) will tell you what file
offset to look for the page.
To find the file to look for the block in do:
postgres=# select relfilenode from pg_class where relname = 'tablename';
relfilenode
-------------
16384
(1 row)
Note that if the file offset is over 1G then you would be looking for a file
named 16384.N where N is which gigabyte chunk.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
We also have the same exact problem - every 5 to 10 days when the data
get to some size, PostgreSQL complains about missing pg_clog files, and
invalid page headers during either vacuum or reindex operations.
The problem happens on different customer sites with Linux 2.6.11.
There is one particular table that is much more heavily used - 99%
inserts, some selects. And this table turns to be the one having
problems.
If this happens at different companies so frequently, I'd doubt it's a
hardware problem. And we did check the hardware but did not find any
problems.
Here are some sample messages:
2007-07-07T10:11:38+00:00 ERROR: could not access status of transaction
842085945
2007-07-07T10:11:38+00:00 DETAIL: could not open file "pg_clog/0323":
No such file or directory
2007-08-07T08:10:23+00:00 ERROR: could not access status of transaction
1481866610
2007-08-07T08:10:24+00:00 DETAIL: could not open file "pg_clog/0585":
No such file or directory
2007-08-07T08:13:55+00:00 ERROR: invalid page header in block 346965 of
relation "sipmessage"
2007-08-07T08:30:16+00:00 ERROR: could not access status of transaction
1481866610
2007-08-07T08:30:16+00:00 DETAIL: could not open file "pg_clog/0585":
No such file or directory
2007-08-07T08:34:08+00:00 ERROR: invalid page header in block 346965 of
relation "sipmessage"
2007-08-07T08:51:02+00:00 ERROR: could not access status of transaction
1481866610
2007-08-07T08:51:02+00:00 DETAIL: could not open file "pg_clog/0585":
No such file or directory
2007-08-13T10:12:07+00:00 ERROR: invalid page header in block 4018 of
relation "calllegstart_sessionid"
2007-08-13T10:12:15+00:00 ERROR: could not access status of
transaction 0
2007-08-13T10:12:15+00:00 DETAIL: could not create file
"pg_subtrans/0201": File exists
-----Original Message-----
From: pgsql-bugs-owner@postgresql.org
[mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Gregory Stark
Sent: Tuesday, July 31, 2007 7:00 AM
To: Marc Schablewski
Cc: Decibel!; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index
"Marc Schablewski" <ms@clickware.de> writes:
I kept a copy of the data files in case it is needed, but I have to
check first, if I am allowed to give away that information. Some of
the
data is confidential. If you just need the files containing the
dammaged
table, this won't be a big problem, because it does not contain any
confidential information (as long as one data file only contains the
data of one table). The other problem is the size of the files. The
whole database is about 60GB and the files belonging to that table are
about 2.5GB. Mayby there is a way to pre-select the data you need?
Perhaps. You could find the records with unreasonable values. But I
don't
think there's any convenient way to find the records which produce the
clog
error or which are missing unless they turn out to be on the same page.
Out of curiosity, what do the unreasonable values look like?
Earlier you said:
We narrowed it down to a few records in that table. Some records
contain
unreasonable values, others produce the same message about the missing
pg_clog file when selected and some are simply missing. But they must
have
existed, because there are still records in a second table referencing
them.
If you still have a live database with this data then if you can do
SELECT ctid FROM tab WHERE ...
for the records with unreasonable values that might tell you what blocks
are
corrupted. The value before the comma is the block number, which when
multiplied by 8192 (assuming you're using 8k blocks) will tell you what
file
offset to look for the page.
To find the file to look for the block in do:
postgres=# select relfilenode from pg_class where relname = 'tablename';
relfilenode
-------------
16384
(1 row)
Note that if the file offset is over 1G then you would be looking for a
file
named 16384.N where N is which gigabyte chunk.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
On Fri, Aug 24, 2007 at 09:19:49AM -0400, Feng Chen wrote:
The problem happens on different customer sites with Linux 2.6.11.
Well, at kernel.org in the changelog for 2.6.12, I see this:
[PATCH] PCI: don't override drv->shutdown unconditionally
There are many drivers that have been setting the generic driver
model level shutdown callback, and pci thus must not override it.
Without this patch we can have really bad data loss on various
raid controllers.
Could it be related to your problem? (I'm not trying to dismiss --
I've just been bitten by too many Linux corner cases not to worry
about the kernel.)
A
--
Andrew Sullivan | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.
--H.W. Fowler
Feng Chen wrote:
We also have the same exact problem - every 5 to 10 days when the data
get to some size, PostgreSQL complains about missing pg_clog files, and
invalid page headers during either vacuum or reindex operations.The problem happens on different customer sites with Linux 2.6.11.
There is one particular table that is much more heavily used - 99%
inserts, some selects. And this table turns to be the one having
problems.If this happens at different companies so frequently, I'd doubt it's a
hardware problem. And we did check the hardware but did not find any
problems.
[...]
2007-08-13T10:12:07+00:00 ERROR: invalid page header in block 4018 of
relation "calllegstart_sessionid"
2007-08-13T10:12:15+00:00 ERROR: could not access status of
transaction 0
2007-08-13T10:12:15+00:00 DETAIL: could not create file
"pg_subtrans/0201": File exists
what version of postgresql is this exactly ? there is a problem in older
8.1 versions that could cause this error under high transaction rates.
Stefan
Feng Chen wrote:
We also have the same exact problem - every 5 to 10 days when the data
get to some size, PostgreSQL complains about missing pg_clog files, and
invalid page headers during either vacuum or reindex operations.
What PG version is this?
2007-08-13T10:12:15+00:00 ERROR: could not access status of
transaction 0
2007-08-13T10:12:15+00:00 DETAIL: could not create file
"pg_subtrans/0201": File exists
Huh???
--
Alvaro Herrera Developer, http://www.PostgreSQL.org/
"Aprende a avergonzarte m�s ante ti que ante los dem�s" (Dem�crito)
VERSION = PostgreSQL 8.1.2
-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@commandprompt.com]
Sent: Friday, August 24, 2007 10:02 AM
To: Feng Chen
Cc: Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index
Feng Chen wrote:
We also have the same exact problem - every 5 to 10 days when the data
get to some size, PostgreSQL complains about missing pg_clog files, and
invalid page headers during either vacuum or reindex operations.
What PG version is this?
2007-08-13T10:12:15+00:00 ERROR: could not access status of
transaction 0
2007-08-13T10:12:15+00:00 DETAIL: could not create file
"pg_subtrans/0201": File exists
Huh???
--
Alvaro Herrera Developer, http://www.PostgreSQL.org/
"Aprende a avergonzarte más ante ti que ante los demás" (Demócrito)
On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
You know that the project doesn't put out maintenance releases for
the fun of it, right? The latest is 8.1.9 in that series. You need
to upgrade.
A
--
Andrew Sullivan | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
--Scott Morris
Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
this could be
http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php
which is fixed in 8.1.3 and later - so you really should look into
upgrading to 8.1.9 as soon as possible ,,,
Stefan
On Fri, Aug 24, 2007 at 10:11:06AM -0400, Andrew Sullivan wrote:
On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
You know that the project doesn't put out maintenance releases for
the fun of it, right? The latest is 8.1.9 in that series. You need
to upgrade.
Indeed, looking at the release notes:
#
Fix race condition that could lead to "file already exists" errors
during pg_clog and pg_subtrans file creation (Tom)
#
--
Andrew Sullivan | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
--Damien Katz
Thanks for the replies! I guess it's time for 8.1.9 then. Will keep you
informed afterwards!
-----Original Message-----
From: Andrew Sullivan [mailto:ajs@crankycanuck.ca]
Sent: Friday, August 24, 2007 10:14 AM
To: Andrew Sullivan
Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski;
Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file /
corrupt index
On Fri, Aug 24, 2007 at 10:11:06AM -0400, Andrew Sullivan wrote:
On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
You know that the project doesn't put out maintenance releases for
the fun of it, right? The latest is 8.1.9 in that series. You need
to upgrade.
Indeed, looking at the release notes:
#
Fix race condition that could lead to "file already exists" errors
during pg_clog and pg_subtrans file creation (Tom)
#
--
Andrew Sullivan | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
--Damien Katz
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
this could be
http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php
Note, however, that that only explains the "could not create ... File
exists" complaint, which is hardly the main problem here.
The combination of "invalid page header" and "could not access status of
transaction" messages looks like nothing so much as severe data
corruption. Remember that the xmin/xmax fields are basically the first
thing we can check with any degree of strictness when examining a
tuple. This means that if a page is partially clobbered, but not in a
way that sets off the invalid-page-header checks, then the odds are very
high that the first detectable sign of trouble will be references to
transaction numbers that are far away from what the system is really
using. (Is 1481866610 anywhere near the current XID counter reported
by pg_controldata?)
I concur with the upthread comment to check into kernel bugs,
particularly if all of your machines are using the same old kernel
release.
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
Remember that the xmin/xmax fields are basically the first thing we can
check with any degree of strictness when examining a tuple. This means that
if a page is partially clobbered, but not in a way that sets off the
invalid-page-header checks, then the odds are very high that the first
detectable sign of trouble will be references to transaction numbers that
are far away from what the system is really using.
I'm increasingly thinking that one of the first things I'll suggest putting
into 8.4 is a per-page checksum after all. It was talked about a while back
and people thought it was pointless but I think the number of reports of
hardware and kernel bugs resulting in zeroed and corrupted pages has been
steadily going up. If not in total than as a percentage of the total problems.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Yes, most of the time PostgreSQL only complains about missing pg_clog/
files then complains about invalid page headers. The "could not create
... File exists" message was only seen a couple of times.
I don't have a system that exhibits this problem right now. I can check
the XID counter when it happens again.
Thanks!
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, August 24, 2007 12:11 PM
To: Stefan Kaltenbrunner
Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski;
Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file /
corrupt index
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
this could be
http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php
Note, however, that that only explains the "could not create ... File
exists" complaint, which is hardly the main problem here.
The combination of "invalid page header" and "could not access status of
transaction" messages looks like nothing so much as severe data
corruption. Remember that the xmin/xmax fields are basically the first
thing we can check with any degree of strictness when examining a
tuple. This means that if a page is partially clobbered, but not in a
way that sets off the invalid-page-header checks, then the odds are very
high that the first detectable sign of trouble will be references to
transaction numbers that are far away from what the system is really
using. (Is 1481866610 anywhere near the current XID counter reported
by pg_controldata?)
I concur with the upthread comment to check into kernel bugs,
particularly if all of your machines are using the same old kernel
release.
regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
Remember that the xmin/xmax fields are basically the first thing we can
check with any degree of strictness when examining a tuple. This means that
if a page is partially clobbered, but not in a way that sets off the
invalid-page-header checks, then the odds are very high that the first
detectable sign of trouble will be references to transaction numbers that
are far away from what the system is really using.
I'm increasingly thinking that one of the first things I'll suggest putting
into 8.4 is a per-page checksum after all. It was talked about a while back
and people thought it was pointless but I think the number of reports of
hardware and kernel bugs resulting in zeroed and corrupted pages has been
steadily going up. If not in total than as a percentage of the total problems.
It's still pointless; a checksum does nothing to prevent data
corruption. The error report might be slightly more obvious to a novice
but it doesn't bring your data back.
Something we could possibly do now is to modify these error messages:
if the transaction number we're trying to check is obviously bogus
(beyond the current XID counter or older than the current freeze
horizon) we could report it as a corrupted XID rather than exposing
the "no such clog segment" condition.
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
It's still pointless; a checksum does nothing to prevent data
corruption. The error report might be slightly more obvious to a novice
but it doesn't bring your data back.
Well if it's a single bit error from bad memory or a torn page from having
full_page_writes turned off then the resulting page could be entirely valid.
The user might not find out about the corrupt data before it's had time to
migrate elsewhere.
Also, the sooner the corrupt data is reported the sooner the user can restore
from backups and avoid further data loss. The last discussion of this feature
concentrated on beingg able to detect torn page corruption with
full_page_writes turned off during recovery.
Something we could possibly do now is to modify these error messages:
if the transaction number we're trying to check is obviously bogus
(beyond the current XID counter or older than the current freeze
horizon) we could report it as a corrupted XID rather than exposing
the "no such clog segment" condition.
That would be clever. I take it you mean the invalid values would be those
values older than the actual relfrozenxid. Certainly they should throw some
sort of error instead of trying to find the transaction in the clog.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Andrew Sullivan wrote:
On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
You know that the project doesn't put out maintenance releases for
the fun of it, right? The latest is 8.1.9 in that series. You need
to upgrade.
We moved the "you should upgrade for all minor releases" from FAQ text
to a link to our web site:
http://www.postgresql.org/support/versioning
The problem is that instead of those words being in the FAQ, they are
now linked from the FAQ, and I am concerned that fewer people are seeing
that recommendation.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
We've got the invalid page header again:
ERROR: invalid page header in block 18223 of relation "calllegstart"
bash-3.00# pg_controldata /var/lib/pgsql/data
pg_control version number: 812
Catalog version number: 200510211
Database system identifier: 5101280766299435989
Database cluster state: in production
pg_control last modified: Mon Aug 27 17:32:12 2007
Current log file ID: 6
Next log file segment: 242
Latest checkpoint location: 6/EFB4C658
Prior checkpoint location: 6/EDC84A9C
Latest checkpoint's REDO location: 6/EFB04720
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 23661755
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Time of latest checkpoint: Mon Aug 27 17:29:17 2007
Maximum data alignment: 4
Database block size: 8192
Blocks per segment of large relation: 131072
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C
-----Original Message-----
From: Feng Chen
Sent: Friday, August 24, 2007 1:42 PM
To: 'Tom Lane'; Stefan Kaltenbrunner
Cc: Alvaro Herrera; Gregory Stark; Marc Schablewski; Decibel!;
pgsql-bugs@postgresql.org
Subject: RE: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog
file / corrupt index
Yes, most of the time PostgreSQL only complains about missing pg_clog/
files then complains about invalid page headers. The "could not create
... File exists" message was only seen a couple of times.
I don't have a system that exhibits this problem right now. I can check
the XID counter when it happens again.
Thanks!
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, August 24, 2007 12:11 PM
To: Stefan Kaltenbrunner
Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski;
Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file /
corrupt index
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Feng Chen wrote:
VERSION = PostgreSQL 8.1.2
this could be
http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php
Note, however, that that only explains the "could not create ... File
exists" complaint, which is hardly the main problem here.
The combination of "invalid page header" and "could not access status of
transaction" messages looks like nothing so much as severe data
corruption. Remember that the xmin/xmax fields are basically the first
thing we can check with any degree of strictness when examining a
tuple. This means that if a page is partially clobbered, but not in a
way that sets off the invalid-page-header checks, then the odds are very
high that the first detectable sign of trouble will be references to
transaction numbers that are far away from what the system is really
using. (Is 1481866610 anywhere near the current XID counter reported
by pg_controldata?)
I concur with the upthread comment to check into kernel bugs,
particularly if all of your machines are using the same old kernel
release.
regards, tom lane