corrupted item pointer:???
hi.
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g
ram.AMDSempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:
./configure \
--prefix=/home/pgdba/work \
--without-debug \
--disable-debug \
--with-pgport=5810 \
--with-tcl \
--with-perl \
--with-python \
--without-krb4 \
--without-krb5 \
--without-pam \
--without-rendezvous \
--with-openssl \
--with-readline \
--with-zlib \
--with-gnu-ld
version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 3.3.5 (Debian 1:3.3.5-13)
on this machine i have copy of 40G database from production servers.
i was testing migration to hstore (
http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)
at some point ot tests it failed saying corrupted item pointer.
migration was done using:
LOOP:
select data from test_table where primary_key = constant;
select data from secondary_table where unique_key = constant;
update test_table set hstore_field = some_value where primary_key =
constant
REPEAT FOR ALL items;
this loop was made in external code (perl, using dbi + dbd::pg).
we did about 37 such loops every second.
everything went fine.
then - every 30 minutes) i ran vacuum to reclaim space from test_table.
at one of such vacuums it paniced showing me forementioned error and killing
all connections.
is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related, hardware-related
or just random thing happening because of nothing.
i tried:
hand-vacuum
vacuum analyze
vacuum full analyze
reindex the table
another vacuum
none of this worked.
i still get the corrupted item pointer message.
any clues on how can i check what went wrong?
depesz
hubert depesz lubaczewski wrote:
hi.
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g
ram.AMD Sempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:
...
version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 3.3.5 (Debian 1:3.3.5-13)
OK - nothing unusual there.
on this machine i have copy of 40G database from production servers.
i was testing migration to hstore (
http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)at some point ot tests it failed saying corrupted item pointer.
is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related,
hardware-related or just random thing happening because of nothing.
Hmm - I believe that means a data/index block was corrupted.
Have you seen any crashes, or hardware-related errors in your logs?
What are your config settings, particularly the first three here:
http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes
--
Richard Huxton
Archonet Ltd
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
Hmm - I believe that means a data/index block was corrupted.
indices were recreated (reindex table), so i think this is data related
problem.
Have you seen any crashes, or hardware-related errors in your logs?
nope. uptime is over 40 days.
the machine is not used for anything else so i can't tell anything, but i
didn't see any problems with it.
What are your config settings, particularly the first three here:
http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes
sure:
irr=# show fsync;
fsync
-------
on
(1 row)
irr=# show wal_sync_method;
wal_sync_method
-----------------
fdatasync
(1 row)
irr=# show full_page_writes;
full_page_writes
------------------
on
(1 row)
depesz
hubert depesz lubaczewski wrote:
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
Hmm - I believe that means a data/index block was corrupted.
indices were recreated (reindex table), so i think this is data related
problem.Have you seen any crashes, or hardware-related errors in your logs?
nope. uptime is over 40 days.
the machine is not used for anything else so i can't tell anything, but i
didn't see any problems with it.What are your config settings, particularly the first three here:
http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writessure:
fsync
-------
on
wal_sync_method
-----------------
fdatasync
full_page_writes
------------------
on
All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html
--
Richard Huxton
Archonet Ltd
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
Hmm - I believe that means a data/index block was corrupted.
indices were recreated (reindex table), so i think this is data related
problem.
AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.
regards, tom lane
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html
i ran the test to find it. as soon as i will get it (probably tomorrow) i
will mail it to the list.
hubert
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
Hmm - I believe that means a data/index block was corrupted.
indices were recreated (reindex table), so i think this is data related
problem.AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.
i'm not familiar with this utility.
i can of course find it using google, but how do i check what is wrong?
i am even willing to upload the dump file, but with 4 milion records in
table, it is going to be rather large...
pg_relation_size says that the table is about 3g i size
depesz
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.
i'm not familiar with this utility.
http://sources.redhat.com/rhdb/
i can of course find it using google, but how do i check what is wrong?
pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")
i am even willing to upload the dump file, but with 4 milion records in
table, it is going to be rather large...
I don't think we want to see the whole thing! But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.
regards, tom lane
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.i'm not familiar with this utility.
http://sources.redhat.com/rhdb/
i can of course find it using google, but how do i check what is wrong?
pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond
block")i am even willing to upload the dump file, but with 4 milion records in
table, it is going to be rather large...I don't think we want to see the whole thing! But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.
if i understand correctly i have to do pg_filedump <relfilenode> of table,
check output for errors, and make pg_filedump -i -f of problematic blocks.
if that's ok - i'm running it.
as soon as i have some info - i'll let you know.
depesz
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond
block")
the problematic table spans over 3 files (18026 18026.1 and 18026.2).
i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.
the dumps are quite large:
pgdba@lab02:~$ ls -l *.dump
-rw-r--r-- 1 pgdba pgdba 154631630 2006-04-14 18:03 18026.1.dump
-rw-r--r-- 1 pgdba pgdba 108808017 2006-04-14 18:03 18026.2.dump
-rw-r--r-- 1 pgdba pgdba 161625849 2006-04-14 18:01 18026.dump
what else can i look in it for?
best regards
hubert
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.
Oh, that's interesting. Looking more closely, the test in
PageRepairFragmentation()
if (itemidptr->itemoff < (int) pd_upper ||
itemidptr->itemoff >= (int) pd_special)
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted item pointer: %u",
itemidptr->itemoff)));
is slightly tighter than what pg_filedump does:
// Make sure the item can physically fit on this block before
// formatting
if ((itemOffset + itemSize > blockSize) ||
(itemOffset + itemSize > bytesToFormat))
printf (" Error: Item contents extend beyond block.\n"
" BlockSize<%d> Bytes Read<%d> Item Start<%d>.\n",
blockSize, bytesToFormat, itemOffset + itemSize);
I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong. Do you want to add one
and try again?
regards, tom lane
On 4/14/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong. Do you want to add one
and try again?
sure. but could you please tell me what to change? c is not my favourite
language and i'd like not to damage something else while trying to change it
myself.
hubert