corrupted item pointer:???

Started by hubert depesz lubaczewskiabout 20 years ago12 messagesgeneral
Jump to latest

hi.
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g
ram.AMDSempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:
./configure \
--prefix=/home/pgdba/work \
--without-debug \
--disable-debug \
--with-pgport=5810 \
--with-tcl \
--with-perl \
--with-python \
--without-krb4 \
--without-krb5 \
--without-pam \
--without-rendezvous \
--with-openssl \
--with-readline \
--with-zlib \
--with-gnu-ld

version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 3.3.5 (Debian 1:3.3.5-13)

on this machine i have copy of 40G database from production servers.
i was testing migration to hstore (
http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)

at some point ot tests it failed saying corrupted item pointer.

migration was done using:
LOOP:
select data from test_table where primary_key = constant;
select data from secondary_table where unique_key = constant;
update test_table set hstore_field = some_value where primary_key =
constant
REPEAT FOR ALL items;
this loop was made in external code (perl, using dbi + dbd::pg).
we did about 37 such loops every second.
everything went fine.
then - every 30 minutes) i ran vacuum to reclaim space from test_table.
at one of such vacuums it paniced showing me forementioned error and killing
all connections.

is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related, hardware-related
or just random thing happening because of nothing.

i tried:
hand-vacuum
vacuum analyze
vacuum full analyze
reindex the table
another vacuum
none of this worked.
i still get the corrupted item pointer message.

any clues on how can i check what went wrong?

depesz

#2Richard Huxton
dev@archonet.com
In reply to: hubert depesz lubaczewski (#1)
Re: corrupted item pointer:???

hubert depesz lubaczewski wrote:

hi.
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g
ram.AMD Sempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:

...

version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 3.3.5 (Debian 1:3.3.5-13)

OK - nothing unusual there.

on this machine i have copy of 40G database from production servers.
i was testing migration to hstore (
http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)

at some point ot tests it failed saying corrupted item pointer.

is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related,
hardware-related or just random thing happening because of nothing.

Hmm - I believe that means a data/index block was corrupted.

Have you seen any crashes, or hardware-related errors in your logs?

What are your config settings, particularly the first three here:
http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes

--
Richard Huxton
Archonet Ltd

In reply to: Richard Huxton (#2)
Re: corrupted item pointer:???

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

Hmm - I believe that means a data/index block was corrupted.

indices were recreated (reindex table), so i think this is data related
problem.

Have you seen any crashes, or hardware-related errors in your logs?

nope. uptime is over 40 days.

the machine is not used for anything else so i can't tell anything, but i
didn't see any problems with it.

What are your config settings, particularly the first three here:

http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes

sure:
irr=# show fsync;
fsync
-------
on
(1 row)

irr=# show wal_sync_method;
wal_sync_method
-----------------
fdatasync
(1 row)

irr=# show full_page_writes;
full_page_writes
------------------
on
(1 row)

depesz

#4Richard Huxton
dev@archonet.com
In reply to: hubert depesz lubaczewski (#3)
Re: corrupted item pointer:???

hubert depesz lubaczewski wrote:

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

Hmm - I believe that means a data/index block was corrupted.

indices were recreated (reindex table), so i think this is data related
problem.

Have you seen any crashes, or hardware-related errors in your logs?

nope. uptime is over 40 days.

the machine is not used for anything else so i can't tell anything, but i
didn't see any problems with it.

What are your config settings, particularly the first three here:

http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes

sure:
fsync
-------
on

wal_sync_method
-----------------
fdatasync

full_page_writes
------------------
on

All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html

--
Richard Huxton
Archonet Ltd

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#3)
Re: corrupted item pointer:???

"hubert depesz lubaczewski" <depesz@gmail.com> writes:

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

Hmm - I believe that means a data/index block was corrupted.

indices were recreated (reindex table), so i think this is data related
problem.

AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.

regards, tom lane

In reply to: Richard Huxton (#4)
Re: corrupted item pointer:???

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html

i ran the test to find it. as soon as i will get it (probably tomorrow) i
will mail it to the list.

hubert

In reply to: Tom Lane (#5)
Re: corrupted item pointer:???

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"hubert depesz lubaczewski" <depesz@gmail.com> writes:

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

Hmm - I believe that means a data/index block was corrupted.

indices were recreated (reindex table), so i think this is data related
problem.

AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.

i'm not familiar with this utility.
i can of course find it using google, but how do i check what is wrong?
i am even willing to upload the dump file, but with 4 milion records in
table, it is going to be rather large...
pg_relation_size says that the table is about 3g i size

depesz

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#7)
Re: corrupted item pointer:???

"hubert depesz lubaczewski" <depesz@gmail.com> writes:

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.

i'm not familiar with this utility.

http://sources.redhat.com/rhdb/

i can of course find it using google, but how do i check what is wrong?

pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")

i am even willing to upload the dump file, but with 4 milion records in
table, it is going to be rather large...

I don't think we want to see the whole thing! But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.

regards, tom lane

In reply to: Tom Lane (#8)
Re: corrupted item pointer:???

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"hubert depesz lubaczewski" <depesz@gmail.com> writes:

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.

i'm not familiar with this utility.

http://sources.redhat.com/rhdb/

i can of course find it using google, but how do i check what is wrong?

pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond
block")

i am even willing to upload the dump file, but with 4 milion records in
table, it is going to be rather large...

I don't think we want to see the whole thing! But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.

if i understand correctly i have to do pg_filedump <relfilenode> of table,
check output for errors, and make pg_filedump -i -f of problematic blocks.
if that's ok - i'm running it.
as soon as i have some info - i'll let you know.

depesz

In reply to: Tom Lane (#8)
Re: corrupted item pointer:???

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond
block")

the problematic table spans over 3 files (18026 18026.1 and 18026.2).
i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.
the dumps are quite large:
pgdba@lab02:~$ ls -l *.dump
-rw-r--r-- 1 pgdba pgdba 154631630 2006-04-14 18:03 18026.1.dump
-rw-r--r-- 1 pgdba pgdba 108808017 2006-04-14 18:03 18026.2.dump
-rw-r--r-- 1 pgdba pgdba 161625849 2006-04-14 18:01 18026.dump

what else can i look in it for?

best regards

hubert

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#10)
Re: corrupted item pointer:???

"hubert depesz lubaczewski" <depesz@gmail.com> writes:

i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.

Oh, that's interesting. Looking more closely, the test in
PageRepairFragmentation()

if (itemidptr->itemoff < (int) pd_upper ||
itemidptr->itemoff >= (int) pd_special)
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted item pointer: %u",
itemidptr->itemoff)));

is slightly tighter than what pg_filedump does:

// Make sure the item can physically fit on this block before
// formatting
if ((itemOffset + itemSize > blockSize) ||
(itemOffset + itemSize > bytesToFormat))
printf (" Error: Item contents extend beyond block.\n"
" BlockSize<%d> Bytes Read<%d> Item Start<%d>.\n",
blockSize, bytesToFormat, itemOffset + itemSize);

I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong. Do you want to add one
and try again?

regards, tom lane

In reply to: Tom Lane (#11)
Re: corrupted item pointer:???

On 4/14/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong. Do you want to add one
and try again?

sure. but could you please tell me what to change? c is not my favourite
language and i'd like not to damage something else while trying to change it
myself.

hubert