Re: Strange database corruption with PostgreSQL 7.4.x o

Started by Nonameover 19 years ago8 messagesgeneral
Jump to latest
#1Noname
Matthias.Pitzl@izb.de

Hello Scott!

Thank you for the quick answer. I'll try to check our hardware which is a
Compaq DL380 G4 with a batteyr buffered write cache on our raid controller.
As the system is running stable at all i think it's not the cpu or memory.
At moment i tend more to a bad disk or SCSI controller but even with that i
don't get any message in my logs...
Any ideas how i could check the hardware?

Best regards,
Matthias

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Scott Marlowe
Sent: Wednesday, September 20, 2006 2:56 PM
To: Matthias.Pitzl@izb.de
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x on

On Wed, 2006-09-20 at 14:34 +0200, Matthias.Pitzl@izb.de wrote:

Hello!

We're running the latest release of PostgreSQL 7.4.13 on a

Debian Sarge

machine. Postgres has been compiled by oureselves.
We have a pretty big database running on this machine, it

has about 6.4 GB

approximately. One table contains about 55 million rows.
Into this table we insert about 500000 rows each day. Our

problem is that

without any obvious reason the database gets corrupt. The

messages we get

are:
invalid page header in block 437702 of relation "xxxx"
We already have tried out some other versions of 7.4. On

another machine

running Debian Woody with PotgreSQL 7.4.10 we don't have

any problems.

Kernels are 2.4.33 on the Sarge machine, 2.4.28 on the

Woody machine. Both

are SMP kernels.
Does anyone of you perhaps have some hints what's going wrong here?

Most likely causes in these cases tends to be, bad memory, bad hard
drive, bad cpu, bad RAID / IDE / SCSI controller, loss of power when
writing to IDE drives / RAID controllers with cache with no battery
backup.

I.e. check your hardware.

---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend

#2Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Noname (#1)
Re: Strange database corruption with PostgreSQL 7.4.x o n

On Wed, 2006-09-20 at 15:14 +0200, Matthias.Pitzl@izb.de wrote:

Hello Scott!

Thank you for the quick answer. I'll try to check our hardware which is a
Compaq DL380 G4 with a batteyr buffered write cache on our raid controller.
As the system is running stable at all i think it's not the cpu or memory.
At moment i tend more to a bad disk or SCSI controller but even with that i
don't get any message in my logs...
Any ideas how i could check the hardware?

Keep in mind, a single bad memory location is all it takes to cause data
corruption, so it could well be memory. CPU is less likely if the
machine is otherwise running stable.

The standard tool on x86 hardware is memtest86 www.memtest86.com

So, you'd have to schedule a maintenance window to run the test in since
you have to basically down the machine and run just memtest86. I think
a few live linux distros have it built in (FC has a memtest label in
some versions I think)

My first suspicion is always memory. We ordered a batch of memory from
a very off brand supplier, and over 75% tested bad. And it took >24
hours to find some of the bad memory.

good luck with your testing, let us know how it goes.

#3Noname
Matthias.Pitzl@izb.de
In reply to: Noname (#1)

Hello Scott!

Thank you. Memtest86 i know. I think we will use this for testing our
hardware too.
Got some other nice information meanwhile from someone also running a DL380
server which had a defect backplane causing similar issues.
He also gave me the hint that there's a test suite CD by Compaq to run some
hardware diagnostic checks on our machine. I will try this out as soon as
possible.
I will inform you when i know more :)

-- Matthias

Show quoted text

-----Original Message-----
From: Scott Marlowe [mailto:smarlowe@g2switchworks.com]
Sent: Wednesday, September 20, 2006 4:12 PM
To: Matthias.Pitzl@izb.de
Cc: pgsql-general@postgresql.org
Subject: RE: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x o n

Keep in mind, a single bad memory location is all it takes to
cause data
corruption, so it could well be memory. CPU is less likely if the
machine is otherwise running stable.

The standard tool on x86 hardware is memtest86 www.memtest86.com

So, you'd have to schedule a maintenance window to run the
test in since
you have to basically down the machine and run just
memtest86. I think
a few live linux distros have it built in (FC has a memtest label in
some versions I think)

My first suspicion is always memory. We ordered a batch of
memory from
a very off brand supplier, and over 75% tested bad. And it took >24
hours to find some of the bad memory.

good luck with your testing, let us know how it goes.

#4Tomasz Ostrowski
tometzky@batory.org.pl
In reply to: Noname (#1)

On Wed, 20 Sep 2006, Matthias.Pitzl@izb.de wrote:

Any ideas how i could check the hardware?

1. memtest86 or memtest86+ at least 8 hours

2. CPU Burn-in
http://users.bigpond.net.au/cpuburn/ at least 8 hours

3. badblocks -s -v -t random /dev/sd%
WARNING: this will destroy your data!

4. smartctl -a /dev/sd%
Does not have to work. Sometimes needs some hacking to make it work.

Dave "KernelSlacker" Jones wrote a very good article about hardware
problems:
http://people.redhat.com/davej/hardware-problems.txt

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh

#5Noname
Matthias.Pitzl@izb.de
In reply to: Tomasz Ostrowski (#4)

Hello Tom!

Not yet, but i will try this one too. Anything special i should look for
when dumping out the bad pages?

-- Matthias

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, September 20, 2006 4:32 PM
To: Matthias.Pitzl@izb.de
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x on Debian Sarge

Matthias.Pitzl@izb.de writes:

invalid page header in block 437702 of relation "xxxx"

I concur with Scott that this sounds suspiciously like a hardware
problem ... but have you tried dumping out the bad pages with
pg_filedump or even just od? The pattern of damage would help to
confirm or disprove the theory.

You can find pg_filedump source code at
http://sources.redhat.com/rhdb/

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#6Noname
Matthias.Pitzl@izb.de
In reply to: Noname (#5)

Hello all!

Ok, i found out some more informations. According to
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&
taskId=110&prodSeriesId=397634&prodTypeId=15351&prodSeriesId=397634&objectID
=PSD_EX050119_CW01 one of our four disks in the server has a firmware issue.
The problem are incomplete writes onto disk while on high I/O load...
We will check this one first. If it won't help, we will try the hardware
diagnostics and some other tests...
Meanwhile thank you all for your suggestions :)

-- Matthias

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of
Matthias.Pitzl@izb.de
Sent: Wednesday, September 20, 2006 3:14 PM
To: smarlowe@g2switchworks.com
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x o

Hello Scott!

Thank you for the quick answer. I'll try to check our
hardware which is a
Compaq DL380 G4 with a batteyr buffered write cache on our
raid controller.
As the system is running stable at all i think it's not the
cpu or memory.
At moment i tend more to a bad disk or SCSI controller but
even with that i
don't get any message in my logs...
Any ideas how i could check the hardware?

Best regards,
Matthias

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#5)
Re: Strange database corruption with PostgreSQL 7.4.x o n Debian Sarge

Matthias.Pitzl@izb.de writes:

Not yet, but i will try this one too. Anything special i should look for
when dumping out the bad pages?

"If we knew what it was we would learn, it wouldn't be research" ...

regards, tom lane

#8Noname
Matthias.Pitzl@izb.de
In reply to: Noname (#6)

Hello everyone!

Small update on this issue:
Our server has four 146GB disks as pairwise RAID 1 and one of these is
affected by the bug mentioned in the HP support page.
As quick fix i moved our database to the the other raid device built of
unaffected disks. Till now i don't got any new database corruption, so i
think the one disk with the firmware bug is the cause of our problems. Since
only the database does a lot of I/O onto the disks, this will help us for
the next days till we can upgrade or replace the bugged disk.
Thank you all for your hints and suggestions!

-- Matthias

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of
Matthias.Pitzl@izb.de
Sent: Wednesday, September 20, 2006 4:51 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x o

Hello all!

Ok, i found out some more informations. According to

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&
taskId=110&prodSeriesId=397634&prodTypeId=15351&prodSeriesId=397634&objectID
=PSD_EX050119_CW01

Show quoted text

one of our four disks in the server has a firmware issue.
The problem are incomplete writes onto disk while on high I/O load...
We will check this one first. If it won't help, we will try the hardware
diagnostics and some other tests...
Meanwhile thank you all for your suggestions :)

-- Matthias