Re: Strange database corruption with PostgreSQL 7.4.x o
Hello Scott!
Thank you for the quick answer. I'll try to check our hardware which is a
Compaq DL380 G4 with a batteyr buffered write cache on our raid controller.
As the system is running stable at all i think it's not the cpu or memory.
At moment i tend more to a bad disk or SCSI controller but even with that i
don't get any message in my logs...
Any ideas how i could check the hardware?
Best regards,
Matthias
Show quoted text
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Scott Marlowe
Sent: Wednesday, September 20, 2006 2:56 PM
To: Matthias.Pitzl@izb.de
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x onOn Wed, 2006-09-20 at 14:34 +0200, Matthias.Pitzl@izb.de wrote:
Hello!
We're running the latest release of PostgreSQL 7.4.13 on a
Debian Sarge
machine. Postgres has been compiled by oureselves.
We have a pretty big database running on this machine, ithas about 6.4 GB
approximately. One table contains about 55 million rows.
Into this table we insert about 500000 rows each day. Ourproblem is that
without any obvious reason the database gets corrupt. The
messages we get
are:
invalid page header in block 437702 of relation "xxxx"
We already have tried out some other versions of 7.4. Onanother machine
running Debian Woody with PotgreSQL 7.4.10 we don't have
any problems.
Kernels are 2.4.33 on the Sarge machine, 2.4.28 on the
Woody machine. Both
are SMP kernels.
Does anyone of you perhaps have some hints what's going wrong here?Most likely causes in these cases tends to be, bad memory, bad hard
drive, bad cpu, bad RAID / IDE / SCSI controller, loss of power when
writing to IDE drives / RAID controllers with cache with no battery
backup.I.e. check your hardware.
---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend
On Wed, 2006-09-20 at 15:14 +0200, Matthias.Pitzl@izb.de wrote:
Hello Scott!
Thank you for the quick answer. I'll try to check our hardware which is a
Compaq DL380 G4 with a batteyr buffered write cache on our raid controller.
As the system is running stable at all i think it's not the cpu or memory.
At moment i tend more to a bad disk or SCSI controller but even with that i
don't get any message in my logs...
Any ideas how i could check the hardware?
Keep in mind, a single bad memory location is all it takes to cause data
corruption, so it could well be memory. CPU is less likely if the
machine is otherwise running stable.
The standard tool on x86 hardware is memtest86 www.memtest86.com
So, you'd have to schedule a maintenance window to run the test in since
you have to basically down the machine and run just memtest86. I think
a few live linux distros have it built in (FC has a memtest label in
some versions I think)
My first suspicion is always memory. We ordered a batch of memory from
a very off brand supplier, and over 75% tested bad. And it took >24
hours to find some of the bad memory.
good luck with your testing, let us know how it goes.
Hello Scott!
Thank you. Memtest86 i know. I think we will use this for testing our
hardware too.
Got some other nice information meanwhile from someone also running a DL380
server which had a defect backplane causing similar issues.
He also gave me the hint that there's a test suite CD by Compaq to run some
hardware diagnostic checks on our machine. I will try this out as soon as
possible.
I will inform you when i know more :)
-- Matthias
Show quoted text
-----Original Message-----
From: Scott Marlowe [mailto:smarlowe@g2switchworks.com]
Sent: Wednesday, September 20, 2006 4:12 PM
To: Matthias.Pitzl@izb.de
Cc: pgsql-general@postgresql.org
Subject: RE: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x o nKeep in mind, a single bad memory location is all it takes to
cause data
corruption, so it could well be memory. CPU is less likely if the
machine is otherwise running stable.The standard tool on x86 hardware is memtest86 www.memtest86.com
So, you'd have to schedule a maintenance window to run the
test in since
you have to basically down the machine and run just
memtest86. I think
a few live linux distros have it built in (FC has a memtest label in
some versions I think)My first suspicion is always memory. We ordered a batch of
memory from
a very off brand supplier, and over 75% tested bad. And it took >24
hours to find some of the bad memory.good luck with your testing, let us know how it goes.
Import Notes
Resolved by subject fallback
On Wed, 20 Sep 2006, Matthias.Pitzl@izb.de wrote:
Any ideas how i could check the hardware?
1. memtest86 or memtest86+ at least 8 hours
2. CPU Burn-in
http://users.bigpond.net.au/cpuburn/ at least 8 hours
3. badblocks -s -v -t random /dev/sd%
WARNING: this will destroy your data!
4. smartctl -a /dev/sd%
Does not have to work. Sometimes needs some hacking to make it work.
Dave "KernelSlacker" Jones wrote a very good article about hardware
problems:
http://people.redhat.com/davej/hardware-problems.txt
Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh
Hello Tom!
Not yet, but i will try this one too. Anything special i should look for
when dumping out the bad pages?
-- Matthias
Show quoted text
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, September 20, 2006 4:32 PM
To: Matthias.Pitzl@izb.de
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x on Debian SargeMatthias.Pitzl@izb.de writes:
invalid page header in block 437702 of relation "xxxx"
I concur with Scott that this sounds suspiciously like a hardware
problem ... but have you tried dumping out the bad pages with
pg_filedump or even just od? The pattern of damage would help to
confirm or disprove the theory.You can find pg_filedump source code at
http://sources.redhat.com/rhdb/regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Import Notes
Resolved by subject fallback
Hello all!
Ok, i found out some more informations. According to
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&
taskId=110&prodSeriesId=397634&prodTypeId=15351&prodSeriesId=397634&objectID
=PSD_EX050119_CW01 one of our four disks in the server has a firmware issue.
The problem are incomplete writes onto disk while on high I/O load...
We will check this one first. If it won't help, we will try the hardware
diagnostics and some other tests...
Meanwhile thank you all for your suggestions :)
-- Matthias
Show quoted text
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of
Matthias.Pitzl@izb.de
Sent: Wednesday, September 20, 2006 3:14 PM
To: smarlowe@g2switchworks.com
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x oHello Scott!
Thank you for the quick answer. I'll try to check our
hardware which is a
Compaq DL380 G4 with a batteyr buffered write cache on our
raid controller.
As the system is running stable at all i think it's not the
cpu or memory.
At moment i tend more to a bad disk or SCSI controller but
even with that i
don't get any message in my logs...
Any ideas how i could check the hardware?Best regards,
Matthias
Import Notes
Resolved by subject fallback
Matthias.Pitzl@izb.de writes:
Not yet, but i will try this one too. Anything special i should look for
when dumping out the bad pages?
"If we knew what it was we would learn, it wouldn't be research" ...
regards, tom lane
Hello everyone!
Small update on this issue:
Our server has four 146GB disks as pairwise RAID 1 and one of these is
affected by the bug mentioned in the HP support page.
As quick fix i moved our database to the the other raid device built of
unaffected disks. Till now i don't got any new database corruption, so i
think the one disk with the firmware bug is the cause of our problems. Since
only the database does a lot of I/O onto the disks, this will help us for
the next days till we can upgrade or replace the bugged disk.
Thank you all for your hints and suggestions!
-- Matthias
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of
Matthias.Pitzl@izb.de
Sent: Wednesday, September 20, 2006 4:51 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange database corruption with
PostgreSQL 7.4.x oHello all!
Ok, i found out some more informations. According to
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&
taskId=110&prodSeriesId=397634&prodTypeId=15351&prodSeriesId=397634&objectID
=PSD_EX050119_CW01
Show quoted text
one of our four disks in the server has a firmware issue.
The problem are incomplete writes onto disk while on high I/O load...
We will check this one first. If it won't help, we will try the hardware
diagnostics and some other tests...
Meanwhile thank you all for your suggestions :)-- Matthias
Import Notes
Resolved by subject fallback