PostgreSQL corruption

Started by James Sewellabout 9 years ago9 messagesgeneral

james.sewell@jirotech.com

about 9 years ago

Hello All,

I am working with a client who is facing issues with database corruption
after a physical hard power off (the machines are at remote sites, this
could be a power outage or user error).

They have an environment made up of many of the following consumer grade
stand alone machines:

- Windows 7 SP1
- PostgreSQL 9.2.4
- Integrated Raid Controller
- Managed by Intel Rapid Storage Technology
- RAID 1 over two disks
- Disk caching disabled
- Not battery backed
- Disk cache disabled
- 2x Seagate SATA disk drives (st500lm021-1kj152
<http://www.seagate.com/www-content/product-content/momentus-fam/momentus-thin/en-us/docs/100737930b.pdf>
)

PostgreSQL is configured as follows:

- fsync=on
- full_page_writes=on
- wal_sync_method=fsync_writethrough

Windows is configured as follows:

- Disk caching disabled for the RAID1 set

They have currently proven that the corruption is repeatable in a testbed
with and without OS/RAID controller caching - but I am working with them to
make this process a little more detailed.

The new process will be:

1. Power on machine
2. If PostgreSQL doesn't start archive $PGDATA and initdb
3. Perform a pg_dumpall to test for corruption
4. If pg_dumpall fails then archive $PGDATA and initdb
5. Start test suite (which mimics high load from their application),
which INSERTS and DELETES records in and out of transaction
6. After 15 minutes cut power and repeat process

We are hoping to get about 20 machines in this testbed, giving us around
1500 power cycles per day.

Test scenarios which have been floated so far:

- As described above, all caching off
- As described above, all caching off, 9.2 stable
- As described above, all caching off, 9.5 stable with checksums

Can anyone think of anything else we should be considering / testing /
factoring in?

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

------------------------------
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

Scott Marlowe

scott.marlowe@gmail.com

about 9 years ago

In reply to: James Sewell (#1)

Re: PostgreSQL corruption

On Mon, Feb 13, 2017 at 9:21 PM, James Sewell <james.sewell@jirotech.com> wrote:

Hello All,

I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).

They have an environment made up of many of the following consumer grade stand alone machines:

Windows 7 SP1
PostgreSQL 9.2.4
Integrated Raid Controller

Managed by Intel Rapid Storage Technology
RAID 1 over two disks
Disk caching disabled
Not battery backed
Disk cache disabled

Some part of your OS or hardware is lying to postgres about fsyncs.
There are a few test suites out there that can test this independent
of postgresql btw, but it's been many years since I cranked one up.
Here's a web page from 2005 describing the problem and using a fsync
tester written in perl.

Try to see if you can get the same types of fsync errors out of your
hardware. If you can, stop, figure how to fix that, and then get back
in the game etc. Til then try not to lose power under load.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Scott Marlowe

scott.marlowe@gmail.com

about 9 years ago

In reply to: Scott Marlowe (#2)

Re: PostgreSQL corruption

On Mon, Feb 13, 2017 at 9:41 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Mon, Feb 13, 2017 at 9:21 PM, James Sewell <james.sewell@jirotech.com> wrote:

Hello All,

I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).

They have an environment made up of many of the following consumer grade stand alone machines:

Windows 7 SP1
PostgreSQL 9.2.4
Integrated Raid Controller

Managed by Intel Rapid Storage Technology
RAID 1 over two disks
Disk caching disabled
Not battery backed
Disk cache disabled

Some part of your OS or hardware is lying to postgres about fsyncs.
There are a few test suites out there that can test this independent
of postgresql btw, but it's been many years since I cranked one up.
Here's a web page from 2005 describing the problem and using a fsync
tester written in perl.

Try to see if you can get the same types of fsync errors out of your
hardware. If you can, stop, figure how to fix that, and then get back
in the game etc. Til then try not to lose power under load.

http://brad.livejournal.com/2116715.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Magnus Hagander

magnus@hagander.net

about 9 years ago

In reply to: James Sewell (#1)

Re: PostgreSQL corruption

On Tue, Feb 14, 2017 at 5:21 AM, James Sewell <james.sewell@jirotech.com>
wrote:

Hello All,

I am working with a client who is facing issues with database corruption
after a physical hard power off (the machines are at remote sites, this
could be a power outage or user error).

They have an environment made up of many of the following consumer grade
stand alone machines:

- Windows 7 SP1
- PostgreSQL 9.2.4

If you're using 9.2.4, you are missing about 4 years worth of bugfixes.
While what you're talking aobut sounds like other issues, you should really
upgrade that to something that doesn't have loads of known bugs and then
re-run the tests.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

James Sewell

james.sewell@jirotech.com

about 9 years ago

In reply to: Magnus Hagander (#4)

Re: PostgreSQL corruption

That's the plan, but it's essentially a client managed embedded database so
small steps needed. If I can prove it's the hardware first that would be
preferable.

It looks like diskcheck.pl doesn't work on Windows (no IO::Handle::sync) -
does anybody know of an alternative testkit. A C one would be the best I
suppose as it could exactly mimic PostgreSQL.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Wed, Feb 15, 2017 at 4:10 AM, Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Feb 14, 2017 at 5:21 AM, James Sewell <james.sewell@jirotech.com>
wrote:

Hello All,

I am working with a client who is facing issues with database corruption
after a physical hard power off (the machines are at remote sites, this
could be a power outage or user error).

They have an environment made up of many of the following consumer grade
stand alone machines:

- Windows 7 SP1
- PostgreSQL 9.2.4

If you're using 9.2.4, you are missing about 4 years worth of bugfixes.
While what you're talking aobut sounds like other issues, you should really
upgrade that to something that doesn't have loads of known bugs and then
re-run the tests.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

James Sewell

james.sewell@jirotech.com

about 9 years ago

In reply to: James Sewell (#5)

Re: PostgreSQL corruption

OK,

So with some help from the IRC channel (thanks macdice and JanniCash) it's
come to light that my RAID1 comprised of 2 * 7200RPM disks is reporting
~500 ops/sec in pg_test_fsync.

This is higher than the ~120 ops/sec which you would expect from 720RPM
disks - therefore something is lying.

Breaking up the RAID and re-imaging with JBOD dropped this to 50 ops/sec -
another question but still looking like a real result.

So in this case it looks like the RAID controller wasn't disabling caching
as advertised.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Wed, Feb 15, 2017 at 10:29 AM, James Sewell <james.sewell@jirotech.com>
wrote:

That's the plan, but it's essentially a client managed embedded database
so small steps needed. If I can prove it's the hardware first that would be
preferable.

It looks like diskcheck.pl doesn't work on Windows (no IO::Handle::sync)
- does anybody know of an alternative testkit. A C one would be the best I
suppose as it could exactly mimic PostgreSQL.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com
*F *(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Wed, Feb 15, 2017 at 4:10 AM, Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Feb 14, 2017 at 5:21 AM, James Sewell <james.sewell@jirotech.com>
wrote:

Hello All,

I am working with a client who is facing issues with database corruption
after a physical hard power off (the machines are at remote sites, this
could be a power outage or user error).

They have an environment made up of many of the following consumer grade
stand alone machines:

- Windows 7 SP1
- PostgreSQL 9.2.4

If you're using 9.2.4, you are missing about 4 years worth of bugfixes.
While what you're talking aobut sounds like other issues, you should really
upgrade that to something that doesn't have loads of known bugs and then
re-run the tests.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Merlin Moncure

mmoncure@gmail.com

about 9 years ago

In reply to: James Sewell (#6)

Re: PostgreSQL corruption

On Tue, Feb 14, 2017 at 7:23 PM, James Sewell <james.sewell@jirotech.com>
wrote:

OK,

So with some help from the IRC channel (thanks macdice and JanniCash)
it's come to light that my RAID1 comprised of 2 * 7200RPM disks is
reporting ~500 ops/sec in pg_test_fsync.

This is higher than the ~120 ops/sec which you would expect from 720RPM
disks - therefore something is lying.

Breaking up the RAID and re-imaging with JBOD dropped this to 50 ops/sec -
another question but still looking like a real result.

So in this case it looks like the RAID controller wasn't disabling caching
as advertised.

yup -- that's the thing. Performance numbers really tell the whole (or at
least most-) of the story. If it's too good to be true, it is. These
days, honestly I'd just throw out the raid controller and install some
intel ssd drives.

merlin

James Sewell

james.sewell@jirotech.com

about 9 years ago

In reply to: Merlin Moncure (#7)

Re: PostgreSQL corruption

Sadly this is for a customer who has 3000 of these in the field, the raid
controller is on the motherboard.

At least they know where to point the finger now!

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Fri, Feb 17, 2017 at 1:25 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Feb 14, 2017 at 7:23 PM, James Sewell <james.sewell@jirotech.com>
wrote:

OK,

So with some help from the IRC channel (thanks macdice and JanniCash)
it's come to light that my RAID1 comprised of 2 * 7200RPM disks is
reporting ~500 ops/sec in pg_test_fsync.

This is higher than the ~120 ops/sec which you would expect from 720RPM
disks - therefore something is lying.

Breaking up the RAID and re-imaging with JBOD dropped this to 50 ops/sec
- another question but still looking like a real result.

So in this case it looks like the RAID controller wasn't disabling
caching as advertised.

yup -- that's the thing. Performance numbers really tell the whole (or at
least most-) of the story. If it's too good to be true, it is. These
days, honestly I'd just throw out the raid controller and install some
intel ssd drives.

merlin

John R Pierce

pierce@hogranch.com

about 9 years ago

In reply to: James Sewell (#8)

Re: PostgreSQL corruption

On 2/16/2017 6:48 PM, James Sewell wrote:

Sadly this is for a customer who has 3000 of these in the field, the
raid controller is on the motherboard.

if thats the usual Intel "Matrix" raid, thats not actually a raid
controller. its intel sata in fake raid mode, the raid is entirely done
in host software.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general