Lost rows/data corruption?

Started by Andrew Hallabout 21 years ago24 messagesgeneral
Jump to latest
#1Andrew Hall
temp02@bluereef.com.au

Hello,

We have a long running DB application using PG7.4.6. We do a VACUUM FULL
every night and a normal 'maintenance' VACUUM every hour. We do nothing with
any indexes. Every now and then we get errors from the database whereby an
update will fail on a table saying that there is duplicate violation on a
primary keyed row. Theoretically this is impossible as the constraint should
prevent any duplication, ever. When the next vacuum is run, we see an error
saying that there are more rows than the database can account for (or words
to that effect) and we should either REINDEX or TRUNCATE the table.

Firstly, how and why could this be happening, and above all how, do we
prevent it.

Thanks and regards,

Andrew.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Hall (#1)
Re: Lost rows/data corruption?

"Andrew Hall" <temp02@bluereef.com.au> writes:

We have a long running DB application using PG7.4.6. We do a VACUUM FULL
every night and a normal 'maintenance' VACUUM every hour. We do nothing with
any indexes. Every now and then we get errors from the database whereby an
update will fail on a table saying that there is duplicate violation on a
primary keyed row. Theoretically this is impossible as the constraint should
prevent any duplication, ever. When the next vacuum is run, we see an error
saying that there are more rows than the database can account for (or words
to that effect) and we should either REINDEX or TRUNCATE the table.

Could we see the exact log trace, rather than handwaving? Also the
schemas of the table(s)/index(es) involved might be useful.

regards, tom lane

#3Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#1)
Re: Lost rows/data corruption?

Tom,

Here is the data you requested. It took little while to gather it as this
kind of corruption doesn't happen all the time.

The first sign that we know something is wrong is our application freezing
when communicating with the DB.

If we then issue a vacuum, here's the output:

sonar=# VACUUM FULL;
WARNING: index "user_session_pkey" contains 85613 row versions, but table
contains 85513 row versions
HINT: Rebuild the index with REINDEX.
WARNING: index "user_session_pkey" contains 85613 row versions, but table
contains 85513 row versions
HINT: Rebuild the index with REINDEX.
VACUUM

If we the try a reindex, here's the output:

sonar=# REINDEX TABLE user_session;
ERROR: could not create unique index
DETAIL: Table contains duplicated values.

There are duplicates and massive amounts of erroneous data in most columns
in the table. From here, no amount of row deletion seems to clear the issue,
the only remedy is a truncate on the table and repopulate. In fact if we try
even a select we get this:

WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

Here is the schema for the table:

sonar=# \d user_session;
Table "public.user_session"
Column | Type |
Modifiers

----------------+-----------------------------+---------------------------------
-------------------
user_id | integer | not null
session_id | integer | not null
group_id | integer |
src | character varying(22) | not null
hit_bytes_in | bigint |
miss_bytes_in | bigint |
logon_time | timestamp without time zone | default
('now'::text)::timestamp
(6) with time zone
logoff_time | timestamp without time zone | default
('now'::text)::timestamp
(6) with time zone
account_active | character(1) | default 'Y'::bpchar
hit_bytes_out | bigint |
miss_bytes_out | bigint |
cost_bytes_in | double precision |
cost_bytes_out | double precision |
time_cost | double precision |
Indexes:
"user_session_pkey" primary key, btree (user_id, session_id)
Foreign-key constraints:
"$1" FOREIGN KEY (user_id) REFERENCES user_table(user_id) ON DELETE
CASCADE
Triggers:
delsessionusagetrigger AFTER DELETE ON user_session FOR EACH ROW EXECUTE
PRO
CEDURE delsessionusagefunc()

I'm most concerned that a primary constraint is being actively violated, and
I don't understand how this is possible.

Any help appreciated.
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: <pgsql-general@postgresql.org>
Sent: Friday, February 04, 2005 10:12 AM
Subject: Re: [GENERAL] Lost rows/data corruption?

Show quoted text

"Andrew Hall" <temp02@bluereef.com.au> writes:

We have a long running DB application using PG7.4.6. We do a VACUUM FULL
every night and a normal 'maintenance' VACUUM every hour. We do nothing
with
any indexes. Every now and then we get errors from the database whereby
an
update will fail on a table saying that there is duplicate violation on a
primary keyed row. Theoretically this is impossible as the constraint
should
prevent any duplication, ever. When the next vacuum is run, we see an
error
saying that there are more rows than the database can account for (or
words
to that effect) and we should either REINDEX or TRUNCATE the table.

Could we see the exact log trace, rather than handwaving? Also the
schemas of the table(s)/index(es) involved might be useful.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Hall (#3)
Re: Lost rows/data corruption?

"Andrew Hall" <temp02@bluereef.com.au> writes:

Here is the data you requested. It took little while to gather it as this
kind of corruption doesn't happen all the time.

It sounds like a mess, all right. Do you have a procedure to follow to
replicate this havoc? Are you sure there's not a hardware problem
underlying it all?

regards, tom lane

#5Andrew Hall
andrew.hall@bluereef.com.au
In reply to: Andrew Hall (#1)
Re: Lost rows/data corruption?

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: <pgsql-general@postgresql.org>
Sent: Tuesday, February 15, 2005 3:25 PM
Subject: Re: [GENERAL] Lost rows/data corruption?

"Andrew Hall" <temp02@bluereef.com.au> writes:

Here is the data you requested. It took little while to gather it as this
kind of corruption doesn't happen all the time.

It sounds like a mess, all right. Do you have a procedure to follow to
replicate this havoc? Are you sure there's not a hardware problem
underlying it all?

regards, tom lane

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes. We also use
XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
power-outages. Any ideas as to how I might go about isolating this? Have you
heard any other reports of this kind and suggested remedies?

#6Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#5)
Re: Lost rows/data corruption?

It sounds like a mess, all right. Do you have a procedure to follow to
replicate this havoc? Are you sure there's not a hardware problem
underlying it all?

regards, tom lane

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes. We also use
XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
power-outages. Any ideas as to how I might go about isolating this? Have
you
heard any other reports of this kind and suggested remedies?

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Hall (#6)
Re: Lost rows/data corruption?

"Andrew Hall" <temp02@bluereef.com.au> writes:

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes.

Okay, then not hardware; but it seems like you ought to be in a position
to create a test case for other people to poke at. I don't insist on
a 100% reproducible case, but something that will show the problem if
run for awhile would be a great help.

regards, tom lane

#8Geoffrey
esoteric@3times25.net
In reply to: Tom Lane (#7)
Re: Lost rows/data corruption?

Tom Lane wrote:

"Andrew Hall" <temp02@bluereef.com.au> writes:

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes.

Okay, then not hardware; but it seems like you ought to be in a position
to create a test case for other people to poke at. I don't insist on
a 100% reproducible case, but something that will show the problem if
run for awhile would be a great help.

His original statement prompts a question in my mind. I may be wrong
here, but when he noted:

'We also use XFS on linux 2.6 as a file system, so the FS should be
fairly tolerant to power-outages.'

Is Andrew indicating here that there might be some issues with power
loss on some of these boxes? If so, is it reasonable to assume that the
filesystem is able to maintain the database integrity in such a power
loss? I understand that XFS is quite a robust file system, but I can't
see relying on such robustness for database integrity (or any file
integrity for that matter). UPS's might be a better solution.

So the actual question in my mind that I didn't see anyone touch on is,
is it safe to assume that a power outage will not affect the database
integrity based on the robustness of the file system type?

Personally, I would not rely on such, but I'd like to hear what the
Postgresql experts think about this issue.

Then again, I may have read too much into Andrew's post. Andrew, do you
assume there have been power issues with any of these machines? Are you
comfortable relying on the filesystem to deal with such issues?

Ideally, I would research any corrolation between power outages and the
database problems. If there are no power outages to speak of, then
sorry for yapping up the wrong woody perennial plant.

Really just fishing for some insights here folks.

--
Until later, Geoffrey

#9Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Geoffrey (#8)
Re: Lost rows/data corruption?

On Tue, 2005-02-15 at 04:56, Geoffrey wrote:

Tom Lane wrote:

"Andrew Hall" <temp02@bluereef.com.au> writes:

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes.

Okay, then not hardware; but it seems like you ought to be in a position
to create a test case for other people to poke at. I don't insist on
a 100% reproducible case, but something that will show the problem if
run for awhile would be a great help.

His original statement prompts a question in my mind. I may be wrong
here, but when he noted:

'We also use XFS on linux 2.6 as a file system, so the FS should be
fairly tolerant to power-outages.'

Is Andrew indicating here that there might be some issues with power
loss on some of these boxes? If so, is it reasonable to assume that the
filesystem is able to maintain the database integrity in such a power
loss? I understand that XFS is quite a robust file system, but I can't
see relying on such robustness for database integrity (or any file
integrity for that matter). UPS's might be a better solution.

If I were him I'd try running my database on a different file system to
see if his version of XFS might be causing these problems.

While I agree that frequent power loss is NOT something a database
should be exposed to, a properly setup machine with a properly
functioning journalling file system should not experience these
problems. Might be time to check the drive subsystem to make sure it's
properly fsyncing data.

#10Marco Colombo
pgsql@esiway.net
In reply to: Andrew Hall (#6)
Re: Lost rows/data corruption?

On Tue, 15 Feb 2005, Andrew Hall wrote:

It sounds like a mess, all right. Do you have a procedure to follow to
replicate this havoc? Are you sure there's not a hardware problem
underlying it all?

regards, tom lane

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes. We also use
XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
power-outages. Any ideas as to how I might go about isolating this? Have you
heard any other reports of this kind and suggested remedies?

Are you running with fsync = off? and did the hosts experience any
power-outage recently?

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it

#11Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#6)
Re: Lost rows/data corruption?

fsync is on for all these boxes. Our customers run their own hardware with
many different specification of hardware in use. Many of our customers don't
have UPS, although their power is probably pretty reliable (normal city
based utilities), but of course I can't guarantee they don't get an outage
once in a while with a thunderstorm etc.

The problem here is that we are consistently seeing the same kind of
corruption and symptoms across a fairly large number of customers (52 have
reported this problem), so there is something endemic happening here that to
be honest, I'm surprised no one else is seeing. Fundamentally there is
nothing particularly abnormal with our application or data, but regardless,
I would have thought these kind of things (application design, data
representation etc) irrelevant to the reliability of the database not to
allow duplicate data on a primary key. Something is causing this corruption,
and one thing we do know is that it doesn't happen immediately with a new
installation, it takes time (several months of usage) before we start to see
this condition. I'd be really surprised if XFS is the problem as I know
there are plenty of other people across the world using it reliability with
PG.

We're going to see if we can build a test environment that can forcibly
cause this but I don't hold much hope, as we've tried to isolate it before
with little success. Here's what we tried changing when we originally went
searching for the problem, and it still here:

- the hardware (tried single CPU instead of dual - though that maybe an
issue with the OS)
- the OS version (tried Linux 2.6.5, 2.6.6, 2.6.7, 2.6.8.1, 2.6.10 and
2.4.22) - all using XFS
- the database table layout (tried changing the way the data is stored)
- the version of Jetty (servlet engine)
- the DB pool manager and PG JDBC driver versions
- the version of PG (tried two or three back from the latest)
- various vacuum regimes

----- Original Message -----
From: "Marco Colombo" <pgsql@esiway.net>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: <pgsql-general@postgresql.org>
Sent: Wednesday, February 16, 2005 2:58 AM
Subject: Re: Lost rows/data corruption?

Show quoted text

On Tue, 15 Feb 2005, Andrew Hall wrote:

It sounds like a mess, all right. Do you have a procedure to follow to
replicate this havoc? Are you sure there's not a hardware problem
underlying it all?

regards, tom lane

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes. We also
use
XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
power-outages. Any ideas as to how I might go about isolating this? Have
you
heard any other reports of this kind and suggested remedies?

Are you running with fsync = off? and did the hosts experience any
power-outage recently?

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it

#12Marco Colombo
pgsql@esiway.net
In reply to: Andrew Hall (#11)
Re: Lost rows/data corruption?

On Wed, 16 Feb 2005, Andrew Hall wrote:

fsync is on for all these boxes. Our customers run their own hardware with
many different specification of hardware in use. Many of our customers don't
have UPS, although their power is probably pretty reliable (normal city based
utilities), but of course I can't guarantee they don't get an outage once in
a while with a thunderstorm etc.

I see. Well I can't help much, then, I don't run PG on XFS. I suggest testing
on a different FS, to exclude XFS problems. But with fsync on, the FS has
very little to do with reliability, unless it _lies_ about fsync(). Any
FS should return from fsync only after data is on disc, journal or not
(there might be issues with meta-data, but it's hardly a problem with PG).

It's more likely the hardware (IDE disks) lies about data being on plate.
But again that's only in case of sudden poweroffs.

[...]

this condition. I'd be really surprised if XFS is the problem as I know there
are plenty of other people across the world using it reliability with PG.

This is kind of OT, but I don't follow your logic here.

I don't see why plenty of success stories of XFS+PG suggest to you
the culprit is PG. To me it's still 50% - 50%. :-)

Moreover, XFS is continuosly updated (as it follows normal linux kernel
fast release cycle, like any other linux FS), so it's hard to make a
data point unless someone else is using _exactly_ the same versions as
you do.

For example, in kernel changelog from 2.6.7 to 2.6.10 you can read:

"[XFS] Fix a race condition in the undo-delayed-write buffer routine."

"[XFS] Fix up memory allocators to be more resilient."

"[XFS] Fix a possible data loss issue after an unaligned unwritten
extent write."

"[XFS] handle inode creating race"

(only a few of them)

Now, I don't have even the faintest idea if that might have affected you
or nor, but still the point is that the linux kernel changes a lot.
And vendors tend to customize their kernels a lot, too. On the PostreSQL
side, releases are slowly-paced, so it's easier.

Anyway, I agree your problem is weird, and that it must be something
on the server side.
No matter what you do on the client side (pool manager, JDBC driver,
servlets engige), in no way the DB should get corrupted with duplicated
primary keys.

I know this is a silly question, but when you write 'We do nothing with
any indexes' do you mean indeces are never, _never_ touched (I mean
explicitly, as in drop/create index), i.e. they are created at schema
creation time and then left alone? Just to make sure...

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it

#13Alban Hertroys
alban@magproductions.nl
In reply to: Marco Colombo (#12)
Re: Lost rows/data corruption?

Marco Colombo wrote:

On Wed, 16 Feb 2005, Andrew Hall wrote:

fsync is on for all these boxes. Our customers run their own hardware
with many different specification of hardware in use. Many of our
customers don't have UPS, although their power is probably pretty
reliable (normal city based utilities), but of course I can't
guarantee they don't get an outage once in a while with a thunderstorm
etc.

I see. Well I can't help much, then, I don't run PG on XFS. I suggest
testing
on a different FS, to exclude XFS problems. But with fsync on, the FS has
very little to do with reliability, unless it _lies_ about fsync(). Any
FS should return from fsync only after data is on disc, journal or not
(there might be issues with meta-data, but it's hardly a problem with PG).

It's more likely the hardware (IDE disks) lies about data being on plate.
But again that's only in case of sudden poweroffs.

Do you happen to have the same type disks in all these systems? That
could point to a disk cache "problem" (f.e. the disks lying about having
written data from the cache to disk).

Or do you use the same disk parameters on all these machines? Have you
tried using the disks w/o write caching and/or in synchronous mode
(contrary to "async").

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

#14Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Alban Hertroys (#13)
Re: Lost rows/data corruption?

On Wed, 2005-02-16 at 07:14, Alban Hertroys wrote:

Marco Colombo wrote:

On Wed, 16 Feb 2005, Andrew Hall wrote:

fsync is on for all these boxes. Our customers run their own hardware
with many different specification of hardware in use. Many of our
customers don't have UPS, although their power is probably pretty
reliable (normal city based utilities), but of course I can't
guarantee they don't get an outage once in a while with a thunderstorm
etc.

I see. Well I can't help much, then, I don't run PG on XFS. I suggest
testing
on a different FS, to exclude XFS problems. But with fsync on, the FS has
very little to do with reliability, unless it _lies_ about fsync(). Any
FS should return from fsync only after data is on disc, journal or not
(there might be issues with meta-data, but it's hardly a problem with PG).

It's more likely the hardware (IDE disks) lies about data being on plate.
But again that's only in case of sudden poweroffs.

Do you happen to have the same type disks in all these systems? That
could point to a disk cache "problem" (f.e. the disks lying about having
written data from the cache to disk).

Or do you use the same disk parameters on all these machines? Have you
tried using the disks w/o write caching and/or in synchronous mode
(contrary to "async").

I was wondering if this problem had ever shown up on a machine that
HADN'T lost power abrubtly or not. IFF the only machines that
experience corruption have lost power beforehand sometime, then I would
look towards either the drives, controller or file system or somewhere
in there.

I know there are write modes in ext3 that will allow corruption on power
loss (I think it's writeback). I know little of XFS in a production
environment, as I run ext3, warts and all.

#15Marco Colombo
pgsql@esiway.net
In reply to: Scott Marlowe (#14)
Re: Lost rows/data corruption?

On Wed, 16 Feb 2005, Scott Marlowe wrote:

I know there are write modes in ext3 that will allow corruption on power
loss (I think it's writeback). I know little of XFS in a production
environment, as I run ext3, warts and all.

Yeah, but even in writeback mode, ext3 doesn't lie on fsync. No FS does.

Since PG can't expect any data to be on disk _before_ fsync completes,
it doesn't really make a difference. You can loose data in writeback mode
_if_ the application is not fsync-ing it (XFS only "mode" is similar to
writeback). I'm not aware of any case in which the system can lie about
fsync(), unless the hardware is lying in turn.

One question for gurus: does PG use fsync() on dirty data pages when
they are flushed to disk at checkpoint time? Does it fsync() the
directory in case of file creation/deletion/rename?

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it

#16Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#6)
Re: Lost rows/data corruption?

I know this is a silly question, but when you write 'We do nothing with
any indexes' do you mean indeces are never, _never_ touched (I mean
explicitly, as in drop/create index), i.e. they are created at schema
creation time and then left alone? Just to make sure...

Hi and thanks for your feedback,

Yes we never touch them, as in, they are implicitly created at schema create
time and then we don't touch them.

#17Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#6)
Re: Lost rows/data corruption?

Do you happen to have the same type disks in all these systems? That could
point to a disk cache "problem" (f.e. the disks lying about having written
data from the cache to disk).

Or do you use the same disk parameters on all these machines? Have you
tried using the disks w/o write caching and/or in synchronous mode
(contrary to "async").

It's all pretty common stuff, quite a few customers use standard IDE
(various flavours of controller/disk), some now use SATA (again various
brands) and the rest use SCSI. The kernel we use is the standard Linus
approved kernel with the inbuilt drivers as part of the kernel. We don't
supply any non-default parameters to the disk controllers.

Thanks for your suggestion on write caching, I'll look into this, I'm also
tempted to try a different journalling FS too.

#18Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#6)
Re: Lost rows/data corruption?

I was wondering if this problem had ever shown up on a machine that
HADN'T lost power abrubtly or not. IFF the only machines that
experience corruption have lost power beforehand sometime, then I would
look towards either the drives, controller or file system or somewhere
in there.

I can't be sure. We have an automated maintenance process that reboots all
our customers machines every 10 days at 2am. Having said this I'm now
wondering if this may have something to do with the issue. This automated
process issues a 'shutdown' to the database (and all other processes), waits
20 seconds and then issues a 'reboot' to the kernel. If the database was
still processing, the active postmaster process may wait for the client to
complete the query before allowing it to close, but I'm assuming that if
this exceeds 20 seconds, the kernel will issue a 'sigquit' to the process
tree and reboot immediately. Could this cause corruption?

#19Michael Fuhr
mike@fuhr.org
In reply to: Andrew Hall (#18)
Re: Lost rows/data corruption?

On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:

We have an automated maintenance process that reboots all our
customers machines every 10 days at 2am.

What's the purpose of doing this? If it's necessary then the reboots
aren't really fixing anything. Is whatever problem that prompted
this procedure being investigated so a permanent fix can be applied?

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Hall (#18)
Re: Lost rows/data corruption?

"Andrew Hall" <temp02@bluereef.com.au> writes:

I can't be sure. We have an automated maintenance process that reboots all
our customers machines every 10 days at 2am.

Why? Sounds like a decision made by someone who is used to Windows.
I've never seen any variant of Unix that needed that.

regards, tom lane

#21Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#6)
#22Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Andrew Hall (#21)
#23Keith C. Perry
netadmin@vcsn.com
In reply to: Andrew Hall (#17)
#24Andrew Hall
temp02@bluereef.com.au
In reply to: Andrew Hall (#6)