Index corruption

Started by Marc Munroalmost 20 years ago38 messageshackers
Jump to latest
#1Marc Munro
marc@bloodnok.com

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

The latest occurrence was on our production cluster yesterday. This has
only happened since we performed kernel upgrades and we are uncertain
whether this represents a kernel bug, or a postgres bug exposed by
different timings in the new kernel.

Our systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.3-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.0

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

In the latest episode, the index corruption was coincident with a
slony-induced vacuum. I don't know if this was the case with our test
system failures.

What can we do to help identify the cause of this? I believe we will be
able to reproduce this on a test system if there is some useful
investigation we can perform.

__
Marc

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#1)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

Please be more specific about what you mean by "index corruption"
... what were the exact symptoms?

postgresql-8.0.3-1PGDG

The *first* thing you should do is update to 8.0.8 and see if the
problem is still there.

regards, tom lane

#3Marc Munro
marc@bloodnok.com
In reply to: Marc Munro (#1)
Re: Index corruption

On Tom Lane's advice, we upgraded to Postgres 8.0.8. We also upgraded
slony to 1.1.5, due to some rpm issues. Apart from that everything is
as described below.

We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

What can we do to help identify the cause of this? We have a test
system that seems able to reproduce this fairly easily.

__
Marc

Show quoted text

On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote:

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

The latest occurrence was on our production cluster yesterday. This has
only happened since we performed kernel upgrades and we are uncertain
whether this represents a kernel bug, or a postgres bug exposed by
different timings in the new kernel.

Our systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.3-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.0

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

In the latest episode, the index corruption was coincident with a
slony-induced vacuum. I don't know if this was the case with our test
system failures.

__
Marc

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#3)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

On Tom Lane's advice, we upgraded to Postgres 8.0.8.

OK, so it's not an already-known problem.

We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

You still haven't given any details. What do you mean by "corrupt the
index" --- what actual symptoms are you seeing?

regards, tom lane

#5Marc Munro
marc@bloodnok.com
In reply to: Tom Lane (#4)
Re: Index corruption

On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote:

OK, so it's not an already-known problem.

We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

You still haven't given any details. What do you mean by "corrupt the
index" --- what actual symptoms are you seeing?

We have a two node slony cluster with load tests running againt the
provider node.

After resyncing the subscriber, and running load tests for about an
hour, the slony subscriber begins to fail. From the log file:

2006-06-28 17:58:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22001','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22002','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22003','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22004','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22005','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22006','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22009','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22010','0');
" ERROR: duplicate key violates unique constraint "table_trans_attr_pk"

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

__
Marc

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#5)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.
Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

Good. How big is the log table --- would it be possible for you to send
me the physical table and index files? (Not a pg_dump, the actual disk
files)

regards, tom lane

#7Marc Munro
marc@bloodnok.com
In reply to: Marc Munro (#1)
Re: Index corruption

We have reproduced the problem again. This time it looks like vacuum is
not part of the picture. From the provider's log:

2006-06-29 14:02:41 CST DEBUG2 cleanupThread: 101.057 seconds for vacuuming

And from the subscriber's:

2006-06-29 13:00:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374740387','22008','4000');

If my maths is correct and the logs are honest, the vacuum would have
started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first
encounter the problem. The clocks on the two machines, though in
different timezones, are currently synced.

Tom, I will create another tarball of the sl_log_1 table and indexes.
Should be quite a bit smaller than the previous one.

__
Marc

#8Chris Browne
cbbrowne@acm.org
In reply to: Marc Munro (#1)
Re: Index corruption

marc@bloodnok.com (Marc Munro) writes:

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

We have seen this phenomenon on 7.4.8 several times; pulled dumps of
sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
led to a fix.

He did, mind you, find some concurrency pattern that led, if memory
serves, to 7.4.12's release. We had experienced cases where there was
some worse corruption that required that we rebuild replicas from
scratch :-(.
--
output = reverse("gro.mca" "@" "enworbbc")
http://cbbrowne.com/info/advocacy.html
"There is no psychiatrist in the world like a puppy licking your
face." -- Ben Williams

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#1)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

Tom,
we have a newer and much smaller (35M) file showing the same thing:

Thanks. Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:

idx1:

Item 190 -- Length: 24 Offset: 3616 (0x0e20) Flags: USED
Block Id: 27806 linp Index: 2 Size: 24
Has Nulls: 0 Has Varwidths: 0

0e20: 00009e6c 02001800 01000000 0057d123 ...l.........W.#
0e30: 1a781200 00000000 .x......

Item 191 -- Length: 24 Offset: 3592 (0x0e08) Flags: USED
Block Id: 27806 linp Index: 2 Size: 24
Has Nulls: 0 Has Varwidths: 0

0e08: 00009e6c 02001800 01000000 0057d123 ...l.........W.#
0e18: 2e781200 00000000 .x......

idx2:

Item 127 -- Length: 16 Offset: 6144 (0x1800) Flags: USED
Block Id: 27806 linp Index: 2 Size: 16
Has Nulls: 0 Has Varwidths: 0

1800: 00009e6c 02001000 0057d123 00000000 ...l.....W.#....

Item 128 -- Length: 16 Offset: 6128 (0x17f0) Flags: USED
Block Id: 27806 linp Index: 2 Size: 16
Has Nulls: 0 Has Varwidths: 0

17f0: 00009e6c 02001000 0057d123 00000000 ...l.....W.#....

However, the two entries in idx1 contain different data!!

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row. We fixed one bug like this before ...

regards, tom lane

#10Hannu Krosing
hannu@tm.ee
In reply to: Chris Browne (#8)
Re: Index corruption

Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne:

marc@bloodnok.com (Marc Munro) writes:

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

We have seen this phenomenon on 7.4.8 several times; pulled dumps of
sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
led to a fix.

He did, mind you, find some concurrency pattern that led, if memory
serves, to 7.4.12's release. We had experienced cases where there was
some worse corruption that required that we rebuild replicas from
scratch :-(.

How well did you check the C-language triggers and special slony
functions for possibly corrupting some backend/shared-mem structures ?

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#11Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#9)
Re: Index corruption

Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane:

Marc Munro <marc@bloodnok.com> writes:

Tom,
we have a newer and much smaller (35M) file showing the same thing:

Thanks. Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:

...

However, the two entries in idx1 contain different data!!

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row. We fixed one bug like this before ...

Marc: do you have triggers on some replicated tables ?

I remember having some corruption in a database with weird circular
trigger structures, some of them being slony log triggers.

The thing that seemed to mess up something inside there, was when change
on parent rownt fired a trigger that changes child table rows and there
rows fired another trigger that changed the same parent row again.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#12Marc Munro
marc@bloodnok.com
In reply to: Hannu Krosing (#11)
Re: Index corruption

On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote:

Marc: do you have triggers on some replicated tables ?

We have a non-slony trigger on only 2 tables, neither of them involved
in this transaction. We certainly have no circular trigger structures.

I remember having some corruption in a database with weird circular
trigger structures, some of them being slony log triggers.

The thing that seemed to mess up something inside there, was when change
on parent rownt fired a trigger that changes child table rows and there
rows fired another trigger that changed the same parent row again.

__
Marc

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: Index corruption

I wrote:

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row. We fixed one bug like this before ...

Further study makes this look more probable. It seems that most of the
activity around the trouble spot consists of transactions inserting
exactly 13 rows into the table; for instance here are traces of two
successive transactions:

Block Item XMIN CMIN

27800 7 600921860 3
27800 8 600921860 7
27800 9 600921860 11
27800 10 600921860 15
27800 11 600921860 19
27800 12 600921860 23
27800 13 600921860 27
27800 14 600921860 31
27800 15 600921860 35
27800 16 600921860 39
27800 17 600921860 43
27800 18 600921860 47
27800 19 600921860 51

27800 20 600921870 3
27800 21 600921870 7
27800 22 600921870 11
27800 23 600921870 15
27800 24 600921870 19
27800 25 600921870 23
27800 26 600921870 27
27800 27 600921870 31
27800 28 600921870 35
27800 29 600921870 39
27800 30 600921870 43
27800 31 600921870 47
27800 32 600921870 51

The pattern of CMIN values is the same for all these transactions.
But look at the rows inserted by 600921856 and 600921858, the two
transactions that seem to be involved with the problem on page
27086:

27787 50 600921856 3
27795 41 600921858 3
27795 42 600921858 7
27795 43 600921858 11
27795 44 600921858 15
27795 45 600921858 19
27795 46 600921858 23
27795 47 600921858 27
27795 48 600921858 31
27795 49 600921858 35
27795 50 600921858 39
27795 51 600921858 43
27806 1 600921858 47
27806 2 600921856 15
27806 3 600921856 19
27806 4 600921856 23
27806 5 600921856 27
27806 6 600921856 31
27806 7 600921856 35
27806 8 600921856 39
27806 9 600921856 43
27806 10 600921856 47
27806 11 600921856 51
27806 12 600921858 51

(27787,50) and (27795,51) are both the last rows on their pages. What's
evidently happened is that the two transactions filled those pages and
then both seized on 27806 as the next page to insert into. I think that
600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on
that page, and then something wiped the page, then 600921858 inserted
its CMIN 47 row as item 1, and then 600921856 got control back and
finished inserting its rows. Further study of the indexes shows that
there are two entries in each index pointing to each of (27806,1) and
(27806,2) --- but since the xxid values are different for the two
(27806,1) entries, those didn't show up as duplicates in my first look.

Given the apparent connection to vacuuming, this is looking a WHOLE
lot like this bug fixed in 8.0.3:

2005-05-07 17:32 tgl

* src/backend/: access/heap/hio.c, commands/vacuumlazy.c
(REL7_3_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c,
access/nbtree/nbtree.c, commands/vacuumlazy.c (REL7_4_STABLE),
access/heap/hio.c, commands/vacuumlazy.c (REL7_2_STABLE),
access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c,
commands/vacuumlazy.c (REL8_0_STABLE), access/heap/hio.c,
access/nbtree/nbtpage.c, access/nbtree/nbtree.c,
commands/vacuumlazy.c: Repair very-low-probability race condition
between relation extension and VACUUM: in the interval between
adding a new page to the relation and formatting it, it was
possible for VACUUM to come along and decide it should format the
page too. Though not harmful in itself, this would cause data loss
if a third transaction were able to insert tuples into the vacuumed
page before the original extender got control back.

Are you *certain* this slave isn't running 8.0.2 or older? If you can
verify that, then I guess we need to look for another mechanism that
could cause the same kind of thing.

regards, tom lane

#14Marc Munro
marc@bloodnok.com
In reply to: Tom Lane (#13)
Re: index corruption

On Thu, 2006-06-29 at 19:59 -0400, Tom Lane wrote:

Ummm ... you did restart the server? "select version();" would be
the definitive test.

Can't say I blame you for the skepticism but I have confirmed it again.

test=# select version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 8.0.8 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
3.4.6 20060404 (Red Hat 3.4.6-3)
(1 row)

__
Marc

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#1)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

If there's anything we can do to help debug this we will. We can apply
patches, different build options, etc.

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case. I don't have a lot of hope for that, but it's
something that would require only machine time not people time.

A couple other things to try, given that you can provoke the failure
fairly easily:

1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber. The intended uses of P_NEW appear to be adequately
interlocked, but it's fairly easy to see how something like this could
happen if there are any places where InvalidBlockNumber is
unintentionally passed to ReadBuffer --- that would look like a P_NEW
call and it *wouldn't* be interlocked. So it would be worth changing
P_NEW to "(-2)" (this should just take a change in bufmgr.h and
recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
at the head of ReadBufferInternal(). Then rebuild with asserts enabled
and see if the failure case provokes that assert.

2. I'm also eyeing this bit of code in hio.c:

/*
* If the FSM knows nothing of the rel, try the last page before
* we give up and extend. This avoids one-tuple-per-page syndrome
* during bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
finished initializing it. AFAICT that's not harmful because the page
will look like it has no free space ... but it seems a bit fragile.
If you dike out the above-mentioned code, can you still provoke the
failure?

A different line of attack is to see if you can make a self-contained
test case so other people can try to reproduce it. More eyeballs on the
problem are always better.

Lastly, it might be interesting to look at the WAL logs for the period
leading up to a failure. This would give us an idea of what was
happening concurrently with the processes that seem directly involved.

regards, tom lane

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#1)
Re: Index corruption

[ back to the start of the thread... ]

Marc Munro <marc@bloodnok.com> writes:

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
...
NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac
Jumbo frames 8192 MTU.
All postgres data and logs are stored on the netapp.

BTW, a couple of thoughts here:

* If my theory about the low-level cause is correct, then reindexing
sl_log_1 would make the "duplicate key" errors go away, but nonetheless
you'd have lost data --- the overwritten rows would be gone. I suppose
that this would result in the slave missing some rows that are present
on the master. Have you tried comparing slave and master databases to
see if you can find any discrepancies?

* One way that the problem could happen would be if a race condition in
the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
the true end-of-file (as determined by another process' write()
extending the EOF just slightly earlier --- ie, lseek fails to note the
effects of the just-completed write, and returns the prior EOF value).
PG does have internal locking that should guarantee that the lseek is
not done until after the write completes ... but could there be a bug in
the kernel allowing stale data to be returned? The SMP hardware is
relevant (maybe one processor sees different data than the other) and
frankly I don't trust NFS very far at all for questions such as this.
It'd be interesting to see if you can reproduce the problem in a
database on local storage.

regards, tom lane

#17Marc Munro
marc@bloodnok.com
In reply to: Tom Lane (#15)
Re: Index corruption

On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case. I don't have a lot of hope for that, but it's
something that would require only machine time not people time.

I'll try this tomorrow.

A couple other things to try, given that you can provoke the failure
fairly easily:

1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber. The intended uses of P_NEW appear to be adequately
interlocked, but it's fairly easy to see how something like this could
happen if there are any places where InvalidBlockNumber is
unintentionally passed to ReadBuffer --- that would look like a P_NEW
call and it *wouldn't* be interlocked. So it would be worth changing
P_NEW to "(-2)" (this should just take a change in bufmgr.h and
recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
at the head of ReadBufferInternal(). Then rebuild with asserts enabled
and see if the failure case provokes that assert.

I'll try this too.

2. I'm also eyeing this bit of code in hio.c:

/*
* If the FSM knows nothing of the rel, try the last page before
* we give up and extend. This avoids one-tuple-per-page syndrome
* during bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
finished initializing it. AFAICT that's not harmful because the page
will look like it has no free space ... but it seems a bit fragile.
If you dike out the above-mentioned code, can you still provoke the
failure?

By dike out, you mean remove? Please confirm and I'll try it.

A different line of attack is to see if you can make a self-contained
test case so other people can try to reproduce it. More eyeballs on the
problem are always better.

Can't really see this being possible. This is clearly a very unusual
problem and without similar hardware I doubt that anyone else will
trigger it. We ran this system happily for nearly a year on the
previous kernel without experiencing this problem (tcp lockups are a
different matter). Also the load is provided by a bunch of servers and
robots simulating rising and falling load.

Lastly, it might be interesting to look at the WAL logs for the period
leading up to a failure. This would give us an idea of what was
happening concurrently with the processes that seem directly involved.

Next time we reproduce it, I'll take a copy of the WAL files too.

__
Marc

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#17)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

By dike out, you mean remove? Please confirm and I'll try it.

Right, just remove (or comment out) the lines I quoted.

We ran this system happily for nearly a year on the
previous kernel without experiencing this problem (tcp lockups are a
different matter).

I'm starting to think that a kernel bug is a realistic explanation,
see other message. Whose kernel build is this exactly?

regards, tom lane

#19Marc Munro
marc@bloodnok.com
In reply to: Tom Lane (#16)
Re: Index corruption

On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote:

[ back to the start of the thread... ]

BTW, a couple of thoughts here:

* If my theory about the low-level cause is correct, then reindexing
sl_log_1 would make the "duplicate key" errors go away, but nonetheless
you'd have lost data --- the overwritten rows would be gone. I suppose
that this would result in the slave missing some rows that are present
on the master. Have you tried comparing slave and master databases to
see if you can find any discrepancies?

Haven't done that yet - in test we tend to restart the old subscriber as
the new provider and rebuild the cluster. I'll check the logs from our
production failure to figure out what to compare and see what I can
discover.

* One way that the problem could happen would be if a race condition in
the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
the true end-of-file (as determined by another process' write()
extending the EOF just slightly earlier --- ie, lseek fails to note the
effects of the just-completed write, and returns the prior EOF value).
PG does have internal locking that should guarantee that the lseek is
not done until after the write completes ... but could there be a bug in
the kernel allowing stale data to be returned? The SMP hardware is
relevant (maybe one processor sees different data than the other) and
frankly I don't trust NFS very far at all for questions such as this.
It'd be interesting to see if you can reproduce the problem in a
database on local storage.

Unfortunately we haven't got any local storage that can stand the sort
of loads we are putting through. With slower storage the CPUs mostly
sit idle and we are very unlikely to trigger a timing-based bug if
that's what it is.

I'll get back to you with kernel build information tomorrow. We'll also
try to talk to some kernel hackers about this.

Many thanks for your efforts so far.
__
Marc

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#19)
Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

I'll get back to you with kernel build information tomorrow. We'll also
try to talk to some kernel hackers about this.

Some googling turned up recent discussions about race conditions in
Linux NFS code:

http://threebit.net/mail-archive/linux-kernel/msg05313.html
http://lkml.org/lkml/2006/3/1/381

I don't know the kernel nearly well enough to guess if these are related
...

regards, tom lane

#21Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#19)
#24Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#22)
#25Marko Kreen
markokr@gmail.com
In reply to: Jan Wieck (#24)
#26Jan Wieck
JanWieck@Yahoo.com
In reply to: Marko Kreen (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#26)
#28Brad Nicholson
bnichols@ca.afilias.info
In reply to: Tom Lane (#20)
#29Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#27)
#30Brad Nicholson
bnichols@ca.afilias.info
In reply to: Brad Nicholson (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brad Nicholson (#28)
#32Brad Nicholson
bnichols@ca.afilias.info
In reply to: Tom Lane (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#29)
#34Hannu Krosing
hannu@tm.ee
In reply to: Jan Wieck (#29)
#35Marc Munro
marc@bloodnok.com
In reply to: Tom Lane (#15)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Munro (#35)
#37Marc Munro
marc@bloodnok.com
In reply to: Tom Lane (#36)
#38mdean
mdean@xn1.com
In reply to: Marc Munro (#37)