Index corruption

Started by Marc Munroover 19 years ago38 messages

marc@bloodnok.com

over 19 years ago

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

The latest occurrence was on our production cluster yesterday. This has
only happened since we performed kernel upgrades and we are uncertain
whether this represents a kernel bug, or a postgres bug exposed by
different timings in the new kernel.

Our systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.3-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.0

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

In the latest episode, the index corruption was coincident with a
slony-induced vacuum. I don't know if this was the case with our test
system failures.

What can we do to help identify the cause of this? I believe we will be
able to reproduce this on a test system if there is some useful
investigation we can perform.

__
Marc

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

Please be more specific about what you mean by "index corruption"
... what were the exact symptoms?

postgresql-8.0.3-1PGDG

The *first* thing you should do is update to 8.0.8 and see if the
problem is still there.

regards, tom lane

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

On Tom Lane's advice, we upgraded to Postgres 8.0.8. We also upgraded
slony to 1.1.5, due to some rpm issues. Apart from that everything is
as described below.

We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

What can we do to help identify the cause of this? We have a test
system that seems able to reproduce this fairly easily.

__
Marc

Show quoted text

On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote:

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

The latest occurrence was on our production cluster yesterday. This has
only happened since we performed kernel upgrades and we are uncertain
whether this represents a kernel bug, or a postgres bug exposed by
different timings in the new kernel.

Our systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.3-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.0

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

In the latest episode, the index corruption was coincident with a
slony-induced vacuum. I don't know if this was the case with our test
system failures.

__
Marc

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#3)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

On Tom Lane's advice, we upgraded to Postgres 8.0.8.

OK, so it's not an already-known problem.

We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

You still haven't given any details. What do you mean by "corrupt the
index" --- what actual symptoms are you seeing?

regards, tom lane

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Tom Lane (#4)

Re: Index corruption

On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote:

OK, so it's not an already-known problem.

We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

You still haven't given any details. What do you mean by "corrupt the
index" --- what actual symptoms are you seeing?

We have a two node slony cluster with load tests running againt the
provider node.

After resyncing the subscriber, and running load tests for about an
hour, the slony subscriber begins to fail. From the log file:

2006-06-28 17:58:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22001','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22002','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22003','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22004','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22005','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22006','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22009','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22010','0');
" ERROR: duplicate key violates unique constraint "table_trans_attr_pk"

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

__
Marc

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#5)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.
Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

Good. How big is the log table --- would it be possible for you to send
me the physical table and index files? (Not a pg_dump, the actual disk
files)

regards, tom lane

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

We have reproduced the problem again. This time it looks like vacuum is
not part of the picture. From the provider's log:

2006-06-29 14:02:41 CST DEBUG2 cleanupThread: 101.057 seconds for vacuuming

And from the subscriber's:

2006-06-29 13:00:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374740387','22008','4000');

If my maths is correct and the logs are honest, the vacuum would have
started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first
encounter the problem. The clocks on the two machines, though in
different timezones, are currently synced.

Tom, I will create another tarball of the sl_log_1 table and indexes.
Should be quite a bit smaller than the previous one.

__
Marc

Import Notes

Reply to msg id not found: 1151603529.3913.30.camel@bloodnok.com

Chris Browne

cbbrowne@acm.org

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

marc@bloodnok.com (Marc Munro) writes:

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

We have seen this phenomenon on 7.4.8 several times; pulled dumps of
sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
led to a fix.

He did, mind you, find some concurrency pattern that led, if memory
serves, to 7.4.12's release. We had experienced cases where there was
some worse corruption that required that we rebuild replicas from
scratch :-(.
--
output = reverse("gro.mca" "@" "enworbbc")
http://cbbrowne.com/info/advocacy.html
"There is no psychiatrist in the world like a puppy licking your
face." -- Ben Williams

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

Tom,
we have a newer and much smaller (35M) file showing the same thing:

Thanks. Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:

idx1:

Item 190 -- Length: 24 Offset: 3616 (0x0e20) Flags: USED
Block Id: 27806 linp Index: 2 Size: 24
Has Nulls: 0 Has Varwidths: 0

0e20: 00009e6c 02001800 01000000 0057d123 ...l.........W.#
0e30: 1a781200 00000000 .x......

Item 191 -- Length: 24 Offset: 3592 (0x0e08) Flags: USED
Block Id: 27806 linp Index: 2 Size: 24
Has Nulls: 0 Has Varwidths: 0

0e08: 00009e6c 02001800 01000000 0057d123 ...l.........W.#
0e18: 2e781200 00000000 .x......

idx2:

Item 127 -- Length: 16 Offset: 6144 (0x1800) Flags: USED
Block Id: 27806 linp Index: 2 Size: 16
Has Nulls: 0 Has Varwidths: 0

1800: 00009e6c 02001000 0057d123 00000000 ...l.....W.#....

Item 128 -- Length: 16 Offset: 6128 (0x17f0) Flags: USED
Block Id: 27806 linp Index: 2 Size: 16
Has Nulls: 0 Has Varwidths: 0

17f0: 00009e6c 02001000 0057d123 00000000 ...l.....W.#....

However, the two entries in idx1 contain different data!!

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row. We fixed one bug like this before ...

regards, tom lane

Import Notes

Reply to msg id not found: 1151613606.3913.50.camel@bloodnok.com

#10

Hannu Krosing

hannu@skype.net

over 19 years ago

In reply to: Chris Browne (#8)

Re: Index corruption

Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne:

marc@bloodnok.com (Marc Munro) writes:

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem. I have not reindexed
the table this time as I do not want to destroy any usable evidence.

We have seen this phenomenon on 7.4.8 several times; pulled dumps of
sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
led to a fix.

He did, mind you, find some concurrency pattern that led, if memory
serves, to 7.4.12's release. We had experienced cases where there was
some worse corruption that required that we rebuild replicas from
scratch :-(.

How well did you check the C-language triggers and special slony
functions for possibly corrupting some backend/shared-mem structures ?

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#11

Hannu Krosing

hannu@skype.net

over 19 years ago

In reply to: Tom Lane (#9)

Re: Index corruption

Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane:

Marc Munro <marc@bloodnok.com> writes:

Tom,
we have a newer and much smaller (35M) file showing the same thing:

Thanks. Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:

...

However, the two entries in idx1 contain different data!!

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row. We fixed one bug like this before ...

Marc: do you have triggers on some replicated tables ?

I remember having some corruption in a database with weird circular
trigger structures, some of them being slony log triggers.

The thing that seemed to mess up something inside there, was when change
on parent rownt fired a trigger that changes child table rows and there
rows fired another trigger that changed the same parent row again.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#12

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Hannu Krosing (#11)

Re: Index corruption

On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote:

Marc: do you have triggers on some replicated tables ?

We have a non-slony trigger on only 2 tables, neither of them involved
in this transaction. We certainly have no circular trigger structures.

I remember having some corruption in a database with weird circular
trigger structures, some of them being slony log triggers.

The thing that seemed to mess up something inside there, was when change
on parent rownt fired a trigger that changes child table rows and there
rows fired another trigger that changed the same parent row again.

__
Marc

#13

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Tom Lane (#9)

Re: Index corruption

I wrote:

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row. We fixed one bug like this before ...

Further study makes this look more probable. It seems that most of the
activity around the trouble spot consists of transactions inserting
exactly 13 rows into the table; for instance here are traces of two
successive transactions:

Block Item XMIN CMIN

27800 7 600921860 3
27800 8 600921860 7
27800 9 600921860 11
27800 10 600921860 15
27800 11 600921860 19
27800 12 600921860 23
27800 13 600921860 27
27800 14 600921860 31
27800 15 600921860 35
27800 16 600921860 39
27800 17 600921860 43
27800 18 600921860 47
27800 19 600921860 51

27800 20 600921870 3
27800 21 600921870 7
27800 22 600921870 11
27800 23 600921870 15
27800 24 600921870 19
27800 25 600921870 23
27800 26 600921870 27
27800 27 600921870 31
27800 28 600921870 35
27800 29 600921870 39
27800 30 600921870 43
27800 31 600921870 47
27800 32 600921870 51

The pattern of CMIN values is the same for all these transactions.
But look at the rows inserted by 600921856 and 600921858, the two
transactions that seem to be involved with the problem on page
27086:

27787 50 600921856 3
27795 41 600921858 3
27795 42 600921858 7
27795 43 600921858 11
27795 44 600921858 15
27795 45 600921858 19
27795 46 600921858 23
27795 47 600921858 27
27795 48 600921858 31
27795 49 600921858 35
27795 50 600921858 39
27795 51 600921858 43
27806 1 600921858 47
27806 2 600921856 15
27806 3 600921856 19
27806 4 600921856 23
27806 5 600921856 27
27806 6 600921856 31
27806 7 600921856 35
27806 8 600921856 39
27806 9 600921856 43
27806 10 600921856 47
27806 11 600921856 51
27806 12 600921858 51

(27787,50) and (27795,51) are both the last rows on their pages. What's
evidently happened is that the two transactions filled those pages and
then both seized on 27806 as the next page to insert into. I think that
600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on
that page, and then something wiped the page, then 600921858 inserted
its CMIN 47 row as item 1, and then 600921856 got control back and
finished inserting its rows. Further study of the indexes shows that
there are two entries in each index pointing to each of (27806,1) and
(27806,2) --- but since the xxid values are different for the two
(27806,1) entries, those didn't show up as duplicates in my first look.

Given the apparent connection to vacuuming, this is looking a WHOLE
lot like this bug fixed in 8.0.3:

2005-05-07 17:32 tgl

* src/backend/: access/heap/hio.c, commands/vacuumlazy.c
(REL7_3_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c,
access/nbtree/nbtree.c, commands/vacuumlazy.c (REL7_4_STABLE),
access/heap/hio.c, commands/vacuumlazy.c (REL7_2_STABLE),
access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c,
commands/vacuumlazy.c (REL8_0_STABLE), access/heap/hio.c,
access/nbtree/nbtpage.c, access/nbtree/nbtree.c,
commands/vacuumlazy.c: Repair very-low-probability race condition
between relation extension and VACUUM: in the interval between
adding a new page to the relation and formatting it, it was
possible for VACUUM to come along and decide it should format the
page too. Though not harmful in itself, this would cause data loss
if a third transaction were able to insert tuples into the vacuumed
page before the original extender got control back.

Are you *certain* this slave isn't running 8.0.2 or older? If you can
verify that, then I guess we need to look for another mechanism that
could cause the same kind of thing.

regards, tom lane

#14

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Tom Lane (#13)

Re: index corruption

On Thu, 2006-06-29 at 19:59 -0400, Tom Lane wrote:

Ummm ... you did restart the server? "select version();" would be
the definitive test.

Can't say I blame you for the skepticism but I have confirmed it again.

test=# select version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 8.0.8 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
3.4.6 20060404 (Red Hat 3.4.6-3)
(1 row)

__
Marc

Import Notes

Resolved by subject fallback

#15

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

If there's anything we can do to help debug this we will. We can apply
patches, different build options, etc.

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case. I don't have a lot of hope for that, but it's
something that would require only machine time not people time.

A couple other things to try, given that you can provoke the failure
fairly easily:

1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber. The intended uses of P_NEW appear to be adequately
interlocked, but it's fairly easy to see how something like this could
happen if there are any places where InvalidBlockNumber is
unintentionally passed to ReadBuffer --- that would look like a P_NEW
call and it *wouldn't* be interlocked. So it would be worth changing
P_NEW to "(-2)" (this should just take a change in bufmgr.h and
recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
at the head of ReadBufferInternal(). Then rebuild with asserts enabled
and see if the failure case provokes that assert.

2. I'm also eyeing this bit of code in hio.c:

/*
* If the FSM knows nothing of the rel, try the last page before
* we give up and extend. This avoids one-tuple-per-page syndrome
* during bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
finished initializing it. AFAICT that's not harmful because the page
will look like it has no free space ... but it seems a bit fragile.
If you dike out the above-mentioned code, can you still provoke the
failure?

A different line of attack is to see if you can make a self-contained
test case so other people can try to reproduce it. More eyeballs on the
problem are always better.

Lastly, it might be interesting to look at the WAL logs for the period
leading up to a failure. This would give us an idea of what was
happening concurrently with the processes that seem directly involved.

regards, tom lane

Import Notes

Reply to msg id not found: 1151624001.3913.66.camel@bloodnok.com

#16

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#1)

Re: Index corruption

[ back to the start of the thread... ]

Marc Munro <marc@bloodnok.com> writes:

We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
...
NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac
Jumbo frames 8192 MTU.
All postgres data and logs are stored on the netapp.

BTW, a couple of thoughts here:

* If my theory about the low-level cause is correct, then reindexing
sl_log_1 would make the "duplicate key" errors go away, but nonetheless
you'd have lost data --- the overwritten rows would be gone. I suppose
that this would result in the slave missing some rows that are present
on the master. Have you tried comparing slave and master databases to
see if you can find any discrepancies?

* One way that the problem could happen would be if a race condition in
the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
the true end-of-file (as determined by another process' write()
extending the EOF just slightly earlier --- ie, lseek fails to note the
effects of the just-completed write, and returns the prior EOF value).
PG does have internal locking that should guarantee that the lseek is
not done until after the write completes ... but could there be a bug in
the kernel allowing stale data to be returned? The SMP hardware is
relevant (maybe one processor sees different data than the other) and
frankly I don't trust NFS very far at all for questions such as this.
It'd be interesting to see if you can reproduce the problem in a
database on local storage.

regards, tom lane

#17

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Tom Lane (#15)

Re: Index corruption

On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case. I don't have a lot of hope for that, but it's
something that would require only machine time not people time.

I'll try this tomorrow.

A couple other things to try, given that you can provoke the failure
fairly easily:

1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber. The intended uses of P_NEW appear to be adequately
interlocked, but it's fairly easy to see how something like this could
happen if there are any places where InvalidBlockNumber is
unintentionally passed to ReadBuffer --- that would look like a P_NEW
call and it *wouldn't* be interlocked. So it would be worth changing
P_NEW to "(-2)" (this should just take a change in bufmgr.h and
recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
at the head of ReadBufferInternal(). Then rebuild with asserts enabled
and see if the failure case provokes that assert.

I'll try this too.

2. I'm also eyeing this bit of code in hio.c:

/*
* If the FSM knows nothing of the rel, try the last page before
* we give up and extend. This avoids one-tuple-per-page syndrome
* during bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
finished initializing it. AFAICT that's not harmful because the page
will look like it has no free space ... but it seems a bit fragile.
If you dike out the above-mentioned code, can you still provoke the
failure?

By dike out, you mean remove? Please confirm and I'll try it.

A different line of attack is to see if you can make a self-contained
test case so other people can try to reproduce it. More eyeballs on the
problem are always better.

Can't really see this being possible. This is clearly a very unusual
problem and without similar hardware I doubt that anyone else will
trigger it. We ran this system happily for nearly a year on the
previous kernel without experiencing this problem (tcp lockups are a
different matter). Also the load is provided by a bunch of servers and
robots simulating rising and falling load.

Lastly, it might be interesting to look at the WAL logs for the period
leading up to a failure. This would give us an idea of what was
happening concurrently with the processes that seem directly involved.

Next time we reproduce it, I'll take a copy of the WAL files too.

__
Marc

#18

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#17)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

By dike out, you mean remove? Please confirm and I'll try it.

Right, just remove (or comment out) the lines I quoted.

We ran this system happily for nearly a year on the
previous kernel without experiencing this problem (tcp lockups are a
different matter).

I'm starting to think that a kernel bug is a realistic explanation,
see other message. Whose kernel build is this exactly?

regards, tom lane

#19

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Tom Lane (#16)

Re: Index corruption

On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote:

[ back to the start of the thread... ]

BTW, a couple of thoughts here:

* If my theory about the low-level cause is correct, then reindexing
sl_log_1 would make the "duplicate key" errors go away, but nonetheless
you'd have lost data --- the overwritten rows would be gone. I suppose
that this would result in the slave missing some rows that are present
on the master. Have you tried comparing slave and master databases to
see if you can find any discrepancies?

Haven't done that yet - in test we tend to restart the old subscriber as
the new provider and rebuild the cluster. I'll check the logs from our
production failure to figure out what to compare and see what I can
discover.

* One way that the problem could happen would be if a race condition in
the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
the true end-of-file (as determined by another process' write()
extending the EOF just slightly earlier --- ie, lseek fails to note the
effects of the just-completed write, and returns the prior EOF value).
PG does have internal locking that should guarantee that the lseek is
not done until after the write completes ... but could there be a bug in
the kernel allowing stale data to be returned? The SMP hardware is
relevant (maybe one processor sees different data than the other) and
frankly I don't trust NFS very far at all for questions such as this.
It'd be interesting to see if you can reproduce the problem in a
database on local storage.

Unfortunately we haven't got any local storage that can stand the sort
of loads we are putting through. With slower storage the CPUs mostly
sit idle and we are very unlikely to trigger a timing-based bug if
that's what it is.

I'll get back to you with kernel build information tomorrow. We'll also
try to talk to some kernel hackers about this.

Many thanks for your efforts so far.
__
Marc

#20

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#19)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

I'll get back to you with kernel build information tomorrow. We'll also
try to talk to some kernel hackers about this.

Some googling turned up recent discussions about race conditions in
Linux NFS code:

http://threebit.net/mail-archive/linux-kernel/msg05313.html
http://lkml.org/lkml/2006/3/1/381

I don't know the kernel nearly well enough to guess if these are related
...

regards, tom lane

#21

Marko Kreen

markokr@gmail.com

over 19 years ago

In reply to: Tom Lane (#20)

Re: Index corruption

On 6/30/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't know the kernel nearly well enough to guess if these are related
...

The sl_log_* tables are indexed on xid, where the relations between
values are not exactly stable. When having high enough activity on
one node or having nodes with XIDs on different enough positions
funny things happen.

Have you already excluded this as a probable reason?

--
marko

#22

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marko Kreen (#21)

Re: Index corruption

"Marko Kreen" <markokr@gmail.com> writes:

The sl_log_* tables are indexed on xid, where the relations between
values are not exactly stable. When having high enough activity on
one node or having nodes with XIDs on different enough positions
funny things happen.

Yeah, that was one of the first things I thought about, but the range of
XIDs involved in these test cases isn't large enough to involve a
wraparound, and anyway it's now looking like the problem is loss of heap
entries, not index corruption at all.

Slony's use of XID comparison semantics for indexes is definitely pretty
scary though. If I were them I'd find a way to get rid of it. In
theory, since the table is only supposed to contain "recent" XIDs,
as long as you keep it vacuumed the index should never contain any
inconsistently-comparable XIDs ... but in a big index, the boundary keys
for upper-level index pages might hang around an awful long time ...

regards, tom lane

#23

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#19)

Re: Index corruption

I trawled through the first, larger dump you sent me, and found multiple
index entries pointing to quite a few heap tuples:

Occurrences block item

2 43961 1
2 43961 2
2 43961 3
2 43961 4
2 43961 5
2 43961 6
2 43961 7
2 43961 8
2 43961 9
2 119695 1
2 119695 2
2 119695 3
2 126029 1
2 126029 2
2 126029 3
2 166694 1
2 166865 1
2 166865 2
2 166865 3
2 166865 4
2 166865 5
2 166865 6
2 166865 7
2 206221 1
2 247123 1
2 327775 1
2 327775 2
2 327775 3
2 327775 4
2 327775 5
2 327775 6
2 327775 7
2 327775 8
2 327775 9
2 327775 10
2 327775 11

Both indexes show identical sets of duplicates, which makes it pretty
hard to credit that it's a within-index problem.

You mentioned that the test had been allowed to run for a good while
after the first slave error was noted. So it seems there's no question
that we are looking at some mechanism that allows the first few entries
on a heap page to be lost and overwritten :-(, and that this happened
several times over the course of the larger run.

regards, tom lane

#24

Jan Wieck

JanWieck@Yahoo.com

over 19 years ago

In reply to: Tom Lane (#22)

Re: Index corruption

On 6/30/2006 9:55 AM, Tom Lane wrote:

"Marko Kreen" <markokr@gmail.com> writes:

The sl_log_* tables are indexed on xid, where the relations between
values are not exactly stable. When having high enough activity on
one node or having nodes with XIDs on different enough positions
funny things happen.

Yeah, that was one of the first things I thought about, but the range of
XIDs involved in these test cases isn't large enough to involve a
wraparound, and anyway it's now looking like the problem is loss of heap
entries, not index corruption at all.

Slony's use of XID comparison semantics for indexes is definitely pretty
scary though. If I were them I'd find a way to get rid of it. In
theory, since the table is only supposed to contain "recent" XIDs,
as long as you keep it vacuumed the index should never contain any
inconsistently-comparable XIDs ... but in a big index, the boundary keys
for upper-level index pages might hang around an awful long time ...

With the final implementation of log switching, the problem of xxid
wraparound will be avoided entirely. Every now and then slony will
switch from one to another log table and when the old one is drained and
logically empty, it is truncated, which should reinitialize all indexes.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#25

Marko Kreen

markokr@gmail.com

over 19 years ago

In reply to: Jan Wieck (#24)

Re: Index corruption

On 6/30/06, Jan Wieck <JanWieck@yahoo.com> wrote:

With the final implementation of log switching, the problem of xxid
wraparound will be avoided entirely. Every now and then slony will
switch from one to another log table and when the old one is drained and
logically empty, it is truncated, which should reinitialize all indexes.

If the xxid-s come from different DB-s, then there can still be problems.

--
marko

#26

Jan Wieck

JanWieck@Yahoo.com

over 19 years ago

In reply to: Marko Kreen (#25)

Re: Index corruption

On 6/30/2006 11:17 AM, Marko Kreen wrote:

On 6/30/06, Jan Wieck <JanWieck@yahoo.com> wrote:

With the final implementation of log switching, the problem of xxid
wraparound will be avoided entirely. Every now and then slony will
switch from one to another log table and when the old one is drained and
logically empty, it is truncated, which should reinitialize all indexes.

If the xxid-s come from different DB-s, then there can still be problems.

How so? They are allways part of a multi-key index having the
originating node ID first.

Jan

#27

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Jan Wieck (#26)

Re: Index corruption

Jan Wieck <JanWieck@Yahoo.com> writes:

On 6/30/2006 11:17 AM, Marko Kreen wrote:

If the xxid-s come from different DB-s, then there can still be problems.

How so? They are allways part of a multi-key index having the
originating node ID first.

Really?

create table @NAMESPACE@.sl_log_1 (
log_origin int4,
log_xid @NAMESPACE@.xxid,
log_tableid int4,
log_actionseq int8,
log_cmdtype char,
log_cmddata text
);
create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1
(log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq);

create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1
(log_xid @NAMESPACE@.xxid_ops);

sl_log_1_idx2 doesn't seem to have any such protection. When I was
poking at Marc's example, though, it seemed that the numbers going
into the table were all *locally generated* XIDs, in fact the same
as the XID doing the insertions. If this is only true on the master,
and slaves can be inserting XIDs coming from different masters,
then I think it will break.

regards, tom lane

#28

Brad Nicholson

bnichols@ca.afilias.info

over 19 years ago

In reply to: Tom Lane (#20)

Re: Index corruption

Tom Lane wrote:

Marc Munro <marc@bloodnok.com> writes:

I'll get back to you with kernel build information tomorrow. We'll also
try to talk to some kernel hackers about this.

Some googling turned up recent discussions about race conditions in
Linux NFS code:

http://threebit.net/mail-archive/linux-kernel/msg05313.html
http://lkml.org/lkml/2006/3/1/381

I don't know the kernel nearly well enough to guess if these are related
...

It may or may not be the same issue, but for what it's worth, we've seen
the same sl_log_1 corruption on AIX 5.1 and 5.3

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

#29

Jan Wieck

JanWieck@Yahoo.com

over 19 years ago

In reply to: Tom Lane (#27)

Re: Index corruption

On 6/30/2006 11:55 AM, Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

On 6/30/2006 11:17 AM, Marko Kreen wrote:

If the xxid-s come from different DB-s, then there can still be problems.

How so? They are allways part of a multi-key index having the
originating node ID first.

Really?

create table @NAMESPACE@.sl_log_1 (
log_origin int4,
log_xid @NAMESPACE@.xxid,
log_tableid int4,
log_actionseq int8,
log_cmdtype char,
log_cmddata text
);
create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1
(log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq);

create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1
(log_xid @NAMESPACE@.xxid_ops);

You're right ... forgot about that one. And yes, there can be
transactions originating from multiple origins (masters) in the same
log. The thing is, the index is only there because in a single origin
situation (most installations are), the log_origin is allways the same.
The optimizer therefore sometimes didn't think using an index at all
would be good.

However, transactions from different origins are NEVER selected together
and it wouldn't make sense to compare their xid's anyway. So the index
might return index tuples for rows from another origin, but the
following qualifications against the log_origin in the heap tuple will
filter them out.

Jan

#30

Brad Nicholson

bnichols@ca.afilias.info

over 19 years ago

In reply to: Brad Nicholson (#28)

Re: Index corruption

Brad Nicholson wrote:

Tom Lane wrote:

Marc Munro <marc@bloodnok.com> writes:

I'll get back to you with kernel build information tomorrow. We'll also
try to talk to some kernel hackers about this.

Some googling turned up recent discussions about race conditions in
Linux NFS code:

http://threebit.net/mail-archive/linux-kernel/msg05313.html
http://lkml.org/lkml/2006/3/1/381

I don't know the kernel nearly well enough to guess if these are related
...

It may or may not be the same issue, but for what it's worth, we've seen
the same sl_log_1 corruption on AIX 5.1 and 5.3

On 7.4.12, I should add.

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

#31

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Brad Nicholson (#28)

Re: Index corruption

Brad Nicholson <bnichols@ca.afilias.info> writes:

It may or may not be the same issue, but for what it's worth, we've seen
the same sl_log_1 corruption on AIX 5.1 and 5.3

Hm, on what filesystem, and what PG version(s)?

I'm not completely satisfied by the its-a-kernel-bug theory, because if
it were then ISTM extending an index would be subject to the same risks
as extending a table; but I see no evidence of index page lossage in
Marc's dump. OTOH the usage patterns are different, so maybe PG isn't
stressing the write-to-lseek path quite as hard for indexes.

regards, tom lane

#32

Brad Nicholson

bnichols@ca.afilias.info

over 19 years ago

In reply to: Tom Lane (#31)

Re: Index corruption

Tom Lane wrote:

Brad Nicholson <bnichols@ca.afilias.info> writes:

It may or may not be the same issue, but for what it's worth, we've seen
the same sl_log_1 corruption on AIX 5.1 and 5.3

Hm, on what filesystem, and what PG version(s)?

I'm not completely satisfied by the its-a-kernel-bug theory, because if
it were then ISTM extending an index would be subject to the same risks
as extending a table; but I see no evidence of index page lossage in
Marc's dump. OTOH the usage patterns are different, so maybe PG isn't
stressing the write-to-lseek path quite as hard for indexes.

regards, tom lane

jfs2 in all cases. I don't recall the PG version for 5.1, but it was
greater that 7.4.8. For 5.3, it was 7.4.12.

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

#33

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Jan Wieck (#29)

Re: Index corruption

Jan Wieck <JanWieck@Yahoo.com> writes:

You're right ... forgot about that one.

However, transactions from different origins are NEVER selected together
and it wouldn't make sense to compare their xid's anyway. So the index
might return index tuples for rows from another origin, but the
following qualifications against the log_origin in the heap tuple will
filter them out.

No, that's not the point here. The point here is that if the xids that
are simultaneously present in the index span more than a 2G-XID range,
btree *will fail* because it will be dealing with keys that do not obey
the transitive law. You do have a problem --- it doesn't explain Marc's
troubles, but sl_log_1_idx2 is broken for multi master situations. All
you need is masters with sufficiently different XID counters.

regards, tom lane

#34

Hannu Krosing

hannu@skype.net

over 19 years ago

In reply to: Jan Wieck (#29)

Re: Index corruption

Ühel kenal päeval, R, 2006-06-30 kell 12:05, kirjutas Jan Wieck:

On 6/30/2006 11:55 AM, Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

On 6/30/2006 11:17 AM, Marko Kreen wrote:

If the xxid-s come from different DB-s, then there can still be problems.

How so? They are allways part of a multi-key index having the
originating node ID first.

Really?

create table @NAMESPACE@.sl_log_1 (
log_origin int4,
log_xid @NAMESPACE@.xxid,
log_tableid int4,
log_actionseq int8,
log_cmdtype char,
log_cmddata text
);
create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1
(log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq);

create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1
(log_xid @NAMESPACE@.xxid_ops);

You're right ... forgot about that one. And yes, there can be
transactions originating from multiple origins (masters) in the same
log. The thing is, the index is only there because in a single origin
situation (most installations are), the log_origin is allways the same.
The optimizer therefore sometimes didn't think using an index at all
would be good.

However, transactions from different origins are NEVER selected together
and it wouldn't make sense to compare their xid's anyway. So the index
might return index tuples for rows from another origin, but the
following qualifications against the log_origin in the heap tuple will
filter them out.

The problem was not only with returning too many rows from tuples, but
as much returning too few. In case when you return too few rows some
actions will just be left out from replication and thus will be missing
from slaves.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#35

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Tom Lane (#15)

Re: Index corruption

On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the

A couple other things to try, given that you can provoke the failure
fairly easily:
. . .
1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber. The intended uses of P_NEW appear to be adequately
. . .
2. I'm also eyeing this bit of code in hio.c:

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
. . .

We tried all of these suggestions and still get the problem. Nothing
interesting in the log file so I guess the Asserts did not fire.

We are going to try experimenting with different kernels now. Unless
anyone has any other suggestions.

__
Marc

#36

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Marc Munro (#35)

Re: Index corruption

Marc Munro <marc@bloodnok.com> writes:

We tried all of these suggestions and still get the problem. Nothing
interesting in the log file so I guess the Asserts did not fire.

Not surprising, it was a long shot that any of those things were really
broken. But worth testing.

We are going to try experimenting with different kernels now. Unless
anyone has any other suggestions.

Right at the moment I have no better ideas :-(

regards, tom lane

#37

Marc Munro

marc@bloodnok.com

over 19 years ago

In reply to: Tom Lane (#36)

Re: Index corruption

For the record, here are the results of our (ongoing) inevstigation into
the index/heap corruption problems I reported a couple of weeks ago.

We were able to trigger the problem with kernels 2.6.16, 2.6.17 and
2.6.18.rc1, with 2.6.16 seeming to be the most flaky.

By replacing the NFS-mounted netapp with a fibre-channel SAN, we have
eliminated the problem on all kernels.

From this, it would seem to be an NFS bug introduced post 2.6.14, though
we cannot rule out a postgres bug exposed by unusual timing issues.

Our starting systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others)
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.8-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.5

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

All tests results were reproduced with postgres 8.0.8

__
Marc

Show quoted text

On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote:

Marc Munro <marc@bloodnok.com> writes:

We tried all of these suggestions and still get the problem. Nothing
interesting in the log file so I guess the Asserts did not fire.

Not surprising, it was a long shot that any of those things were really
broken. But worth testing.

We are going to try experimenting with different kernels now. Unless
anyone has any other suggestions.

Right at the moment I have no better ideas :-(

regards, tom lane

#38

mdean

mdean@xn1.com

over 19 years ago

In reply to: Marc Munro (#37)

always denying corruption

Marc Munro wrote:

For the record, here are the results of our (ongoing) inevstigation into
the index/heap corruption problems I reported a couple of weeks ago.

We were able to trigger the problem with kernels 2.6.16, 2.6.17 and
2.6.18.rc1, with 2.6.16 seeming to be the most flaky.

By replacing the NFS-mounted netapp with a fibre-channel SAN, we have
eliminated the problem on all kernels.

From this, it would seem to be an NFS bug introduced post 2.6.14, though
we cannot rule out a postgres bug exposed by unusual timing issues.

Our starting systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others)
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.8-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.5

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

All tests results were reproduced with postgres 8.0.8

__
Marc

On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote:

Marc Munro <marc@bloodnok.com> writes:

We tried all of these suggestions and still get the problem. Nothing
interesting in the log file so I guess the Asserts did not fire.

Not surprising, it was a long shot that any of those things were really
broken. But worth testing.

We are going to try experimenting with different kernels now. Unless
anyone has any other suggestions.

Right at the moment I have no better ideas :-(

regards, tom lane

On a good stock day, some levity is justified. How are hackers like
politicians?

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date: 7/17/2006