Re: index corruption?

Started by Ed L.almost 23 years ago6 messages

pgsql@bluepolka.net

almost 23 years ago

On Feb 13, 2003, Tom Lane wrote:

Laurette Cisneros <laurette@nextbus.com> writes:

This is the error in the pgsql log:
2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is
not a btree

This says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values. I am not aware of any PG bugs that could overwrite those
fields. I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?

I am seeing this same problem on two separate machines, one brand new, one
older. Not sure yet what is causing it, but seems pretty unlikely that it
is hardware-related.

Import Notes

Reply to msg id not found: 200303311530.28896.pgsql@bluepolka.netReference msg id not found: 200303311530.28896.pgsql@bluepolka.net

Ed L.

pgsql@bluepolka.net

almost 23 years ago

In reply to: Ed L. (#1)

On Monday March 31 2003 3:38, Ed L. wrote:

On Feb 13, 2003, Tom Lane wrote:

Laurette Cisneros <laurette@nextbus.com> writes:

This is the error in the pgsql log:
2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is
not a btree

This says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values. I am not aware of any PG bugs that could overwrite those
fields. I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?

I am seeing this same problem on two separate machines, one brand new,
one older. Not sure yet what is causing it, but seems pretty unlikely
that it is hardware-related.

I am dabbling for the first time with a (crashing) C trigger, so that may be
the culprit here.

Tom Lane

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Ed L. (#2)

"Ed L." <pgsql@bluepolka.net> writes:

I am seeing this same problem on two separate machines, one brand new,
one older. Not sure yet what is causing it, but seems pretty unlikely
that it is hardware-related.

I am dabbling for the first time with a (crashing) C trigger, so that may be
the culprit here.

Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption. (First, the bogus code has to
overwrite a shared disk buffer. If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high. Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)

When you find the problem, please take note of whether there's something
involved that increases the chances of corruption getting to disk. We
might want to try to do something about it ...

regards, tom lane

Ed L.

pgsql@bluepolka.net

almost 23 years ago

In reply to: Tom Lane (#3)

On Monday March 31 2003 3:54, Tom Lane wrote:

"Ed L." <pgsql@bluepolka.net> writes:

I am seeing this same problem on two separate machines, one brand new,
one older. Not sure yet what is causing it, but seems pretty unlikely
that it is hardware-related.

I am dabbling for the first time with a (crashing) C trigger, so that
may be the culprit here.

Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption. (First, the bogus code has to
overwrite a shared disk buffer. If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high. Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)

When you find the problem, please take note of whether there's something
involved that increases the chances of corruption getting to disk. We
might want to try to do something about it ...

It is definitely due to some rogue trigger code. Not sure what exactly, but
if I remove a certain code segment the problem disappears.

scott.marlowe

scott.marlowe@ihs.com

almost 23 years ago

In reply to: Ed L. (#1)

On Mon, 31 Mar 2003, Ed L. wrote:

On Feb 13, 2003, Tom Lane wrote:

Laurette Cisneros <laurette@nextbus.com> writes:

This is the error in the pgsql log:
2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is
not a btree

This says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values. I am not aware of any PG bugs that could overwrite those
fields. I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?

I am seeing this same problem on two separate machines, one brand new, one
older. Not sure yet what is causing it, but seems pretty unlikely that it
is hardware-related.

Until you've tested them, the likelyhood is unimportant. If you've tested
the boxes, and the memory tests good and the hard drives test good, then
there is still likely to be another explanation, like a runaway kernel bug
is writing somewhere it should every fifth eon or two.

If you haven't tested the boxes, they're reliability is part of the NULL
set. :-)

Ed L.

pgsql@bluepolka.net

almost 23 years ago

In reply to: Ed L. (#4)

On Monday March 31 2003 4:15, Ed L. wrote:

On Monday March 31 2003 3:54, Tom Lane wrote:

"Ed L." <pgsql@bluepolka.net> writes:

I am seeing this same problem on two separate machines, one brand
new, one older. Not sure yet what is causing it, but seems pretty
unlikely that it is hardware-related.

I am dabbling for the first time with a (crashing) C trigger, so that
may be the culprit here.

Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption. (First, the bogus code has to
overwrite a shared disk buffer. If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high. Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)

When you find the problem, please take note of whether there's
something involved that increases the chances of corruption getting to
disk. We might want to try to do something about it ...

Well, I fixed it but cannot now remember exactly what change did it amidst a
bunch of rewrites of some existing stuff, and I cannot get back to that
state from here. :( It was definitely arising from some funky C trigger
code of my own making.