Re: index corruption?
On Feb 13, 2003, Tom Lane wrote:
Laurette Cisneros <laurette@nextbus.com> writes:
This is the error in the pgsql log:
2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is
not a btreeThis says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values. I am not aware of any PG bugs that could overwrite those
fields. I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?
I am seeing this same problem on two separate machines, one brand new, one
older. Not sure yet what is causing it, but seems pretty unlikely that it
is hardware-related.
Ed
Import Notes
Reply to msg id not found: 200303311530.28896.pgsql@bluepolka.netReference msg id not found: 200303311530.28896.pgsql@bluepolka.net
On Monday March 31 2003 3:38, Ed L. wrote:
On Feb 13, 2003, Tom Lane wrote:
Laurette Cisneros <laurette@nextbus.com> writes:
This is the error in the pgsql log:
2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is
not a btreeThis says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values. I am not aware of any PG bugs that could overwrite those
fields. I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?I am seeing this same problem on two separate machines, one brand new,
one older. Not sure yet what is causing it, but seems pretty unlikely
that it is hardware-related.
I am dabbling for the first time with a (crashing) C trigger, so that may be
the culprit here.
Ed
"Ed L." <pgsql@bluepolka.net> writes:
I am seeing this same problem on two separate machines, one brand new,
one older. Not sure yet what is causing it, but seems pretty unlikely
that it is hardware-related.
I am dabbling for the first time with a (crashing) C trigger, so that may be
the culprit here.
Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption. (First, the bogus code has to
overwrite a shared disk buffer. If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high. Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)
When you find the problem, please take note of whether there's something
involved that increases the chances of corruption getting to disk. We
might want to try to do something about it ...
regards, tom lane
On Monday March 31 2003 3:54, Tom Lane wrote:
"Ed L." <pgsql@bluepolka.net> writes:
I am seeing this same problem on two separate machines, one brand new,
one older. Not sure yet what is causing it, but seems pretty unlikely
that it is hardware-related.I am dabbling for the first time with a (crashing) C trigger, so that
may be the culprit here.Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption. (First, the bogus code has to
overwrite a shared disk buffer. If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high. Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)When you find the problem, please take note of whether there's something
involved that increases the chances of corruption getting to disk. We
might want to try to do something about it ...
It is definitely due to some rogue trigger code. Not sure what exactly, but
if I remove a certain code segment the problem disappears.
Ed
On Mon, 31 Mar 2003, Ed L. wrote:
On Feb 13, 2003, Tom Lane wrote:
Laurette Cisneros <laurette@nextbus.com> writes:
This is the error in the pgsql log:
2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is
not a btreeThis says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values. I am not aware of any PG bugs that could overwrite those
fields. I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?I am seeing this same problem on two separate machines, one brand new, one
older. Not sure yet what is causing it, but seems pretty unlikely that it
is hardware-related.
Until you've tested them, the likelyhood is unimportant. If you've tested
the boxes, and the memory tests good and the hard drives test good, then
there is still likely to be another explanation, like a runaway kernel bug
is writing somewhere it should every fifth eon or two.
If you haven't tested the boxes, they're reliability is part of the NULL
set. :-)
On Monday March 31 2003 4:15, Ed L. wrote:
On Monday March 31 2003 3:54, Tom Lane wrote:
"Ed L." <pgsql@bluepolka.net> writes:
I am seeing this same problem on two separate machines, one brand
new, one older. Not sure yet what is causing it, but seems pretty
unlikely that it is hardware-related.I am dabbling for the first time with a (crashing) C trigger, so that
may be the culprit here.Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption. (First, the bogus code has to
overwrite a shared disk buffer. If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high. Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)When you find the problem, please take note of whether there's
something involved that increases the chances of corruption getting to
disk. We might want to try to do something about it ...
Well, I fixed it but cannot now remember exactly what change did it amidst a
bunch of rewrites of some existing stuff, and I cannot get back to that
state from here. :( It was definitely arising from some funky C trigger
code of my own making.
Ed