TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
I've got a customer who was having a problem with a backend running away
with memory. It would hit 46G before finally being running the box
completely out of memory. It didn't appear to be related to hashjoin/agg
or pending triggers, so I had them recompile with debug and assert
turned on. A bit after restarting we get:
TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
Looking through the code I see that's something to do with indexes, but
I'm not sure what. Is this likely a corrupted index? If so, is there
some way I could identify which index?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
Looking through the code I see that's something to do with indexes, but
I'm not sure what. Is this likely a corrupted index?
Sounds that way.
If so, is there some way I could identify which index?
gdb the core file and do "p rel->rd_rel->relname" in the _bt_search
stack frame (which is not going to be the top of stack but should be
close to the top).
regards, tom lane
On Wed, Oct 26, 2005 at 04:48:22PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
Looking through the code I see that's something to do with indexes, but
I'm not sure what. Is this likely a corrupted index?Sounds that way.
Is this something that should be logged better than it is? Is there any
value in trying to save the index/table that's busted?
This is 8.0.3, btw.
If so, is there some way I could identify which index?
gdb the core file and do "p rel->rd_rel->relname" in the _bt_search
stack frame (which is not going to be the top of stack but should be
close to the top).
Yeah, looks like no core files yet (they would be in $PGDATA somewhere,
correct?), so I had them hack their startup script to include ulimit -c
unlimited. Luckily this is pretty reproducable, so hopefully we'll have
a core soon.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Wed, Oct 26, 2005 at 04:48:22PM -0400, Tom Lane wrote:
Sounds that way.
Is this something that should be logged better than it is?
We don't even know what it is yet, so that question seems a bit premature.
Is there any value in trying to save the index/table that's busted?
It'd probably be worth making a copy for forensic purposes before you
REINDEX it.
regards, tom lane
On Wed, Oct 26, 2005 at 04:03:41PM -0500, Jim C. Nasby wrote:
Is this something that should be logged better than it is? Is there any
value in trying to save the index/table that's busted?
Unfortuatly, it's one of those Assert(expr) lines, they don't generally
give the opportunity for extra debug data. They're usually not even
compiled in...
Yeah, looks like no core files yet (they would be in $PGDATA somewhere,
correct?), so I had them hack their startup script to include ulimit -c
unlimited. Luckily this is pretty reproducable, so hopefully we'll have
a core soon.
Well, you know the query that generates it? EXPLAIN should tell you the
indexes it's using, so you could just REINDEX those...
Or REINDEX them all... :)
Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
On Wed, Oct 26, 2005 at 11:14:17PM +0200, Martijn van Oosterhout wrote:
On Wed, Oct 26, 2005 at 04:03:41PM -0500, Jim C. Nasby wrote:
Is this something that should be logged better than it is? Is there any
value in trying to save the index/table that's busted?Unfortuatly, it's one of those Assert(expr) lines, they don't generally
give the opportunity for extra debug data. They're usually not even
compiled in...Yeah, looks like no core files yet (they would be in $PGDATA somewhere,
correct?), so I had them hack their startup script to include ulimit -c
unlimited. Luckily this is pretty reproducable, so hopefully we'll have
a core soon.Well, you know the query that generates it? EXPLAIN should tell you the
indexes it's using, so you could just REINDEX those...
Well, since query logging only logs when a query finishes...
Or REINDEX them all... :)
242G database. That would take quite a while... :)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Wed, Oct 26, 2005 at 11:14:17PM +0200, Martijn van Oosterhout wrote:
Well, you know the query that generates it?
Well, since query logging only logs when a query finishes...
Don't forget to look at debug_query_string when you get the core file.
regards, tom lane
On Wed, Oct 26, 2005 at 05:47:15PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Wed, Oct 26, 2005 at 11:14:17PM +0200, Martijn van Oosterhout wrote:
Well, you know the query that generates it?
Well, since query logging only logs when a query finishes...
Don't forget to look at debug_query_string when you get the core file.
Didn't know about that, it will certainly help.
There's nothing in the FAQ's about what to look for in a coredump is
there?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Oct 26, 2005 at 04:48:22PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
Looking through the code I see that's something to do with indexes, but
I'm not sure what. Is this likely a corrupted index?Sounds that way.
If so, is there some way I could identify which index?
gdb the core file and do "p rel->rd_rel->relname" in the _bt_search
stack frame (which is not going to be the top of stack but should be
close to the top).
Reproduced the crash, but still no core file... where exactly should it
have been put? Would in be outside of $PGDATA?
Is there any way to verify the ulimit setings that the backend is
running under?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Reproduced the crash, but still no core file... where exactly should it
have been put? Would in be outside of $PGDATA?
In 8.0 I'd expect to find it in $PGDATA/base/DBOID/core (or possibly
core.NNNN). There are some platforms like Darwin that tend to put
core files in a fixed directory such as /cores, though.
regards, tom lane
On Wed, Oct 26, 2005 at 06:38:45PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Reproduced the crash, but still no core file... where exactly should it
have been put? Would in be outside of $PGDATA?In 8.0 I'd expect to find it in $PGDATA/base/DBOID/core (or possibly
core.NNNN). There are some platforms like Darwin that tend to put
core files in a fixed directory such as /cores, though.
Grr... /etc/profile had
ulimit -S -c 0 > /dev/null 2>&1
Is there any way to verify what limits are in place for a running
backend?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Oct 26, 2005 at 06:06:19PM -0500, Jim C. Nasby wrote:
On Wed, Oct 26, 2005 at 06:38:45PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Reproduced the crash, but still no core file... where exactly should it
have been put? Would in be outside of $PGDATA?In 8.0 I'd expect to find it in $PGDATA/base/DBOID/core (or possibly
core.NNNN). There are some platforms like Darwin that tend to put
core files in a fixed directory such as /cores, though.Grr... /etc/profile had
ulimit -S -c 0 > /dev/null 2>&1Is there any way to verify what limits are in place for a running
backend?
Also, is an assert guaranteed to dump core?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Is there any way to verify what limits are in place for a running
backend?
Damifino.
Also, is an assert guaranteed to dump core?
Yup ... at least, it will call abort().
regards, tom lane
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Is there anything in particular you'd like to see from the index file? I
made a copy of it before reindexing...
Could you send me the whole file (off-list)?
regards, tom lane
Import Notes
Reply to msg id not found: 20051027012807.GA16682@pervasive.com
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Wed, Oct 26, 2005 at 09:29:23PM -0400, Tom Lane wrote:
Could you send me the whole file (off-list)?
Ok, will send URL as soon as I have it from client.
Well, the answer is that there's nothing wrong with that index except
that four consecutive pages near the end (32K total) have been zeroed
out :-(
In the situation where _bt_search descends into one of these pages,
we'd have the following behavior:
* P_RIGHTMOST() appears true because btpo_next is zero, therefore
_bt_moveright does nothing.
* P_ISLEAF() appears false because the flags word is zero, so we
don't fall out of the loop.
* _bt_binsrch finds high < low and returns low (ie, 1) rather than complaining.
* If Asserts are on then the internal ItemIdIsUsed assert in PageGetItem
triggers, resulting in the behavior mentioned in $subject.
* If not, then we stupidly fetch a zero block number out of the
nonexistent item, and iterate to page zero, ie, the metapage.
The above observations still hold true for the metapage, so
it's an infinite loop --- or would be except that we're building
a stack entry each time around the loop, and so we gradually
exhaust memory. This matches the other behavior Jim saw.
Bottom line is that index searches probably ought to have some
non-Assert defenses against zeroed-out pages. Obviously we can't
expect to catch every flavor of data corruption, but this particular
one has been seen before...
BTW, Jim, any thoughts about how the index got corrupted? Have you
had any crashes on that machine lately?
regards, tom lane
Import Notes
Reply to msg id not found: 20051027020026.GB16682@pervasive.com
On Thu, Oct 27, 2005 at 11:53:01PM -0400, Tom Lane wrote:
BTW, Jim, any thoughts about how the index got corrupted? Have you
had any crashes on that machine lately?
Write-through cache on drive array that's not battery backed. Plus, the
backend has been crashing on a sig 11 about once a week for
who-knows-how-long. You do the math...
BTW, now that the backend is compiled with --enable-debug I hope to find
out the reason for the random crashes.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, 27 Oct 2005, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Wed, Oct 26, 2005 at 09:29:23PM -0400, Tom Lane wrote:
Could you send me the whole file (off-list)?
Ok, will send URL as soon as I have it from client.
Well, the answer is that there's nothing wrong with that index except
that four consecutive pages near the end (32K total) have been zeroed
out :-(
[snip]
Bottom line is that index searches probably ought to have some
non-Assert defenses against zeroed-out pages. Obviously we can't
expect to catch every flavor of data corruption, but this particular
one has been seen before...
Definately. I've seen faulty hardware somehow zero blocks where I would
have expected random data. I wonder if we can test with PageIsNew(), which
is very inexpensive. The question is: what do we do when we detect this?
BTW, Jim, any thoughts about how the index got corrupted? Have you
had any crashes on that machine lately?
Have spoken with Jim on IRC, he says that there have been several crashes
recently due to a faulty disk array. I guess the zeroing could be an
outcome of the faulty disk. I wonder if the crash the faulty disk resulted
in could have been caused some where around mdextend() where we create a
zero'd page but before we could have written out the initialised page.
If this happened 4 times in a row it could account for the problem. It
does seem a bit unlikely thought.
That being said, is there any reason where don't extend the file with a
PageInit()'d block instead of a zero'd file?
Thanks,
Gavin
Gavin Sherry <swm@linuxworld.com.au> writes:
Definately. I've seen faulty hardware somehow zero blocks where I would
have expected random data. I wonder if we can test with PageIsNew(), which
is very inexpensive. The question is: what do we do when we detect this?
I think erroring out with a message along the line of "corrupted data in
index foo" is plenty. The recovery action is easy: reindex. So all we
have to do is not crash and deliver a useful error message.
We know that the page has already gotten by PageHeaderIsValid, so either
it's in reasonable condition or it's all zeroes. So a quick check for
some nonzero header fields is enough ... PageIsNew is as good as
anything.
The next question is what's the minimal number of places we have to add
it to to cover all paths in the index modules?
That being said, is there any reason where don't extend the file with a
PageInit()'d block instead of a zero'd file?
Yeah: this scenario is exactly it. PageInit doesn't make for a valid
index page.
regards, tom lane
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Thu, Oct 27, 2005 at 11:53:01PM -0400, Tom Lane wrote:
BTW, Jim, any thoughts about how the index got corrupted? Have you
had any crashes on that machine lately?
Write-through cache on drive array that's not battery backed. Plus, the
backend has been crashing on a sig 11 about once a week for
who-knows-how-long. You do the math...
My guess is that this could not have been caused by a backend crash.
I did some analysis and found that one of the four pages had been a
level-1 internal page, not a leaf page (there is a sibling link from
a level-1 page and a downlink from the level-2 root, QED). It's
very hard to see how to explain the condition of the index as a
result of a single backend crash, and even if you posit one crash
for each page, it's tough to believe that there wouldn't be additional
corruption elsewhere (eg, from incomplete page split operations). But
AFAICT there's nothing at all wrong anywhere else. And if there were
multiple crashes, why are the affected pages contiguous?
I think the data was dropped at the disk or filesystem level. If you
have had power failures then that's certainly not hard to believe.
BTW, now that the backend is compiled with --enable-debug I hope to find
out the reason for the random crashes.
Yeah, that is an interesting question.
regards, tom lane
On Fri, Oct 28, 2005 at 02:26:31PM +1000, Gavin Sherry wrote:
Have spoken with Jim on IRC, he says that there have been several crashes
recently due to a faulty disk array. I guess the zeroing could be an
outcome of the faulty disk. I wonder if the crash the faulty disk resulted
in could have been caused some where around mdextend() where we create a
zero'd page but before we could have written out the initialised page.
Just to clarify, there's no evidence that the array is faulty. I do know
that they were using write-back with a non-battery-backed cache though.
What has been happening is periodic random crashes, around 1 a week. I
now have a good core for one, as well as an assert:
TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File:
"slru.c", Line: 308)
I haven't looked at that code yet, so I have no idea what that actually
means. Let me know what info y'all would like to see out of the core.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
What has been happening is periodic random crashes, around 1 a week. I
now have a good core for one, as well as an assert:
TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File:
"slru.c", Line: 308)
I haven't looked at that code yet, so I have no idea what that actually
means. Let me know what info y'all would like to see out of the core.
The whole contents of *shared and the local variables of
SimpleLruReadPage would be good for starters.
Also, what PG version is this exactly, again?
regards, tom lane
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
"Jim C. Nasby" <jnasby@pervasive.com> writes:What has been happening is periodic random crashes, around
1 a week. I
now have a good core for one, as well as an assert:
TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File:
"slru.c", Line: 308)I haven't looked at that code yet, so I have no idea what
that actually
means. Let me know what info y'all would like to see out of
the core.
The whole contents of *shared and the local variables of
SimpleLruReadPage would be good for starters.
I know how to get to the SimpleLruReadPage frame, but what commands do I need to use after that?
Also, what PG version is this exactly, again?
8.0.3.
Import Notes
Resolved by subject fallback
"Jim Nasby" <jnasby@pervasive.com> writes:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
The whole contents of *shared and the local variables of
SimpleLruReadPage would be good for starters.
I know how to get to the SimpleLruReadPage frame, but what commands do I need to use after that?
p *shared
info locals
regards, tom lane
Here's the full info from 2 different cores:
[root@pg8 coredumps]# cat slru.gdb
f 3
p *shared
p pageno
p slotno
p ok
p xid
quit
[root@pg8 coredumps]# gdb -x slru.gdb /usr/bin/postmaster core.25146 |tail -n 13
warning: svr4_current_sos: Can't read pathname for load map: Input/output error
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=162932, xid=0) at slru.c:307
307 Assert(shared->page_number[slotno] == pageno &&
$1 = {ControlLock = SubtransControlLock, page_buffer = {0x2a98298380 "", 0x2a9829a380 "",
0x2a9829c380 "", 0x2a9829e380 "", 0x2a982a0380 "", 0x2a982a2380 "", 0x2a982a4380 "",
0x2a982a6380 ""}, page_status = {SLRU_PAGE_CLEAN, SLRU_PAGE_READ_IN_PROGRESS,
SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN, SLRU_PAGE_DIRTY, SLRU_PAGE_READ_IN_PROGRESS,
SLRU_PAGE_READ_IN_PROGRESS, SLRU_PAGE_CLEAN}, page_number = {162878, 162877, 163050,
162883, 163270, 162761, 162980, 162797}, page_lru_count = {8, 2, 5, 1, 139, 4, 0, 3},
buffer_locks = {24, 25, 26, 27, 28, 29, 30, 31}, latest_page_number = 163270}
$2 = 162932
$3 = 1
$4 = 1 '\001'
$5 = 0
[root@pg8 coredumps]# gdb -x slru.gdb /usr/bin/postmaster core.32555 |tail -n 13
warning: svr4_current_sos: Can't read pathname for load map: Input/output error
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=164152, xid=0) at slru.c:307
307 Assert(shared->page_number[slotno] == pageno &&
$1 = {ControlLock = SubtransControlLock, page_buffer = {0x2a98298380 "", 0x2a9829a380 "",
0x2a9829c380 "", 0x2a9829e380 "", 0x2a982a0380 "", 0x2a982a2380 "", 0x2a982a4380 "",
0x2a982a6380 ""}, page_status = {SLRU_PAGE_READ_IN_PROGRESS, SLRU_PAGE_CLEAN,
SLRU_PAGE_CLEAN, SLRU_PAGE_DIRTY, SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN,
SLRU_PAGE_CLEAN}, page_number = {164145, 164146, 164147, 164153, 164148, 164150, 164151,
164149}, page_lru_count = {0, 1, 2, 106, 5, 7, 8, 6}, buffer_locks = {24, 25, 26, 27, 28,
29, 30, 31}, latest_page_number = 164153}
$2 = 164152
$3 = 0
$4 = 1 '\001'
$5 = 0
[root@pg8 coredumps]#
Also, here's the trace from a 3rd core:
[root@pg8 coredumps]# gdb /usr/bin/postgres core.13897
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".
warning: core file may not match specified executable file.
Core was generated by `gdb -q -fullname /usr/bin/postmaster core.25146'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000003b894688e3 in ?? ()
(gdb) bt
#0 0x0000003b894688e3 in ?? ()
#1 0x00000000004f4f20 in ExecReScanHashJoin ()
#2 0x00000000004b593c in DoCopy (stmt=Variable "stmt" is not available.
) at copy.c:767
#3 0x0000000000447190 in _hash_log2 () at hashutil.c:107
#4 0x0000000000000000 in ?? ()
(gdb)
-rw------- 1 root root 29179904 Oct 28 10:08 core.13897
-rw------- 1 root root 1166159872 Oct 28 07:13 core.25146
-rw------- 1 root root 1167413248 Oct 28 09:05 core.32555
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
BTW, what's the stack trace in those two core files?
regards, tom lane
On Fri, Oct 28, 2005 at 03:04:02PM -0400, Tom Lane wrote:
BTW, what's the stack trace in those two core files?
From 25146:
#0 0x0000003b8942e37d in raise () from /lib64/tls/libc.so.6
#1 0x0000003b8942faae in abort () from /lib64/tls/libc.so.6
#2 0x00000000005d36f8 in ExceptionalCondition (
conditionName=0x623a <Address 0x623a out of bounds>,
errorType=0x623a <Address 0x623a out of bounds>,
fileName=0x623a <Address 0x623a out of bounds>, lineNumber=-1) at assert.c:51
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=162932, xid=0) at slru.c:307
#4 0x0000000000473863 in SlruSelectLRUPage (ctl=0x7d9f40, pageno=163131) at slru.c:753
#5 0x0000000000473439 in SimpleLruReadPage (ctl=0x7d9f40, pageno=163131, xid=334094300)
at slru.c:254
#6 0x0000000000473eeb in SubTransGetParent (xid=334094300) at subtrans.c:116
#7 0x0000000000473f61 in SubTransGetTopmostTransaction (xid=Variable "xid" is not available.
) at subtrans.c:153
#8 0x00000000005efa38 in HeapTupleSatisfiesSnapshot (tuple=0x2ac2fb04b0, snapshot=0x88ab98,
buffer=86685) at tqual.c:967
#9 0x0000000000447d7a in heapgettup (relation=0x2add22b960, dir=1, tuple=0x8c4a90,
buffer=0x8c4ab0, snapshot=0x88ab98, nkeys=0, key=0x0, pages=597) at heapam.c:305
#10 0x0000000000448b53 in heap_getnext (scan=0x8c4a68, direction=Variable "direction" is not available.
) at heapam.c:832
#11 0x00000000004f7f86 in SeqNext (node=Variable "node" is not available.
) at nodeSeqscan.c:102
#12 0x00000000004eec2e in ExecScan (node=0x8c3b68, accessMtd=0x4f7f20 <SeqNext>)
at execScan.c:98
#13 0x00000000004e9c9d in ExecProcNode (node=0x8c3b68) at execProcnode.c:303
#14 0x00000000004f2a75 in ExecAgg (node=0x8c3610) at nodeAgg.c:783
#15 0x00000000004e9bea in ExecProcNode (node=0x8c3610) at execProcnode.c:353
#16 0x00000000004e8ccd in ExecutorRun (queryDesc=Variable "queryDesc" is not available.
) at execMain.c:1060
#17 0x000000000056968e in PortalRunSelect (portal=0x8acb38, forward=Variable "forward" is not available.
) at pquery.c:746
#18 0x0000000000569caf in PortalRun (portal=0x8acb38, count=9223372036854775807,
dest=0x8bbae0, altdest=0x8bbae0, completionTag=0x7fbfffdfd0 "") at pquery.c:561
#19 0x0000000000565f12 in exec_simple_query (
query_string=0x89e0b8 "SELECT count(*) as cnt FROM queue where machineindex = '32'")
at postgres.c:933
#20 0x0000000000567b33 in PostgresMain (argc=4, argv=0x846368, username=0x846328 "iacm")
at postgres.c:3007
#21 0x000000000053ac70 in ServerLoop () at postmaster.c:2836
#22 0x000000000053c374 in PostmasterMain (argc=5, argv=0x843500) at postmaster.c:918
#23 0x0000000000507fef in main (argc=5, argv=0x843500) at main.c:268
And 32555:
#0 0x0000003b8942e37d in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x0000003b8942e37d in raise () from /lib64/tls/libc.so.6
#1 0x0000003b8942faae in abort () from /lib64/tls/libc.so.6
#2 0x00000000005d36f8 in ExceptionalCondition (
conditionName=0x7f2b <Address 0x7f2b out of bounds>,
errorType=0x7f2b <Address 0x7f2b out of bounds>,
fileName=0x7f2b <Address 0x7f2b out of bounds>, lineNumber=-1) at assert.c:51
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=164152, xid=0) at slru.c:307
#4 0x0000000000473863 in SlruSelectLRUPage (ctl=0x7d9f40, pageno=164037) at slru.c:753
#5 0x0000000000473439 in SimpleLruReadPage (ctl=0x7d9f40, pageno=164037, xid=335949336)
at slru.c:254
#6 0x0000000000473eeb in SubTransGetParent (xid=335949336) at subtrans.c:116
#7 0x0000000000473f61 in SubTransGetTopmostTransaction (xid=Variable "xid" is not available.
) at subtrans.c:153
#8 0x00000000005ef963 in HeapTupleSatisfiesSnapshot (tuple=0x2abc81e0b8, snapshot=0x8788d8,
buffer=73427) at tqual.c:905
#9 0x0000000000448dc6 in heap_release_fetch (relation=0x2add227130, snapshot=0x8788d8,
tuple=0x8d2460, userbuf=0x8d2480, keep_buf=1 '\001', pgstat_info=0x8d24b8) at heapam.c:979
#10 0x0000000000450c8f in index_getnext (scan=0x8d2418, direction=ForwardScanDirection)
at indexam.c:528
#11 0x00000000004f5012 in IndexNext (node=0x8d18c0) at nodeIndexscan.c:316
#12 0x00000000004eec2e in ExecScan (node=0x8d18c0, accessMtd=0x4f4f20 <IndexNext>)
at execScan.c:98
#13 0x00000000004e9c8d in ExecProcNode (node=0x8d18c0) at execProcnode.c:307
#14 0x00000000004e8ccd in ExecutorRun (queryDesc=Variable "queryDesc" is not available.
) at execMain.c:1060
#15 0x000000000056968e in PortalRunSelect (portal=0x8add58, forward=Variable "forward" is not available.
) at pquery.c:746
#16 0x0000000000569caf in PortalRun (portal=0x8add58, count=9223372036854775807,
dest=0x8e5c48, altdest=0x8e5c48, completionTag=0x7fbfffdfd0 "") at pquery.c:561
#17 0x0000000000565f12 in exec_simple_query (
query_string=0x89f2d8 "select index from daily_reports where accountindex = '3034' and date = '1130040000'") at postgres.c:933
#18 0x0000000000567b33 in PostgresMain (argc=4, argv=0x846368, username=0x846328 "iacm")
at postgres.c:3007
#19 0x000000000053ac70 in ServerLoop () at postmaster.c:2836
#20 0x000000000053c374 in PostmasterMain (argc=5, argv=0x843500) at postmaster.c:918
#21 0x0000000000507fef in main (argc=5, argv=0x843500) at main.c:268
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Here's another core... (pid 805 for reference)
#0 0x0000003b8942e37d in raise () from /lib64/tls/libc.so.6
#0 0x0000003b8942e37d in raise () from /lib64/tls/libc.so.6
#1 0x0000003b8942faae in abort () from /lib64/tls/libc.so.6
#2 0x00000000005d36f8 in ExceptionalCondition (
conditionName=0x325 <Address 0x325 out of bounds>,
errorType=0x325 <Address 0x325 out of bounds>,
fileName=0x325 <Address 0x325 out of bounds>, lineNumber=-1) at assert.c:51
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=169039, xid=0) at slru.c:307
#4 0x0000000000473863 in SlruSelectLRUPage (ctl=0x7d9f40, pageno=169162) at slru.c:753
#5 0x0000000000473439 in SimpleLruReadPage (ctl=0x7d9f40, pageno=169162, xid=346445732)
at slru.c:254
#6 0x0000000000473eeb in SubTransGetParent (xid=346445732) at subtrans.c:116
#7 0x0000000000473f61 in SubTransGetTopmostTransaction (xid=Variable "xid" is not available.
) at subtrans.c:153
#8 0x00000000005efa38 in HeapTupleSatisfiesSnapshot (tuple=0x2ac4b87dd0, snapshot=0x877908,
buffer=90248) at tqual.c:967
#9 0x0000000000447d7a in heapgettup (relation=0x2add20fd98, dir=1, tuple=0x8e7210,
buffer=0x8e7230, snapshot=0x877908, nkeys=0, key=0x0, pages=435) at heapam.c:305
#10 0x0000000000448b53 in heap_getnext (scan=0x8e71e8, direction=Variable "direction" is not available.
) at heapam.c:832
#11 0x00000000004f7f86 in SeqNext (node=Variable "node" is not available.
) at nodeSeqscan.c:102
#12 0x00000000004eec2e in ExecScan (node=0x8b7c38, accessMtd=0x4f7f20 <SeqNext>)
at execScan.c:98
#13 0x00000000004e9c9d in ExecProcNode (node=0x8b7c38) at execProcnode.c:303
#14 0x00000000004f7431 in ExecNestLoop (node=0x8b64b0) at nodeNestloop.c:135
#15 0x00000000004e9c4d in ExecProcNode (node=0x8b64b0) at execProcnode.c:326
#16 0x00000000004f89f9 in ExecSort (node=0x8b6398) at nodeSort.c:102
#17 0x00000000004e9c0a in ExecProcNode (node=0x8b6398) at execProcnode.c:345
#18 0x00000000004f9048 in ExecLimit (node=0x8b6150) at nodeLimit.c:87
#19 0x00000000004e9bb4 in ExecProcNode (node=0x8b6150) at execProcnode.c:369
#20 0x00000000004e8ccd in ExecutorRun (queryDesc=Variable "queryDesc" is not available.
) at execMain.c:1060
#21 0x000000000056968e in PortalRunSelect (portal=0x8ad5a8, forward=Variable "forward" is not available.
) at pquery.c:746
#22 0x0000000000569caf in PortalRun (portal=0x8ad5a8, count=9223372036854775807,
dest=0x8e1870, altdest=0x8e1870, completionTag=0x7fbfffdfd0 "") at pquery.c:561
#23 0x0000000000565f12 in exec_simple_query (
query_string=0x89f168 ' ' <repeats 71 times>, "SELECT a.index,a.jobtype,a.machineindex,a.pid,a.data,a.status,a.starttime,a.ranby,a.clientindex,a.parentindex,a.output_data,a.per"...)
at postgres.c:933
#24 0x0000000000567b33 in PostgresMain (argc=4, argv=0x846368, username=0x846328 "iacm")
at postgres.c:3007
#25 0x000000000053ac70 in ServerLoop () at postmaster.c:2836
#26 0x000000000053c374 in PostmasterMain (argc=5, argv=0x843500) at postmaster.c:918
#27 0x0000000000507fef in main (argc=5, argv=0x843500) at main.c:268
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=169039, xid=0) at slru.c:307
307 Assert(shared->page_number[slotno] == pageno &&
$1 = {ControlLock = SubtransControlLock, page_buffer = {0x2a98298380 "", 0x2a9829a380 "",
0x2a9829c380 "", 0x2a9829e380 "", 0x2a982a0380 "", 0x2a982a2380 "", 0x2a982a4380 "",
0x2a982a6380 ""}, page_status = {SLRU_PAGE_DIRTY, SLRU_PAGE_CLEAN,
SLRU_PAGE_READ_IN_PROGRESS, SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN, SLRU_PAGE_READ_IN_PROGRESS,
SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN}, page_number = {169452, 169351, 169163, 169238, 169236,
169328, 169233, 169239}, page_lru_count = {17108, 4, 1, 3, 5, 0, 6, 2}, buffer_locks = {
24, 25, 26, 27, 28, 29, 30, 31}, latest_page_number = 169452}
$2 = 169039
$3 = 2
$4 = 1 '\001'
$5 = 0
-
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-946
Jim C. Nasby wrote:
Here's another core... (pid 805 for reference)
All of them have in common that the slotno being passed ($3 below) is in
SLRU_PAGE_READ_IN_PROGRESS state ... could it be a problem with lock
reordering? Maybe somebody is trying to read in a page, and somebody
else steals the buffer from under them. Not sure how likely is that.
BTW what's the relationship with the other assertion failure (the one in
the subject)?
#3 0x000000000047365f in SimpleLruReadPage (ctl=0x7d9f40, pageno=169039, xid=0) at slru.c:307
307 Assert(shared->page_number[slotno] == pageno &&$1 = {ControlLock = SubtransControlLock, page_buffer = {0x2a98298380 "", 0x2a9829a380 "",
0x2a9829c380 "", 0x2a9829e380 "", 0x2a982a0380 "", 0x2a982a2380 "", 0x2a982a4380 "",
0x2a982a6380 ""}, page_status = {SLRU_PAGE_DIRTY, SLRU_PAGE_CLEAN,
SLRU_PAGE_READ_IN_PROGRESS, SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN, SLRU_PAGE_READ_IN_PROGRESS,
SLRU_PAGE_CLEAN, SLRU_PAGE_CLEAN}, page_number = {169452, 169351, 169163, 169238, 169236,
169328, 169233, 169239}, page_lru_count = {17108, 4, 1, 3, 5, 0, 6, 2}, buffer_locks = {
24, 25, 26, 27, 28, 29, 30, 31}, latest_page_number = 169452}
$2 = 169039
$3 = 2
$4 = 1 '\001'
$5 = 0
--
Alvaro Herrera Valdivia, Chile ICBM: S 39� 49' 17.7", W 73� 14' 26.8"
"Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
All of them have in common that the slotno being passed ($3 below) is in
SLRU_PAGE_READ_IN_PROGRESS state ... could it be a problem with lock
reordering? Maybe somebody is trying to read in a page, and somebody
else steals the buffer from under them. Not sure how likely is that.
It's even more interesting than that: in all three cases,
SlruSelectLRUPage has selected a "least recently used" page that is
still in READ_IN_PROGRESS state (ie, we haven't finished faulting it in)
and is recursively calling SimpleLruReadPage to wait for that condition
to terminate.
Apparently, Jim's setup could desperately do with a larger SLRU arena
for pg_subtrans, because this is supposed to be a never-happen path ---
if you can't finish loading a page before you need its slot for
something else, you are thrashing with a capital T.
I suppose there's a bug in this path, but I'm darned if I can see what
it is. There are a number of obvious inefficiencies, but those
shouldn't be important given that this isn't supposed to happen much.
But how's it getting to the Assert failure?
regards, tom lane
On Fri, Oct 28, 2005 at 04:58:56PM -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
All of them have in common that the slotno being passed ($3 below) is in
SLRU_PAGE_READ_IN_PROGRESS state ... could it be a problem with lock
reordering? Maybe somebody is trying to read in a page, and somebody
else steals the buffer from under them. Not sure how likely is that.It's even more interesting than that: in all three cases,
SlruSelectLRUPage has selected a "least recently used" page that is
still in READ_IN_PROGRESS state (ie, we haven't finished faulting it in)
and is recursively calling SimpleLruReadPage to wait for that condition
to terminate.Apparently, Jim's setup could desperately do with a larger SLRU arena
for pg_subtrans, because this is supposed to be a never-happen path ---
if you can't finish loading a page before you need its slot for
something else, you are thrashing with a capital T.I suppose there's a bug in this path, but I'm darned if I can see what
it is. There are a number of obvious inefficiencies, but those
shouldn't be important given that this isn't supposed to happen much.
But how's it getting to the Assert failure?
If it helps, this is a ~250G database that's (now) on an 8-way (opteron
I think) machine with 32G. shared_buffers is set to 1G. My client also
has a 4-way machine with 16G, although it seemed to be having some
issues with producing cores that were useful.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
I wrote:
I suppose there's a bug in this path, but I'm darned if I can see what
it is. There are a number of obvious inefficiencies, but those
shouldn't be important given that this isn't supposed to happen much.
But how's it getting to the Assert failure?
While I'm disinclined to change anything until we can explain why it's
crashing, I suspect that the solution may be to avoid the recursive call
of SimpleLruReadPage, as in the attached patch. Jim, are you interested
in seeing if this patch makes the problem go away for you?
regards, tom lane
On Fri, Oct 28, 2005 at 05:45:51PM -0400, Tom Lane wrote:
I wrote:
I suppose there's a bug in this path, but I'm darned if I can see what
it is. There are a number of obvious inefficiencies, but those
shouldn't be important given that this isn't supposed to happen much.
But how's it getting to the Assert failure?While I'm disinclined to change anything until we can explain why it's
crashing, I suspect that the solution may be to avoid the recursive call
of SimpleLruReadPage, as in the attached patch. Jim, are you interested
in seeing if this patch makes the problem go away for you?
Well, this is a production system... what's the risk with that patch?
BTW, is it typical to see a 10 difference between asserts on and off? My
client has a process that was doing 10-20 records/sec with asserts on
and 90-110 with asserts off.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Fri, Oct 28, 2005 at 05:45:51PM -0400, Tom Lane wrote:
Jim, are you interested
in seeing if this patch makes the problem go away for you?
Well, this is a production system... what's the risk with that patch?
Well, it's utterly untested, which means it might crash your system,
which is where you are now, no?
BTW, is it typical to see a 10 difference between asserts on and off? My
client has a process that was doing 10-20 records/sec with asserts on
and 90-110 with asserts off.
Not typical, but I can believe there are some code paths like that.
regards, tom lane
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
"Jim C. Nasby" <jnasby@pervasive.com> writes:On Fri, Oct 28, 2005 at 05:45:51PM -0400, Tom Lane wrote:
Jim, are you interested
in seeing if this patch makes the problem go away for you?Well, this is a production system... what's the risk with
that patch?
Well, it's utterly untested, which means it might crash your system,
which is where you are now, no?
Yes, but the crashes are somewhat sporadic and most importantly they don't appear to involve any data loss/corruption. I just don't want to make matters any worse.
In any case, my client's gone home for the weekend, so I doubt anything would happen until Monday.
BTW, is it typical to see a 10 difference between asserts
on and off? My
client has a process that was doing 10-20 records/sec with
asserts on
and 90-110 with asserts off.
Not typical, but I can believe there are some code paths like that.
Yeah, they're doing some not-so-good things like row-by-row operations, so that's probably what the issue is. I seem to recall 20% being the penalty that's normally thrown around, so I was just surprised by such a huge difference.
Import Notes
Resolved by subject fallback
OK, I think I see it. The problem is that the code in slru.c is careful
about not modifying state when it doesn't hold the proper lock, but not
so careful about not *inspecting* state without the proper lock. In
particular consider these lines in SimpleLruReadPage (line numbers are
as in CVS tip):
/* Mark the slot read-busy (no-op if it already was) */
277 shared->page_number[slotno] = pageno;
278 shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS;
...
/* Release shared lock, grab per-buffer lock instead */
287 LWLockRelease(shared->ControlLock);
288 LWLockAcquire(shared->buffer_locks[slotno], LW_EXCLUSIVE);
/*
* Check to see if someone else already did the read, or took
* the buffer away from us. If so, restart from the top.
*/
294 if (shared->page_number[slotno] != pageno ||
295 shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
...
Suppose that we arrive at line 277 when the page is currently being
faulted in by another process (ie, its state is already
read-in-progress). The assignments at lines 277 & 278 are then no-ops,
and we'll block at line 288 waiting for the other guy to finish the I/O
and release the per-buffer lock.
Suppose that after the I/O finishes and before we get to run again, this
buffer sinks back to the bottom of the LRU list and gets chosen to be
replaced with another page. Some other process will then start
executing this code and will change the page number (line 277) and
change the state from clean to read-in-progress (line 278).
The race condition is that this could happen between the tests at lines
294 and 295 in our original process. In that case the original process
would think that the page still needed to be read in, and would set
about doing so. It would discover its error at the Assert at line
308, exactly where Jim reports a problem.
The association we thought we'd noted to recursion through
SlruSelectLRUPage is in part a red herring: the race can occur without
that. However, it's perhaps a bit more probable to occur in that path,
because when SlruSelectLRUPage recurses back to SimpleLruReadPage, we
*know* that there is another process currently doing read-in, and so
the first coincidence is already satisfied.
Still, the race condition window is extremely narrow, only a couple of
instructions. I looked at the assembly output for the compiler
Jim is using, and it looks like this:
cmpl %r13d, 104(%rbp,%r12,4)
je .L155
.L116:
... code for if() body here
...
.L155:
cmpl $1, 72(%rbp,%r12,4)
jne .L116
... code for subsequent lines here
However, if the processor predicts the forward branch not to be taken,
it could waste a few cycles recovering from its mistake, so the window
maybe is a little wider than it appears.
I'd like Jim to test this theory by seeing if it helps to reverse the
order of the if-test elements at lines 294/295, ie make it look like
if (shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||
shared->page_number[slotno] != pageno)
This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.
I'm still thinking about how to make a real fix without introducing
another lock/unlock cycle --- we can do that if we have to, but I think
maybe it's possible to fix it without.
SimpleLruWritePage has an identical problem BTW :-(
regards, tom lane
I wrote:
OK, I think I see it. The problem is that the code in slru.c is careful
about not modifying state when it doesn't hold the proper lock, but not
so careful about not *inspecting* state without the proper lock.
...
I'm still thinking about how to make a real fix without introducing
another lock/unlock cycle --- we can do that if we have to, but I think
maybe it's possible to fix it without.
Attached is a proposed patch to fix up slru.c so that it's not playing
any games by either reading or writing shared state without holding
the ControlLock for the SLRU set.
The main problem with the existing code, as I now see it, was a poor
choice of representation of page state: we had states CLEAN, DIRTY, and
WRITE_IN_PROGRESS, which meant that re-dirtying a page while a write was
in progress required setting the state back to DIRTY, which hid the fact
that indeed a write was in progress. So the I/O code was designed to
cope with not knowing whether another write was already in progress. We
can make it a whole lot cleaner by changing the state representation so
that we can tell the difference --- then, we can know before releasing
the ControlLock whether we need to write the page or just wait for
someone else to finish writing it. And that means we can do all the
state-manipulation before releasing the lock.
I could have just added an additional state WRITE_IN_PROGRESS_REDIRTIED,
or some such, but it seemed notationally cleaner to create a separate
"page_dirty" boolean, and reduce the number of states to four (empty,
read-in-progress, valid, write-in-progress). This lets outside code
such as clog.c just set "page_dirty = true" rather than doing a complex
bit of logic to change the state value properly.
The patch is a bit bulky because of the representation change, but the
key changes are localized in SimpleLruReadPage and SimpleLruWritePage.
I think this code is a whole lot cleaner than it was before, but it's
a bit of a large change to be making so late in the 8.1 cycle (not to
mention that we really ought to back-patch similar changes all the way
back, because the bug exists as far back as 7.2). I am going to take
another look to see if there is a less invasive change that will fix
the problem at some performance cost; if so, we might want to do it that
way in 8.1 and back branches.
Any comments on the patch, or thoughts on how to proceed?
regards, tom lane
On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote:
I'd like Jim to test this theory by seeing if it helps to reverse the
order of the if-test elements at lines 294/295, ie make it look likeif (shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||
shared->page_number[slotno] != pageno)This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.
Given your proposed fix on -patches, do you still need me to test this?
Also, is there any heap corruption risk associated with this patch?
I'm also wondering what the effect of this is when assertions are turned
off. My client had to go back to running with assertions turned off
because of the performance impact. Are they now risking data corruption?
Is there a way to turn on the assertion just in this code segment?
This incident has made me wonder if it's worth creating two classes of
assertions. The (hopefully more common) set of assertions would be for
things that shouldn't happen, but if go un-caught won't result in heap
corruption. A new set (well, existing asserts, but just re-classified)
would be for things that if uncaught could result in heap corruption. My
hope is that the set of critical assertions could be turned on by
default, helping to identify race conditions and other bugs that
conventional testing is unlikely to find.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Sorry, two more things...
Will increasing shared_buffers make this less likely to occur? Or is
this just something that's likely to happen when there are things like
seqscans that are putting buffers near the front of the LRU? (The 8.0.3
buffer manager does something like that, right?)
Is this something that a test case can be created for? I know someone
submitted a framework for doing concurrent testing...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Good analysis. I guess the question is what patch would we put into a
subrelease? If you go for a new state code, rather than a separate
boolean, does it reduce the size of the patch?
---------------------------------------------------------------------------
Tom Lane wrote:
I wrote:
OK, I think I see it. The problem is that the code in slru.c is careful
about not modifying state when it doesn't hold the proper lock, but not
so careful about not *inspecting* state without the proper lock.
...
I'm still thinking about how to make a real fix without introducing
another lock/unlock cycle --- we can do that if we have to, but I think
maybe it's possible to fix it without.Attached is a proposed patch to fix up slru.c so that it's not playing
any games by either reading or writing shared state without holding
the ControlLock for the SLRU set.The main problem with the existing code, as I now see it, was a poor
choice of representation of page state: we had states CLEAN, DIRTY, and
WRITE_IN_PROGRESS, which meant that re-dirtying a page while a write was
in progress required setting the state back to DIRTY, which hid the fact
that indeed a write was in progress. So the I/O code was designed to
cope with not knowing whether another write was already in progress. We
can make it a whole lot cleaner by changing the state representation so
that we can tell the difference --- then, we can know before releasing
the ControlLock whether we need to write the page or just wait for
someone else to finish writing it. And that means we can do all the
state-manipulation before releasing the lock.I could have just added an additional state WRITE_IN_PROGRESS_REDIRTIED,
or some such, but it seemed notationally cleaner to create a separate
"page_dirty" boolean, and reduce the number of states to four (empty,
read-in-progress, valid, write-in-progress). This lets outside code
such as clog.c just set "page_dirty = true" rather than doing a complex
bit of logic to change the state value properly.The patch is a bit bulky because of the representation change, but the
key changes are localized in SimpleLruReadPage and SimpleLruWritePage.I think this code is a whole lot cleaner than it was before, but it's
a bit of a large change to be making so late in the 8.1 cycle (not to
mention that we really ought to back-patch similar changes all the way
back, because the bug exists as far back as 7.2). I am going to take
another look to see if there is a less invasive change that will fix
the problem at some performance cost; if so, we might want to do it that
way in 8.1 and back branches.Any comments on the patch, or thoughts on how to proceed?
regards, tom lane
Content-Description: slru-race-1.patch
[ Type application/octet-stream treated as attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
If you go for a new state code, rather than a separate
boolean, does it reduce the size of the patch?
No, and it certainly wouldn't improve my level of confidence in it ...
regards, tom lane
Jim C. Nasby wrote:
On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote:
I'd like Jim to test this theory by seeing if it helps to reverse the
order of the if-test elements at lines 294/295, ie make it look likeif (shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||
shared->page_number[slotno] != pageno)This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.Given your proposed fix on -patches, do you still need me to test this?
Also, is there any heap corruption risk associated with this patch?
Because it is a test, I am not sure there is any way to know what the
possible impact of a bug is. If we knew there were bug in the patch,
it would have been fixed already.
I'm also wondering what the effect of this is when assertions are turned
off. My client had to go back to running with assertions turned off
because of the performance impact. Are they now risking data corruption?
Is there a way to turn on the assertion just in this code segment?This incident has made me wonder if it's worth creating two classes of
assertions. The (hopefully more common) set of assertions would be for
things that shouldn't happen, but if go un-caught won't result in heap
corruption. A new set (well, existing asserts, but just re-classified)
would be for things that if uncaught could result in heap corruption. My
hope is that the set of critical assertions could be turned on by
default, helping to identify race conditions and other bugs that
conventional testing is unlikely to find.
That is probably overkill. Running with test patches isn't something we
expect folks to do often.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
If you go for a new state code, rather than a separate
boolean, does it reduce the size of the patch?No, and it certainly wouldn't improve my level of confidence in it ...
Well, then what real options do we have? It seems the patch is just
required for all branches.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote:
This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.
Given your proposed fix on -patches, do you still need me to test this?
Yes; we still need to verify that my theory actually explains your
problem. Given that I'm positing that you can repeatedly hit a
two-instruction window, this is by no means a sure thing. We need
it tested (and with asserts on, so that we can tell if it's fixed
the problem or not).
Also, is there any heap corruption risk associated with this patch?
Look, Jim, I'm trying to help you fix this. Are you going to help or not?
If you want some kind of written guarantee, you're not going to get one.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Well, then what real options do we have? It seems the patch is just
required for all branches.
I think it would be possible to fix it in a less invasive way by taking
and releasing the ControlLock an extra time in SimpleLruReadPage and
SimpleLruWritePage. What's indeterminate about that is the performance
cost. In situations where there's not a lot of SLRU I/O traffic it's
presumably negligible, but in a case like Jim's where there's evidently
a *whole* lot of traffic, it might be a killer.
regards, tom lane
Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote:
This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.Given your proposed fix on -patches, do you still need me to test this?
Yes; we still need to verify that my theory actually explains your
problem. Given that I'm positing that you can repeatedly hit a
two-instruction window, this is by no means a sure thing. We need
it tested (and with asserts on, so that we can tell if it's fixed
the problem or not).Also, is there any heap corruption risk associated with this patch?
Look, Jim, I'm trying to help you fix this. Are you going to help or not?
If you want some kind of written guarantee, you're not going to get one.
I think we can say Jim gets his money back if he finds a bug. :-)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, Oct 31, 2005 at 01:05:06PM -0500, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote:
This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.Given your proposed fix on -patches, do you still need me to test this?
Yes; we still need to verify that my theory actually explains your
problem. Given that I'm positing that you can repeatedly hit a
two-instruction window, this is by no means a sure thing. We need
it tested (and with asserts on, so that we can tell if it's fixed
the problem or not).
Ok, I'll work on getting this tested. Just to clarify, if this fixes it
then the problem wouldn't occur, or would we just see a different
assert?
Also, is there any heap corruption risk associated with this patch?
Look, Jim, I'm trying to help you fix this. Are you going to help or not?
If you want some kind of written guarantee, you're not going to get one.
Of course not, and I'm not looking for one. On the otherhand, I don't
want to recommend something on a production system without understanding
what kind of risks are involved, and unfortunately much of this is still
over my head. I would really like to have a better idea of what the
impact of this bug is.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Well, then what real options do we have? It seems the patch is just
required for all branches.I think it would be possible to fix it in a less invasive way by taking
and releasing the ControlLock an extra time in SimpleLruReadPage and
SimpleLruWritePage. What's indeterminate about that is the performance
cost. In situations where there's not a lot of SLRU I/O traffic it's
presumably negligible, but in a case like Jim's where there's evidently
a *whole* lot of traffic, it might be a killer.
To me a performance problem is much harder get reports on and to locate
than a real fix to the problem. I think if a few people eyeball the
patch, it is OK for application. Are backpatches significantly
different?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, Oct 31, 2005 at 01:01:14PM -0500, Bruce Momjian wrote:
This incident has made me wonder if it's worth creating two classes of
assertions. The (hopefully more common) set of assertions would be for
things that shouldn't happen, but if go un-caught won't result in heap
corruption. A new set (well, existing asserts, but just re-classified)
would be for things that if uncaught could result in heap corruption. My
hope is that the set of critical assertions could be turned on by
default, helping to identify race conditions and other bugs that
conventional testing is unlikely to find.That is probably overkill. Running with test patches isn't something we
expect folks to do often.
I wasn't thinking about test patches.
My assumption is that the asserts that are currently in place fall into
one of two categories: either they check for something that if false
could result in data corruption in the heap, or they check for something
that shouldn't happen, but if it does it can't corrupt the heap. If that
assumption is correct then seperating them might make it easier to run
with the set of critical asserts turned on. Currently, there can be a
substantial performance penalty with all asserts turned on, but I
suspect a lot of that penalty is from asserts in things like parsing and
planning code; code that pretty much couldn't corrupt data.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Bruce Momjian <pgman@candle.pha.pa.us> writes:
To me a performance problem is much harder get reports on and to locate
than a real fix to the problem. I think if a few people eyeball the
patch, it is OK for application. Are backpatches significantly
different?
Well, the logic is the same all the way back, but the code has changed
textually quite a bit since 7.4 and even more since 7.3. I think the
patch would apply reasonably cleanly to 8.0, but adjusting it for 7.x
will take a bit of work, which would mean those versions would probably
need to be reviewed separately.
One possible compromise is to use this patch in 8.x and a simpler patch
in 7.x --- people who are very concerned about performance ought to be
running 8.x anyway ;-)
regards, tom lane
Jim C. Nasby wrote:
On Mon, Oct 31, 2005 at 01:01:14PM -0500, Bruce Momjian wrote:
This incident has made me wonder if it's worth creating two classes of
assertions. The (hopefully more common) set of assertions would be for
things that shouldn't happen, but if go un-caught won't result in heap
corruption. A new set (well, existing asserts, but just re-classified)
would be for things that if uncaught could result in heap corruption. My
hope is that the set of critical assertions could be turned on by
default, helping to identify race conditions and other bugs that
conventional testing is unlikely to find.That is probably overkill. Running with test patches isn't something we
expect folks to do often.I wasn't thinking about test patches.
My assumption is that the asserts that are currently in place fall into
one of two categories: either they check for something that if false
could result in data corruption in the heap, or they check for something
that shouldn't happen, but if it does it can't corrupt the heap. If that
assumption is correct then seperating them might make it easier to run
with the set of critical asserts turned on. Currently, there can be a
substantial performance penalty with all asserts turned on, but I
suspect a lot of that penalty is from asserts in things like parsing and
planning code; code that pretty much couldn't corrupt data.
There is no way if the system has some incorrect value whether that
would later corrupt the data or not. Anything the system does that it
shouldn't do is a potential corruption problem.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
I wrote:
I think it would be possible to fix it in a less invasive way by taking
and releasing the ControlLock an extra time in SimpleLruReadPage and
SimpleLruWritePage. What's indeterminate about that is the performance
cost.
Attached is an alternative patch that does it this way. I realized that
we could use LWLockConditionalAcquire to usually avoid any extra lock
traffic, so the performance cost may be negligible except under the very
heaviest of loads. I still like the bigger patch for 8.2 and forward,
because it's a lot cleaner, but this seems like a credible alternative
for 8.1 and back branches.
Comments?
regards, tom lane
OK, this is the way to fix for 8.0 and earlier. It is up to you about
8.1. I think we can handle the larger patch if we do another RC.
---------------------------------------------------------------------------
Tom Lane wrote:
I wrote:
I think it would be possible to fix it in a less invasive way by taking
and releasing the ControlLock an extra time in SimpleLruReadPage and
SimpleLruWritePage. What's indeterminate about that is the performance
cost.Attached is an alternative patch that does it this way. I realized that
we could use LWLockConditionalAcquire to usually avoid any extra lock
traffic, so the performance cost may be negligible except under the very
heaviest of loads. I still like the bigger patch for 8.2 and forward,
because it's a lot cleaner, but this seems like a credible alternative
for 8.1 and back branches.Comments?
regards, tom lane
Content-Description: slru-race-2.patch
[ Type application/octet-stream treated as attachment, skipping... ]
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
OK, this is the way to fix for 8.0 and earlier. It is up to you about
8.1. I think we can handle the larger patch if we do another RC.
Well, I'd like not to do another RC, so I'll hold the larger patch for
8.2.
We still need a test to confirm it fixes Jim's problem though.
Jim, if you like you can test the second proposed patch instead of
that off-the-cuff line swapping. However, either one will need to
be run with Asserts on in order to have any confidence that the problem
is fixed. If performance is an issue, most of the performance hit is
probably coming from memory context checking, so what I'd suggest you
do is comment out these two #defines in src/include/pg_config_manual.h:
#define CLOBBER_FREED_MEMORY
#define MEMORY_CONTEXT_CHECKING
That should let you build with --enable-cassert and not take quite
so much speed hit.
regards, tom lane
On Mon, Oct 31, 2005 at 01:34:17PM -0500, Bruce Momjian wrote:
There is no way if the system has some incorrect value whether that
would later corrupt the data or not. Anything the system does that it
shouldn't do is a potential corruption problem.
But is it safe to say that there are areas where a failed assert is far
more likely to result in data corruption? And that there's also areas
where there's likely to be difficult/impossible to find bugs, such as
race conditions? ISTM that it would be valuable to do some additional
checking in these critical areas.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On 10/31/05, Jim C. Nasby <jnasby@pervasive.com> wrote:
On Mon, Oct 31, 2005 at 01:34:17PM -0500, Bruce Momjian wrote:
There is no way if the system has some incorrect value whether that
would later corrupt the data or not. Anything the system does that it
shouldn't do is a potential corruption problem.But is it safe to say that there are areas where a failed assert is far
more likely to result in data corruption? And that there's also areas
where there's likely to be difficult/impossible to find bugs, such as
race conditions? ISTM that it would be valuable to do some additional
checking in these critical areas.
There are, no doubt, also places where an assert has minimal to no
performance impact. I'd wager a guess that the intersection of low
impact asserts, and asserts which measure high risk activities, is
small enough to be uninteresting.
Now that I've got a little better idea of what this code does, I've
noticed something interesting... this issue is happening on an 8-way
machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this
greatly increase the odds of buffer conflicts? Bug aside, would it be
better to set NUM_SLRU_BUFFERS higher for a larger number of CPUs?
Also, something else to note is that this database can see a pretty high
transaction rate... I just checked and it was doing 200TPS, but iirc it
can hit 1000+ TPS during the day.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote:
Now that I've got a little better idea of what this code does, I've
noticed something interesting... this issue is happening on an 8-way
machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this
greatly increase the odds of buffer conflicts? Bug aside, would it be
better to set NUM_SLRU_BUFFERS higher for a larger number of CPUs?
We had talked about increasing NUM_SLRU_BUFFERS depending on
shared_buffers, but it didn't get done. Something to consider for 8.2
though. I think you could have better performance by increasing that
setting, while at the same time dimishing the possibility that the race
condition appears.
I think you should also consider increasing PGPROC_MAX_CACHED_SUBXIDS
(src/include/storage/proc.h), because that should decrease the chance
that the subtrans area needs to be scanned. By how much, however, I
wouldn't know -- it depends on the number of subtransactions you
typically have; I guess you could activate the measuring code in
procarray.c to get a figure.
--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
www.google.com: interfaz de l�nea de comando para la web.
On Mon, Oct 31, 2005 at 09:02:59PM -0300, Alvaro Herrera wrote:
Jim C. Nasby wrote:
Now that I've got a little better idea of what this code does, I've
noticed something interesting... this issue is happening on an 8-way
machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this
greatly increase the odds of buffer conflicts? Bug aside, would it be
better to set NUM_SLRU_BUFFERS higher for a larger number of CPUs?We had talked about increasing NUM_SLRU_BUFFERS depending on
shared_buffers, but it didn't get done. Something to consider for 8.2
though. I think you could have better performance by increasing that
setting, while at the same time dimishing the possibility that the race
condition appears.
Ok, I'll look into that. This database is definately having issues due
to the sheer transaction volume, so maybe that will help.
If NUM_SLRU_BUFFERS were to be tied to something, wouldn't it make more
sense to tie it to wal_buffers though? One example is a data warehouse
might have a very high shared_buffers, but most likely won't have a high
transaction rate. ISTM that most databases with a high transaction rate
are likely to have increased wal_buffers.
I think you should also consider increasing PGPROC_MAX_CACHED_SUBXIDS
(src/include/storage/proc.h), because that should decrease the chance
that the subtrans area needs to be scanned. By how much, however, I
wouldn't know -- it depends on the number of subtransactions you
typically have; I guess you could activate the measuring code in
procarray.c to get a figure.
AFAIK they're not using subtransactions at all, but I'll check.
Is there anywhere this stuff is documented other than in code? It sounds
like an advanced tuning guide would be very valuable for environments
like this one...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
AFAIK they're not using subtransactions at all, but I'll check.
Well, yeah, they are ... else you'd never have seen this failure.
regards, tom lane
Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
AFAIK they're not using subtransactions at all, but I'll check.
Well, yeah, they are ... else you'd never have seen this failure.
Maybe it's in plpgsql EXCEPTION clauses.
--
Alvaro Herrera Valdivia, Chile ICBM: S 39� 49' 17.7", W 73� 14' 26.8"
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu. Five minutes later I realize that it's also talking
about food" (Donald Knuth)
On Tue, Nov 01, 2005 at 11:23:55AM -0300, Alvaro Herrera wrote:
Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
AFAIK they're not using subtransactions at all, but I'll check.
Well, yeah, they are ... else you'd never have seen this failure.
Maybe it's in plpgsql EXCEPTION clauses.
They say they're not using either.
Doesn't clog use the same code?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Nov 01, 2005 at 11:23:55AM -0300, Alvaro Herrera wrote:
Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
AFAIK they're not using subtransactions at all, but I'll check.
Well, yeah, they are ... else you'd never have seen this failure.
Maybe it's in plpgsql EXCEPTION clauses.
Err, I forgot they're using Slony, which is probably using savepoints
and/or exceptions.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Doesn't clog use the same code?
Yeah, but all three of your examples were referencing pg_subtrans,
as proven by the stack traces and the contents of the shared control
block.
Even though the bug seems completely clog.c's fault, this is not a
coincidence: if subtransactions are being used heavily, then pg_subtrans
would have far greater I/O volume than any of the other clog-managed
logs, and hence have more exposure to the race condition.
We really ought to fix that code so that pg_subtrans can have more
buffers than pg_clog...
regards, tom lane
I wrote:
Even though the bug seems completely clog.c's fault,
s/clog.c/slru.c/ of course :-(
regards, tom lane
jnasby@pervasive.com ("Jim C. Nasby") writes:
On Tue, Nov 01, 2005 at 11:23:55AM -0300, Alvaro Herrera wrote:
Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
AFAIK they're not using subtransactions at all, but I'll check.
Well, yeah, they are ... else you'd never have seen this failure.
Maybe it's in plpgsql EXCEPTION clauses.
Err, I forgot they're using Slony, which is probably using savepoints
and/or exceptions.
Slony-I does use exceptions in pretty conventional ways; it does *not*
make any use of subtransactions, because it needs to run on PG 7.3 and
7.4 that do not support subtransactions.
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://www3.sympatico.ca/cbbrowne/linuxxian.html
"I can't believe my room doesn't have Ethernet! Why wasn't it wired
when the house was built?"
"The house was built in 1576."
-- Alex Kamilewicz on the Oxford breed of `conference American.'
jnasby@pervasive.com ("Jim C. Nasby") writes:
AFAIK they're not using subtransactions at all, but I'll check.
Are they perchance using pl/PerlNG?
We discovered a problem with Slony-I's handling of subtransactions
which was exposed by pl/PerlNG, which evidently wraps its SPI calls
inside subtransactions.
For more details...
<http://gborg.postgresql.org/project/slony1/bugs/bugupdate.php?1293>
That is the only subtransaction issue I am aware of...
--
select 'cbbrowne' || '@' || 'acm.org';
http://www.ntlug.org/~cbbrowne/nonrdbms.html
:FATAL ERROR -- ERROR IN ERROR HANDLER
alvherre@alvh.no-ip.org (Alvaro Herrera) writes:
Chris Browne wrote:
jnasby@pervasive.com ("Jim C. Nasby") writes:
On Tue, Nov 01, 2005 at 11:23:55AM -0300, Alvaro Herrera wrote:
Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
AFAIK they're not using subtransactions at all, but I'll check.
Well, yeah, they are ... else you'd never have seen this failure.
Maybe it's in plpgsql EXCEPTION clauses.
Err, I forgot they're using Slony, which is probably using savepoints
and/or exceptions.Slony-I does use exceptions in pretty conventional ways; it does *not*
make any use of subtransactions, because it needs to run on PG 7.3 and
7.4 that do not support subtransactions.Hmm, does it use the BEGIN/EXCEPTION/END construct at all? Because if
it does, it won't work on 7.4; and if it doesn't, then it isn't using
savepoints in 8.0 either.
Ah, then I was misreading that.
There are instances of RAISE EXCEPTION, which was what I had in mind,
but not of BEGIN/EXCEPTION/END.
There is some logic in 8.x to *detect* the nesting of transactions,
but that's quite another matter.
--
output = ("cbbrowne" "@" "acm.org")
http://cbbrowne.com/info/multiplexor.html
"If I could find a way to get [Saddam Hussein] out of there, even
putting a contract out on him, if the CIA still did that sort of a
thing, assuming it ever did, I would be for it." -- Richard M. Nixon
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Jim C. Nasby wrote:
My assumption is that the asserts that are currently in place fall into
one of two categories: either they check for something that if false
could result in data corruption in the heap, or they check for something
that shouldn't happen, but if it does it can't corrupt the heap. If that
assumption is correct then seperating them might make it easier to run
with the set of critical asserts turned on. Currently, there can be a
substantial performance penalty with all asserts turned on, but I
suspect a lot of that penalty is from asserts in things like parsing and
planning code; code that pretty much couldn't corrupt data.There is no way if the system has some incorrect value whether that
would later corrupt the data or not. Anything the system does that it
shouldn't do is a potential corruption problem.
That's true but the reason why is subtler than that. If something has happened
that "can't happen" then there's no way to know how you arrived in that
situation. You might already have major problems that can lead to data
corruption -- or for that matter have already caused data corruption..
I happen to think that except for the rare assertion that has major
performance impact all the assertions should be on in production builds. The
goal of assertions is to catch corruption quickly and that's something that's
just as important in production as it is in debugging.
--
greg
Greg Stark <gsstark@mit.edu> writes:
I happen to think that except for the rare assertion that has major
performance impact all the assertions should be on in production builds. The
goal of assertions is to catch corruption quickly and that's something that's
just as important in production as it is in debugging.
You seem not to have read the documentation:
<term><option>--enable-cassert</option></term>
Enables <firstterm>assertion</> checks in the server, which test for
many <quote>can't happen</> conditions. This is invaluable for
code development purposes, but the tests slow things down a little.
Also, having the tests turned on won't necessarily enhance the
stability of your server! The assertion checks are not categorized
for severity, and so what might be a relatively harmless bug will
still lead to server restarts if it triggers an assertion
failure. Currently, this option is not recommended for
production use, but you should have it on for development work
or when running a beta version.
The great thing about Assert() is that you can throw one in for any
condition that your code is assuming-without-proof, without having to
think too hard about consequences. If we were to recommend having
enable-cassert on in production databases, we would need a MUCH higher
standard of care about when to use Assert. I would bet that ninety
percent of the Asserts in the existing code are on conditions that could
represent, at worst, corruption of backend-local or even
transaction-local data structures. Taking down the entire database
cluster for that is not something that sounds like a stability-enhancing
tradeoff to me.
In other words, the "you don't know how bad it might be" theory has a
flip side: you can't cry wolf when there's no wolf, either, at least not
if you want to continue to be listened to.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Greg Stark <gsstark@mit.edu> writes:
I happen to think that except for the rare assertion that has major
performance impact all the assertions should be on in production builds. The
goal of assertions is to catch corruption quickly and that's something that's
just as important in production as it is in debugging.You seem not to have read the documentation:
Sure I have, I just disagree.
I would bet that ninety percent of the Asserts in the existing code are on
conditions that could represent, at worst, corruption of backend-local or
even transaction-local data structures. Taking down the entire database
cluster for that is not something that sounds like a stability-enhancing
tradeoff to me.
It may be minor corruption or it may be that the reason for the minor
corruption comes from some larger bug. It may also be backend-local or
transaction-local corruption at the time the assert catches it but cause major
damage by the time it actually crashes a non-assert-enabled database.
--
greg
On Wed, Nov 02, 2005 at 07:03:57AM -0500, Greg Stark wrote:
I would bet that ninety percent of the Asserts in the existing code are on
conditions that could represent, at worst, corruption of backend-local or
even transaction-local data structures. Taking down the entire database
cluster for that is not something that sounds like a stability-enhancing
tradeoff to me.It may be minor corruption or it may be that the reason for the minor
corruption comes from some larger bug. It may also be backend-local or
transaction-local corruption at the time the assert catches it but cause major
damage by the time it actually crashes a non-assert-enabled database.
Agreed. Personally I'd want to know about anything that corrupts my
data, no matter what the locality. I would also argue that if people are
seeing 'minor' asserts firing in production that there's a bug that
needs to be tracked down.
IF it comes out that there's some asserts that can be fired even though
there's not anything really bad happening, they could always be
relegated to a second class of assert that's not normally turned on.
BTW, that's a reversal from what I was originally arguing for, which was
due to the performance penalty associated with --enable-cassert. My
client is now running with Tom's suggestion of commenting out
CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING and performance is
good. It appears to be as good as it was with asserts disabled. So I
think it would definately be good to break those options out from
--enable-cassert. That makes it viable to run with asserts in
production, at least from a performance standpoint.
BTW, they're also running with patch2 now. Previously, with asserts
turned on and without the patch, they were seeing assert failures on
average of 2/day. So hopefully tomorrow we'll have an idea if the patch
fixed this or not.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
BTW, that's a reversal from what I was originally arguing for, which was
due to the performance penalty associated with --enable-cassert. My
client is now running with Tom's suggestion of commenting out
CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING and performance is
good. It appears to be as good as it was with asserts disabled.
Interesting. I've always wondered whether the "debug_assertions" GUC
variable is worth the electrons it's printed on. If you are running
with asserts active, that variable actually slows things down, by
requiring an additional bool test for every Assert. I suppose the
motivation was to allow the same compiled executable to be used for both
assert-enabled and assert-disabled runs, but how many people really need
that capability?
regards, tom lane
On Wed, Nov 02, 2005 at 02:04:09PM -0500, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
BTW, that's a reversal from what I was originally arguing for, which was
due to the performance penalty associated with --enable-cassert. My
client is now running with Tom's suggestion of commenting out
CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING and performance is
good. It appears to be as good as it was with asserts disabled.Interesting. I've always wondered whether the "debug_assertions" GUC
variable is worth the electrons it's printed on. If you are running
with asserts active, that variable actually slows things down, by
requiring an additional bool test for every Assert. I suppose the
motivation was to allow the same compiled executable to be used for both
assert-enabled and assert-disabled runs, but how many people really need
that capability?
Not sure how that relates to CLOBBER_FREED_MEMORY and
MEMORY_CONTEXT_CHECKING :P, but I agree that it doesn't make sense to
have a GUC, at least not if asserts default to being compiled out.
Hrm... does debug_assertions end up changing assert_enabled?
BTW, is MEMORY_CONTEXT_CHECKING that expensive? It seems like it
shouldn't be, but I'm only guessing at what exactly it does...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote:
BTW, is MEMORY_CONTEXT_CHECKING that expensive? It seems like it
shouldn't be, but I'm only guessing at what exactly it does...
Yes, because not only it checks marker bytes at the end of palloc chunks
and similar stuff, but it also overwrites whole contexts with 0x7f when
they are reset.
May I propose to make Assert() yield only WARNINGs, and take out the
most expensive parts of MEMORY_CONTEXT_CHECKING, when --enable-cassert
is disabled? Enabling it would trigger the current FailedAssertion and
all of MEMORY_CONTEXT_CHECKING. That way we get all bug reports without
the expensive runtime costs for in-production systems.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
May I propose to make Assert() yield only WARNINGs,
That is a horrid idea --- for one thing, it means that Asserts inside
the elog machinery itself would be instant infinite recursion, and even
elsewhere you'd have to think a bit about whether it's ok to call the
elog machinery. Plus, once you *have* detected an assertion failure,
allowing the code to keep running is just silly.
Either they dump core or they're disabled, there is no third option.
I do think it would be reasonable to fix things so that
MEMORY_CONTEXT_CHECKING could be turned on and off at runtime.
Perhaps rather than an all-or-nothing debug_assertions GUC variable,
what we want is something that turns on or off "expensive" assertion
checks at runtime. This could include MEMORY_CONTEXT_CHECKING and
anything else where the actual checking of the condition is nontrivial.
(For instance, there is code in list.c that grovels over the whole
list for a consistency check --- that is "expensive". There is some
code in the bufmgr that scans through all the buffers --- ditto.)
regards, tom lane
On Thu, Nov 03, 2005 at 11:11:42AM -0500, Tom Lane wrote:
Perhaps rather than an all-or-nothing debug_assertions GUC variable,
what we want is something that turns on or off "expensive" assertion
checks at runtime. This could include MEMORY_CONTEXT_CHECKING and
anything else where the actual checking of the condition is nontrivial.
(For instance, there is code in list.c that grovels over the whole
list for a consistency check --- that is "expensive". There is some
code in the bufmgr that scans through all the buffers --- ditto.)
Two levels of assertions sounds like a great idea... wish I'd thought of
it! ;P
Seriously, I am wondering about the performance hit of always checking
debug_assertions.
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00389.php
indicates that even with debug_assertions=false, --enable-cassert has a
big performance impact.
I don't really see any reason for debug_assertions unless we start
defaulting to assertions being compiled in. If we're going to split
things up maybe the expensive assertions you mention should get a
different macro so that there's no performance hit unless you
specifically compile them in. Michael's email tends to indicate that
going the other way around (one macro, two GUC's) wouldn't do any good.
BTW, I can do some testing of assert performance impact with dbt2 on a
Solaris box if anyone's interested in the data...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
Seriously, I am wondering about the performance hit of always checking
debug_assertions.
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00389.php
indicates that even with debug_assertions=false, --enable-cassert has a
big performance impact.
That's because MEMORY_CONTEXT_CHECKING happens anyway --- it's not
currently switched off by the GUC variable. I don't think we have any
recent data point about the cost of Asserts per se, except your own
report that it seems minimal. My thought is that it would be even
more minimal if the tests of debug_assertion were removed ;-) ...
except in association with Asserts that are actually testing an
expensive-to-check condition, of which there are very few.
regards, tom lane
Added to TODO:
* Consider increasing internal areas when shared buffers is increased
http://archives.postgresql.org/pgsql-hackers/2005-10/msg01419.php
---------------------------------------------------------------------------
Alvaro Herrera wrote:
Jim C. Nasby wrote:
Now that I've got a little better idea of what this code does, I've
noticed something interesting... this issue is happening on an 8-way
machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this
greatly increase the odds of buffer conflicts? Bug aside, would it be
better to set NUM_SLRU_BUFFERS higher for a larger number of CPUs?We had talked about increasing NUM_SLRU_BUFFERS depending on
shared_buffers, but it didn't get done. Something to consider for 8.2
though. I think you could have better performance by increasing that
setting, while at the same time dimishing the possibility that the race
condition appears.I think you should also consider increasing PGPROC_MAX_CACHED_SUBXIDS
(src/include/storage/proc.h), because that should decrease the chance
that the subtrans area needs to be scanned. By how much, however, I
wouldn't know -- it depends on the number of subtransactions you
typically have; I guess you could activate the measuring code in
procarray.c to get a figure.--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
www.google.com: interfaz de l?nea de comando para la web.---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +