buffer assertion tripping under repeat pgbench load

Started by Greg Smithover 13 years ago37 messageshackers
Jump to latest
#1Greg Smith
gsmith@gregsmith.com

I'm testing a checkout from a few days ago and trying to complete a day
long pgbench stress test, with assertions and debugging on. I want to
make sure the base code works as expected before moving on to testing
checksums. It's crashing before finishing though. Here's a sample:

2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=3.108 s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832
s, average=0.139 s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1703)
2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was
terminated by signal 6: Aborted
2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END;

Running the same test set again gave the same crash, so this looks
repeatable. It's not trivial to trigger given the average runtime to
crash I'm seeing. Second sample:

2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=3.310 s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648
s, average=0.415 s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1703)
2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was
terminated by signal 6: Aborted
2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END;

The important part:

TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1703)

That's the check done by "AtEOXact_Buffers - clean up at end of
transaction". It fired while executing the "END;"" statement at the end
of the standard pgbench test. The test looks for loose things that
weren't pinned/unpinned correctly by the transaction. I'm not sure if
getting a stack trace at that point will be useful. The ref count
mistake could easily have been made by an earlier statement in the
transaction block, right?

This is on a server where shared_buffers=2GB, checkpoint_segments=64.
Test runs were automated with pgbench-tools, using the following
configuration:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="500 1000"
SETCLIENTS="4 8 16 32 64 96"
SETTIMES=3
RUNTIME=600

All of the crashes so far have been at scale=500, and I haven't seen
those finish yet. It failed 7 hours in the first time, then 4 hours in.

For code reference, the last commit in the source code tree I built
against was:

af275a12dfeecd621ed9899a9382f26a68310263
Thu Dec 20 14:23:31 2012 +0200

Looking at recent activity, the only buffer pin related changes I saw
was this one:

commit 62656617dbe49cca140f3da588a5e311b3fc35ea
Date: Mon Dec 3 18:53:31 2012 +0000
Avoid holding vmbuffer pin after VACUUM.

This is my first test like this against 9.3 development though, so the
cause could be an earlier commit. I'm just starting with the most
recent work as the first suspect. Next I think I'll try autovacuum=off
and see if the crash goes away. Other ideas are welcome.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Greg Smith (#1)
Re: buffer assertion tripping under repeat pgbench load

Hi,

On 2012-12-23 02:36:42 -0500, Greg Smith wrote:

I'm testing a checkout from a few days ago and trying to complete a day long
pgbench stress test, with assertions and debugging on. I want to make sure
the base code works as expected before moving on to testing checksums. It's
crashing before finishing though. Here's a sample:

2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32 buffers
(0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108
s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139
s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
1703)
2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was
terminated by signal 6: Aborted
2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END;

Running the same test set again gave the same crash, so this looks
repeatable. It's not trivial to trigger given the average runtime to crash
I'm seeing. Second sample:

2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34 buffers
(0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310
s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415
s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
1703)
2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was
terminated by signal 6: Aborted
2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END;

The important part:

TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
1703)

That's the check done by "AtEOXact_Buffers - clean up at end of
transaction". It fired while executing the "END;"" statement at the end of
the standard pgbench test. The test looks for loose things that weren't
pinned/unpinned correctly by the transaction. I'm not sure if getting a
stack trace at that point will be useful. The ref count mistake could
easily have been made by an earlier statement in the transaction block,
right?

This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test
runs were automated with pgbench-tools, using the following configuration:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="500 1000"
SETCLIENTS="4 8 16 32 64 96"
SETTIMES=3
RUNTIME=600

All of the crashes so far have been at scale=500, and I haven't seen those
finish yet. It failed 7 hours in the first time, then 4 hours in.

For code reference, the last commit in the source code tree I built against
was:

af275a12dfeecd621ed9899a9382f26a68310263
Thu Dec 20 14:23:31 2012 +0200

Looking at recent activity, the only buffer pin related changes I saw was
this one:

commit 62656617dbe49cca140f3da588a5e311b3fc35ea
Date: Mon Dec 3 18:53:31 2012 +0000
Avoid holding vmbuffer pin after VACUUM.

I don't immediately see anything bad here.

This is my first test like this against 9.3 development though, so the cause
could be an earlier commit. I'm just starting with the most recent work as
the first suspect. Next I think I'll try autovacuum=off and see if the
crash goes away. Other ideas are welcome.

Something like the (untested) debug message below might be helpful:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7be767b..20f61a1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1700,7 +1700,13 @@ AtEOXact_Buffers(bool isCommit)
 		for (i = 0; i < NBuffers; i++)
 		{
-			Assert(PrivateRefCount[i] == 0);
+			if (PrivateRefCount[i] != 0)
+			{
+				BufferDesc *bufHdr = &BufferDescriptors[i];
+				elog(PANIC, "refcount of %s is %u should be 0, globally: %u",
+				     relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum),
+				     PrivateRefCount[i], bufHdr->refcount);
+			}
 		}
 	}
 #endif

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: buffer assertion tripping under repeat pgbench load

Andres Freund <andres@2ndquadrant.com> writes:

This is my first test like this against 9.3 development though, so the cause
could be an earlier commit. I'm just starting with the most recent work as
the first suspect. Next I think I'll try autovacuum=off and see if the
crash goes away. Other ideas are welcome.

Something like the (untested) debug message below might be helpful:

It might also be interesting to know if there is more than one
still-pinned buffer --- that is, if you're going to hack the code, fix
it to elog(LOG) each pinned buffer and then panic after completing the
loop.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#3)
Re: buffer assertion tripping under repeat pgbench load

On 12/23/12 1:10 PM, Tom Lane wrote:

It might also be interesting to know if there is more than one
still-pinned buffer --- that is, if you're going to hack the code, fix
it to elog(LOG) each pinned buffer and then panic after completing the
loop.

Easy enough; I kept it so the actual source of panic is still an
assertion failure:

diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..df43643 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,21 @@ AtEOXact_Buffers(bool isCommit)
         if (assert_enabled)
         {
                 int                     i;
+               int                     RefCountErrors = 0;
                 for (i = 0; i < NBuffers; i++)
                 {
-                       Assert(PrivateRefCount[i] == 0);
+
+                       if (PrivateRefCount[i] != 0)
+                       {
+                               BufferDesc *bufHdr = &BufferDescriptors[i];
+                               elog(LOG, "refcount of %s is %u should 
be 0, globally: %u",
+                                    relpathbackend(bufHdr->tag.rnode, 
InvalidBackendId, bufHdr->tag.forkNum),
+                                    PrivateRefCount[i], bufHdr->refcount);
+                               RefCountErrors++;
+                       }
                 }
+               Assert(RefCountErrors == 0);
         }
  #endif

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#4)
Re: buffer assertion tripping under repeat pgbench load

On 23 December 2012 19:42, Greg Smith <greg@2ndquadrant.com> wrote:

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..df43643 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,21 @@ AtEOXact_Buffers(bool isCommit)
if (assert_enabled)
{
int                     i;
+               int                     RefCountErrors = 0;
for (i = 0; i < NBuffers; i++)
{
-                       Assert(PrivateRefCount[i] == 0);
+
+                       if (PrivateRefCount[i] != 0)
+                       {
+                               BufferDesc *bufHdr = &BufferDescriptors[i];
+                               elog(LOG, "refcount of %s is %u should be 0,
globally: %u",
+                                    relpathbackend(bufHdr->tag.rnode,
InvalidBackendId, bufHdr->tag.forkNum),
+                                    PrivateRefCount[i], bufHdr->refcount);
+                               RefCountErrors++;
+                       }
}
+               Assert(RefCountErrors == 0);
}
#endif

We already have PrintBufferLeakWarning() for this, which might be a bit neater.

If that last change was the cause, then its caused within VACUUM. I'm
running a thrash test with autovacuums set much more frequently but
nothing yet.

Are you by any chance running with unlogged tables? There was a change
there recently, something around checkpoint IIRC. Can you set
checkpoints more frequent also?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#5)
Re: buffer assertion tripping under repeat pgbench load

On 12/23/12 3:17 PM, Simon Riggs wrote:

If that last change was the cause, then its caused within VACUUM. I'm
running a thrash test with autovacuums set much more frequently but
nothing yet.

I am not very suspicious of that VACUUM change; just pointed it out for
completeness sake.

Are you by any chance running with unlogged tables? There was a change
there recently, something around checkpoint IIRC. Can you set
checkpoints more frequent also?

This is straight pgbench, modifying only scale, client count, workers,
and duration. No unlogged tables involved.

The crash is so intermittent that I'm trying not to change anything
until I get more productive output out of one. The last run I kicked
off has been chugging for 14 hours without a hiccup yet.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#6)
Re: buffer assertion tripping under repeat pgbench load

On 23 December 2012 21:52, Greg Smith <greg@2ndquadrant.com> wrote:

On 12/23/12 3:17 PM, Simon Riggs wrote:

If that last change was the cause, then its caused within VACUUM. I'm
running a thrash test with autovacuums set much more frequently but
nothing yet.

I am not very suspicious of that VACUUM change; just pointed it out for
completeness sake.

Nothing seen here either after 100 minutes of thrash with ~1 VACUUM
per sec, at SF1

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#5)
Re: buffer assertion tripping under repeat pgbench load

On 12/23/12 3:17 PM, Simon Riggs wrote:

We already have PrintBufferLeakWarning() for this, which might be a bit neater.

Maybe. I tried using this, and I just got a seg fault within that code.
I can't figure out if I called it incorrectly or if the buffer
involved is so damaged that PrintBufferLeakWarning chokes on it. I'll
look at that myself later.

I did get some output from the variation Andres suggested. There was
exactly one screwed up buffer:

2012-12-24 06:08:46 EST [26015]: WARNING: refcount of base/16384/49169
is 1073741824 should be 0, globally: 0

That is pgbench_accounts_pkey. 1073741824 =
0100 0000 0000 0000 0000 0000 0000 0000 = 2^30

Pretty odd value to find in a PrivateRefCount. What makes me nervous
about all of the PrivateRefCount coding is how it switches between
references like PrivateRefCount[(bufnum) - 1] and PrivateRefCount[b].
Might this be an off by one error in one of those, where the wrong form
was used?

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#8)
Re: buffer assertion tripping under repeat pgbench load

Greg Smith <greg@2ndQuadrant.com> writes:

I did get some output from the variation Andres suggested. There was
exactly one screwed up buffer:

2012-12-24 06:08:46 EST [26015]: WARNING: refcount of base/16384/49169
is 1073741824 should be 0, globally: 0

That is pgbench_accounts_pkey. 1073741824 =
0100 0000 0000 0000 0000 0000 0000 0000 = 2^30

Huh. Looks a bit like overflow of the refcount, which would explain why
it takes such a long test case to reproduce it. But how could that be
happening without somebody forgetting to decrement the refcount, which
ought to lead to a visible failure in shorter tests? Even more
interesting that the buffer's global refcount is zero.

Now that we know it's an index, it would be mighty useful to know which
block is the problem (I'm wondering if it's metapage or root). Could
you add the block number to the printout?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#8)
Re: buffer assertion tripping under repeat pgbench load

On 24 December 2012 15:57, Greg Smith <greg@2ndquadrant.com> wrote:

2012-12-24 06:08:46 EST [26015]: WARNING: refcount of base/16384/49169 is
1073741824 should be 0, globally: 0

That is pgbench_accounts_pkey. 1073741824 =
0100 0000 0000 0000 0000 0000 0000 0000 = 2^30

Pretty odd value to find in a PrivateRefCount. What makes me nervous about
all of the PrivateRefCount coding is how it switches between references like
PrivateRefCount[(bufnum) - 1] and PrivateRefCount[b]. Might this be an off
by one error in one of those, where the wrong form was used?

PrivateRefCount is also one of the only large chunks of memory in a
running server where single bit errors aren't tolerated.

In shared buffers, most SLRUs or various other places, single bit
errors would often go unnoticed.

I wonder if you're having a hardware problem?,

If not, it might be worth looking to see if that block is the root,
maybe there's a problem in fastroot code, or something that only kicks
in with larger indexes.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#9)
Re: buffer assertion tripping under repeat pgbench load

On 24 December 2012 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh. Looks a bit like overflow of the refcount, which would explain why
it takes such a long test case to reproduce it. But how could that be
happening without somebody forgetting to decrement the refcount, which
ought to lead to a visible failure in shorter tests? Even more
interesting that the buffer's global refcount is zero.

But we test at the end of each transaction whether any pin count is
non-zero, so it can't have slowly built up.

It would be easier for it to have decremented too many times within
just one transaction.

Something to do with tail recursion during VACUUM?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#10)
Re: buffer assertion tripping under repeat pgbench load

On 12/24/12 11:10 AM, Simon Riggs wrote:

I wonder if you're having a hardware problem?

Always possible. I didn't report this until I had replicated the crash
and seen exactly the same thing twice. I've seen it crash on this
assertion 6 times now. Bad hardware is not normally so consistent.

I do have another system that's almost identical to this one, except for
the hard drives. If progress on tracking down the problem stalls I can
try to duplicate this on there.

I kicked off another test that includes the block number just before Tom
suggested it, so I should have the block by tomorrow at the latest. The
range of runtime before crash is 3 to 14 hours so far.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#12)
Re: buffer assertion tripping under repeat pgbench load

On 24 December 2012 16:25, Greg Smith <greg@2ndquadrant.com> wrote:

On 12/24/12 11:10 AM, Simon Riggs wrote:

I wonder if you're having a hardware problem?

Always possible. I didn't report this until I had replicated the crash and
seen exactly the same thing twice. I've seen it crash on this assertion 6
times now. Bad hardware is not normally so consistent.

Consistency in bugs is a wonderful thing, so I figure we can
immediately rule out the hardware.

I kicked off another test that includes the block number just before Tom
suggested it, so I should have the block by tomorrow at the latest. The
range of runtime before crash is 3 to 14 hours so far.

Tom will have fixed it by then ;-)

Sorry, I should have mentioned it was the block number that I wanted
and that was why I suggested using the other print function.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#12)
Re: buffer assertion tripping under repeat pgbench load

Greg Smith <greg@2ndQuadrant.com> writes:

I kicked off another test that includes the block number just before Tom
suggested it, so I should have the block by tomorrow at the latest. The
range of runtime before crash is 3 to 14 hours so far.

Cool. Once you get the crash, please also capture the contents of the
index's metapage --- pg_filedump or contrib/pageinspect's bt_metap()
should do. If the problem page isn't the metapage itself, we'll need
that info to tell if it's the root.

Actually, it might be interesting to capture the content of the problem
page itself, too, if it isn't the metapage. "pg_filedump -i -F" or
contrib/pageinspect's bt_page_stats() + bt_page_items() should show
anything useful.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#14)
Re: buffer assertion tripping under repeat pgbench load

To try and speed up replicating this problem I switched to a smaller
database scale, 100, and I was able to get a crash there. Here's the
latest:

2012-12-26 00:01:19 EST [2278]: WARNING: refcount of base/16384/57610
blockNum=118571, flags=0x106 is 1073741824 should be 0, globally: 0
2012-12-26 00:01:19 EST [2278]: WARNING: buffers with non-zero refcount
is 1
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line:
1720)

That's the same weird 1073741824 count as before. I was planning to
dump some index info, but then I saw this:

$ psql -d pgbench -c "select relname,relkind,relfilenode from pg_class
where relfilenode=57610"
relname | relkind | relfilenode
------------------+---------+-------------
pgbench_accounts | r | 57610

Making me think this isn't isolated to being an index problem. I tried
to soldier on with pg_filedump anyway. It looks like the last version I
saw there (9.2.0 from November) doesn't compile anymore:

$ make -f Makefile.contrib USE_PGXS=1
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv -g -I. -I.
-I/var/lib/pgsql/pgwork/inst/baseline/include/server
-I/var/lib/pgsql/pgwork/inst/baseline/include/internal -D_GNU_SOURCE
-c -o pg_filedump.o pg_filedump.c -MMD -MP -MF .deps/pg_filedump.Po
pg_filedump.c: In function �FormatHeader�:
pg_filedump.c:617: error: request for member �xlogid� in something not a
structure or union
pg_filedump.c:617: error: request for member �xrecoff� in something not
a structure or union
pg_filedump.c: In function �FormatItem�:
pg_filedump.c:904: error: invalid application of �sizeof� to incomplete
type �HeapTupleHeaderData�
...

Lots more after that. Does this need an update or is there just a devel
version I should grab?

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#15)
Re: buffer assertion tripping under repeat pgbench load

Greg Smith <greg@2ndQuadrant.com> writes:

To try and speed up replicating this problem I switched to a smaller
database scale, 100, and I was able to get a crash there. Here's the
latest:

2012-12-26 00:01:19 EST [2278]: WARNING: refcount of base/16384/57610
blockNum=118571, flags=0x106 is 1073741824 should be 0, globally: 0
2012-12-26 00:01:19 EST [2278]: WARNING: buffers with non-zero refcount
is 1
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line:
1720)

That's the same weird 1073741824 count as before. I was planning to
dump some index info, but then I saw this:

$ psql -d pgbench -c "select relname,relkind,relfilenode from pg_class
where relfilenode=57610"
relname | relkind | relfilenode
------------------+---------+-------------
pgbench_accounts | r | 57610

Making me think this isn't isolated to being an index problem.

Yeah, that destroys my theory that there's something broken about index
management specifically. Now we're looking for something that can
affect any buffer's refcount, which more than likely means it has
nothing to do with the buffer's contents ...

I tried
to soldier on with pg_filedump anyway. It looks like the last version I
saw there (9.2.0 from November) doesn't compile anymore:

Meh, looks like it needs fixes for Heikki's int64-xlogrecoff patch.
I haven't gotten around to doing that yet, but would gladly take a
patch if anyone wants to do it. However, I now doubt that examining
the buffer content will help much on this problem.

Now that we know the bug's reproducible on smaller instances, could you
put together an exact description of what you're doing to trigger
it? What is the DB configuration, pgbench parameters, etc?

Also, it'd be worthwhile to just repeat the test a few more times
to see if there's any sort of pattern in which buffers get affected.
I'm now suspicious that it might not always be just one buffer,
for example.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#16)
Re: buffer assertion tripping under repeat pgbench load

Tom Lane <tgl@sss.pgh.pa.us> schrieb:

Greg Smith <greg@2ndQuadrant.com> writes:

To try and speed up replicating this problem I switched to a smaller
database scale, 100, and I was able to get a crash there. Here's the

latest:

2012-12-26 00:01:19 EST [2278]: WARNING: refcount of

base/16384/57610

blockNum=118571, flags=0x106 is 1073741824 should be 0, globally: 0
2012-12-26 00:01:19 EST [2278]: WARNING: buffers with non-zero

refcount

is 1
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c",

Line:

1720)

That's the same weird 1073741824 count as before. I was planning to
dump some index info, but then I saw this:

$ psql -d pgbench -c "select relname,relkind,relfilenode from

pg_class

where relfilenode=57610"
relname | relkind | relfilenode
------------------+---------+-------------
pgbench_accounts | r | 57610

Making me think this isn't isolated to being an index problem.

Yeah, that destroys my theory that there's something broken about index
management specifically. Now we're looking for something that can
affect any buffer's refcount, which more than likely means it has
nothing to do with the buffer's contents ...

I tried
to soldier on with pg_filedump anyway. It looks like the last

version I

saw there (9.2.0 from November) doesn't compile anymore:

Meh, looks like it needs fixes for Heikki's int64-xlogrecoff patch.
I haven't gotten around to doing that yet, but would gladly take a
patch if anyone wants to do it. However, I now doubt that examining
the buffer content will help much on this problem.

Now that we know the bug's reproducible on smaller instances, could you
put together an exact description of what you're doing to trigger
it? What is the DB configuration, pgbench parameters, etc?

Also, it'd be worthwhile to just repeat the test a few more times
to see if there's any sort of pattern in which buffers get affected.
I'm now suspicious that it might not always be just one buffer,
for example.

I don't think its necessarily only one buffer - if I read the above output correctly Greg used the suggested debug output which just put the elog(WARN) before the Assert...

Greg, could you output all "bad" buffers and only assert after the loop if there was at least one refcounted buffer?

Andres

--- 
Please excuse the brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Greg Smith
gsmith@gregsmith.com
In reply to: Andres Freund (#17)
Re: buffer assertion tripping under repeat pgbench load

On 12/26/12 1:58 PM, anarazel@anarazel.de wrote:

I don't think its necessarily only one buffer - if I read the above output correctly Greg used the suggested debug output which just put the elog(WARN) before the Assert...

Greg, could you output all "bad" buffers and only assert after the loop if there was at least one refcounted buffer?

I've been doing that for a while, the only thing that's new is getting
the block number. It's only ever been one buffer involved.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
Re: buffer assertion tripping under repeat pgbench load

On Wed, Dec 26, 2012 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, that destroys my theory that there's something broken about index
management specifically. Now we're looking for something that can
affect any buffer's refcount, which more than likely means it has
nothing to do with the buffer's contents ...

Hardware problem is still a plausible candidate. Have you run any
memory checker software or seen anything else crash? Classically gcc
is as good at detecting memory problems as memory checking software.
Or a bad cpu can also sometimes cause problems like this. Have you
been able to reproduce on any other machines?

Did you ever say what kind of hardware it was? This is the local
reference count so I can't see how it could be a race condition or
anything like that but it sure smells a bit like one.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
Re: buffer assertion tripping under repeat pgbench load

On Wed, Dec 26, 2012 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, that destroys my theory that there's something broken about index
management specifically. Now we're looking for something that can
affect any buffer's refcount, which more than likely means it has
nothing to do with the buffer's contents ...

Also, do you have the buffer id of the broken buffer? I wonder if it's
not just any buffer but always the same same buffer even if it's a
different block in that buffer. e.g. maybe it's always the first
buffer because something is overwriting past the end of the
BufferBlocks array which is declared immediately before
PrivateRefCount. (Or maybe your compiler is laying out these objects
in a different way from most people's compilers and we're overwriting
past the end of some other object routinely but yours is the only
place where it's being laid out preceding a critical data structure)

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#20)
#22Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#19)
#23Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#23)
#25Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
#28Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#25)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#28)
In reply to: Robert Haas (#29)
#31Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#28)
#32Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#5)
#33Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#23)
#34Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#33)
#35Satoshi Nagayasu
snaga@uptime.jp
In reply to: Greg Smith (#33)
#36Greg Smith
gsmith@gregsmith.com
In reply to: Satoshi Nagayasu (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#36)