FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

Started by Erik Rijkersover 13 years ago15 messages
#1Erik Rijkers
er@xs4all.nl
2 attachment(s)

pg 9.2 git master
AMD 8120 (8-core) / 6 GB memory / Centos 6.2

I have experimented a bit with dropping a table from master, then querying that table from a
sync-rep slave. It is a little worrying that this, the first test I tried, crashes the slave.

There are two instance on one machine, head1 (=master) and head2 (=sync-rep slave).

First, I generated a tab-separated file, a one off, to be used in the test:

echo "
copy (
select
repeat('X',20) as c1
, repeat('X',20) as c2
, repeat('X',20) as c3
, repeat('X',20) as c4
, repeat('X',20) as c5
from generate_series(1, 200000)
)
to stdout
csv delimiter E'\t';
" | $HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql -p 6564 -d testdb > dropload_copy.txt

That txt file is zipped, and the actual test consists of a bash while loop which
1. drops the table
2. loads the file into the table
3. Either:
a. nothing
b. does a select count(*) on the table

So, it repeats the following:

zcat dropload_copy.txt.gz \
| grep -v '^#' \
| $HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql -p 6564 -d testdb -c "
drop table if exists t;
create table t (
c1 text,
c2 text,
c3 text,
c4 text,
c5 text
);
copy t from stdin csv delimiter E'\t';
analyze t;";

PAUSE_DURATION=0
PSQL=$HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql

if [[ 0 -eq 1 ]]; # ON-OFF switch
then

echo "sleep $PAUSE_DURATION"
sleep $PAUSE_DURATION;
(
echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6564 -d
testdb # master
echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6565 -d
testdb # wal_receiver_01
#echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6566 -d
testdb # wal_receiver_02
) | grep -v '^$'
fi

This runs fine for hours on end, as long as the ON-OFF switch is disabled.

But when that if-block is added the client crashes after a while (sometimes almost immediately; it
never survives longer then 20 minutes):

2012-05-26 10:44:22.617 CEST 10274 ERROR: could not fsync file "base/21268/32807": No such file
or directory
2012-05-26 10:44:28.465 CEST 10274 ERROR: could not fsync file "base/21268/32867": No such file
or directory
2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or
directory
2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994
xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6:
Aborted
2012-05-26 10:44:31.131 CEST 10269 LOG: terminating any other active server processes

Crazy scenario , I'll admit, but surely this shouldn't be able to crash the client?

I attach the logfiles of master(=head1) and slave (=head2). It show how the above ran for an hour
without problems (while the ON/OFF switch was disabled), and how the crash came quickly when I
switched it on (to add the select count(*) statements).

Erik Rijkers

Attachments:

logfile.head2application/octet-stream; name=logfile.head2Download
logfile.head1application/octet-stream; name=logfile.head1Download
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Erik Rijkers (#1)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On 26.05.2012 12:21, Erik Rijkers wrote:

But when that if-block is added the client crashes after a while (sometimes almost immediately; it
never survives longer then 20 minutes):

2012-05-26 10:44:22.617 CEST 10274 ERROR: could not fsync file "base/21268/32807": No such file
or directory
2012-05-26 10:44:28.465 CEST 10274 ERROR: could not fsync file "base/21268/32867": No such file
or directory
2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or
directory
2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994
xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6:
Aborted
2012-05-26 10:44:31.131 CEST 10269 LOG: terminating any other active server processes

Crazy scenario , I'll admit, but surely this shouldn't be able to crash the client?

Thanks for the report. I was able to reproduce this with that script,
and I think I see what's going on now.

There's something wrong with the way AccessExclusiveLocks work on a
standby. I did "begin; truncate foo; -- leave the xact open" in the
master, and waited until the xlog records are shipped to the standby.
Then I did this in the standby:

testdb=# begin;
BEGIN
testdb=# select * from foo;
id
----
(0 rows)

testdb=# select locktype, database, relation, virtualtransaction, pid,
mode, granted, fastpath from pg_locks where locktype='relation' and
relation='foo'::regclass;
locktype | database | relation | virtualtransaction | pid |
mode | granted | fastpath
----------+----------+----------+--------------------+-------+---------------------+---------+----------
relation | 16384 | 27332 | 2/78 | 24984 |
AccessShareLock | t | t
relation | 16384 | 27332 | 1/0 | 24344 |
AccessExclusiveLock | t | f
(2 rows)

The "select * from foo" query should have blocked, because the
transaction in the master is holding an AccessExclusiveLock on the table.

The process holding the AccessExclusiveLock is the startup process. It's
holding the lock on behalf of the transaction in the master. But
something's wrong, and the AccessExclusiveLock doesn't stop a regular
backend from acquiring the AccessShareLock on the table. I suspect the
fast-path locking patch, because this works on 9.1.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Wed, May 30, 2012 at 1:07 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's something wrong with the way AccessExclusiveLocks work on a standby.
I did "begin; truncate foo; -- leave the xact open" in the master, and
waited until the xlog records are shipped to the standby. Then I did this in
the standby:

testdb=# begin;
BEGIN
testdb=# select * from foo;
 id
----
(0 rows)

testdb=# select locktype, database, relation, virtualtransaction, pid, mode,
granted, fastpath from pg_locks where locktype='relation' and
relation='foo'::regclass;
 locktype | database | relation | virtualtransaction |  pid  | mode
| granted | fastpath
----------+----------+----------+--------------------+-------+---------------------+---------+----------
 relation |    16384 |    27332 | 2/78               | 24984 |
AccessShareLock     | t       | t
 relation |    16384 |    27332 | 1/0                | 24344 |
AccessExclusiveLock | t       | f
(2 rows)

The "select * from foo" query should have blocked, because the transaction
in the master is holding an AccessExclusiveLock on the table.

The process holding the AccessExclusiveLock is the startup process. It's
holding the lock on behalf of the transaction in the master. But something's
wrong, and the AccessExclusiveLock doesn't stop a regular backend from
acquiring the AccessShareLock on the table. I suspect the fast-path locking
patch, because this works on 9.1.

Yeah, apparently so. gdb says that FastPathStrongRelationLocks on the
standby is all-zeros even after that record has been replayed. Not
sure how that's possible yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#3)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The process holding the AccessExclusiveLock is the startup process. It's
holding the lock on behalf of the transaction in the master. But something's
wrong, and the AccessExclusiveLock doesn't stop a regular backend from
acquiring the AccessShareLock on the table. I suspect the fast-path locking
patch, because this works on 9.1.

Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
standby is all-zeros even after that record has been replayed.  Not
sure how that's possible yet.

Ah. The problem is that FastPathTag() expects that locks on database
objects will only be taken by backends with a non-zero value for
MyDatabaseId. Apparently the can-i-use-the-fastpath test and the
do-i-need-to-force-other-people-out-of-the-fastpath test need to be a
bit more asymmetrical than they are at present.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#4)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Wed, May 30, 2012 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The process holding the AccessExclusiveLock is the startup process. It's
holding the lock on behalf of the transaction in the master. But something's
wrong, and the AccessExclusiveLock doesn't stop a regular backend from
acquiring the AccessShareLock on the table. I suspect the fast-path locking
patch, because this works on 9.1.

Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
standby is all-zeros even after that record has been replayed.  Not
sure how that's possible yet.

Ah.  The problem is that FastPathTag() expects that locks on database
objects will only be taken by backends with a non-zero value for
MyDatabaseId.  Apparently the can-i-use-the-fastpath test and the
do-i-need-to-force-other-people-out-of-the-fastpath test need to be a
bit more asymmetrical than they are at present.

I've fixed things so that Heikki's test case now behaves as expected.
Hopefully this fixes Erik's problem as well, but I haven't tested.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Erik Rijkers
er@xs4all.nl
In reply to: Robert Haas (#5)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Wed, May 30, 2012 22:25, Robert Haas wrote:

On Wed, May 30, 2012 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The process holding the AccessExclusiveLock is the startup process. It's
holding the lock on behalf of the transaction in the master. But something's
wrong, and the AccessExclusiveLock doesn't stop a regular backend from
acquiring the AccessShareLock on the table. I suspect the fast-path locking
patch, because this works on 9.1.

Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
standby is all-zeros even after that record has been replayed.  Not
sure how that's possible yet.

Ah.  The problem is that FastPathTag() expects that locks on database
objects will only be taken by backends with a non-zero value for
MyDatabaseId.  Apparently the can-i-use-the-fastpath test and the
do-i-need-to-force-other-people-out-of-the-fastpath test need to be a
bit more asymmetrical than they are at present.

I've fixed things so that Heikki's test case now behaves as expected.
Hopefully this fixes Erik's problem as well, but I haven't tested.

(I double-checked that I got your latest commit in)

I'm afraid it's not yet resolved; the sync-slave still crashes almost immediately:

master logfile says:
2012-05-30 23:30:07.846 CEST 3918 LOG: standby wal_receiver_01 is now the synchronous standby
with priority 1

sync-slave logfile:

[...]
2012-05-30 23:30:07.833 CEST 3908 LOG: database system is ready to accept read only connections
cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/000000010000000000000004': No such file or
directory
2012-05-30 23:30:07.845 CEST 3917 LOG: streaming replication successfully connected to primary
2012-05-30 23:40:52.635 CEST 5287 ERROR: could not open relation with OID 26563
2012-05-30 23:40:52.635 CEST 5287 STATEMENT: select current_setting('port') port, count(*) from
public.t
2012-05-30 23:40:57.909 CEST 3909 FATAL: could not open file "base/21268/26569": No such file or
directory
2012-05-30 23:40:57.909 CEST 3909 CONTEXT: writing block 5152 of relation base/21268/26569
xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-30 23:40:58.006 CEST 5331 FATAL: could not open file "base/21268/26569": No such file or
directory
2012-05-30 23:40:58.006 CEST 5331 CONTEXT: writing block 5153 of relation base/21268/26569
2012-05-30 23:40:59.661 CEST 3908 LOG: startup process (PID 3909) was terminated by signal 6:
Aborted
2012-05-30 23:40:59.661 CEST 3908 LOG: terminating any other active server processes

#7Robert Haas
robertmhaas@gmail.com
In reply to: Erik Rijkers (#6)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers <er@xs4all.nl> wrote:

(I double-checked that I got your latest commit in)

I'm afraid it's not yet resolved; the sync-slave still crashes almost immediately:

master logfile says:
2012-05-30 23:30:07.846 CEST 3918 LOG:  standby wal_receiver_01 is now the synchronous standby
with priority 1

sync-slave logfile:

[...]
2012-05-30 23:30:07.833 CEST 3908 LOG:  database system is ready to accept read only connections
cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/000000010000000000000004': No such file or
directory
2012-05-30 23:30:07.845 CEST 3917 LOG:  streaming replication successfully connected to primary
2012-05-30 23:40:52.635 CEST 5287 ERROR:  could not open relation with OID 26563
2012-05-30 23:40:52.635 CEST 5287 STATEMENT:  select current_setting('port') port, count(*) from
public.t
2012-05-30 23:40:57.909 CEST 3909 FATAL:  could not open file "base/21268/26569": No such file or
directory
2012-05-30 23:40:57.909 CEST 3909 CONTEXT:  writing block 5152 of relation base/21268/26569
       xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-30 23:40:58.006 CEST 5331 FATAL:  could not open file "base/21268/26569": No such file or
directory
2012-05-30 23:40:58.006 CEST 5331 CONTEXT:  writing block 5153 of relation base/21268/26569
2012-05-30 23:40:59.661 CEST 3908 LOG:  startup process (PID 3909) was terminated by signal 6:
Aborted
2012-05-30 23:40:59.661 CEST 3908 LOG:  terminating any other active server processes

Hmm. I set up what I believe to be the same test case you were using,
and it's not crashing for me. In fact, with the latest code, I
haven't been able to produce any error at all. When I reverted my
last commit, I managed to get this:

ERROR: could not open relation with OID 18229
STATEMENT: select current_setting('port') port, count(*) from t

...but just once, and with no other error messages. So I'm either
missing a step somewhere, or something about your system just happens
to tickle this moreso than on mine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Erik Rijkers
er@xs4all.nl
In reply to: Robert Haas (#7)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Thu, May 31, 2012 03:30, Robert Haas wrote:

On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers <er@xs4all.nl> wrote:

directory
2012-05-30 23:40:57.909 CEST 3909 CONTEXT:  writing block 5152 of relation base/21268/26569
       xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-30 23:40:58.006 CEST 5331 FATAL:  could not open file "base/21268/26569": No such file
or
directory
2012-05-30 23:40:58.006 CEST 5331 CONTEXT:  writing block 5153 of relation base/21268/26569
2012-05-30 23:40:59.661 CEST 3908 LOG:  startup process (PID 3909) was terminated by signal 6:
Aborted
2012-05-30 23:40:59.661 CEST 3908 LOG:  terminating any other active server processes

Hmm. I set up what I believe to be the same test case you were using,
and it's not crashing for me. In fact, with the latest code, I
haven't been able to produce any error at all. When I reverted my
last commit, I managed to get this:

ERROR: could not open relation with OID 18229
STATEMENT: select current_setting('port') port, count(*) from t

...but just once, and with no other error messages. So I'm either
missing a step somewhere, or something about your system just happens
to tickle this moreso than on mine.

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop
is necessary; a few iterations are often successful.

I can test it today on a few other systems to see if it is reproducible.

thanks,

Erik Rijkers

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Erik Rijkers (#8)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On 31.05.2012 08:06, Erik Rijkers wrote:

On Thu, May 31, 2012 03:30, Robert Haas wrote:

On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers<er@xs4all.nl> wrote:

directory
2012-05-30 23:40:57.909 CEST 3909 CONTEXT: writing block 5152 of relation base/21268/26569
xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-30 23:40:58.006 CEST 5331 FATAL: could not open file "base/21268/26569": No such file
or
directory
2012-05-30 23:40:58.006 CEST 5331 CONTEXT: writing block 5153 of relation base/21268/26569
2012-05-30 23:40:59.661 CEST 3908 LOG: startup process (PID 3909) was terminated by signal 6:
Aborted
2012-05-30 23:40:59.661 CEST 3908 LOG: terminating any other active server processes

Hmm. I set up what I believe to be the same test case you were using,
and it's not crashing for me. In fact, with the latest code, I
haven't been able to produce any error at all. When I reverted my
last commit, I managed to get this:

ERROR: could not open relation with OID 18229
STATEMENT: select current_setting('port') port, count(*) from t

...but just once, and with no other error messages. So I'm either
missing a step somewhere, or something about your system just happens
to tickle this moreso than on mine.

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop
is necessary; a few iterations are often successful.

I can test it today on a few other systems to see if it is reproducible.

I could no longer reproduce it after Robert's fix, the test case has
been running for about an hour now. Please triple-check that you have it
applied in the standby :-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Robert Haas
robertmhaas@gmail.com
In reply to: Erik Rijkers (#8)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
is necessary; a few iterations are often successful.

Yes... I let it run all night on my laptop with no errors.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#10)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On 31 May 2012 12:14, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
is necessary; a few iterations are often successful.

Yes... I let it run all night on my laptop with no errors.

It looked to me like the correct fix to me as well, FWIW.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#12Erik Rijkers
er@xs4all.nl
In reply to: Robert Haas (#10)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Thu, May 31, 2012 13:14, Robert Haas wrote:

On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the
table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
is necessary; a few iterations are often successful.

Yes... I let it run all night on my laptop with no errors.

My apologies to both of you. It seems the problem was indeed solved with that commit from Robert.
I managed to forget that I, uh... temporary, commented out the git-pull from my build script...

Thanks,

Erik Rijkers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Erik Rijkers (#12)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Thu, May 31, 2012 at 7:49 AM, Erik Rijkers <er@xs4all.nl> wrote:

On Thu, May 31, 2012 13:14, Robert Haas wrote:

On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the
table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
is necessary; a few iterations are often successful.

Yes... I let it run all night on my laptop with no errors.

My apologies to both of you.  It seems the problem was indeed solved with that commit from Robert.
 I managed to forget that I, uh... temporary, commented out the git-pull from my build script...

No problem.

The one thing that still seems a little odd to me is that this caused
a pin count to get orphaned. It seems reasonable that ignoring the
AccessExclusiveLock could result in not-found errors trying to open a
missing relation, and even fsync requests on a missing relation. But
I don't see why that would cause the backend-local pin counts to get
messed up, which makes me wonder if there really is another bug here
somewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

Robert Haas <robertmhaas@gmail.com> writes:

The one thing that still seems a little odd to me is that this caused
a pin count to get orphaned. It seems reasonable that ignoring the
AccessExclusiveLock could result in not-found errors trying to open a
missing relation, and even fsync requests on a missing relation. But
I don't see why that would cause the backend-local pin counts to get
messed up, which makes me wonder if there really is another bug here
somewhere.

According to Heikki's log, the Assert was in the startup process itself,
and it happened after an error:

2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or directory
2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994
xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: Aborted

I don't think that code is meant to recover from errors anyway, so
the fact that it fails with a pin count held isn't exactly surprising.
But it might be worth looking at exactly which on_proc_exit callbacks
are installed in the startup process and what assumptions they make.

As for where the error came from in the first place, it's easy to
imagine somebody who's not got the word about the AccessExclusiveLock
reading pages of the table into buffers that have already been scanned
by the DROP. So you'd end up with orphaned buffers belonging to a
vanished table. If somebody managed to dirty them by setting hint bits
(we do allow that in HS mode no?) then later you'd have various processes
trying to write the buffer before recycling it, which seems to fit the
reported error.

regards, tom lane

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

On Thu, May 31, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The one thing that still seems a little odd to me is that this caused
a pin count to get orphaned.  It seems reasonable that ignoring the
AccessExclusiveLock could result in not-found errors trying to open a
missing relation, and even fsync requests on a missing relation.  But
I don't see why that would cause the backend-local pin counts to get
messed up, which makes me wonder if there really is another bug here
somewhere.

According to Heikki's log, the Assert was in the startup process itself,
and it happened after an error:

2012-05-26 10:44:28.587 CEST 10270 FATAL:  could not open file "base/21268/32994": No such file or directory
2012-05-26 10:44:28.588 CEST 10270 CONTEXT:  writing block 2508 of relation base/21268/32994
         xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-26 10:44:31.131 CEST 10269 LOG:  startup process (PID 10270) was terminated by signal 6: Aborted

I don't think that code is meant to recover from errors anyway, so
the fact that it fails with a pin count held isn't exactly surprising.
But it might be worth looking at exactly which on_proc_exit callbacks
are installed in the startup process and what assumptions they make.

Which code isn't meant to recover from errors?

As for where the error came from in the first place, it's easy to
imagine somebody who's not got the word about the AccessExclusiveLock
reading pages of the table into buffers that have already been scanned
by the DROP.  So you'd end up with orphaned buffers belonging to a
vanished table.  If somebody managed to dirty them by setting hint bits
(we do allow that in HS mode no?) then later you'd have various processes
trying to write the buffer before recycling it, which seems to fit the
reported error.

Right, I understand the other errors. It's just the pin count that I
am a bit confused about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company