ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Started by Justin Pryzbyabout 8 years ago15 messageshackers

pryzby@telsasoft.com

about 8 years ago

While grepping logs, I came across this crash, which I caused while adding many
indices in a test environment. I don't know that there's any reason to believe
one way or the other if this is specific to running on pg11b1.

< 2018-06-17 11:38:45.465 CDT pryzbyj >FATAL: the database system is in recovery mode
< 2018-06-17 11:38:45.466 CDT pryzbyj >FATAL: the database system is in recovery mode
< 2018-06-17 11:38:45.468 CDT >FATAL: could not extend file "base/17379/38665798": No space left on device
< 2018-06-17 11:38:45.468 CDT >HINT: Check free disk space.
< 2018-06-17 11:38:45.468 CDT >CONTEXT: WAL redo at 2AA/239676B8 for Heap/INSERT+INIT: off 1
< 2018-06-17 11:38:45.468 CDT >WARNING: buffer refcount leak: [2366] (rel=base/17379/38665798, blockNum=1241, flags=0x8a000000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)

Unfortunately, I have neither a core nor stacktrace to share, as the DB was
(until after running out of space) on the same FS as /var :(

Jun 17 11:38:42 localhost abrt[30526]: Aborting dump: only 30MiB is available on /var/spool/abrt
Jun 17 11:38:45 localhost abrt[1261]: Aborting dump: only 16MiB is available on /var/spool/abrt

Not sure how useful it's to crash test ENOSPC (although I see there's
continuing patches for this case).

Justin

Michael Paquier

michael@paquier.xyz

about 8 years ago

In reply to: Justin Pryzby (#1)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

On Wed, Jun 27, 2018 at 06:39:39PM -0500, Justin Pryzby wrote:

< 2018-06-17 11:38:45.465 CDT pryzbyj >FATAL: the database system is in recovery mode
< 2018-06-17 11:38:45.466 CDT pryzbyj >FATAL: the database system is in recovery mode
< 2018-06-17 11:38:45.468 CDT >FATAL: could not extend file "base/17379/38665798": No space left on device
< 2018-06-17 11:38:45.468 CDT >HINT: Check free disk space.
< 2018-06-17 11:38:45.468 CDT >CONTEXT: WAL redo at 2AA/239676B8 for Heap/INSERT+INIT: off 1
< 2018-06-17 11:38:45.468 CDT >WARNING: buffer refcount leak: [2366] (rel=base/17379/38665798, blockNum=1241, flags=0x8a000000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)

Unfortunately, I have neither a core nor stacktrace to share, as the DB was
(until after running out of space) on the same FS as /var :(

You may be interested in this thread, vintage 2017, where I also posted
a test case, assuming that you can create an extra partition and make it
run out of space:
/messages/by-id/19533.1497283908@sss.pgh.pa.us
So that's a known problem.
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Michael Paquier (#2)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Michael Paquier <michael@paquier.xyz> writes:

On Wed, Jun 27, 2018 at 06:39:39PM -0500, Justin Pryzby wrote:

< 2018-06-17 11:38:45.468 CDT >FATAL: could not extend file "base/17379/38665798": No space left on device
< 2018-06-17 11:38:45.468 CDT >HINT: Check free disk space.
< 2018-06-17 11:38:45.468 CDT >CONTEXT: WAL redo at 2AA/239676B8 for Heap/INSERT+INIT: off 1
< 2018-06-17 11:38:45.468 CDT >WARNING: buffer refcount leak: [2366] (rel=base/17379/38665798, blockNum=1241, flags=0x8a000000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)

You may be interested in this thread, vintage 2017, where I also posted
a test case, assuming that you can create an extra partition and make it
run out of space:
/messages/by-id/19533.1497283908@sss.pgh.pa.us
So that's a known problem.

I got around to poking into this today. It's easily reproducible with
Michael's script, and capturing a stack trace from the assertion seems
to be enough to explain the problem:

#0 ExceptionalCondition (conditionName=0xa60137 "!(RefCountErrors == 0)",
errorType=0x8e50f7 "FailedAssertion", fileName=0xa5ffa3 "bufmgr.c",
lineNumber=2523) at assert.c:30
#1 0x000000000075829b in CheckForBufferLeaks () at bufmgr.c:2523
#2 0x00000000007582b3 in AtProcExit_Buffers (code=<value optimized out>,
arg=<value optimized out>) at bufmgr.c:2476
#3 0x00000000007651f5 in shmem_exit (code=1) at ipc.c:272
#4 0x000000000076526e in proc_exit_prepare (code=1) at ipc.c:194
#5 0x00000000007652f8 in proc_exit (code=1) at ipc.c:107
#6 0x000000000089af5d in errfinish (dummy=<value optimized out>) at elog.c:543
#7 0x000000000078615b in mdextend (reln=<value optimized out>,
forknum=MAIN_FORKNUM, blocknum=<value optimized out>,
buffer=<value optimized out>, skipFsync=false) at md.c:549
#8 0x0000000000758bcf in ReadBuffer_common (smgr=<value optimized out>,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19603,
mode=RBM_ZERO_AND_LOCK, strategy=0x0, hit=0x7fff70b892cf) at bufmgr.c:865
#9 0x0000000000759424 in ReadBufferWithoutRelcache (rnode=...,
forkNum=MAIN_FORKNUM, blockNum=4294967295, mode=<value optimized out>,
strategy=<value optimized out>) at bufmgr.c:692
#10 0x000000000052a09b in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=19603, mode=RBM_ZERO_AND_LOCK)
at xlogutils.c:489
#11 0x000000000052a419 in XLogReadBufferForRedoExtended (record=0x13827e8,
block_id=0 '\000', mode=RBM_ZERO_AND_LOCK, get_cleanup_lock=false,
buf=0x7fff70b893ac) at xlogutils.c:390
#12 0x000000000052a649 in XLogInitBufferForRedo (record=<value optimized out>,
block_id=<value optimized out>) at xlogutils.c:305
#13 0x00000000004bfa8a in heap_xlog_insert (record=0x13827e8) at heapam.c:8580
#14 0x00000000004c0b55 in heap_redo (record=0x13827e8) at heapam.c:9282
#15 0x000000000051fab0 in StartupXLOG () at xlog.c:7319
#16 0x0000000000710b71 in StartupProcessMain () at startup.c:217
#17 0x000000000052eeb5 in AuxiliaryProcessMain (argc=2, argv=0x7fff70b8c320)
at bootstrap.c:441
#18 0x000000000070b6d7 in StartChildProcess (type=StartupProcess)
at postmaster.c:5331
#19 0x000000000070fc25 in PostmasterMain (argc=3, argv=<value optimized out>)
at postmaster.c:1371

So basically, WAL replay hits an error while holding a buffer pin, and
nothing is done to release the buffer pin, but AtProcExit_Buffers thinks
something should have been done.

I'm not sure that it's worth trying to fix this "nicely". The startup
process does not run a transaction and hence doesn't have any of the
infrastructure that'd be needed to clean up in a nice way --- for
instance, it has no CurrentResourceOwner. Perhaps in some distant
future somebody will think it's worth trying to improve that, but
I'm not interested in working on it right now, and even less interested
in back-patching it.

What seems like a better answer for now is to adjust AtProcExit_Buffers
so that it doesn't cause an assertion failure in this case. I think we
can define "this case" as being "failure exit from the startup process":
if that happens, the postmaster is going to throw up its hands and force
a panic shutdown anyway, so the failure to free a buffer pin shouldn't
have any serious consequences.

The attached one-liner patch does that, and is enough to get through
Michael's test case without an assertion. This is just proof of concept
though --- my inclination, if we go this route, is to make a slightly
longer patch that would fix CheckForBufferLeaks to still print the WARNING
messages if any, but not die with an assertion afterwards.

Another question is whether this is really a sufficient band-aid. It
looks to me like AtProcExit_Buffers will be called in any auxiliary
process type, not only the startup process. Do, or should we, force
a panic restart for nonzero-exit-code failures of all aux process types?
If not, what are we going to do to clean up similar failures in other
aux process types?

BTW, this assertion has been there since fe548629; before that, there
was code that would release any leaked buffer pins, relying on the
PrivateRefCount data for that. So another idea is to restore some
version of that capability, although I think we really really don't
wanna scan the PrivateRefCount array if we can avoid it.

Thoughts?

regards, tom lane

Andres Freund

andres@anarazel.de

about 8 years ago

In reply to: Tom Lane (#3)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Hi,

On 2018-07-15 18:48:43 -0400, Tom Lane wrote:

So basically, WAL replay hits an error while holding a buffer pin, and
nothing is done to release the buffer pin, but AtProcExit_Buffers thinks
something should have been done.

I think there's a few other cases where we hit this. I've seen something
similar from inside checkpointer / BufferSync(). I'd be surprised if
bgwriter couldn't be triggered into the same.

What seems like a better answer for now is to adjust AtProcExit_Buffers
so that it doesn't cause an assertion failure in this case. I think we
can define "this case" as being "failure exit from the startup process":
if that happens, the postmaster is going to throw up its hands and force
a panic shutdown anyway, so the failure to free a buffer pin shouldn't
have any serious consequences.

The attached one-liner patch does that, and is enough to get through
Michael's test case without an assertion. This is just proof of concept
though --- my inclination, if we go this route, is to make a slightly
longer patch that would fix CheckForBufferLeaks to still print the WARNING
messages if any, but not die with an assertion afterwards.

Another question is whether this is really a sufficient band-aid. It
looks to me like AtProcExit_Buffers will be called in any auxiliary
process type, not only the startup process.

We could just replace the Assert() with a PANIC?

Do, or should we, force a panic restart for nonzero-exit-code failures
of all aux process types? If not, what are we going to do to clean up
similar failures in other aux process types?

I'm pretty sure that we do *not* force a panic on all nonzero-exit-code
cases for other subprocesses.

BTW, this assertion has been there since fe548629; before that, there
was code that would release any leaked buffer pins, relying on the
PrivateRefCount data for that. So another idea is to restore some
version of that capability, although I think we really really don't
wanna scan the PrivateRefCount array if we can avoid it.

I don't think scanning PrivateRefCount would be particularly problematic
- in almost all cases it's going to be tiny. These day it's not NBuffers
sized anymore, so I can't forsee any performance problems?

I think we could invent something like like enter/leave "transactional
mode" wherein we throw a PANIC when a buffer leaked. Processes that
don't enter it - like startup, checkpointer, etc - would just WARN?

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Andres Freund (#4)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Andres Freund <andres@anarazel.de> writes:

On 2018-07-15 18:48:43 -0400, Tom Lane wrote:

So basically, WAL replay hits an error while holding a buffer pin, and
nothing is done to release the buffer pin, but AtProcExit_Buffers thinks
something should have been done.

I think there's a few other cases where we hit this. I've seen something
similar from inside checkpointer / BufferSync(). I'd be surprised if
bgwriter couldn't be triggered into the same.

Hm, yeah, on reflection it's pretty obvious that those are hazard cases.

I'm pretty sure that we do *not* force a panic on all nonzero-exit-code
cases for other subprocesses.

That's my recollection as well -- mostly, we just start a new one.

So I said I didn't want to do extra work on this, but I am looking into
fixing it by having these aux process types run a ResourceOwner that can
be told to clean up any open buffer pins at exit. We could be sure the
coverage is complete by dint of removing the special-case code in
resowner.c that allows buffer pins to be taken with no active resowner.
Then CheckForBufferLeaks can be left as-is, ie something we do only
in assert builds.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Tom Lane (#5)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

I wrote:

So I said I didn't want to do extra work on this, but I am looking into
fixing it by having these aux process types run a ResourceOwner that can
be told to clean up any open buffer pins at exit. We could be sure the
coverage is complete by dint of removing the special-case code in
resowner.c that allows buffer pins to be taken with no active resowner.
Then CheckForBufferLeaks can be left as-is, ie something we do only
in assert builds.

That turned out to be a larger can of worms than I'd anticipated, as it
soon emerged that we'd acquired a whole lot of cargo-cult programming
around ResourceOwners. Having a ResourceOwner in itself does nothing
for you: there has to be code someplace that ensures we'll call
ResourceOwnerRelease at an appropriate time. There was basically noplace
outside xact.c and portalmem.c that got this completely right. bgwriter.c
and a couple of other places at least tried a little, by doing a
ResourceOwnerRelease in their sigsetjmp-catching stanzas, but that didn't
account for the process-exit code path. Other places just created a
ResourceOwner and did *nothing* about cleaning it up.

I decided that the most expedient way of dealing with this was to create
a self-contained facility in resowner.c that would create a standalone
ResourceOwner and register a shmem-exit callback to clean it up.
That way it's easier (less code) to do it right than to do it wrong.
That led to the attached, which passes check-world as well as Michael's
full-disk test script.

I'm mostly pretty happy with this, but I think there are a couple of
loose ends in logicalfuncs.c and slotfuncs.c: those are creating
non-standalone ResourceOwners (children of whatever the active
ResourceOwner is) and doing nothing much to clean them up. That seems
pretty bogus. It's not a permanent resource leak, because sooner or
later the parent ResourceOwner will get cleaned up and that will
recurse to the child ... but then why bother with a child ResourceOwner
at all? I added asserts to these calls to verify that there was a
parent resowner (if there isn't, the code is just broken), but I think
that we should either add more code to clean up the child resowner
promptly, or just not bother with a child at all.

Not very sure what to do with this. I don't think we should try to
backpatch it, because it's essentially changing the API requirements
around buffer access in auxiliary processes --- but it might not be
too late to squeeze it into v11. It is a bug fix, in that the current
code can leak buffer pins in some code paths and we won't notice in
non-assert builds; but we've not really seen any field reports of that
happening, so I'm not sure how important this is.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Tom Lane (#6)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

I wrote:

So I said I didn't want to do extra work on this, but I am looking into
fixing it by having these aux process types run a ResourceOwner that can
be told to clean up any open buffer pins at exit.

That turned out to be a larger can of worms than I'd anticipated, as it
soon emerged that we'd acquired a whole lot of cargo-cult programming
around ResourceOwners. ...
I'm mostly pretty happy with this, but I think there are a couple of
loose ends in logicalfuncs.c and slotfuncs.c: those are creating
non-standalone ResourceOwners (children of whatever the active
ResourceOwner is) and doing nothing much to clean them up. That seems
pretty bogus.

Further investigation showed that the part of that code that was
actually needed was not the creation of a child ResourceOwner, but rather
restoration of the old CurrentResourceOwner setting after exiting the
logical decoding loop. Apparently we can come out of that with the
TopTransaction resowner being active. This still seems a bit bogus;
maybe there should be a save-and-restore happening somewhere else?
But I'm not really excited about doing more than commenting it.

Also, most of the other random creations of ResourceOwners seem to just
not be necessary at all, even with the new rule that you must have a
resowner to acquire buffer pins. So the attached revised patch just
gets rid of them, and improves some misleading/wrong comments on the
topic. It'd still be easy to use CreateAuxProcessResourceOwner in any
process where we discover we need one, but I don't see the value in adding
useless overhead.

At this point I'm leaning to just applying this in HEAD and calling it
good. The potential for assertion failures isn't relevant to production
builds, and my best guess at this point is that there isn't really any
other user-visible bug. The resowners that were being created and not
adequately cleaned up all seem to have been dead code anyway (ie they'd
never acquire any resources). We could have a problem if, say, the
bgwriter exited via the FATAL path while holding a pin, but I don't think
there's a reason for it to do that except in a database-wide shutdown,
where the consequences of a leaked pin seem pretty minimal.

Any objections? Anyone want to do further review?

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Tom Lane (#7)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any objections? Anyone want to do further review?

LGTM. I think this is an improvement. However, it seems like it
might be a good idea for ResourceOwnerRememberBuffer and
ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if
somebody messes up it will trip an assertion rather than just seg
faulting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Robert Haas (#8)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any objections? Anyone want to do further review?

LGTM. I think this is an improvement. However, it seems like it
might be a good idea for ResourceOwnerRememberBuffer and
ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if
somebody messes up it will trip an assertion rather than just seg
faulting.

Uh, what? There are only a few callers of those, and they'd all have
crashed already if they were somehow dealing with an invalid buffer.

regards, tom lane

#10

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Tom Lane (#9)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

On Tue, Jul 17, 2018 at 8:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any objections? Anyone want to do further review?

LGTM. I think this is an improvement. However, it seems like it
might be a good idea for ResourceOwnerRememberBuffer and
ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if
somebody messes up it will trip an assertion rather than just seg
faulting.

Uh, what? There are only a few callers of those, and they'd all have
crashed already if they were somehow dealing with an invalid buffer.

Sorry, I meant Assert(owner != NULL).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Robert Haas (#10)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 17, 2018 at 8:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Uh, what? There are only a few callers of those, and they'd all have
crashed already if they were somehow dealing with an invalid buffer.

Sorry, I meant Assert(owner != NULL).

Oh, gotcha: so that if an external developer hits it, he can more
easily see that this is from an (effective) API change and not some
mysterious bug. Makes sense, especially if we include a comment:

/* We used to allow pinning buffers without a resowner, but no more */
Assert(CurrentResourceOwner != NULL);

I think it's sufficient to do this in ResourceOwnerEnlargeBuffers,
though. The other two should be unreachable without having gone
through that.

regards, tom lane

#12

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 8 years ago

In reply to: Tom Lane (#7)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Hello. I confirmed that this patch fixes the crash.

At Tue, 17 Jul 2018 20:01:05 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <14892.1531872065@sss.pgh.pa.us>

I wrote:

So I said I didn't want to do extra work on this, but I am looking into
fixing it by having these aux process types run a ResourceOwner that can
be told to clean up any open buffer pins at exit.

That turned out to be a larger can of worms than I'd anticipated, as it
soon emerged that we'd acquired a whole lot of cargo-cult programming
around ResourceOwners. ...
I'm mostly pretty happy with this, but I think there are a couple of
loose ends in logicalfuncs.c and slotfuncs.c: those are creating
non-standalone ResourceOwners (children of whatever the active
ResourceOwner is) and doing nothing much to clean them up. That seems
pretty bogus.

Further investigation showed that the part of that code that was
actually needed was not the creation of a child ResourceOwner, but rather
restoration of the old CurrentResourceOwner setting after exiting the
logical decoding loop. Apparently we can come out of that with the
TopTransaction resowner being active. This still seems a bit bogus;
maybe there should be a save-and-restore happening somewhere else?
But I'm not really excited about doing more than commenting it.

CurrentResourceOwner doesn't seem to be changed. It is just saved
during snapshot export and used just as a flag only for checking
for duplicate snapshot exporting.

Also, most of the other random creations of ResourceOwners seem to just
not be necessary at all, even with the new rule that you must have a
resowner to acquire buffer pins. So the attached revised patch just
gets rid of them, and improves some misleading/wrong comments on the
topic. It'd still be easy to use CreateAuxProcessResourceOwner in any
process where we discover we need one, but I don't see the value in adding
useless overhead.

+1 for unifying to resowner for auxiliary processes. I found the
comment below just before ending cleanup of auxiliary process
main funcs.

| * These operations are really just a minimal subset of
| * AbortTransaction(). We don't have very many resources to worry
| * about in checkpointer, but we do have LWLocks, buffers, and temp
| * files.

So couldn't we use TopTransactionResourceOwner instead of
AuxProcessResrouceOwner? I feel a bit uneasy that bootstrap and
standalone-backend have *AuxProcess*ResourceOwner.

It's not about this ptch, but while looking this closer, I found
the following comment on ShutdownXLOG().

| /*
| * This must be called ONCE during postmaster or standalone-backend shutdown
| */

Is the "postmaster" typo of "bootstrap process"? It is also
called from checkpointer when non-standlone mode.

At this point I'm leaning to just applying this in HEAD and calling it
good. The potential for assertion failures isn't relevant to production
builds, and my best guess at this point is that there isn't really any
other user-visible bug. The resowners that were being created and not
adequately cleaned up all seem to have been dead code anyway (ie they'd
never acquire any resources).

Agreed.

We could have a problem if, say, the
bgwriter exited via the FATAL path while holding a pin, but I don't think
there's a reason for it to do that except in a database-wide shutdown,
where the consequences of a leaked pin seem pretty minimal.

Any objections? Anyone want to do further review?

FWIW I think we won't be concerned about leaked pins after FATAL.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Kyotaro Horiguchi (#12)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Hello. I confirmed that this patch fixes the crash.

Thanks for double-checking.

At Tue, 17 Jul 2018 20:01:05 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <14892.1531872065@sss.pgh.pa.us>

Further investigation showed that the part of that code that was
actually needed was not the creation of a child ResourceOwner, but rather
restoration of the old CurrentResourceOwner setting after exiting the
logical decoding loop. Apparently we can come out of that with the
TopTransaction resowner being active.

CurrentResourceOwner doesn't seem to be changed. It is just saved
during snapshot export and used just as a flag only for checking
for duplicate snapshot exporting.

I tried to remove "CurrentResourceOwner = old_resowner" from
logicalfuncs.c, and got a failure in the contrib/test_decoding test.
I investigated the reason and found that it was because
CurrentResourceOwner was equal to TopTransactionResourceOwner when we
come out of the decoding loop. If we don't restore it then subsequent
executor activity gets logged in the wrong resowner.

It is true that the one in slotfuncs.c seems to be unnecessary,
but I thought it best to leave it there; it's cheap, and maybe
there's a problem in cases not exercised in the regression tests.

So couldn't we use TopTransactionResourceOwner instead of
AuxProcessResrouceOwner? I feel a bit uneasy that bootstrap and
standalone-backend have *AuxProcess*ResourceOwner.

Since the aux processes aren't running transactions, I didn't think
that TopTransactionResourceOwner was appropriate. There's also
a problem for bootstrap and standalone backend cases: those do run
transactions and therefore create/destroy TopTransactionResourceOwner,
leaving nothing behind for ShutdownXLOG to use if it tries to use
that. We need an extra resowner somewhere.

It's true that if you adopt a narrow definition of "aux process"
as being "one that goes through AuxiliaryProcessMain", calling
this extra resowner AuxProcessResourceOwner is a bit of a misnomer.
I couldn't think of something better to call it, though. With a
slightly wider understanding of "auxiliary process" as being anything
that's not the postmaster or a client-session backend, it's fine.

It's not about this ptch, but while looking this closer, I found
the following comment on ShutdownXLOG().
| /*
| * This must be called ONCE during postmaster or standalone-backend shutdown
| */
Is the "postmaster" typo of "bootstrap process"? It is also
called from checkpointer when non-standlone mode.

Well, it's actually executed by the checkpointer these days, but
that's an implementation detail. I think this comment is fine:
at some point while the postmaster is shutting down, this should
be done, and done just once.

regards, tom lane

#14

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Tom Lane (#13)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

On Wed, Jul 18, 2018 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So couldn't we use TopTransactionResourceOwner instead of
AuxProcessResrouceOwner? I feel a bit uneasy that bootstrap and
standalone-backend have *AuxProcess*ResourceOwner.

Since the aux processes aren't running transactions, I didn't think
that TopTransactionResourceOwner was appropriate. There's also
a problem for bootstrap and standalone backend cases: those do run
transactions and therefore create/destroy TopTransactionResourceOwner,
leaving nothing behind for ShutdownXLOG to use if it tries to use
that. We need an extra resowner somewhere.

FallbackResourceOwner? DefaultResourceOwner? SessionResourceOwner?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Robert Haas (#14)

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jul 18, 2018 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So couldn't we use TopTransactionResourceOwner instead of
AuxProcessResrouceOwner? I feel a bit uneasy that bootstrap and
standalone-backend have *AuxProcess*ResourceOwner.

Since the aux processes aren't running transactions, I didn't think
that TopTransactionResourceOwner was appropriate. There's also
a problem for bootstrap and standalone backend cases: those do run
transactions and therefore create/destroy TopTransactionResourceOwner,
leaving nothing behind for ShutdownXLOG to use if it tries to use
that. We need an extra resowner somewhere.

FallbackResourceOwner? DefaultResourceOwner? SessionResourceOwner?

Those names all suggest (to me anyway) that this resowner exists in
all, or at least most, processes. That's not the situation as of this
patch, although I could imagine an alternate universe where it's true;
for example, if we decided there were a reason for normal backends to
have a session-lifespan resowner. But even then, it might be better
to distinguish that from aux processes' use of resowners.

(I'm not really convinced that it'd be a good idea for normal backends
to have a session-lifespan resowner; that could mask bugs involving
trying to acquire resources outside a transaction.)

regards, tom lane

ENOSPC FailedAssertion("!(RefCountErrors == 0)"

Attachments:

Attachments:

Attachments: