RFC: bufmgr locking changes

Started by Neil Conwayabout 22 years ago5 messages
#1Neil Conway
neilc@samurai.com
1 attachment(s)

I've attached a (gzip'ed) patch that makes the following changes to
the buffer manager:

(1) Overhaul locking; whenever the code needs to modify the state
of an individual buffer, do synchronization via a per-buffer
"meta data lock" rather than the global BufMgrLock. For more
info on the motivation for this, see the several recent
-hackers threads on the subject.

(2) Reduce superfluous dirtying of pages: previously,
LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the
buffer. This behavior has been removed, to reduce the number
of pages that are dirtied. For more info on the motivation for
this, see the previous -hackers thread on the topic.

(3) Simplify the code in a bunch of places

This basically involved rewriting bufmgr.c (which makes the patch a
little illegible; it might be easier to just apply it and read through
the new code). That also means there are still a fair number of bugs
and unresolved issues.

The new locking works as follows:

- modifying the shared buffer table (buf_table.c) or making calls
into the freelist code (freelist.c) still requires holding the
BufMgrLock. There was some talk of trying to add more granular
locking to freelist.c itself: if there is still significant
contention for the BufMgrLock after these changes have been
applied, I'll take a look at doing that

- to modify a BufferDesc's meta data (shared refcount, flags, tag,
etc.), one must hold the buffer's "meta data lock". Also, I
removed the existing "io_in_progress" lock; instead, we now hold
the buffer's meta data lock while doing buffer I/O.

- if a backend holds the BufMgrLock, it will never try to
LWLockAcquire() a per-buffer meta data lock, due to the risk of
deadlock (and the loss in concurrency if we got blocked waiting
while still holding the BufMgrLock). Instead it does a
LWLockConditionalAcquire() and handles an acquisition failure
sanely

- if a backend holds the meta data lock for a buffer, it CAN
attempt to LWLockAcquire() the BufMgrLock. This is safe from
deadlocks, due to the previous paragraph.

The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t
1000` bails out pretty quickly, for example), but any comments on
whether this scheme is in-theory correct would be very welcome.

For #2, my understanding of the existing XLOG code is incomplete, so
perhaps someone can tell me if I'm on the right track. I've modified a
few of the XLogInsert() call sites so that they now:

- acquire an exclusive buffer cntx_lock

- modify the content of the page/buffer

- WriteNoReleaseBuffer() to mark the buffer as dirty; since we
still hold the buffer's cntx_lock, a checkpoint process can't
write this page to disk yet. This replaces the implicit "mark
this page as dirty" behavior of LockBuffer()

- do the XLogInsert()

- release the cntx_lock

- ReleaseBuffer() to decrement the buffer's pincount

For example, sequence.c now works like this (I've probably missed a
few places) -- is this correct, and/or is there a better way to do it?

Notes, and caveats:

- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
completely equivalent to WriteNoReleaseBuffer(), so I just removed
the former and replaced all the calls to it with calls to the later.

- Moved the code for flushing a dirty buffer to disk to buf_io.c

- Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c

- Removed the following now-unused buffer flag bits: BM_VALID,
BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
'flags' field of the BufferDesc down to 8 bits (from 16)

- Removed the cntxDirty field from the BufferDesc: now that we don't
need to acquire the BufMgrLock to modify the buffer's flags, there's
no reason to keep this around

- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
saves 2 bits * shared_buffers per backend on 32-bit machines and 6
bits * shared_buffers per backend on some 64-bit machines. It means
a given backend can only pin a single buffer 65,636 times, but that
should be more than enough. Similarly, made LocalRefCount an array
of uint16s.

I was thinking of adding overflow checking to the refcount increment
code to make sure we fail safely if a backend *does* try to exceed
this number of pins, but I can't imagine a scenario when it would
actually occur, so I haven't bothered.

- Remove the BufferLocks array. AFAICS this wasn't actually necessary.

- A few loose ends in the code still need to be wrapped up (for
example, I need to take another glance at the pin-waiter stuff, and
the error handling still needs some more work), but I think most of
the functionality is there. Areas of concern are denoted by 'XXX'
comments.

Any comments?

-Neil

Attachments:

buffer-locking-45.patch.gzapplication/octet-streamDownload
#2Kurt Roeckx
Q@ping.be
In reply to: Neil Conway (#1)
Re: RFC: bufmgr locking changes

On Wed, Jan 07, 2004 at 05:37:09PM -0500, Neil Conway wrote:

- if a backend holds the BufMgrLock, it will never try to
LWLockAcquire() a per-buffer meta data lock, due to the risk of
deadlock (and the loss in concurrency if we got blocked waiting
while still holding the BufMgrLock). Instead it does a
LWLockConditionalAcquire() and handles an acquisition failure
sanely

I can only see this work when you always check that you actually
got the right buffer after you locked the meta data. There is
nothing preventing an other backend from using it to for
something else in it between the time you release the BufMgrLock
and you lock the buffer's meta data.

Note that your PinBuffer() is now always called when you already
have the lock meta lock, which is probably a good idea or you're
going to make that function more complex that it should be. Just
say that it should hold the meta lock instead of the BufMgrLock
when it's called.

Kurt

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#1)
Re: RFC: bufmgr locking changes

Neil Conway <neilc@samurai.com> writes:

- to modify a BufferDesc's meta data (shared refcount, flags, tag,
etc.), one must hold the buffer's "meta data lock". Also, I
removed the existing "io_in_progress" lock; instead, we now hold
the buffer's meta data lock while doing buffer I/O.

The latter is a really bad idea IMHO. The io_in_progress lock can be
held for eons (in CPU terms) and should not be blocking people who
simply want to bump their refcount up and down. In particular, there is
no reason for an in-process write to block people who want read-only
access to the buffer.

It's possible that you could combine the io_in_progress lock with the
cntx_lock, but I doubt locking out access to the metadata during i/o
is a good idea.

- if a backend holds the BufMgrLock, it will never try to
LWLockAcquire() a per-buffer meta data lock, due to the risk of
deadlock (and the loss in concurrency if we got blocked waiting
while still holding the BufMgrLock). Instead it does a
LWLockConditionalAcquire() and handles an acquisition failure
sanely

Hmm, what's "sanely" exactly? It seems to me that figuring out how to
not need to lock in this direction is a critical part of the redesign,
so you can't just handwave here and expect people to understand.

- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
completely equivalent to WriteNoReleaseBuffer(), so I just removed
the former and replaced all the calls to it with calls to the later.

The reason I've kept the separation was as a form of documentation as to
the reason for each write. Although they currently do the same thing,
that might not always be true. I'd prefer not to eliminate the
distinction from the source code --- though I'd not object if you want
to make SetBufferCommitInfoNeedsSave a macro that invokes the other
routine.

- Removed the following now-unused buffer flag bits: BM_VALID,
BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
'flags' field of the BufferDesc down to 8 bits (from 16)

I cannot believe that it's workable to remove BM_JUST_DIRTIED. How are
you handling the race conditions that this was for? I'm also
unconvinced that removing BM_IO_IN_PROGRESS is a good idea.

- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
saves 2 bits * shared_buffers per backend on 32-bit machines and 6
bits * shared_buffers per backend on some 64-bit machines. It means
a given backend can only pin a single buffer 65,636 times, but that
should be more than enough. Similarly, made LocalRefCount an array
of uint16s.

I think both of these are ill-considered micro-optimization. How do you
know that the pin count can't exceed 64K? Consider recursive plpgsql
functions for a likely counterexample.

I was thinking of adding overflow checking to the refcount increment
code to make sure we fail safely if a backend *does* try to exceed
this number of pins, but I can't imagine a scenario when it would
actually occur, so I haven't bothered.

Leaving the arrays as longs is a much saner approach IMHO.

- Remove the BufferLocks array. AFAICS this wasn't actually necessary.

Please put that back. It is there to avoid unnecessary acquisitions of
buffer locks during UnlockBuffers (which is executed during any
transaction abort). Without it, you will be needing to lock every
single buffer during an abort in order to check its flags.

regards, tom lane

#4Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#3)
Re: RFC: bufmgr locking changes

(Sorry Tom, I was meaning to reply to you once I'd had a chance to
revise the bufmgr patch; since that seems a fair ways off, I figured
it would be better to respond now.)

Tom Lane <tgl@sss.pgh.pa.us> writes:

Neil Conway <neilc@samurai.com> writes:

we now hold the buffer's meta data lock while doing buffer I/O.

The latter is a really bad idea IMHO. The io_in_progress lock can be
held for eons (in CPU terms) and should not be blocking people who
simply want to bump their refcount up and down.

My reasoning was that the contention for the per-buffer meta data lock
should be pretty low. The io_in_progress lock is held when we're
either faulting a page in or flushing a page out. In the first case,
we can't actually use the buffer no matter how we do the locking (its
content is incomplete), so there's no effective loss in
concurrency. In the second case, what kinds of concurrent activity can
we allow on the buffer? (We can allow reads, of course, but I don't
believe we can allow writes.)

However, I'll think some more on this, you (and Jan, who raised this
point a while ago via IRC) are probably correct.

It's possible that you could combine the io_in_progress lock with the
cntx_lock

Yeah, that's a possibility.

- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
completely equivalent to WriteNoReleaseBuffer(), so I just removed
the former and replaced all the calls to it with calls to the later.

The reason I've kept the separation was as a form of documentation as to
the reason for each write. Although they currently do the same thing,
that might not always be true. I'd prefer not to eliminate the
distinction from the source code --- though I'd not object if you want
to make SetBufferCommitInfoNeedsSave a macro that invokes the other
routine.

Ok, fair enough -- I've changed SetBufferCommitInfoNeedsSave() to be a
macro for WriteNoReleaseBuffer().

- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
saves 2 bits * shared_buffers per backend on 32-bit machines and 6
bits * shared_buffers per backend on some 64-bit machines. It means
a given backend can only pin a single buffer 65,636 times, but that
should be more than enough. Similarly, made LocalRefCount an array
of uint16s.

I think both of these are ill-considered micro-optimization. How do you
know that the pin count can't exceed 64K? Consider recursive plpgsql
functions for a likely counterexample.

Fair enough -- I couldn't conceive of an actual scenario in which
a single backend would acquire > 64K pins on a single buffer, but I'll
take your word that it's not so far fetched. However, there is still
room for improvement, IMHO: on a machine with 64-bit longs, we're
plainly allocating 4 bytes more than we need do. Any objection if I
change these to arrays of int32?

Please put that back. It is there to avoid unnecessary acquisitions
of buffer locks during UnlockBuffers (which is executed during any
transaction abort). Without it, you will be needing to lock every
single buffer during an abort in order to check its flags.

It seems bizarre that we need to iterate through a few thousand array
elements just to do some lock cleanup. I'll take a closer look at
this...

-Neil

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#4)
Re: RFC: bufmgr locking changes

Neil Conway <neilc@samurai.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

The latter is a really bad idea IMHO. The io_in_progress lock can be
held for eons (in CPU terms) and should not be blocking people who
simply want to bump their refcount up and down.

My reasoning was that the contention for the per-buffer meta data lock
should be pretty low. The io_in_progress lock is held when we're
either faulting a page in or flushing a page out. In the first case,
we can't actually use the buffer no matter how we do the locking (its
content is incomplete), so there's no effective loss in
concurrency. In the second case, what kinds of concurrent activity can
we allow on the buffer? (We can allow reads, of course, but I don't
believe we can allow writes.)

True, there's no win in the read-busy case, but I think you
underestimate the value of the write-busy case. Multiple concurrent
readers are a very important consideration. In Postgres it is possible
for a reader to cause a write to occur (because it sets commit hint
bits, as per the SetBufferCommitInfoNeedsSave() business), and so you
could have a situation like

Reader pins page
Reader examines some tuples
Reader sets a commit bit and dirties page
...
Writer starts write
...
Reader examines some more tuples
Reader unpins page
Writer finishes write

If the reader can't unpin until the writer is done, then we will have
foreground readers blocked on the background writer process, which is
exactly what we do not want.

I think both of these are ill-considered micro-optimization. How do you
know that the pin count can't exceed 64K? Consider recursive plpgsql
functions for a likely counterexample.

Fair enough -- I couldn't conceive of an actual scenario in which
a single backend would acquire > 64K pins on a single buffer, but I'll
take your word that it's not so far fetched. However, there is still
room for improvement, IMHO: on a machine with 64-bit longs, we're
plainly allocating 4 bytes more than we need do. Any objection if I
change these to arrays of int32?

That seems like a reasonable compromise.

regards, tom lane