Spinlock performance improvement proposal
At the just-past OSDN database conference, Bruce and I were annoyed by
some benchmark results showing that Postgres performed poorly on an
8-way SMP machine. Based on past discussion, it seems likely that the
culprit is the known inefficiency in our spinlock implementation.
After chewing on it for awhile, we came up with an idea for a solution.
The following proposal should improve performance substantially when
there is contention for a lock, but it creates no portability risks
because it uses the same system facilities (TAS and SysV semaphores)
that we have always relied on. Also, I think it'd be fairly easy to
implement --- I could probably get it done in a day.
Comments anyone?
regards, tom lane
Plan:
Replace most uses of spinlocks with "lightweight locks" (LW locks)
implemented by a new lock manager. The principal remaining use of true
spinlocks (TAS locks) will be to provide mutual exclusion of access to
LW lock structures. Therefore, we can assume that spinlocks are never
held for more than a few dozen instructions --- and never across a kernel
call.
It's pretty easy to rejigger the spinlock code to work well when the lock
is never held for long. We just need to change the spinlock retry code
so that it does a tight spin (continuous retry) for a few dozen cycles ---
ideally, the total delay should be some small multiple of the max expected
lock hold time. If lock still not acquired, yield the CPU via a select()
call (10 msec minimum delay) and repeat. Although this looks inefficient,
it doesn't matter on a uniprocessor because we expect that backends will
only rarely be interrupted while holding the lock, so in practice a held
lock will seldom be encountered. On SMP machines the tight spin will win
since the lock will normally become available before we give up and yield
the CPU.
Desired properties of the LW lock manager include:
* very fast fall-through when no contention for lock
* waiting proc does not spin
* support both exclusive and shared (read-only) lock modes
* grant lock to waiters in arrival order (no starvation)
* small lock structure to allow many LW locks to exist.
Proposed contents of LW lock structure:
spinlock mutex (protects LW lock state and PROC queue links)
count of exclusive holders (always 0 or 1)
count of shared holders (0 .. MaxBackends)
queue head pointer (NULL or ptr to PROC object)
queue tail pointer (could do without this to save space)
If a backend sees it must wait to acquire the lock, it adds its PROC
struct to the end of the queue, releases the spinlock mutex, and then
sleeps by P'ing its per-backend wait semaphore. A backend releasing the
lock will check to see if any waiter should be granted the lock. If so,
it will update the lock state, release the spinlock mutex, and finally V
the wait semaphores of any backends that it decided should be released
(which it removed from the lock's queue while holding the sema). Notice
that no kernel calls need be done while holding the spinlock. Since the
wait semaphore will remember a V occurring before P, there's no problem
if the releaser is fast enough to release the waiter before the waiter
reaches its P operation.
We will need to add a few fields to PROC structures:
* Flag to show whether PROC is waiting for an LW lock, and if so
whether it waits for read or write access
* Additional PROC queue link field.
We can't reuse the existing queue link field because it is possible for a
PROC to be waiting for both a heavyweight lock and a lightweight one ---
this will occur when HandleDeadLock or LockWaitCancel tries to acquire
the LockMgr module's lightweight lock (formerly spinlock).
It might seem that we also need to create a second wait semaphore per
backend, one to wait on HW locks and one to wait on LW locks. But I
believe we can get away with just one, by recognizing that a wait for an
LW lock can never be interrupted by a wait for a HW lock, only vice versa.
After being awoken (V'd), the LW lock manager must check to see if it was
actually granted the lock (easiest way: look at own PROC struct to see if
LW lock wait flag has been cleared). If not, the V must have been to
grant us a HW lock --- but we still have to sleep to get the LW lock. So
remember this happened, then loop back and P again. When we finally get
the LW lock, if there was an extra P operation then V the semaphore once
before returning. This will allow ProcSleep to exit the wait for the HW
lock when we return to it.
Fine points:
While waiting for an LW lock, we need to show in our PROC struct whether
we are waiting for read or write access. But we don't need to remember
this after getting the lock; if we know we have the lock, it's easy to
see by inspecting the lock whether we hold read or write access.
ProcStructLock cannot be replaced by an LW lock, since a backend cannot
use an LW lock until it has obtained a PROC struct and a semaphore,
both of which are protected by this lock. It seems okay to use a plain
spinlock for this purpose. NOTE: it's okay for SInvalLock to be an LW
lock, as long as the LW mgr does not depend on accessing the SI array
of PROC objects, but only chains through the PROCs themselves.
Another tricky point is that some of the setup code executed by the
postmaster may try to to grab/release LW locks. Here, we can probably
allow a special case for MyProc=NULL. It's likely that we should never
see a block under these circumstances anyway, so finding MyProc=NULL when
we need to block may just be a fatal error condition.
A nastier case is checkpoint processes; these expect to grab BufMgr and
WAL locks. Perhaps okay for them to do plain sleeps in between attempts
to grab the locks? This says that the MyProc=NULL case should release
the spinlock mutex, sleep 10 msec, try again, rather than any sort of error
or expectation of no conflict. Are there any cases where this represents
a horrid performance loss? Checkpoint itself seems noncritical.
Alternative is for checkpoint to be allowed to create a PROC struct (but
not to enter it in SI list) so's it can participate normally in LW lock
operations. That seems a good idea anyway, actually, so that the PROC
struct's facility for releasing held LW locks at elog time will work
inside the checkpointer. (But that means we need an extra sema too?
Okay, but don't want an extra would-be backend to obtain the extra sema
and perhaps cause a checkpoint proc to fail. So must allocate the PROC
and sema for checkpoint process separately from those reserved for
backends.)
Sounds cool to me ... definitely something to fix before v7.2, if its as
"easy" as you make it sound ... I'm expecting the new drive to be
installed today (if all goes well ... Thomas still has his date/time stuff
to finish off, now that CVSup is fixed ...
Let''s try and target Monday for Beta then? I think the only two
outstaandings are you and Thomas right now?
Bruce, that latest rtree patch looks intriguing also ... can anyone
comment positive/negative about it, so that we can try and get that in
before Beta?
On Wed, 26 Sep 2001, Tom Lane wrote:
Show quoted text
At the just-past OSDN database conference, Bruce and I were annoyed by
some benchmark results showing that Postgres performed poorly on an
8-way SMP machine. Based on past discussion, it seems likely that the
culprit is the known inefficiency in our spinlock implementation.
After chewing on it for awhile, we came up with an idea for a solution.The following proposal should improve performance substantially when
there is contention for a lock, but it creates no portability risks
because it uses the same system facilities (TAS and SysV semaphores)
that we have always relied on. Also, I think it'd be fairly easy to
implement --- I could probably get it done in a day.Comments anyone?
regards, tom lane
Plan:
Replace most uses of spinlocks with "lightweight locks" (LW locks)
implemented by a new lock manager. The principal remaining use of true
spinlocks (TAS locks) will be to provide mutual exclusion of access to
LW lock structures. Therefore, we can assume that spinlocks are never
held for more than a few dozen instructions --- and never across a kernel
call.It's pretty easy to rejigger the spinlock code to work well when the lock
is never held for long. We just need to change the spinlock retry code
so that it does a tight spin (continuous retry) for a few dozen cycles ---
ideally, the total delay should be some small multiple of the max expected
lock hold time. If lock still not acquired, yield the CPU via a select()
call (10 msec minimum delay) and repeat. Although this looks inefficient,
it doesn't matter on a uniprocessor because we expect that backends will
only rarely be interrupted while holding the lock, so in practice a held
lock will seldom be encountered. On SMP machines the tight spin will win
since the lock will normally become available before we give up and yield
the CPU.Desired properties of the LW lock manager include:
* very fast fall-through when no contention for lock
* waiting proc does not spin
* support both exclusive and shared (read-only) lock modes
* grant lock to waiters in arrival order (no starvation)
* small lock structure to allow many LW locks to exist.Proposed contents of LW lock structure:
spinlock mutex (protects LW lock state and PROC queue links)
count of exclusive holders (always 0 or 1)
count of shared holders (0 .. MaxBackends)
queue head pointer (NULL or ptr to PROC object)
queue tail pointer (could do without this to save space)If a backend sees it must wait to acquire the lock, it adds its PROC
struct to the end of the queue, releases the spinlock mutex, and then
sleeps by P'ing its per-backend wait semaphore. A backend releasing the
lock will check to see if any waiter should be granted the lock. If so,
it will update the lock state, release the spinlock mutex, and finally V
the wait semaphores of any backends that it decided should be released
(which it removed from the lock's queue while holding the sema). Notice
that no kernel calls need be done while holding the spinlock. Since the
wait semaphore will remember a V occurring before P, there's no problem
if the releaser is fast enough to release the waiter before the waiter
reaches its P operation.We will need to add a few fields to PROC structures:
* Flag to show whether PROC is waiting for an LW lock, and if so
whether it waits for read or write access
* Additional PROC queue link field.
We can't reuse the existing queue link field because it is possible for a
PROC to be waiting for both a heavyweight lock and a lightweight one ---
this will occur when HandleDeadLock or LockWaitCancel tries to acquire
the LockMgr module's lightweight lock (formerly spinlock).It might seem that we also need to create a second wait semaphore per
backend, one to wait on HW locks and one to wait on LW locks. But I
believe we can get away with just one, by recognizing that a wait for an
LW lock can never be interrupted by a wait for a HW lock, only vice versa.
After being awoken (V'd), the LW lock manager must check to see if it was
actually granted the lock (easiest way: look at own PROC struct to see if
LW lock wait flag has been cleared). If not, the V must have been to
grant us a HW lock --- but we still have to sleep to get the LW lock. So
remember this happened, then loop back and P again. When we finally get
the LW lock, if there was an extra P operation then V the semaphore once
before returning. This will allow ProcSleep to exit the wait for the HW
lock when we return to it.Fine points:
While waiting for an LW lock, we need to show in our PROC struct whether
we are waiting for read or write access. But we don't need to remember
this after getting the lock; if we know we have the lock, it's easy to
see by inspecting the lock whether we hold read or write access.ProcStructLock cannot be replaced by an LW lock, since a backend cannot
use an LW lock until it has obtained a PROC struct and a semaphore,
both of which are protected by this lock. It seems okay to use a plain
spinlock for this purpose. NOTE: it's okay for SInvalLock to be an LW
lock, as long as the LW mgr does not depend on accessing the SI array
of PROC objects, but only chains through the PROCs themselves.Another tricky point is that some of the setup code executed by the
postmaster may try to to grab/release LW locks. Here, we can probably
allow a special case for MyProc=NULL. It's likely that we should never
see a block under these circumstances anyway, so finding MyProc=NULL when
we need to block may just be a fatal error condition.A nastier case is checkpoint processes; these expect to grab BufMgr and
WAL locks. Perhaps okay for them to do plain sleeps in between attempts
to grab the locks? This says that the MyProc=NULL case should release
the spinlock mutex, sleep 10 msec, try again, rather than any sort of error
or expectation of no conflict. Are there any cases where this represents
a horrid performance loss? Checkpoint itself seems noncritical.Alternative is for checkpoint to be allowed to create a PROC struct (but
not to enter it in SI list) so's it can participate normally in LW lock
operations. That seems a good idea anyway, actually, so that the PROC
struct's facility for releasing held LW locks at elog time will work
inside the checkpointer. (But that means we need an extra sema too?
Okay, but don't want an extra would-be backend to obtain the extra sema
and perhaps cause a checkpoint proc to fail. So must allocate the PROC
and sema for checkpoint process separately from those reserved for
backends.)---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
"Marc G. Fournier" <scrappy@hub.org> writes:
Let''s try and target Monday for Beta then?
Sounds like a plan.
regards, tom lane
The plan for the new spinlocks does look like it has some potential. My
only comment in regards to permformance when we start looking at SMP
machines is ... it is my belief that getting a true threaded backend may
be the only way to get the full potential out of SMP machines. I see that
is one of the things to experiment with on the TODO list and I have seen
some people have messed around already with this using Solaris threads.
It should probably be attempted with pthreads if PostgreSQL is going to
keep some resemblance of cross-platform compatibility. At that time, it
would probably be easier to go in and clean up some stuff for the
implementation of other TODO items (put in the base framework for more
complex future items) as threading the backend would take a little bit of
ideology shift.
Of course, it is much easier to stand back and talk about this then
actually do it - especially comming from someone who has only tried to
contribute a few pieces of code. Keep up the good work.
On Wed, 26 Sep 2001, Tom Lane wrote:
At the just-past OSDN database conference, Bruce and I were annoyed by
some benchmark results showing that Postgres performed poorly on an
8-way SMP machine. Based on past discussion, it seems likely that the
culprit is the known inefficiency in our spinlock implementation.
After chewing on it for awhile, we came up with an idea for a solution.The following proposal should improve performance substantially when
there is contention for a lock, but it creates no portability risks
because it uses the same system facilities (TAS and SysV semaphores)
that we have always relied on. Also, I think it'd be fairly easy to
implement --- I could probably get it done in a day.Comments anyone?
regards, tom lane
--
//========================================================\\
|| D. Hageman <dhageman@dracken.com> ||
\\========================================================//
Tom Lane wrote:
At the just-past OSDN database conference, Bruce and I were annoyed by
some benchmark results showing that Postgres performed poorly on an
8-way SMP machine. Based on past discussion, it seems likely that the
culprit is the known inefficiency in our spinlock implementation.
After chewing on it for awhile, we came up with an idea for a solution.The following proposal should improve performance substantially when
there is contention for a lock, but it creates no portability risks
because it uses the same system facilities (TAS and SysV semaphores)
that we have always relied on. Also, I think it'd be fairly easy to
implement --- I could probably get it done in a day.Comments anyone?
We have been doing some scalability testing just recently here at Red
Hat. The machine I was using was a 4-way 550 MHz Xeon SMP machine, I
also ran the machine in uniprocessor mode to make some comparisons. All
runs were made on Red Hat Linux running 2.4.x series kernels. I've
examined a number of potentially interesting cases -- I'm still
analyzing the results, but some of the initial results might be
interesting:
- We have tried benchmarking the following: TAS spinlocks (existing
implementation), SysV semaphores (existing implementation), Pthread
Mutexes. Pgbench runs were conducted for 1 to 512 simultaneous backends.
For these three cases we found:
- TAS spinlocks fared the best of all three lock types, however above
100 clients the Pthread mutexes were lock step in performance. I expect
this is due to the cost of any system calls being negligible
relative to lock wait time.
- SysV semaphore implementation faired terribly as expected. However,
it is worse, relative to the TAS spinlocks on SMP than on uniprocessor.
- Since the above seemed to indicate that the lock implementation may
not be the problem (Pthread mutexes are supposed to be implemented to be
less bang-bang than the Postgres TAS spinlocks, IIRC), I decided to
profile Postgres. After much trouble, I got results for it using
oprofile, a kernel profiler for Linux. Unfortunately, I can only profile
for uniprocessor right now using oprofile, as it doesn't support SMP
boxes yet. (soon, I hope.)
Initial results (top five -- if you would like a complete profile, let
me know):
Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
26.57 42255.02 42255.02
FindLockCycleRecurse
5.55 51081.02 8826.00 s_lock_sleep
5.07 59145.03 8064.00 heapgettup
4.48 66274.03 7129.00 hash_search
4.48 73397.03 7123.00 s_lock
2.85 77926.03 4529.00
HeapTupleSatisfiesSnapshot
2.07 81217.04 3291.00 SHMQueueNext
1.85 84154.04 2937.00 AllocSetAlloc
1.84 87085.04 2931.00 fmgr_isbuiltin
1.64 89696.04 2611.00 set_ps_display
1.51 92101.04 2405.00 FunctionCall2
1.47 94442.04 2341.00 XLogInsert
1.39 96649.04 2207.00 _bt_compare
1.22 98597.04 1948.00 SpinAcquire
1.22 100544.04 1947.00 LockBuffer
1.21 102469.04 1925.00 tag_hash
1.01 104078.05 1609.00 LockAcquire
.
.
.
(The samples are proportional to execution time.)
This would seem to point to the deadlock detector. (Which some have
fingered as a possible culprit before, IIRC.)
However, this seems to be a red herring. Removing the deadlock detector
had no effect. In fact, benchmarking showed removing it yielded no
improvement in transaction processing rate on uniprocessor or SMP
systems. Instead, it seems that the deadlock detector simply amounts to
"something to do" for the blocked backend while it waits for lock
acquisition.
Profiling bears this out:
Flat profile:
Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
12.38 14112.01 14112.01 s_lock_sleep
10.18 25710.01 11598.01 s_lock
6.47 33079.01 7369.00 hash_search
5.88 39784.02 6705.00 heapgettup
5.32 45843.02 6059.00
HeapTupleSatisfiesSnapshot
2.62 48830.02 2987.00 AllocSetAlloc
2.48 51654.02 2824.00 fmgr_isbuiltin
1.89 53813.02 2159.00 XLogInsert
1.86 55938.02 2125.00 _bt_compare
1.72 57893.03 1955.00 SpinAcquire
1.61 59733.03 1840.00 LockBuffer
1.60 61560.03 1827.00 FunctionCall2
1.56 63339.03 1779.00 tag_hash
1.46 65007.03 1668.00 set_ps_display
1.20 66372.03 1365.00 SearchCatCache
1.14 67666.03 1294.00 LockAcquire
.
.
.
Our current suspicion isn't that the lock implementation is the only
problem (though there is certainly room for improvement), or perhaps
isn't even the main problem. For example, there has been some suggestion
that perhaps some component of the database is causing large lock
contention. My opinion is that rather than guessing and taking stabs in
the dark, we need to take a more reasoned approach to these things.
IMHO, the next step should be to apply instrumentation (likely via some
neat macros) to all lock acquires / releases. Then, it will be possible
to determine what components are the greatest consumers of locks, and to
determine whether it is a component problem or a systemic problem. (i.e.
some component vs. simply just the lock implementation.)
Neil
--
Neil Padgett
Red Hat Canada Ltd. E-Mail: npadgett@redhat.com
2323 Yonge Street, Suite #300,
Toronto, ON M4P 2C9
"D. Hageman" <dhageman@dracken.com> writes:
The plan for the new spinlocks does look like it has some potential. My
only comment in regards to permformance when we start looking at SMP
machines is ... it is my belief that getting a true threaded backend may
be the only way to get the full potential out of SMP machines.
Depends on what you mean. For scaling well with many connections and
simultaneous queries, there's no reason IMHO that the current
process-per-backend model won't do, assuming the locking issues are
addressed.
If you're talking about making a single query use multiple CPUs, then
yes, we're probably talking about a fundamental rewrite to use threads
or some other mechanism.
-Doug
--
In a world of steel-eyed death, and men who are fighting to be warm,
Come in, she said, I'll give you shelter from the storm. -Dylan
Import Notes
Reply to msg id not found: D.HagemansmessageofWed26Sep2001134046-0500CDT
On 26 Sep 2001, Doug McNaught wrote:
"D. Hageman" <dhageman@dracken.com> writes:
The plan for the new spinlocks does look like it has some potential. My
only comment in regards to permformance when we start looking at SMP
machines is ... it is my belief that getting a true threaded backend may
be the only way to get the full potential out of SMP machines.Depends on what you mean. For scaling well with many connections and
simultaneous queries, there's no reason IMHO that the current
process-per-backend model won't do, assuming the locking issues are
addressed.
Well, I know the current process-per-backend model does quite well. My
argument is not that it fails to do as intended. My original argument is
that it is belief (at the momment with the knowledge I have) to get the
full potential out of SMP machines - threads might be the way to go. The
data from RedHat is quite interesting, so my feelings on this might
change or could be re-inforced. I watch anxiously ;-)
If you're talking about making a single query use multiple CPUs, then
yes, we're probably talking about a fundamental rewrite to use threads
or some other mechanism.
Well, we have several thread model ideologies that we could chose from.
Only experimentation would let us determine the proper path to follow and
then it wouldn't be ideal for everyone. You kinda just have to take the
best scenerio and run with it. My first inclination would be something
like a thread per connection (to reduce connection overhead), but then we
could run into limits on different platforms (threads per process). I
kinda like the idea of using a thread for replication purposes ... lots
of interesting possibilities exist and I will be first to admit that I
don't have all the answers.
--
//========================================================\\
|| D. Hageman <dhageman@dracken.com> ||
\\========================================================//
Neil Padgett <npadgett@redhat.com> writes:
Initial results (top five -- if you would like a complete profile, let
me know):
Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
26.57 42255.02 42255.02 FindLockCycleRecurse
Yipes. It would be interesting to know more about the locking pattern
of your benchmark --- are there long waits-for chains, or not? The
present deadlock detector was certainly written with an eye to "get it
right" rather than "make it fast", but I wonder whether this shows a
performance problem in the detector, or just too many executions because
you're waiting too long to get locks.
However, this seems to be a red herring. Removing the deadlock detector
had no effect. In fact, benchmarking showed removing it yielded no
improvement in transaction processing rate on uniprocessor or SMP
systems. Instead, it seems that the deadlock detector simply amounts to
"something to do" for the blocked backend while it waits for lock
acquisition.
Do you have any idea about the typical lock-acquisition delay in this
benchmark? Our docs advise trying to set DEADLOCK_TIMEOUT higher than
the typical acquisition delay, so that the deadlock detector does not
run unnecessarily.
For example, there has been some suggestion
that perhaps some component of the database is causing large lock
contention.
My thought as well. I would certainly recommend that you use more than
one test case while looking at these things.
regards, tom lane
"D. Hageman" wrote:
The plan for the new spinlocks does look like it has some potential. My
only comment in regards to permformance when we start looking at SMP
machines is ... it is my belief that getting a true threaded backend may
be the only way to get the full potential out of SMP machines. I see that
is one of the things to experiment with on the TODO list and I have seen
some people have messed around already with this using Solaris threads.
It should probably be attempted with pthreads if PostgreSQL is going to
keep some resemblance of cross-platform compatibility. At that time, it
would probably be easier to go in and clean up some stuff for the
implementation of other TODO items (put in the base framework for more
complex future items) as threading the backend would take a little bit of
ideology shift.
I can only think of two objectives for threading. (1) running the various
connections in their own thread instead of their own process. (2) running
complex queries across multiple threads.
For item (1) I see no value to this. It is a lot of work with no tangible
benefit. If you have an old fashion pthreads implementation, it will hurt
performance because are scheduled within the single process's time slice.. If
you have a newer kernel scheduled implementation, then you will have the same
scheduling as separate processes. The only thing you will need to do is
switch your brain from figuring out how to share data, to trying to figure
out how to isolate data. A multithreaded implementation lacks many of the
benefits and robustness of a multiprocess implementation.
For item (2) I can see how that could speed up queries in a low utilization
system, and that would be cool, but in a server that is under load, threading
the queries probably be less efficient.
Tom Lane wrote:
Neil Padgett <npadgett@redhat.com> writes:
Initial results (top five -- if you would like a complete profile, let
me know):
Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
26.57 42255.02 42255.02 FindLockCycleRecurseYipes. It would be interesting to know more about the locking pattern
of your benchmark --- are there long waits-for chains, or not? The
present deadlock detector was certainly written with an eye to "get it
right" rather than "make it fast", but I wonder whether this shows a
performance problem in the detector, or just too many executions because
you're waiting too long to get locks.However, this seems to be a red herring. Removing the deadlock detector
had no effect. In fact, benchmarking showed removing it yielded no
improvement in transaction processing rate on uniprocessor or SMP
systems. Instead, it seems that the deadlock detector simply amounts to
"something to do" for the blocked backend while it waits for lock
acquisition.Do you have any idea about the typical lock-acquisition delay in this
benchmark? Our docs advise trying to set DEADLOCK_TIMEOUT higher than
the typical acquisition delay, so that the deadlock detector does not
run unnecessarily.
Well. Currently the runs are the typical pg_bench runs. This was useful
since it was a handy benchmark that was already done, and I was hoping
it might be useful for comparison since it seems to be popular. More
benchmarks of different types would of course be useful though.
I think the large time consumed by the deadlock detector in the profile
is simply due to too many executions while waiting to acquire to
contended locks. But, I agree that it seems DEADLOCK_TIMEOUT was set too
low, since it appears from the profile output that the deadlock detector
was running unnecessarily. But the deadlock detector isn't causing the
SMP performance hit right now, since the throughput is the same with it
in place or with it removed completely. I therefore didn't make any
attempt to tune DEADLOCK_TIMEOUT. As I mentioned before, it apparently
just gives the backend "something" to do while it waits for a lock.
I'm thinking that the deadlock detector unnecessarily has no effect on
performance since the shared memory is causing some level of
serialization. So, one CPU (or two, or three, but not all) is doing
useful work, while the others are idle (that is to say, doing no useful
work). If they are idle spinning, or idle running the deadlock detector
the net throughput is still the same. (This might also indicate that
improving the lock design won't help here.) Of course, another
possibility is that you spend so long spinning simply because you do
spin (rather than sleep), and this is wasting much CPU time so the
useful work backends take longer to get things done. Either is just
speculation right now without any data to back things up.
For example, there has been some suggestion
that perhaps some component of the database is causing large lock
contention.My thought as well. I would certainly recommend that you use more than
one test case while looking at these things.
Yes. That is another suggestion for a next step. Several cases might
serve to better expose the path causing the slowdown. I think that
several test cases of varying usage patterns, coupled with hold time
instrumentation (which can tell what routine acquired the lock and how
long it held it, and yield wait-for data in the analysis), are the right
way to go about attacking SMP performance. Any other thoughts?
Neil
--
Neil Padgett
Red Hat Canada Ltd. E-Mail: npadgett@redhat.com
2323 Yonge Street, Suite #300,
Toronto, ON M4P 2C9
On Wed, 26 Sep 2001, mlw wrote:
I can only think of two objectives for threading. (1) running the various
connections in their own thread instead of their own process. (2) running
complex queries across multiple threads.For item (1) I see no value to this. It is a lot of work with no tangible
benefit. If you have an old fashion pthreads implementation, it will hurt
performance because are scheduled within the single process's time slice..
Old fashion ... as in a userland library that implements POSIX threads?
Well, I would agree. However, most *modern* implementations are done in
the kernel or kernel and userland coop model and don't have this
limitation (as you mention later in your e-mail). You have kinda hit on
one of my gripes about computers in general. At what point in time does
one say something is obsolete or too old to support anymore - that it
hinders progress instead of adding a "feature"?
you have a newer kernel scheduled implementation, then you will have the same
scheduling as separate processes. The only thing you will need to do is
switch your brain from figuring out how to share data, to trying to figure
out how to isolate data. A multithreaded implementation lacks many of the
benefits and robustness of a multiprocess implementation.
Save for the fact that the kernel can switch between threads faster then
it can switch processes considering threads share the same address space,
stack, code, etc. If need be sharing the data between threads is much
easier then sharing between processes.
I can't comment on the "isolate data" line. I am still trying to figure
that one out.
That last line is a troll if I every saw it ;-) I will agree that threads
isn't for everything and that it has costs just like everything else. Let
me stress that last part - like everything else. Certain costs exist in
the present model, nothing is - how should we say ... perfect.
For item (2) I can see how that could speed up queries in a low utilization
system, and that would be cool, but in a server that is under load, threading
the queries probably be less efficient.
Well, I don't follow your logic and you didn't give any substance to back
up your claim. I am willing to listen.
Another thought ... Oracle uses threads doesn't it or at least it has a
single processor and multi-processor version last time I knew ... which do
they claim is better? (Not saying that Oracle's proclimation of what is
good and what is not matters, but it is good for another view point).
--
//========================================================\\
|| D. Hageman <dhageman@dracken.com> ||
\\========================================================//
Neil Padgett <npadgett@redhat.com> writes:
Well. Currently the runs are the typical pg_bench runs.
With what parameters? If you don't initialize the pg_bench database
with "scale" proportional to the number of clients you intend to use,
then you'll naturally get huge lock contention. For example, if you
use scale=1, there's only one "branch" in the database. Since every
transaction wants to update the branch's balance, every transaction
has to write-lock that single row, and so everybody serializes on that
one lock. Under these conditions it's not surprising to see lots of
lock waits and lots of useless runs of the deadlock detector ...
regards, tom lane
On Wed, 26 Sep 2001, mlw wrote:
I can only think of two objectives for threading. (1) running the various
connections in their own thread instead of their own process. (2) running
complex queries across multiple threads.
I did a multi-threaded version of 7.0.2 using Solaris threads about a year
ago in order to try
and get multiple backend connections working under one java process using
jni. I used the thread per connection model.
I eventually got it working, but it was/is very messy ( there were global
variables everywhere! ). Anyway, I was able to get a pretty good speed up
on inserts by scheduling buffer writes from multiple connections on one
common writing thread.
I also got some other features that were important to me at the time.
1. True prepared statements under java with bound input and output
variables
2. Better system utilization
a. fewer Solaris lightweight processes mapped to threads.
b. Fewer open files per postgres installation
3. Automatic vacuums when system activity is low by a daemon thread.
but there were some drawbacks... One rogue thread or bad user
function could take down all connections for that process. This
was and seems to still be the major drawback to using threads.
Myron Scott
mscott@sacadia.com
"D. Hageman" <dhageman@dracken.com> writes:
you have a newer kernel scheduled implementation, then you will have the same
scheduling as separate processes. The only thing you will need to do is
switch your brain from figuring out how to share data, to trying to figure
out how to isolate data. A multithreaded implementation lacks many of the
benefits and robustness of a multiprocess implementation.Save for the fact that the kernel can switch between threads faster then
it can switch processes considering threads share the same address space,
stack, code, etc. If need be sharing the data between threads is much
easier then sharing between processes.
When using a kernel threading model, it's not obvious to me that the
kernel will switch between threads much faster than it will switch
between processes. As far as I can see, the only potential savings is
not reloading the pointers to the page tables. That is not nothing,
but it is also not a lot.
I can't comment on the "isolate data" line. I am still trying to figure
that one out.
Sometimes you need data which is specific to a particular thread.
Basically, you have to look at every global variable in the Postgres
backend, and determine whether to share it among all threads or to
make it thread-specific. In other words, you have to take extra steps
to isolate the data within the thread. This is the reverse of the
current situation, in which you have to take extra steps to share data
among all backend processes.
That last line is a troll if I every saw it ;-) I will agree that threads
isn't for everything and that it has costs just like everything else. Let
me stress that last part - like everything else. Certain costs exist in
the present model, nothing is - how should we say ... perfect.
When writing in C, threading inevitably loses robustness. Erratic
behaviour by one thread, perhaps in a user defined function, can
subtly corrupt the entire system, rather than just that thread. Part
of defensive programming is building barriers between different parts
of a system. Process boundaries are a powerful barrier.
(Actually, though, Postgres is already vulnerable to erratic behaviour
because any backend process can corrupt the shared buffer pool.)
Ian
"D. Hageman" <dhageman@dracken.com> writes:
Save for the fact that the kernel can switch between threads faster then
it can switch processes considering threads share the same address space,
stack, code, etc. If need be sharing the data between threads is much
easier then sharing between processes.
This depends on your system. Solaris has a huge difference between
thread and process context switch times, whereas Linux has very little
difference (and in fact a Linux process context switch is about as
fast as a Solaris thread switch on the same hardware--Solaris is just
a pig when it comes to process context switching).
I can't comment on the "isolate data" line. I am still trying to figure
that one out.
I think his point is one of clarity and maintainability. When a
task's data is explicitly shared (via shared memory of some sort) it's
fairly clear when you're accessing shared data and need to worry about
locking. Whereas when all data is shared by default (as with threads)
it's very easy to miss places where threads can step on each other.
-Doug
--
In a world of steel-eyed death, and men who are fighting to be warm,
Come in, she said, I'll give you shelter from the storm. -Dylan
Import Notes
Reply to msg id not found: D.HagemansmessageofWed26Sep2001161422-0500CDT
On 26 Sep 2001, Ian Lance Taylor wrote:
Save for the fact that the kernel can switch between threads faster then
it can switch processes considering threads share the same address space,
stack, code, etc. If need be sharing the data between threads is much
easier then sharing between processes.When using a kernel threading model, it's not obvious to me that the
kernel will switch between threads much faster than it will switch
between processes. As far as I can see, the only potential savings is
not reloading the pointers to the page tables. That is not nothing,
but it is also not a lot.
It is my understanding that avoiding a full context switch of the
processor can be of a significant advantage. This is especially important
on processor architectures that can be kinda slow at doing it (x86). I
will admit that most modern kernels have features that assist software
packages utilizing the forking model (copy on write for instance). It is
also my impression that these do a good job. I am the kind of guy that
looks towards the future (as in a year, year and half or so) and say that
processors will hopefully get faster at context switching and more and
more kernels will implement these algorithms to speed up the forking
model. At the same time, I see more and more processors being shoved into
a single box and it appears that the threads model works better on these
type of systems.
I can't comment on the "isolate data" line. I am still trying to figure
that one out.Sometimes you need data which is specific to a particular thread.
When you need data that is specific to a thread you use a TSD (Thread
Specific Data).
Basically, you have to look at every global variable in the Postgres
backend, and determine whether to share it among all threads or to
make it thread-specific.
Yes, if one was to implement threads into PostgreSQL I would think that
some re-writing would be in order of several areas. Like I said before,
give a person a chance to restructure things so future TODO items wouldn't
be so hard to implement. Personally, I like to stay away from global
variables as much as possible. They just get you into trouble.
That last line is a troll if I every saw it ;-) I will agree that threads
isn't for everything and that it has costs just like everything else. Let
me stress that last part - like everything else. Certain costs exist in
the present model, nothing is - how should we say ... perfect.When writing in C, threading inevitably loses robustness. Erratic
behaviour by one thread, perhaps in a user defined function, can
subtly corrupt the entire system, rather than just that thread. Part
of defensive programming is building barriers between different parts
of a system. Process boundaries are a powerful barrier.
I agree with everything you wrote above except for the first line. My
only comment is that process boundaries are only *truely* a powerful
barrier if the processes are different pieces of code and are not
dependent on each other in crippling ways. Forking the same code with the
bug in it - and only 1 in 5 die - is still 4 copies of buggy code running
on your system ;-)
(Actually, though, Postgres is already vulnerable to erratic behaviour
because any backend process can corrupt the shared buffer pool.)
I appreciate your total honest view of the situation.
--
//========================================================\\
|| D. Hageman <dhageman@dracken.com> ||
\\========================================================//
On 26 Sep 2001, Doug McNaught wrote:
This depends on your system. Solaris has a huge difference between
thread and process context switch times, whereas Linux has very little
difference (and in fact a Linux process context switch is about as
fast as a Solaris thread switch on the same hardware--Solaris is just
a pig when it comes to process context switching).
Yeah, I kinda commented on this in another e-mail. Linux has some nice
tweaks for software using the forking model, but I am sure a couple of
Solaris admins out there like to run PostgreSQL. ;-) You are right in
that it is very system dependent. I should have prefaced it with "In
general ..."
I can't comment on the "isolate data" line. I am still trying to figure
that one out.I think his point is one of clarity and maintainability. When a
task's data is explicitly shared (via shared memory of some sort) it's
fairly clear when you're accessing shared data and need to worry about
locking. Whereas when all data is shared by default (as with threads)
it's very easy to miss places where threads can step on each other.
Well, I understand what you are saying and you are correct. The situation
is that when you implement anything using pthreads you lock your
variables (which is where the major performance penalty comes into play
with threads). Now, the kicker is how you lock them. Depending on how
you do it (as per discussion earlier on this list concerning threads) it
can be faster or slower. It all depends on what model you use.
Data is not explicitely shared between threads unless you make it so. The
threads just share the same stack and all of that, but you can't
(shouldn't is probably a better word) really access anything you don't have
an address for. Threads just makes it easier to share if you want to.
Also, see my other e-mail to the list concerning TSDs.
--
//========================================================\\
|| D. Hageman <dhageman@dracken.com> ||
\\========================================================//
Ian Lance Taylor <ian@airs.com> writes:
(Actually, though, Postgres is already vulnerable to erratic behaviour
because any backend process can corrupt the shared buffer pool.)
Not to mention the other parts of shared memory.
Nonetheless, our experience has been that cross-backend failures due to
memory clobbers in shared memory are very infrequent --- certainly far
less often than we see localized-to-a-backend crashes. Probably this is
because the shared memory is (a) small compared to the rest of the
address space and (b) only accessed by certain specific modules within
Postgres.
I'm convinced that switching to a thread model would result in a
significant degradation in our ability to recover from coredump-type
failures, even given the (implausible) assumption that we introduce no
new bugs during the conversion. I'm also *un*convinced that such a
conversion will yield significant performance benefits, unless we
introduce additional cross-thread dependencies (and more fragility
and lock contention) by tactics such as sharing catalog caches across
threads.
regards, tom lane
... Thomas still has his date/time stuff
to finish off, now that CVSup is fixed ...
I'm now getting clean runs through the regression tests on a freshly
merged cvs tree. I'd like to look at it a little more to adjust
pg_proc.h attributes before I commit the changes.
There was a bit of a hiccup when merging since there was some bytea
stuff added to the catalogs over the last couple of weeks. Could folks
hold off on claiming new OIDs until I get this stuff committed? TIA
I expect to be able to merge this stuff by Friday at the latest, more
likely tomorrow.
- Thomas
On Wed, 26 Sep 2001, D. Hageman wrote:
Save for the fact that the kernel can switch between threads faster then
it can switch processes considering threads share the same address space,
stack, code, etc. If need be sharing the data between threads is much
easier then sharing between processes.When using a kernel threading model, it's not obvious to me that the
kernel will switch between threads much faster than it will switch
between processes. As far as I can see, the only potential savings is
not reloading the pointers to the page tables. That is not nothing,
but it is also
<major snippage>
I can't comment on the "isolate data" line. I am still trying to figure
that one out.Sometimes you need data which is specific to a particular thread.
When you need data that is specific to a thread you use a TSD (Thread
Specific Data).
Which Linux does not support with a vengeance, to my knowledge.
As a matter of fact, quote from Linus on the matter was something like
"Solution to slow process switching is fast process switching, not another
kernel abstraction [referring to threads and TSD]". TSDs make
implementation of thread switching complex, and fork() complex.
The question about threads boils down to: Is there far more data that is
shared than unshared? If yes, threads are better, if not, you'll be
abusing TSD and slowing things down.
I believe right now, postgresql' model of sharing only things that need to
be shared is pretty damn good. The only slight problem is overhead of
forking another backend, but its still _fast_.
IMHO, threads would not bring large improvement to postgresql.
Actually, if I remember, there was someone who ported postgresql (I think
it was 6.5) to be multithreaded with major pain, because the requirement
was to integrate with CORBA. I believe that person posted some benchmarks
which were essentially identical to non-threaded postgres...
-alex