reducing the overhead of frequent table locks - now, with WIP patch

Started by Robert Haasalmost 15 years ago99 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

I've now spent enough time working on this issue now to be convinced
that the approach has merit, if we can work out the kinks. I'll start
with some performance numbers.

The case where the current system for taking table locks is really
hurting us is where we have a large number of backends attempting to
access a small number of relations. They all fight over the lock
manager lock on whichever partition (or partitions) that relation (or
those relations) fall in. Increasing the number of partitions doesn't
help, because they are all trying to access the same object, and that
object is only ever going to be in one partition. To exercise this
case, I chose the following benchmark: pgbench -n -S -T 300 -c 36 -j
36. I first tested this on my MacBook Pro, with scale factor 10 and
shared_buffers=400MB. Here are the results of alternating runs
without and with the patch:

tps = 23997.120971 (including connections establishing)
tps = 25003.186860 (including connections establishing)
tps = 23499.257892 (including connections establishing)
tps = 24435.793773 (including connections establishing)
tps = 23579.624360 (including connections establishing)
tps = 24791.974810 (including connections establishing)

As you can see, this works out to a bit more than a 4% improvement on
this two-core box. I also got access (thanks to Nate Boley) to a
24-core box and ran the same test with scale factor 100 and
shared_buffers=8GB. Here are the results of alternating runs without
and with the patch on that machine:

tps = 36291.996228 (including connections establishing)
tps = 129242.054578 (including connections establishing)
tps = 36704.393055 (including connections establishing)
tps = 128998.648106 (including connections establishing)
tps = 36531.208898 (including connections establishing)
tps = 131341.367344 (including connections establishing)

That's an improvement of about ~3.5x. According to the vmstat output,
when running without the patch, the CPU state was about 40% idle.
With the patch, it dropped down to around 6%.

There are numerous problems with the code as it stands at this point.
It crashes if you try to use 2PC, which means the regression tests
fail; it probably does horrible things if you run out of shared
memory; pg_locks knows nothing about the new mechanism (arguably, we
could leave it that way: only locks that can't possibly be conflicting
with anything can be taken using this mechanism, but it would be nice
to fix, I think); and there are likely some other gotchas as well.
Still, the basic mechanism appears to work.

The code is attached, for anyone who may be curious. Known idiocies
are marked with "ZZZ". The design was discussed on the previous
thread ("reducing the overhead of frequent table locks"), q.v. There
are some comments in the patch as well, but more is likely needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

fastlock-v1.patchapplication/octet-stream; name=fastlock-v1.patchDownload+779-266
#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#1)
Re: reducing the overhead of frequent table locks - now, with WIP patch

Robert Haas <robertmhaas@gmail.com> wrote:

That's an improvement of about ~3.5x.

Outstanding!

I don't want to even peek at this until I've posted the two WIP SSI
patches (now both listed on the "Open Items" page), but will
definitely take a look after that.

-Kevin

#3Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#2)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Fri, Jun 3, 2011 at 10:13 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

That's an improvement of about ~3.5x.

Outstanding!

I don't want to even peek at this until I've posted the two WIP SSI
patches (now both listed on the "Open Items" page), but will
definitely take a look after that.

Yeah, those SSI items are important to get nailed down RSN. But
thanks for your interest in this patch. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#1)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Fri, Jun 03, 2011 at 09:17:08AM -0400, Robert Haas wrote:

As you can see, this works out to a bit more than a 4% improvement on
this two-core box. I also got access (thanks to Nate Boley) to a
24-core box and ran the same test with scale factor 100 and
shared_buffers=8GB. Here are the results of alternating runs without
and with the patch on that machine:

tps = 36291.996228 (including connections establishing)
tps = 129242.054578 (including connections establishing)
tps = 36704.393055 (including connections establishing)
tps = 128998.648106 (including connections establishing)
tps = 36531.208898 (including connections establishing)
tps = 131341.367344 (including connections establishing)

Nice!

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#1)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Fri, Jun 3, 2011 at 2:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've now spent enough time working on this issue now to be convinced
that the approach has merit, if we can work out the kinks.

Yes, the approach has merits and I'm sure we can work out the kinks.

As you can see, this works out to a bit more than a 4% improvement on
this two-core box.  I also got access (thanks to Nate Boley) to a
24-core box and ran the same test with scale factor 100 and
shared_buffers=8GB.  Here are the results of alternating runs without
and with the patch on that machine:

tps = 36291.996228 (including connections establishing)
tps = 129242.054578 (including connections establishing)
tps = 36704.393055 (including connections establishing)
tps = 128998.648106 (including connections establishing)
tps = 36531.208898 (including connections establishing)
tps = 131341.367344 (including connections establishing)

That's an improvement of about ~3.5x.  According to the vmstat output,
when running without the patch, the CPU state was about 40% idle.
With the patch, it dropped down to around 6%.

Congratulations. I believe that is realistic based upon my investigations.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#5)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Sat, Jun 4, 2011 at 2:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

As you can see, this works out to a bit more than a 4% improvement on
this two-core box.  I also got access (thanks to Nate Boley) to a
24-core box and ran the same test with scale factor 100 and
shared_buffers=8GB.  Here are the results of alternating runs without
and with the patch on that machine:

tps = 36291.996228 (including connections establishing)
tps = 129242.054578 (including connections establishing)
tps = 36704.393055 (including connections establishing)
tps = 128998.648106 (including connections establishing)
tps = 36531.208898 (including connections establishing)
tps = 131341.367344 (including connections establishing)

That's an improvement of about ~3.5x.  According to the vmstat output,
when running without the patch, the CPU state was about 40% idle.
With the patch, it dropped down to around 6%.

Congratulations. I believe that is realistic based upon my investigations.

Tom,

You should look at this. It's good.

The approach looks sound to me. It's a fairly isolated patch and we
should be considering this for inclusion in 9.1, not wait another
year.

I will happily add its a completely different approach to the one I'd
been working on, and even more happily is so different from the Oracle
approach that we are definitely unencumbered by patent issues here.
Well done Robert, Noah.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#6)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 04.06.2011 18:01, Simon Riggs wrote:

It's a fairly isolated patch and we
should be considering this for inclusion in 9.1, not wait another
year.

-1

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#2)
Re: reducing the overhead of frequent table locks - now, with WIP patch

Simon Riggs wrote:

we should be considering this for inclusion in 9.1, not wait
another year.

-1

I'm really happy that we're addressing the problems with scaling to
a large number of cores, and this patch sounds great. Adding a new
feature at this point in the release cycle would be horrible.
Frankly, from the tone of Robert's post, it probably wouldn't be
appropriate to include it in a release if it showed up in this
condition at the start of the last CF for that release.

The nice thing about annual releases is there's never one too far
away -- unless, of course, we hold up a release up to squeeze in
"just one more" feature.

-Kevin

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#6)
Re: reducing the overhead of frequent table locks - now, with WIP patch

Simon Riggs <simon@2ndquadrant.com> writes:

The approach looks sound to me. It's a fairly isolated patch and we
should be considering this for inclusion in 9.1, not wait another
year.

That suggestion is completely insane. The patch is only WIP and full of
bugs, even according to its author. Even if it were solid, it is way
too late to be pushing such stuff into 9.1. We're trying to ship a
release, not find ways to cause it to slip more.

regards, tom lane

#10Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Robert Haas (#1)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 06/03/2011 03:17 PM, Robert Haas wrote:
[...]

As you can see, this works out to a bit more than a 4% improvement on
this two-core box. I also got access (thanks to Nate Boley) to a
24-core box and ran the same test with scale factor 100 and
shared_buffers=8GB. Here are the results of alternating runs without
and with the patch on that machine:

tps = 36291.996228 (including connections establishing)
tps = 129242.054578 (including connections establishing)
tps = 36704.393055 (including connections establishing)
tps = 128998.648106 (including connections establishing)
tps = 36531.208898 (including connections establishing)
tps = 131341.367344 (including connections establishing)

That's an improvement of about ~3.5x. According to the vmstat output,
when running without the patch, the CPU state was about 40% idle.
With the patch, it dropped down to around 6%.

nice - but lets see on real hardware...

Testing this on a brand new E7-4850 4 Socket/10cores+HT Box - so 80
hardware threads:

first some numbers with -HEAD(-T 120, runtimes at lower -c counts have
fairly high variation in the results, first number is the number of
connections/threads):

-j1: tps = 7928.965493 (including connections establishing)
-j8: tps = 53610.572347 (including connections establishing)
-j16: tps = 80835.446118 (including connections establishing)
-j32: tps = 75666.731883 (including connections establishing)
-j40: tps = 74628.568388 (including connections establishing)
-j64. tps = 68268.081973 (including connections establishing)
-c80 tps = 66704.216166 (including connections establishing)

postgresql is completely lock limited in this test anything beyond
around -j10 is basically not able to push the box to more than 80% IDLE(!)

and now with the patch applied:

-j1: tps = 7783.295587 (including connections establishing)
-j8: tps = 44361.661947 (including connections establishing)
-j16: tps = 92270.464541 (including connections establishing)
-j24: tps = 108259.524782 (including connections establishing)
-j32: tps = 183337.422612 (including connections establishing)
-j40 tps = 209616.052430 (including connections establishing)
-j48: tps = 229621.292382 (including connections establishing)
-j56: tps = 218690.391603 (including connections establishing)
-j64: tps = 188028.348501 (including connections establishing)
-j80. tps = 118814.741609 (including connections establishing)

so much better - but I still think there is some headroom left still,
although pgbench itself is a CPU hog in those benchmark with eating up
to 10 cores in the worst case scenario - will retest with sysbench which
in the past showed more reasonable CPU usage for me.

and a profile(patched code) for the -j48(aka fastest) case:

731535 11.8408 postgres s_lock
291878 4.7244 postgres LWLockAcquire
242373 3.9231 postgres AllocSetAlloc
239083 3.8698 postgres LWLockRelease
202341 3.2751 postgres SearchCatCache
190055 3.0763 postgres hash_search_with_hash_value
187148 3.0292 postgres base_yyparse
173265 2.8045 postgres GetSnapshotData
75700 1.2253 postgres core_yylex
74974 1.2135 postgres MemoryContextAllocZeroAligned
61404 0.9939 postgres _bt_compare
57529 0.9312 postgres MemoryContextAlloc

and one for the -j80 case(also patched).

485798 48.9667 postgres s_lock
60327 6.0808 postgres LWLockAcquire
57049 5.7503 postgres LWLockRelease
18357 1.8503 postgres hash_search_with_hash_value
17033 1.7169 postgres GetSnapshotData
14763 1.4881 postgres base_yyparse
14460 1.4575 postgres SearchCatCache
13975 1.4086 postgres AllocSetAlloc
6416 0.6467 postgres PinBuffer
5024 0.5064 postgres SIGetDataEntries
4704 0.4741 postgres core_yylex
4625 0.4662 postgres _bt_compare

Stefan

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Stefan Kaltenbrunner (#10)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 05.06.2011 22:04, Stefan Kaltenbrunner wrote:

and one for the -j80 case(also patched).

485798 48.9667 postgres s_lock
60327 6.0808 postgres LWLockAcquire
57049 5.7503 postgres LWLockRelease
18357 1.8503 postgres hash_search_with_hash_value
17033 1.7169 postgres GetSnapshotData
14763 1.4881 postgres base_yyparse
14460 1.4575 postgres SearchCatCache
13975 1.4086 postgres AllocSetAlloc
6416 0.6467 postgres PinBuffer
5024 0.5064 postgres SIGetDataEntries
4704 0.4741 postgres core_yylex
4625 0.4662 postgres _bt_compare

Hmm, does that mean that it's spending 50% of the time spinning on a
spinlock? That's bad. It's one thing to be contended on a lock, and have
a lot of idle time because of that, but it's even worse to spend a lot
of time spinning because that CPU time won't be spent on doing more
useful work, even if there is some other process on the system that
could make use of that CPU time.

I like the overall improvement on the throughput, of course, but we have
to find a way to avoid the busy-wait.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Heikki Linnakangas (#11)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 06/05/2011 09:12 PM, Heikki Linnakangas wrote:

On 05.06.2011 22:04, Stefan Kaltenbrunner wrote:

and one for the -j80 case(also patched).

485798 48.9667 postgres s_lock
60327 6.0808 postgres LWLockAcquire
57049 5.7503 postgres LWLockRelease
18357 1.8503 postgres hash_search_with_hash_value
17033 1.7169 postgres GetSnapshotData
14763 1.4881 postgres base_yyparse
14460 1.4575 postgres SearchCatCache
13975 1.4086 postgres AllocSetAlloc
6416 0.6467 postgres PinBuffer
5024 0.5064 postgres SIGetDataEntries
4704 0.4741 postgres core_yylex
4625 0.4662 postgres _bt_compare

Hmm, does that mean that it's spending 50% of the time spinning on a
spinlock? That's bad. It's one thing to be contended on a lock, and have
a lot of idle time because of that, but it's even worse to spend a lot
of time spinning because that CPU time won't be spent on doing more
useful work, even if there is some other process on the system that
could make use of that CPU time.

well yeah - we are broken right now with only being able to use ~20% of
CPU on a modern mid-range box, but using 80% CPU (or 4x like in the
above case) and only getting less than 2x the performance seems wrong as
well. I also wonder if we are still missing something fundamental -
because even with the current patch we are quite far away from linear
scaling and light-years from some of our competitors...

Stefan

#13Robert Haas
robertmhaas@gmail.com
In reply to: Stefan Kaltenbrunner (#12)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Sun, Jun 5, 2011 at 4:01 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

On 06/05/2011 09:12 PM, Heikki Linnakangas wrote:

On 05.06.2011 22:04, Stefan Kaltenbrunner wrote:

and one for the -j80 case(also patched).

485798   48.9667  postgres                 s_lock
60327     6.0808  postgres                 LWLockAcquire
57049     5.7503  postgres                 LWLockRelease
18357     1.8503  postgres                 hash_search_with_hash_value
17033     1.7169  postgres                 GetSnapshotData
14763     1.4881  postgres                 base_yyparse
14460     1.4575  postgres                 SearchCatCache
13975     1.4086  postgres                 AllocSetAlloc
6416      0.6467  postgres                 PinBuffer
5024      0.5064  postgres                 SIGetDataEntries
4704      0.4741  postgres                 core_yylex
4625      0.4662  postgres                 _bt_compare

Hmm, does that mean that it's spending 50% of the time spinning on a
spinlock? That's bad. It's one thing to be contended on a lock, and have
a lot of idle time because of that, but it's even worse to spend a lot
of time spinning because that CPU time won't be spent on doing more
useful work, even if there is some other process on the system that
could make use of that CPU time.

well yeah - we are broken right now with only being able to use ~20% of
CPU on a modern mid-range box, but using 80% CPU (or 4x like in the
above case) and only getting less than 2x the performance seems wrong as
well. I also wonder if we are still missing something fundamental -
because even with the current patch we are quite far away from linear
scaling and light-years from some of our competitors...

Could you compile with LWLOCK_STATS, rerun these tests, total up the
"blk" numbers by LWLockId, and post the results? (Actually, totalling
up the shacq and exacq numbers would be useful as well, if you
wouldn't mind.)

Unless I very much miss my guess, we're going to see zero contention
on the new structures introduced by this patch. Rather, I suspect
what we're going to find is that, with the hideous contention on one
particular lock manager partition lock removed, there's a more
spread-out contention problem, likely involving the lock manager
partition lock, the buffer mapping locks, and possibly other LWLocks
as well. The fact that the system is busy-waiting rather than just
not using the CPU at all probably means that the remaining contention
is more spread out than that which is removed by this patch. We don't
actually have everything pile up on a single LWLock (as happens in git
master), but we do spend a lot of time fighting cache lines away from
other CPUs. Or at any rate, that's my guess: we need some real
numbers to know for sure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#13)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Sun, Jun 5, 2011 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Could you compile with LWLOCK_STATS, rerun these tests, total up the
"blk" numbers by LWLockId, and post the results?  (Actually, totalling
up the shacq and exacq numbers would be useful as well, if you
wouldn't mind.)

I did this on the loaner 24-core box from Nate Boley and got the
following results. This is just the LWLocks that had blk>0.

lwlock 0: shacq 0 exacq 200625 blk 24044
lwlock 4: shacq 80101430 exacq 196 blk 28
lwlock 33: shacq 8333673 exacq 11977 blk 864
lwlock 34: shacq 7092293 exacq 11890 blk 803
lwlock 35: shacq 7893875 exacq 11909 blk 848
lwlock 36: shacq 7567514 exacq 11912 blk 830
lwlock 37: shacq 7427774 exacq 11930 blk 745
lwlock 38: shacq 7120108 exacq 11989 blk 853
lwlock 39: shacq 7584952 exacq 11982 blk 782
lwlock 40: shacq 7949867 exacq 12056 blk 821
lwlock 41: shacq 6612240 exacq 11929 blk 746
lwlock 42: shacq 47512112 exacq 11844 blk 4503
lwlock 43: shacq 7943511 exacq 11871 blk 878
lwlock 44: shacq 7534558 exacq 12033 blk 800
lwlock 45: shacq 7128256 exacq 12045 blk 856
lwlock 46: shacq 7575339 exacq 12015 blk 818
lwlock 47: shacq 6745173 exacq 12094 blk 806
lwlock 48: shacq 8410348 exacq 12104 blk 977
lwlock 49: shacq 0 exacq 5007594 blk 172533
lwlock 50: shacq 0 exacq 5011704 blk 172282
lwlock 51: shacq 0 exacq 5003356 blk 172802
lwlock 52: shacq 0 exacq 5009020 blk 174648
lwlock 53: shacq 0 exacq 5010808 blk 172080
lwlock 54: shacq 0 exacq 5004908 blk 169934
lwlock 55: shacq 0 exacq 5009324 blk 170281
lwlock 56: shacq 0 exacq 5005904 blk 171001
lwlock 57: shacq 0 exacq 5006984 blk 169942
lwlock 58: shacq 0 exacq 5000346 blk 170001
lwlock 59: shacq 0 exacq 5004884 blk 170484
lwlock 60: shacq 0 exacq 5006304 blk 171325
lwlock 61: shacq 0 exacq 5008421 blk 170866
lwlock 62: shacq 0 exacq 5008162 blk 170868
lwlock 63: shacq 0 exacq 5002238 blk 170291
lwlock 64: shacq 0 exacq 5005348 blk 169764
lwlock 307: shacq 0 exacq 2 blk 1
lwlock 315: shacq 0 exacq 3 blk 2
lwlock 337: shacq 0 exacq 4 blk 3
lwlock 345: shacq 0 exacq 2 blk 1
lwlock 349: shacq 0 exacq 2 blk 1
lwlock 231251: shacq 0 exacq 2 blk 1
lwlock 253831: shacq 0 exacq 2 blk 1

So basically, even with the patch, at 24 cores the lock manager locks
are still under tremendous pressure. But note that there's a big
difference between what's happening here and what's happening without
the patch. Here's without the patch:

lwlock 0: shacq 0 exacq 191613 blk 17591
lwlock 4: shacq 21543085 exacq 102 blk 20
lwlock 33: shacq 2237938 exacq 11976 blk 463
lwlock 34: shacq 1907344 exacq 11890 blk 458
lwlock 35: shacq 2125308 exacq 11908 blk 442
lwlock 36: shacq 2038220 exacq 11912 blk 430
lwlock 37: shacq 1998059 exacq 11927 blk 449
lwlock 38: shacq 1916179 exacq 11953 blk 409
lwlock 39: shacq 2042173 exacq 12019 blk 479
lwlock 40: shacq 2140002 exacq 12056 blk 448
lwlock 41: shacq 1776772 exacq 11928 blk 392
lwlock 42: shacq 12777368 exacq 11842 blk 2451
lwlock 43: shacq 2132240 exacq 11869 blk 478
lwlock 44: shacq 2026845 exacq 12031 blk 446
lwlock 45: shacq 1918618 exacq 12045 blk 449
lwlock 46: shacq 2038437 exacq 12011 blk 472
lwlock 47: shacq 1814660 exacq 12089 blk 401
lwlock 48: shacq 2261208 exacq 12105 blk 478
lwlock 49: shacq 0 exacq 1347524 blk 17020
lwlock 50: shacq 0 exacq 1350678 blk 16888
lwlock 51: shacq 0 exacq 1346260 blk 16744
lwlock 52: shacq 0 exacq 1348432 blk 16864
lwlock 53: shacq 0 exacq 22216779 blk 4914363
lwlock 54: shacq 0 exacq 22217309 blk 4525381
lwlock 55: shacq 0 exacq 1348406 blk 13438
lwlock 56: shacq 0 exacq 1345996 blk 13299
lwlock 57: shacq 0 exacq 1347890 blk 13654
lwlock 58: shacq 0 exacq 1343486 blk 13349
lwlock 59: shacq 0 exacq 1346198 blk 13471
lwlock 60: shacq 0 exacq 1346236 blk 13532
lwlock 61: shacq 0 exacq 1343688 blk 13547
lwlock 62: shacq 0 exacq 1350068 blk 13614
lwlock 63: shacq 0 exacq 1345302 blk 13420
lwlock 64: shacq 0 exacq 1348858 blk 13635
lwlock 321: shacq 0 exacq 2 blk 1
lwlock 329: shacq 0 exacq 4 blk 3
lwlock 337: shacq 0 exacq 6 blk 4
lwlock 347: shacq 0 exacq 5 blk 4
lwlock 357: shacq 0 exacq 3 blk 2
lwlock 363: shacq 0 exacq 3 blk 2
lwlock 369: shacq 0 exacq 4 blk 3
lwlock 379: shacq 0 exacq 2 blk 1
lwlock 383: shacq 0 exacq 2 blk 1
lwlock 445: shacq 0 exacq 2 blk 1
lwlock 449: shacq 0 exacq 2 blk 1
lwlock 451: shacq 0 exacq 2 blk 1
lwlock 1023: shacq 0 exacq 2 blk 1
lwlock 11401: shacq 0 exacq 2 blk 1
lwlock 115591: shacq 0 exacq 2 blk 1
lwlock 117177: shacq 0 exacq 2 blk 1
lwlock 362839: shacq 0 exacq 2 blk 1

In the unpatched case, two lock manager locks are getting beaten to
death, and the others all about equally contended. By eliminating the
portion of the lock manager contention that pertains specifically to
the two heavily trafficked locks, system throughput improves by about
3.5x - and, not surprisingly, traffic on the lock manager locks
increases by approximately the same multiple. Those locks now become
the contention bottleneck, with about 12x the blocking they had
pre-patch. I'm definitely interested in investigating what to do
about that, but I don't think it's this patch's problem to fix all of
our lock manager bottlenecks. Another thing to note is that
pre-patch, the two really badly contented LWLocks were blocking about
22% of the time; post-patch, all of the lock manager locks are
blocking about 3.4% of the time. That's certainly not great, but it's
progress.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#14)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Sun, Jun 5, 2011 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm definitely interested in investigating what to do
about that, but I don't think it's this patch's problem to fix all of
our lock manager bottlenecks.

I did some further investigation of this. It appears that more than
99% of the lock manager lwlock traffic that remains with this patch
applied has locktag_type == LOCKTAG_VIRTUALTRANSACTION. Every SELECT
statement runs in a separate transaction, and for each new transaction
we run VirtualXactLockTableInsert(), which takes a lock on the vxid of
that transaction, so that other processes can wait for it. That
requires acquiring and releasing a lock manager partition lock, and we
have to do the same thing a moment later at transaction end to dump
the lock.

A quick grep seems to indicate that the only places where we actually
make use of those VXID locks are in DefineIndex(), when CREATE INDEX
CONCURRENTLY is in use, and during Hot Standby, when max_standby_delay
expires. Considering that these are not commonplace events, it seems
tremendously wasteful to incur the overhead for every transaction. It
might be possible to make the lock entry spring into existence "on
demand" - i.e. if a backend wants to wait on a vxid entry, it creates
the LOCK and PROCLOCK objects for that vxid. That presents a few
synchronization challenges, and plus we have to make sure that the
backend that's just been "given" a lock knows that it needs to release
it, but those seem like they might be manageable problems, especially
given the new infrastructure introduced by the current patch, which
already has to deal with some of those issues. I'll look into this
further.

It's likely that if we lick this problem, the BufFreelistLock and
BufMappingLocks are going to be the next hot spot. Of course, we're
ignoring the ten-thousand pound gorilla in the corner, which is that
on write workloads we have a pretty bad contention problem with
WALInsertLock, which I fear will not be so easily addressed. But one
problem at a time, I guess.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#15)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 06.06.2011 07:12, Robert Haas wrote:

I did some further investigation of this. It appears that more than
99% of the lock manager lwlock traffic that remains with this patch
applied has locktag_type == LOCKTAG_VIRTUALTRANSACTION. Every SELECT
statement runs in a separate transaction, and for each new transaction
we run VirtualXactLockTableInsert(), which takes a lock on the vxid of
that transaction, so that other processes can wait for it. That
requires acquiring and releasing a lock manager partition lock, and we
have to do the same thing a moment later at transaction end to dump
the lock.

A quick grep seems to indicate that the only places where we actually
make use of those VXID locks are in DefineIndex(), when CREATE INDEX
CONCURRENTLY is in use, and during Hot Standby, when max_standby_delay
expires. Considering that these are not commonplace events, it seems
tremendously wasteful to incur the overhead for every transaction. It
might be possible to make the lock entry spring into existence "on
demand" - i.e. if a backend wants to wait on a vxid entry, it creates
the LOCK and PROCLOCK objects for that vxid. That presents a few
synchronization challenges, and plus we have to make sure that the
backend that's just been "given" a lock knows that it needs to release
it, but those seem like they might be manageable problems, especially
given the new infrastructure introduced by the current patch, which
already has to deal with some of those issues. I'll look into this
further.

Ah, I remember I saw that vxid lock pop up quite high in an oprofile
profile recently. I think it was the case of executing a lot of very
simple prepared queries. So it would be nice to address that, even from
a single CPU point of view.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#9)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Sat, Jun 4, 2011 at 5:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

The approach looks sound to me. It's a fairly isolated patch and we
should be considering this for inclusion in 9.1, not wait another
year.

That suggestion is completely insane.  The patch is only WIP and full of
bugs, even according to its author.  Even if it were solid, it is way
too late to be pushing such stuff into 9.1.  We're trying to ship a
release, not find ways to cause it to slip more.

In 8.3, you implemented virtual transactionids days before we produced
a Release Candidate, against my recommendation.

At that time, I didn't start questioning your sanity. In fact we all
applauded that because it was a great performance gain.

The fact that you disagree with me does not make me insane. Inaction
on this point, resulting in a year's delay, will be considered to be a
gross waste by the majority of objective observers.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#17)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 06.06.2011 12:40, Simon Riggs wrote:

On Sat, Jun 4, 2011 at 5:55 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Simon Riggs<simon@2ndquadrant.com> writes:

The approach looks sound to me. It's a fairly isolated patch and we
should be considering this for inclusion in 9.1, not wait another
year.

That suggestion is completely insane. The patch is only WIP and full of
bugs, even according to its author. Even if it were solid, it is way
too late to be pushing such stuff into 9.1. We're trying to ship a
release, not find ways to cause it to slip more.

In 8.3, you implemented virtual transactionids days before we produced
a Release Candidate, against my recommendation.

FWIW, this bottleneck was not introduced by the introduction of virtual
transaction ids. Before that patch, we just took the lock on the real
transaction id instead.

The fact that you disagree with me does not make me insane.

You are not insane, even if your suggestion is.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On Mon, Jun 6, 2011 at 2:54 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ah, I remember I saw that vxid lock pop up quite high in an oprofile profile
recently. I think it was the case of executing a lot of very simple prepared
queries. So it would be nice to address that, even from a single CPU point
of view.

It doesn't seem too hard to do, although I have to think about the
details. Even though the VXID locks involved are Exclusive locks,
they are actually very much like the "weak" locks that the current
patch accelerates, because the Exclusive lock is taken only by the
VXID owner, and it can therefore be safely assumed that the initial
lock acquisition won't block anything. Therefore, it's really
unnecessary to touch the primary lock table at transaction start (and
to only touch it at the end if someone's waiting). However, there's a
fly in the ointment: when someone tries to ShareLock a VXID, we need
to determine whether that VXID is still around and, if so, make an
Exclusive lock entry for it in the primary lock table. And, unlike
what I'm doing for strong relation locks, it's probably NOT acceptable
for that to acquire and release every per-backend LWLock, because
every place that waits for VXID locks waits for a list of locks in
sequence, so we could end up with O(n^2) behavior. Now, in theory
that's not a huge problem: the VXID includes the backend ID, so we
ought to be able to figure out which single per-backend LWLock is of
interest and just acquire/release that one. Unfortunately, it appears
that there's no easy way to go from a backend ID to a PGPROC. The
backend IDs are offsets into the "ProcState" array, so they give us a
pointer to the backend's sinval state, not its PGPROC. And while the
PGPROC has a pointer to the sinval info, there's no pointer in the
opposite direction. Even if there were, we'd probably need to hold
SInvalWriteLock in shared mode to follow it.

That might not be the end of the world, since VXID locks are fairly
infrequently used, but it's certainly a little grotty. I do rather
wonder if we should be trying to reduce the number of separate places
where we list the running processes. We have arrays of PGPROC
structures, and then we have one set of pointers to PGPROCs in the
ProcArray, and then we have the ProcState structures for sinval. I
wonder if there's some way to rearrange all this to simplify the
bookkeeping.

BTW, how do you identify from oprofile that *vxid* locks were the
problem? I didn't think it could produce that level of detail.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#15)
Re: reducing the overhead of frequent table locks - now, with WIP patch

On 06.06.2011 07:12, Robert Haas wrote:

I did some further investigation of this. It appears that more than
99% of the lock manager lwlock traffic that remains with this patch
applied has locktag_type == LOCKTAG_VIRTUALTRANSACTION. Every SELECT
statement runs in a separate transaction, and for each new transaction
we run VirtualXactLockTableInsert(), which takes a lock on the vxid of
that transaction, so that other processes can wait for it. That
requires acquiring and releasing a lock manager partition lock, and we
have to do the same thing a moment later at transaction end to dump
the lock.

A quick grep seems to indicate that the only places where we actually
make use of those VXID locks are in DefineIndex(), when CREATE INDEX
CONCURRENTLY is in use, and during Hot Standby, when max_standby_delay
expires. Considering that these are not commonplace events, it seems
tremendously wasteful to incur the overhead for every transaction. It
might be possible to make the lock entry spring into existence "on
demand" - i.e. if a backend wants to wait on a vxid entry, it creates
the LOCK and PROCLOCK objects for that vxid. That presents a few
synchronization challenges, and plus we have to make sure that the
backend that's just been "given" a lock knows that it needs to release
it, but those seem like they might be manageable problems, especially
given the new infrastructure introduced by the current patch, which
already has to deal with some of those issues. I'll look into this
further.

At the moment, the transaction with given vxid acquires an ExclusiveLock
on the vxid, and anyone who wants to wait for it to finish acquires a
ShareLock. If we simply reverse that, so that the transaction itself
takes ShareLock, and anyone wanting to wait on it take an ExclusiveLock,
will this fastlock patch bust this bottleneck too?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#19)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#1)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#18)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#24)
#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#25)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#26)
#28Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#1)
#29Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#25)
#30Dave Page
dpage@pgadmin.org
In reply to: Dimitri Fontaine (#29)
#31Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Dave Page (#30)
#32Stephen Frost
sfrost@snowman.net
In reply to: Dave Page (#30)
#33Dave Page
dpage@pgadmin.org
In reply to: Stefan Kaltenbrunner (#31)
#34Josh Berkus
josh@agliodbs.com
In reply to: Dimitri Fontaine (#29)
#35Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#30)
#36Dave Page
dpage@pgadmin.org
In reply to: Stephen Frost (#32)
#37Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Josh Berkus (#28)
#38Chris Browne
cbbrowne@acm.org
In reply to: Simon Riggs (#27)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Chris Browne (#38)
#40Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Stephen Frost (#32)
#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Dave Page (#36)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dimitri Fontaine (#29)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#1)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dave Page (#36)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#43)
#46Stephen Frost
sfrost@snowman.net
In reply to: Simon Riggs (#41)
#47Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Josh Berkus (#28)
#48Dave Page
dpage@pgadmin.org
In reply to: Tom Lane (#44)
#49Stephen Frost
sfrost@snowman.net
In reply to: Alvaro Herrera (#42)
#50Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#45)
#51Simon Riggs
simon@2ndQuadrant.com
In reply to: Dave Page (#33)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#51)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Joshua D. Drake (#50)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#51)
#55Josh Berkus
josh@agliodbs.com
In reply to: Dave Page (#48)
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#53)
#57Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#53)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#55)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#57)
#60Thom Brown
thom@linux.com
In reply to: Tom Lane (#58)
#61Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#56)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Thom Brown (#60)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#59)
#64Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#59)
#65Stephen Frost
sfrost@snowman.net
In reply to: Simon Riggs (#64)
#66Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#64)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#64)
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#64)
#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#68)
#70Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Jignesh K. Shah (#47)
#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#61)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#69)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Jignesh K. Shah (#70)
#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#72)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#51)
#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#75)
#77Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#76)
#78Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#74)
#79Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#62)
#80Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#25)
#81Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#80)
#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#80)
#83Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#49)
#84Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#80)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#83)
#86Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#82)
#87Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#81)
#88Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#84)
#89Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#85)
#90Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#88)
#91Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#90)
#92Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#68)
#93Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#90)
#94Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#86)
#95Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#91)
#96Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#91)
#97Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#96)
#98Dave Page
dpage@pgadmin.org
In reply to: Robert Haas (#97)
#99Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Dave Page (#98)