Avoiding repeated snapshot computation
On some recent benchmarks and profile data, I saw GetSnapshotData
figures at the very top or near top. For lesser number of clients, it
can account for 10-20% of time, but more number of clients I have seen
it taking up as much as 40% of sample time. Unfortunately, the machine
of which I was running these tests is currently not available and so I
don't have the exact numbers. But the observation is almost correct.
Our recent work on separating the hot members of PGPROC in a separate
array would definitely reduce data cache misses ans reduce the
GetSnapshotData time, but it probably still accounts for a large
enough critical section for a highly contended lock.
I think now that we have reduced the run time of the function itself,
we should now try to reduce the number of times the function is
called. Robert proposed a way to reduce the number of calls per
transaction. I think we can go one more step further and reduce the
number for across the transactions.
One major problem today could be because the way LWLock works. If the
lock is currently held in SHARED mode by some backend and some other
backend now requests it in SHARED mode, it will immediately get it.
Thats probably the right thing to do because you don't want the reader
to really wait when the lock is readily available. But in the case of
GetSnapshotData(), every reader is doing exactly the same thing; they
are computing a snapshot based on the same shared state and would
compute exactly the same snapshot (if we ignore the fact that we don't
include caller's XID in xip array, but thats a minor detail). And
because the way LWLock works, more and more readers would get in to
compute the snapshot, until the exclusive waiters get a window to
sneak in, either because more and more processes slowly start sleeping
for exclusive access. To depict it, the four transactions make
overlapping calls for GetSnapshotData() and hence the total critical
section starts when the first caller enters it and the ends when the
last caller exits.
Txn1 ------[ SHARED ]---------------------
Txn2 --------[ SHARED ]-------------------
Txn3 -----------------[ SHARED ]-------------
Txn4 -------------------------------------------[ SHARED
]---------
|<---------------Total Time ------------------------------------>|
Couple of ideas come to mind to solve this issue.
A snapshot once computed will remain valid for every call irrespective
of its origin until at least one transaction ends. So we can store the
last computed snapshot in some shared area and reuse it for all
subsequent GetSnapshotData calls. The shared snapshot will get
invalidated when some transaction ends by calling
ProcArrayEndTransaction(). I tried this approach and saw a 15%
improvement for 32-80 clients on the 32 core HP IA box with pgbench -s
100 -N tests. Not bad, but I think this can be improved further.
What we can do is when a transaction comes to compute its snapshot, it
checks if some other transaction is already computing a snapshot for
itself. If so, it just sleeps on the lock. When the other process
finishes computing the snapshot, it saves the snapshot is a shared
area and wakes up all processes waiting for the snapshot. All those
processes then just copy the snapshot from the shared area and they
are done. This will not only reduce the total CPU consumption by
avoiding repetitive work, but would also reduce the total time for
which ProcArrayLock is held in SHARED mode by avoiding pipeline of
GetSnapshotData calls. I am currently trying the shared work queue
mechanism to implement this, but I am sure we can do it this in some
other way too.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
What we can do is when a transaction comes to compute its snapshot, it
checks if some other transaction is already computing a snapshot for
itself. If so, it just sleeps on the lock. When the other process
finishes computing the snapshot, it saves the snapshot is a shared
area and wakes up all processes waiting for the snapshot. All those
processes then just copy the snapshot from the shared area and they
are done. This will not only reduce the total CPU consumption by
avoiding repetitive work, but would also reduce the total time for
which ProcArrayLock is held in SHARED mode by avoiding pipeline of
GetSnapshotData calls. I am currently trying the shared work queue
mechanism to implement this, but I am sure we can do it this in some
other way too.
I don't quite understand how this is going to work. Transaction A
ends, invaliding the shared snapshot. Now transactions B, C, and D
come along wishing to take snapshots. B begins taking a snapshot, so
C and D wait (blech!) for that to be complete. Then, they start
copying the snapshot. Meanwhile, transaction E ends, invalidating the
shared snapshot, and then transaction F comes along, wanting to take a
snapshot. If F constructs a new shared snapshot, then doesn't that
overwrite the same memory area C and D were busy copying? It seems
like F needs some way to know that C and D are done with the old
shared snapshot before it starts writing a new one. Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!). That
could maybe be a gain at very high concurrency when spinlock
contention is eating us alive, but it doesn't seem like a good idea in
general. Maybe I'm missing something; did you mean to attach a patch?
I had a different idea for avoiding repeated snapshot computation:
save the snapshot in backend-private memory. When a process takes
ProcArrayLock in exclusive mode, it increments a 64-byte counter.
When a process takes ProcArrayLock in shared mode, it can check
whether the counter has changed; if not, it can reuse the snapshot it
took before. I think there might be away to tinker with the snapshot
management so as to do this without any more copying than we do
presently, since CurrentSnapshotData and SecondarySnapshotData are
basically treated as scratch-pads anyhow.
Another idea would be to try to avoid taking ProcArrayLock at all.
For example, suppose we again have a 64-bit counter in shared memory.
A transaction wishing to do ProcArrayEndTransaction() takes
ProcArrayLock in exclusive mode, increments the counter, memory
barrier, clears its XID, memory barrier, increments the counter,
releases ProcArrayLock. A transaction wishing to take a snapshot
first reads the counter. If the value read from the counter is even,
then we continue like this: memory barrier, take snapshot, memory
barrier, recheck counter. If we find that the counter is unchanged,
then nobody did ProcArrayEndTransaction() while we were taking the
snapshot, and we're good to go. If the counter changed then we take
ProcArrayLock in shared mode and retake the snapshot. (We also take
ProcArrayLock in shared mode if the initial value we read was odd.)
This seems like it would all but eliminate contention on the spinlock
protecting ProcArrayLock, but I'm not sure how much overhead it would
add in other cases. In particular, if we have to retake the snapshot
more than very occasionally, it's not going to be good.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:What we can do is when a transaction comes to compute its snapshot, it
checks if some other transaction is already computing a snapshot for
itself. If so, it just sleeps on the lock. When the other process
finishes computing the snapshot, it saves the snapshot is a shared
area and wakes up all processes waiting for the snapshot. All those
processes then just copy the snapshot from the shared area and they
are done. This will not only reduce the total CPU consumption by
avoiding repetitive work, but would also reduce the total time for
which ProcArrayLock is held in SHARED mode by avoiding pipeline of
GetSnapshotData calls. I am currently trying the shared work queue
mechanism to implement this, but I am sure we can do it this in some
other way too.I don't quite understand how this is going to work.
I will try, keeping it simple.
Transaction A
ends, invaliding the shared snapshot. Now transactions B, C, and D
come along wishing to take snapshots. B begins taking a snapshot, so
C and D wait (blech!)
Yeah, C and D waits. But thats better than concurrently doing the same
computation.
for that to be complete. Then, they start
copying the snapshot.
And they are holding the ProcArrayLock in shared mode.
Meanwhile, transaction E ends, invalidating the
shared snapshot,
E can't end until B, C and D are done with copying the snapshot.
and then transaction F comes along, wanting to take a
snapshot. If F constructs a new shared snapshot, then doesn't that
overwrite the same memory area C and D were busy copying? It seems
like F needs some way to know that C and D are done with the old
shared snapshot before it starts writing a new one.
Right. And that can easily be achieved by holding shared lock on
ProcArray. The snapshot is invalidated by holding exclusive lock and
set/copied while holding shared lock. I am assuming setting a boolean
(valid/invalid) can safely be done with a shared lock. But I may be
wrong.
Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!).
Yes, B and F pay a price of double copy. But I think it can be a net
saving because C and D (and many more hopefully) don't need to
recompute the snapshot again by going over a potentially large
ProcArray. So as I demonstrated in the illustration, the total time
for which ProcArray lock is held will be significantly less with large
number of clients.
That
could maybe be a gain at very high concurrency when spinlock
contention is eating us alive, but it doesn't seem like a good idea in
general. Maybe I'm missing something; did you mean to attach a patch?
I have a patch that I am attaching. It contains some unrelated changes
(don't know why; may be I took a diff with some other branch), but you
will get the idea. This needs improvement though because it just
checks if a valid shared snapshot is available and copies that. If
not, it goes ahead and computes one for itself. We need something more
intelligent to know that a snapshot computation is in progress and we
should wait instead of building our own. This patch had shown 15-20%
improvement on the HP box that we are using for the benchmarks.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Attachments:
shared-snapshot-v2.patchapplication/octet-stream; name=shared-snapshot-v2.patchDownload+179-81
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!).
Yes, B and F pay a price of double copy. But I think it can be a net
saving because C and D (and many more hopefully) don't need to
recompute the snapshot again by going over a potentially large
ProcArray.
Like Robert, I'm finding it hard to believe that this isn't going to be
a net pessimization in as many or more cases as it'd be a win. Also,
didn't we just get rid of most of the issue of "going over a large
ProcArray" with the move of hot members to a separate array?
In the end, getting a snapshot is exactly about copying information out
of shared memory. Perhaps it would be more productive to think about
how we could further modify the ProcArray representation to reduce the
impedance mismatch between it and the snapshot representation, rather
than introducing a second shared-memory copy of the same information.
regards, tom lane
Hi,
On Saturday, November 26, 2011 04:52:50 PM Pavan Deolasee wrote:
I think now that we have reduced the run time of the function itself,
we should now try to reduce the number of times the function is
called. Robert proposed a way to reduce the number of calls per
transaction. I think we can go one more step further and reduce the
number for across the transactions.
You could also try if it makes a difference reducing SnapshotData to one
instead of two cachelines. The data itself fits into one without problems.
Trivial patch attached.
Generally I think we should check that for most of the more commonly used
strutures, we have many with too much padding.
Andres
Attachments:
snapshotdata-one-cacheline.patchtext/x-patch; charset=UTF-8; name=snapshotdata-one-cacheline.patchDownload+2-2
Andres Freund <andres@anarazel.de> writes:
You could also try if it makes a difference reducing SnapshotData to one
instead of two cachelines. The data itself fits into one without problems.
Trivial patch attached.
On what grounds do you argue that this patch gets SnapshotData into one
cacheline today (and on what hardware), or that it will continue to do
so in the future? If we're this close to the edge then any future
addition to the struct will destroy the point. I'd just as soon keep
the fields in a logical order.
regards, tom lane
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
You could also try if it makes a difference reducing SnapshotData to one
instead of two cachelines. The data itself fits into one without
problems. Trivial patch attached.On what grounds do you argue that this patch gets SnapshotData into one
cacheline today (and on what hardware), or that it will continue to do
so in the future? If we're this close to the edge then any future
addition to the struct will destroy the point. I'd just as soon keep
the fields in a logical order.
To my knowledge there is no current and supported (i.e. I don't count Itanium)
hardware that doesn't have 64bit cachelines except in the embedded market.
I also don't see a movement towards 128bit cpus or anything else requiring
bigger pointers.
If platforms with 128bit cachelines or such would gain popularity - nothing
would be lost by that change.
Which change are you "afraid" of?
Now the holes I talked about obviously were on a 64bit machine. But there are
a few situations where improving layout so it fits with 8byte alignment for
pointers makes the situation worse for 32bit. Looking a bit careful at the
changes one does should catch those problems.
Also its not *that* close to overflowing again. It was 72 bytes before, now
its 56 on a 64bit x86 machine with linux abi.
I agree that logical order is important. I don't propose changing all structs
in pg mechanically.+
Its sad that there is no sensible method to let the compiler safely reorder
struct members accross compiler (-versions) and platforms...
Andres
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.
Andres
On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.
You forget to attach the benchmark results.
My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.
But testing is even better than guessing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.
You forget to attach the benchmark results.
My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.
All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry
for that.
And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte
cachelines?
And yea. I didn't add benchmark results. I don't think I *have* to do that
when making suggestions to somebody trying to improve something specific. I
also currently don't have hardware where I can sensibly run at a high enough
concurrency to see that GetSnapshotData takes ~40% of runtime.
Additional cacheline references around synchronized access can hurt to my
knowledge...
Andres
On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.
You forget to attach the benchmark results.
My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.But testing is even better than guessing.
Being prodded like that I ran a very quick benchmark on my workstation.
Unfortunately that means I cannot work during the time which is why I kept it
rather short...
That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) *
2(threads) and 24GB of ram.
Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 pgbench
pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60
origsnap: 92825.743958 93145.110901 93389.915474 93175.482351
reordersnap: 93560.183272 93613.333494 93495.263012 93523.368489
pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60
origsnap: 81846.743329 81545.175672 81702.755226 81576.576435
81228.154119 81546.047708 81421.436262
reordersnap: 81823.479196 81787.784508 81820.242145 81790.263415
81762.421592 81496.333144 81732.088876
At that point I noticed I had accidentally run with a nearly stock config...
An even shorter run with a more approrioate config yielded:
pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20
origsnap: 102234.664020 102003.449741 102119.509053 101722.410387
101973.651318 102056.440561
reordersnap: 103444.877879 103385.888808 103302.318923 103372.659486
103330.157612 103313.833821
Looks like a win to me. Even on this comparatively small machine.
Andres
Andres Freund wrote:
All current x86 cpus use 64bytes.
That matches what I found in recent research on the topic.
-Kevin
Import Notes
Resolved by subject fallback
On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de>
wrote:
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.
You forget to attach the benchmark results.
My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.But testing is even better than guessing.
Being prodded like that I ran a very quick benchmark on my workstation.
Unfortunately that means I cannot work during the time which is why I kept
it
rather short...That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) *
2(threads) and 24GB of ram.Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20
pgbenchpgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60
origsnap: 92825.743958 93145.110901 93389.915474 93175.482351
reordersnap: 93560.183272 93613.333494 93495.263012 93523.368489pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60
origsnap: 81846.743329 81545.175672 81702.755226 81576.576435
81228.154119 81546.047708 81421.436262
reordersnap: 81823.479196 81787.784508 81820.242145 81790.263415
81762.421592 81496.333144 81732.088876At that point I noticed I had accidentally run with a nearly stock
config...
An even shorter run with a more approrioate config yielded:pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20
origsnap: 102234.664020 102003.449741 102119.509053 101722.410387
101973.651318 102056.440561
reordersnap: 103444.877879 103385.888808 103302.318923 103372.659486
103330.157612 103313.833821Looks like a win to me. Even on this comparatively small machine.
This may not be necessary, but can you please share the modified config you
used for the last run.
I tabulated your last results to make it more readable, and added columns
to show the improvement.
origsnap reordersnap diff %age improvement
------------------------------------------------------------------
102234.66402 103444.877879 1210.213859 1.1837607827
102003.449741 103385.888808 1382.439067 1.3552865815
102119.509053 103302.318923 1182.80987 1.1582604352
101722.410387 103372.659486 1650.249099 1.6223063263
101973.651318 103330.157612 1356.506294 1.3302517626
102056.440561 103313.833821 1257.39326 1.2320567454
That looks like a win to me too. We're getting a little over 1% improvement
for free!
Maybe submitting this patch to the commitfest might help get some serious
consideration from a reviewer.
Regards,
--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Hi Gurjeet,
On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote:
On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de>
wrote:
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old
one.You forget to attach the benchmark results.
My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.But testing is even better than guessing.
Being prodded like that I ran a very quick benchmark on my workstation.
Unfortunately that means I cannot work during the time which is why I
kept it
rather short...That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) *
2(threads) and 24GB of ram.Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20
pgbenchpgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60
...
Looks like a win to me. Even on this comparatively small machine.This may not be necessary, but can you please share the modified config you
used for the last run.
Can't right now because I don't have access from here, but I will do so. I
don't think its particularly interesting though.
I tabulated your last results to make it more readable, and added columns
to show the improvement.origsnap reordersnap diff %age improvement
------------------------------------------------------------------
102234.66402 103444.877879 1210.213859 1.1837607827
102003.449741 103385.888808 1382.439067 1.3552865815
102119.509053 103302.318923 1182.80987 1.1582604352
101722.410387 103372.659486 1650.249099 1.6223063263
101973.651318 103330.157612 1356.506294 1.3302517626
102056.440561 103313.833821 1257.39326 1.2320567454That looks like a win to me too. We're getting a little over 1% improvement
for free!
At least if we can convince Tom that such a change would be ok ;)
I would like to see somebody running a benchmark on a machine with higher
concurrency...
Maybe submitting this patch to the commitfest might help get some serious
consideration from a reviewer.
It wasn't actually ment as an actual patch but a comment, but well ;). Will
add it.
Andres
Andres Freund wrote:
I would like to see somebody running a benchmark on a machine with
higher concurrency...
Yeah, me too. I don't find it at all hard to believe that we will
see significant performance benefit by changing the PGPROC structure
so that all parts of it can be accessed through one cache line rather
than two. The fact that we saw improvement when changing it down to
two, even though there is extra indirection in some places, seems
like pretty solid evidence on the face of it.
I went in to the office on Sunday to try to get a few hours of
benchmarks for this on our new monster machine, but the DBA
responsible for putting it into production was on it first, loading
it with production data. At this point it will get really hard to
schedule any more of the 20-hour runs that were the basis of most of
my recent numbers, but I may be able to shut down production use for
a two or three hour window on a weekend to run an abbreviated set,
focusing on the higher concurrency levels. Ideally that would be on
top of the other pending performance patches.
Based on my earlier rounds of benchmarking, I expect that this
approach will show the greatest benefit at the higher concurrency
levels, where cache lines are most stressed.
-Kevin
Import Notes
Resolved by subject fallback
On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Andres Freund wrote:
I would like to see somebody running a benchmark on a machine with
higher concurrency...Yeah, me too. I don't find it at all hard to believe that we will
see significant performance benefit by changing the PGPROC structure
so that all parts of it can be accessed through one cache line rather
than two. The fact that we saw improvement when changing it down to
two, even though there is extra indirection in some places, seems
like pretty solid evidence on the face of it.
I think there is fundamental difference between the PGPROC patch that
we did and the rearranging SnapshotData members that Andres has
proposed. The PGPROC structure is a shared memory area, very heavily
accessed by GetSnapshotData (and some others). There was a linear
increase in the access as number of clients go up. The SnapshotData
structure is mostly from a backend local memory (unless we do
something what I suggested in the OP) and hence fitting that in a
single cache line may or may not have much impact. We don't even
guarantee that it starts at a cacheline which means in most cases it
will still be spread on multiple cache lines.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!).Yes, B and F pay a price of double copy. But I think it can be a net
saving because C and D (and many more hopefully) don't need to
recompute the snapshot again by going over a potentially large
ProcArray.Like Robert, I'm finding it hard to believe that this isn't going to be
a net pessimization in as many or more cases as it'd be a win. Also,
didn't we just get rid of most of the issue of "going over a large
ProcArray" with the move of hot members to a separate array?
Yeah, separating the PGPROC members has helped a lot. But that does
not reduce the number of GetSnapshotData calls. It only makes each
call less expensive. As I said, I had seen 15-20% improvement with not
even a slightly tuned patch, so I am optimistic about the potential of
the approach.
In the end, getting a snapshot is exactly about copying information out
of shared memory. Perhaps it would be more productive to think about
how we could further modify the ProcArray representation to reduce the
impedance mismatch between it and the snapshot representation, rather
than introducing a second shared-memory copy of the same information.
I think that a good idea. We need a representation that needs minimum
processing to derive the snapshot. One idea is to maintain a bitmap of
currently running transactions in the shared memory. The size of the
bitmap has to be significantly larger than the ProcArray itself
because there could be gaps. But even then since its just a bit per
XID, we can have a bitmap to represent a window of say 16K or 32K
consecutive XIDs. We can also have a summary bitmap to quickly find if
there is at least one running transaction in any given chunk. If there
are long running transactions which don't fit in the bitmap, we can
track them separately in an array. Snapshot is then just a copy of
this bitmap along with some additional information.
Will try to think more about it and see if I can test this idea. But
any comments welcome.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote:
On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!).Yes, B and F pay a price of double copy. But I think it can be a net
saving because C and D (and many more hopefully) don't need to
recompute the snapshot again by going over a potentially large
ProcArray.Like Robert, I'm finding it hard to believe that this isn't going to be
a net pessimization in as many or more cases as it'd be a win. ?Also,
didn't we just get rid of most of the issue of "going over a large
ProcArray" with the move of hot members to a separate array?Yeah, separating the PGPROC members has helped a lot. But that does
not reduce the number of GetSnapshotData calls. It only makes each
call less expensive. As I said, I had seen 15-20% improvement with not
even a slightly tuned patch, so I am optimistic about the potential of
the approach.
Agreed. I think there is great potential here. We have been stumped at
how to reduce the overhead of this for years, and it seems you are now
on a promising path.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Tuesday, November 29, 2011 05:51:40 AM Pavan Deolasee wrote:
On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Andres Freund wrote:
I would like to see somebody running a benchmark on a machine with
higher concurrency...Yeah, me too. I don't find it at all hard to believe that we will
see significant performance benefit by changing the PGPROC structure
so that all parts of it can be accessed through one cache line rather
than two. The fact that we saw improvement when changing it down to
two, even though there is extra indirection in some places, seems
like pretty solid evidence on the face of it.I think there is fundamental difference between the PGPROC patch that
we did and the rearranging SnapshotData members that Andres has
proposed. The PGPROC structure is a shared memory area, very heavily
accessed by GetSnapshotData (and some others). There was a linear
increase in the access as number of clients go up. The SnapshotData
structure is mostly from a backend local memory (unless we do
something what I suggested in the OP) and hence fitting that in a
single cache line may or may not have much impact. We don't even
guarantee that it starts at a cacheline which means in most cases it
will still be spread on multiple cache lines.
Well, I could measure a ~1.3% benefit on a modestly concurrent machine (see
the earlier reformatted mail from Gurjeet) and the benefits were definitely
smaller without concurrency. But I aggree - the benefits wont be even remotely
as big as the PGPROC splitup patch.
Youre also right about the alignment not being predictable enough.
Perhaps pg should start using/introducing a force_cacheline_align macro which
can easily implemented on most compilers? Especially for SnapshotData and
consorts which are aligned statically that should be good enough.
Something along thelines of
#ifdef _GNUC__
#define force_align_to(alignment) __attribute__((align(alignment)))
#elif __MSVC__
#define force_align_to(alignment) __declspec(align(alignment))
#else
#define force_align_to(alignment)
#endif
#define CACHELIGN_ALIGNMENT 64
#define force_cacheline_align force_align_to(CACHELING_ALIGNMENT)
Looks complete enough for now.
Back to PGPROC:
Looking at the current PGPROC layout on x86-64 with linux abi (generated by
pahole):
struct PGPROC {
SHM_QUEUE links; /* 0 16 */
PGSemaphoreData sem; /* 16 8 */
int waitStatus; /* 24 4 */
Latch procLatch; /* 28 12 */
LocalTransactionId lxid; /* 40 4 */
int pid; /* 44 4 */
int pgprocno; /* 48 4 */
BackendId backendId; /* 52 4 */
Oid databaseId; /* 56 4 */
Oid roleId; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
bool recoveryConflictPending; /* 64 1 */
bool lwWaiting; /* 65 1 */
bool lwExclusive; /* 66 1 */
/* XXX 5 bytes hole, try to pack */
struct PGPROC * lwWaitLink; /* 72 8 */
LOCK * waitLock; /* 80 8 */
PROCLOCK * waitProcLock; /* 88 8 */
LOCKMODE waitLockMode; /* 96 4 */
LOCKMASK heldLocks; /* 100 4 */
XLogRecPtr waitLSN; /* 104 8 */
int syncRepState; /* 112 4 */
/* XXX 4 bytes hole, try to pack */
SHM_QUEUE syncRepLinks; /* 120 16 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
SHM_QUEUE myProcLocks[16]; /* 136 256 */
/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
struct XidCache subxids; /* 392 256 */
/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
LWLockId backendLock; /* 648 4 */
/* XXX 4 bytes hole, try to pack */
uint64 fpLockBits; /* 656 8 */
Oid fpRelId[16]; /* 664 64 */
/* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */
bool fpVXIDLock; /* 728 1 */
/* XXX 3 bytes hole, try to pack */
LocalTransactionId fpLocalTransactionId; /* 732 4 */
/* size: 736, cachelines: 12, members: 28 */
/* sum members: 720, holes: 4, sum holes: 16 */
/* last cacheline: 32 bytes */
}
It seems possible that reordering it to avoid sharing unrelated stuff on the
same cacheline might be beneficial. E.g. moving fastpath lock stuf away from
the more common lwlock infrastructure might be a good idea. Currently those
lines will bounce continually even though the fastpath stuff will mostly be
accessed locally.
Subxids and syncRepLinks look like good candidates too.
Anyway, sorry for the topic drift.
Andres
Andres
Hi,
On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote:
This may not be necessary, but can you please share the modified config you
used for the last run.
I just copied the .conf I had for some independent development.
max_connections = 100
listen_addresses = ''
port=5432
shared_buffers = 2GB
temp_buffers = 64MB
work_mem = 96MB
maintenance_work_mem = 1GB
effective_cache_size = 20GB
log_line_prefix = '%d %a %c %v %x %m ' # special values:
log_statement = 'ddl'
#decreases variance
track_activities = off
track_counts = off
autovacuum = off
update_process_title = off
logging_collector = off
log_destination = stderr
log_min_messages = error
log_checkpoints = on
log_autovacuum_min_duration=0
synchronous_commit = off
checkpoint_segments = 30
checkpoint_timeout = 30min
checkpoint_completion_target = 0.8
log_error_verbosity = verbose
But about the only thing really important is the shared_buffers difference
(commenting it out nearly yielded the former speed)
Andres