Reduce ProcArrayLock contention

Started by Amit Kapilaalmost 11 years ago56 messageshackers

amit.kapila16@gmail.com

almost 11 years ago

I have been working on to analyze different ways to reduce
the contention around ProcArrayLock. I have evaluated mainly
2 ideas, first one is to partition the ProcArrayLock (the basic idea
is to allow multiple clients (equal to number of ProcArrayLock partitions)
to perform ProcArrayEndTransaction and then wait for all of them at
GetSnapshotData time) and second one is to have a mechanism to
GroupClear the Xid during ProcArrayEndTransaction() and the second
idea clearly stands out in my tests, so I have prepared the patch for that
to further discuss here.

The idea behind second approach (GroupClear Xid) is, first try to get
ProcArrayLock conditionally in ProcArrayEndTransaction(), if we get
the lock then clear the advertised XID, else set the flag (which indicates
that advertised XID needs to be clear for this proc) and push this
proc to pendingClearXidList. Except one proc, all other proc's will
wait for their Xid to be cleared. The only allowed proc will attempt the
lock acquiration, after acquring the lock, pop all of the requests off the
list using compare-and-swap, servicing each one before moving to
next proc, and clearing their Xids. After servicing all the requests
on pendingClearXidList, release the lock and once again go through
the saved pendingClearXidList and wake all the processes waiting
for their Xid to be cleared. To set the appropriate value for
ShmemVariableCache->latestCompletedXid, we need to advertise
latestXid incase proc needs to be pushed to pendingClearXidList.
Attached patch implements the above idea.

Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 150
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Head : commit 51d0fe5d
Patch -1 : group_xid_clearing_at_trans_end_rel_v1

Client Count/TPS18163264128HEAD814609210899199262363617812Patch-110866483
11093199083122028237

The graph for the data is attached.

Points about performance data
---------------------------------------------
1. Gives good performance improvement at or greater than 64 clients
and give somewhat moderate improvement at lower client count. The
reason is that because the contention around ProcArrayLock is mainly
seen at higher client count. I have checked that at higher client-count,
it started behaving lockless (which means performance with patch is
equivivalent to if we just comment out ProcArrayLock in
ProcArrayEndTransaction()).
2. There is some noise in this data (at 1 client count, I don't expect
much difference).
3. I have done similar tests on power-8 m/c and found similar gains.
4. The gains are visible when the data fits in shared_buffers as for other
workloads I/O starts dominating.
5. I have seen that effect of Patch is much more visible if we keep
autovacuum = off (do manual vacuum after each run) and keep
wal_writer_delay to lower value (say 20ms). I have not included that
data here, but if somebody is interested, I can do the detailed tests
against HEAD with those settings and share the results.

Here are steps used to take data (there are repeated for each reading)
--------------------------------------------------------------------------------------------------------
1. Start Server
2. dropdb postgres
3. createdb posters
4. pgbench -i -s 300 postgres
5. pgbench -c $threads -j $threads -T 1800 -M prepared postgres
6. checkpoint
7. Stop Server

Thanks to Robert Haas for having discussion (offlist) about the idea
and suggestions to improve it and also Andres Freund for having
discussion and sharing thoughts about this idea at PGCon.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

almost 11 years ago

In reply to: Amit Kapila (#1)

Re: Reduce ProcArrayLock contention

On 29 June 2015 at 16:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks to Robert Haas for having discussion (offlist) about the idea
and suggestions to improve it and also Andres Freund for having
discussion and sharing thoughts about this idea at PGCon.

Weird. This patch is implemented exactly the way I said to implement it
publicly at PgCon.

Was nobody recording the discussion at the unconference??

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Simon Riggs (#2)

Re: Reduce ProcArrayLock contention

On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs <simon@2ndQuadrant.com> wrote:

On 29 June 2015 at 16:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks to Robert Haas for having discussion (offlist) about the idea
and suggestions to improve it and also Andres Freund for having
discussion and sharing thoughts about this idea at PGCon.

Weird. This patch is implemented exactly the way I said to implement it
publicly at PgCon.

Was nobody recording the discussion at the unconference??

Amit presented an earlier version of this at the in unconference?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

almost 11 years ago

In reply to: Andres Freund (#3)

Re: Reduce ProcArrayLock contention

On 29 June 2015 at 18:11, Andres Freund <andres@anarazel.de> wrote:

On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs <simon@2ndQuadrant.com>
wrote:

On 29 June 2015 at 16:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks to Robert Haas for having discussion (offlist) about the idea
and suggestions to improve it and also Andres Freund for having
discussion and sharing thoughts about this idea at PGCon.

Weird. This patch is implemented exactly the way I said to implement it
publicly at PgCon.

Was nobody recording the discussion at the unconference??

Amit presented an earlier version of this at the in unconference?

Yes, I know. And we all had a long conversation about how to do it without
waking up the other procs.

Forming a list, like we use for sync rep and having just a single process
walk the queue was the way I suggested then and previously.

Weird.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Simon Riggs (#4)

Re: Reduce ProcArrayLock contention

On Mon, Jun 29, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yes, I know. And we all had a long conversation about how to do it without
waking up the other procs.

Forming a list, like we use for sync rep and having just a single process
walk the queue was the way I suggested then and previously.

Weird.

I am not sure what your point is. Are you complaining that you didn't
get a design credit for this patch? If so, I think that's a bit
petty. I agree that you mentioned something along these lines at
PGCon, but Amit and I have been discussing this every week for over a
month, so it's not as if the conversations at PGCon were the only
ones, or the first. Nor is there a conspiracy to deprive Simon Riggs
of credit for his ideas. I believe that you should assume good faith
and take it for granted that Amit credited who he believed that he got
his ideas from. The fact that you may have had similar ideas does not
mean that he got his from you. It probably does mean that they are
good ideas, since we are apparently all thinking in the same way.

(I could provide EDB email internal emails documenting the timeline of
various ideas and showing which ones happened before and after PGCon,
and we could debate exactly who thought of what when. But I don't
really see the point. I certainly hope that a debate over who
deserves how much credit for what isn't going to overshadow the good
work Amit has done on this patch.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Simon Riggs (#4)

Re: Reduce ProcArrayLock contention

On Mon, Jun 29, 2015 at 10:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 29 June 2015 at 18:11, Andres Freund <andres@anarazel.de> wrote:

On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs <simon@2ndQuadrant.com>

wrote:

On 29 June 2015 at 16:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks to Robert Haas for having discussion (offlist) about the idea
and suggestions to improve it and also Andres Freund for having
discussion and sharing thoughts about this idea at PGCon.

Weird. This patch is implemented exactly the way I said to implement it
publicly at PgCon.

Was nobody recording the discussion at the unconference??

Amit presented an earlier version of this at the in unconference?

Yes, I know. And we all had a long conversation about how to do it

without waking up the other procs.

Forming a list, like we use for sync rep and having just a single process

walk the queue was the way I suggested then and previously.

Yes, I remember very well about your suggestion which you have
given during Unconference and I really value that idea/suggestion.
Here, the point is that I could get chance to have in-person discussion
with Robert and Andres where they have spent more time discussing
about the problem (Robert has discussed about this much more apart
from PGCon as well), but that doesn't mean I am not thankful of the
ideas or suggestions I got from you and or others during Unconference.

Thank you for participating in the discussion and giving suggestions
and I really mean it, as I felt good about it even after wards.

Now, I would like to briefly explain how allow-one-waker idea has
helped to improve the patch as not every body here was present
in that Un-conference. The basic problem I was seeing during
initial version of patch was that I was using LWLockAcquireOrWait
call for all the clients who are not able to get ProcArrayLock
(conditionally in the first attempt) and then after waking each proc
was checking if it's XID is cleared and if not they were again try
to AcquireOrWait for ProcArrayLock, this was limiting the patch
to improve performance at somewhat moderate client count like
32 or 64 because even trying for LWLock in exclusive mode again
and again was the limiting factor (I have tried with
LWLockAcquireConditional as well instead of LWLockAcquireOrWait,
though it helped a bit but not very much). Allowing just one-client
to become kind of Group leader to clear the other proc's xid and wake
them-up and all the other proc's waiting after adding them to
pendingClearXid
list has helped to reduce the bottleneck around procarraylock significantly.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

almost 11 years ago

In reply to: Robert Haas (#5)

Re: Reduce ProcArrayLock contention

On 30 June 2015 at 03:43, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 29, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:

Yes, I know. And we all had a long conversation about how to do it

without

waking up the other procs.

Forming a list, like we use for sync rep and having just a single process
walk the queue was the way I suggested then and previously.

Weird.

I am not sure what your point is. Are you complaining that you didn't
get a design credit for this patch? If so, I think that's a bit
petty. I agree that you mentioned something along these lines at
PGCon, but Amit and I have been discussing this every week for over a
month, so it's not as if the conversations at PGCon were the only
ones, or the first. Nor is there a conspiracy to deprive Simon Riggs
of credit for his ideas. I believe that you should assume good faith
and take it for granted that Amit credited who he believed that he got
his ideas from. The fact that you may have had similar ideas does not
mean that he got his from you. It probably does mean that they are
good ideas, since we are apparently all thinking in the same way.

Oh, I accept multiple people can have the same ideas. That happens to me a
lot around here.

What I find weird is that the discussion was so intense about
LWLockAcquireOrWait that when someone presented a solution there were
people that didn't notice. It makes me wonder whether large group
discussions are worth it.

What I find even weirder is the thought that there were another 100 people
in the room and it makes me wonder whether others present had even better
ideas but they don't speak up for some other reason. I guess that happens
on-list all the time, its just we seldom experience the number of people
watching our discussions. I wonder how many times people give good ideas
and we ignore them, myself included.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Peter Geoghegan

pg@bowt.ie

almost 11 years ago

In reply to: Simon Riggs (#7)

Re: Reduce ProcArrayLock contention

On Mon, Jun 29, 2015 at 11:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I find weird is that the discussion was so intense about
LWLockAcquireOrWait that when someone presented a solution there were people
that didn't notice. It makes me wonder whether large group discussions are
worth it.

I didn't think of this myself, but it feels like something I could
have thought of easily.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

almost 11 years ago

In reply to: Amit Kapila (#6)

Re: Reduce ProcArrayLock contention

On 30 June 2015 at 04:21, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now, I would like to briefly explain how allow-one-waker idea has
helped to improve the patch as not every body here was present
in that Un-conference.

The same idea applies for marking commits in clog, for which I have been
sitting on a patch for a month or so and will post now I'm done travelling.

These ideas have been around some time and are even listed on the
PostgreSQL TODO:
http://archives.postgresql.org/pgsql-hackers/2007-09/msg00206.php

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Simon Riggs (#9)

Re: Reduce ProcArrayLock contention

On Tue, Jun 30, 2015 at 11:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 June 2015 at 04:21, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now, I would like to briefly explain how allow-one-waker idea has
helped to improve the patch as not every body here was present
in that Un-conference.

The same idea applies for marking commits in clog, for which I have been

sitting on a patch for a month or so and will post now I'm done travelling.

Sure and I think we might want to try something similar even
for XLogFlush where we use LWLockAcquireOrWait for
WALWriteLock, not sure how it will workout in that case as
I/O is involved, but I think it is worth trying.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11

Simon Riggs

simon@2ndQuadrant.com

almost 11 years ago

In reply to: Amit Kapila (#10)

Re: Reduce ProcArrayLock contention

On 30 June 2015 at 07:30, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure and I think we might want to try something similar even
for XLogFlush where we use LWLockAcquireOrWait for
WALWriteLock, not sure how it will workout in that case as
I/O is involved, but I think it is worth trying.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Pavan Deolasee

pavan.deolasee@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#1)

Re: Reduce ProcArrayLock contention

On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Head : commit 51d0fe5d
Patch -1 : group_xid_clearing_at_trans_end_rel_v1

Client Count/TPS18163264128HEAD814609210899199262363617812Patch-110866483
11093199083122028237

The graph for the data is attached.

Numbers look impressive and definitely shows that the idea is worth
pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8
clients, I did not see any improvement. In fact, averages over 2 runs
showed a slight 2-4% decline in the tps. Having said that, there is no
reason to disbelieve your numbers and no much powerful machines, we might
see the gains.

BTW I ran the tests with, pgbench -s 10 -c 4 -T 300

Points about performance data

---------------------------------------------
1. Gives good performance improvement at or greater than 64 clients
and give somewhat moderate improvement at lower client count. The
reason is that because the contention around ProcArrayLock is mainly
seen at higher client count. I have checked that at higher client-count,
it started behaving lockless (which means performance with patch is
equivivalent to if we just comment out ProcArrayLock in
ProcArrayEndTransaction()).

Well, I am not entirely sure if thats a correct way of looking at it. Sure,
you would see less contention on the ProcArrayLock because the fact is that
there are far fewer backends trying to acquire it. But those who don't get
the lock will sleep and hence the contention is moved somewhere else, at
least partially.

2. There is some noise in this data (at 1 client count, I don't expect
much difference).
3. I have done similar tests on power-8 m/c and found similar gains.

As I said, I'm not seeing benefits on my laptop (Macbook Pro, Quad core,
SSD). But then I ran with much lower scale factor and much lesser number of
clients.

4. The gains are visible when the data fits in shared_buffers as for other
workloads I/O starts dominating.

Thats seems be perfectly expected.

5. I have seen that effect of Patch is much more visible if we keep
autovacuum = off (do manual vacuum after each run) and keep
wal_writer_delay to lower value (say 20ms).

Do you know why that happens? Is it because the contention moves somewhere
else with autovacuum on?

Regarding the design itself, I've an idea that may be we can create a
general purpose infrastructure to use this technique. If its useful here,
I'm sure there are other places where this can be applied with similar
effect.

For example, how about adding an API such as LWLockDispatchWork(lock, mode,
function_ptr, data_ptr)? Here the data_ptr points to somewhere in shared
memory that the function_ptr can work on once lock is available. If the
lock is available in the requested mode then the function_ptr is executed
with the given data_ptr and the function returns. If the lock is not
available then the work is dispatched to some Q (tracked on per-lock
basis?) and the process goes to sleep. Whenever the lock becomes available
in the requested mode, the work is executed by some other backedn and the
primary process is woken up. This will most likely happen in the
LWLockRelease() path when the last holder is about to give up the lock so
that it becomes available in the requested "mode". Now there is lot of
handwaving here and I'm not sure if the LWLock infrastructure permits us to
add something like this easily. But I thought I will put forward the idea
anyways. In fact, I remember trying something of this sort a long time
back, but can't recollect why I gave up on the idea. May be I did not see
much benefit of the entire approach of clubbing work-pieces and doing them
in a single process. But then I probably did not have access to powerful
machines then to correctly measure the benefits. Hence I'm not willing to
give up on the idea, especially given your test results.

BTW may be the LWLockDispatchWork() makes sense only for EXCLUSIVE locks
because we tend to read from shared memory and populate local structures in
READ mode and that can only happen in the primary backend itself.

Regarding the patch, the compare-and-exchange function calls that you've
used would work only for 64-bit machines, right? You would need to use
equivalent 32-bit calls on a 32-bit machine.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Pavan Deolasee (#12)

Re: Reduce ProcArrayLock contention

On Fri, Jul 24, 2015 at 4:26 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Head : commit 51d0fe5d
Patch -1 : group_xid_clearing_at_trans_end_rel_v1

Client Count/TPS18163264128
HEAD814609210899199262363617812
Patch-11086648311093199083122028237

The graph for the data is attached.

Numbers look impressive and definitely shows that the idea is worth

pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8
clients, I did not see any improvement.

I can't help in this because I think we need somewhat
bigger m/c to test the impact of patch.

In fact, averages over 2 runs showed a slight 2-4% decline in the tps.

Having said that, there is no reason to disbelieve your numbers and no much
powerful machines, we might see the gains.

BTW I ran the tests with, pgbench -s 10 -c 4 -T 300

I am not sure if this result is worth worrying to investigate as in
write tests (that too for short duration), such fluctuations can
occur and I think till we see complete results for multiple clients
(1, 4, 8 .. 64 or 128) (possible on some high end m/c), it is difficult
to draw any conclusion.

Points about performance data
---------------------------------------------
1. Gives good performance improvement at or greater than 64 clients
and give somewhat moderate improvement at lower client count. The
reason is that because the contention around ProcArrayLock is mainly
seen at higher client count. I have checked that at higher client-count,
it started behaving lockless (which means performance with patch is
equivivalent to if we just comment out ProcArrayLock in
ProcArrayEndTransaction()).

Well, I am not entirely sure if thats a correct way of looking at it.

Sure, you would see less contention on the ProcArrayLock because the fact
is that there are far fewer backends trying to acquire it.

I was telling that fact even without my patch. Basically I have
tried by commenting ProcArrayLock in ProcArrayEndTransaction.

But those who don't get the lock will sleep and hence the contention is

moved somewhere else, at least partially.

Sure, if contention is reduced at one place it will move
to next lock.

4. The gains are visible when the data fits in shared_buffers as for

other

workloads I/O starts dominating.

Thats seems be perfectly expected.

5. I have seen that effect of Patch is much more visible if we keep
autovacuum = off (do manual vacuum after each run) and keep
wal_writer_delay to lower value (say 20ms).

Do you know why that happens? Is it because the contention moves

somewhere else with autovacuum on?

No, autovacuum generates I/O due to which sometimes there
is more variation in Write tests.

Regarding the design itself, I've an idea that may be we can create a

general purpose infrastructure to use this technique.

I think this could be beneficial if can comeup with
some clean interface.

If its useful here, I'm sure there are other places where this can be

applied with similar effect.

I also think so.

For example, how about adding an API such as LWLockDispatchWork(lock,

mode, function_ptr, data_ptr)? Here the data_ptr points to somewhere in
shared memory that the function_ptr can work on once lock is available. If
the lock is available in the requested mode then the function_ptr is

executed with the given data_ptr and the function returns.

I can do something like that if others also agree with this new
API in LWLock series, but personally I don't think LWLock.c is
the right place to expose API for this work. Broadly the work
we are doing can be thought of below sub-tasks.

1. Advertise each backend's xid.
2. Push all backend's except one on global list.
3. wait till some-one wakes and check if the xid is cleared,
repeat untll the xid is clear
4. Acquire the lock
5. Pop all the backend's and clear each one's xid and used
their published xid to advance global latestCompleteXid.
6. Release Lock
7. Wake all the processes waiting for their xid to be cleared
and before waking mark that Xid of the backend is clear.

So among these only step 2 can be common among different
algorithms, other's need some work specific to each optimization.

Does any one else see a better way to provide a generic API, so
that it can be used for other places if required in future?

If the lock is not available then the work is dispatched to some Q

(tracked on per-lock basis?) and the process goes to sleep. Whenever the
lock becomes available in the requested mode, the work is executed by some
other backedn and the primary process is woken up. This will most likely

happen in the LWLockRelease() path when the last holder is about to give

up the lock so that it becomes available in the requested "mode".

I am not able to follow what you want to achieve with this,
Why is 'Q' better than the current process to perform the
work specific to whole group and does 'Q' also wait on the
current lock, if yes how?

I think this will over complicate the stuff without any real
benefit, atleast for this optimization.

Regarding the patch, the compare-and-exchange function calls that you've

used would work only for 64-bit machines, right? You would need to use
equivalent 32-bit calls on a 32-bit machine.

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.

Note - This patch requires some updation in
src/backend/access/transam/README.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#13)

Re: Reduce ProcArrayLock contention

On Sat, Jul 25, 2015 at 12:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.

Instead of using pg_atomic_uint64, how about using pg_atomic_uint32
and storing the pgprocno rather than the pointer directly? Then it
can work the same way (and be the same size) on every platform.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Pavan Deolasee

pavan.deolasee@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#13)

Re: Reduce ProcArrayLock contention

On Sat, Jul 25, 2015 at 10:12 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Numbers look impressive and definitely shows that the idea is worth

pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8
clients, I did not see any improvement.

I can't help in this because I think we need somewhat
bigger m/c to test the impact of patch.

I understand. IMHO it will be a good idea though to ensure that the patch
does not cause regression for other setups such as a less powerful machine
or while running with lower number of clients.

I was telling that fact even without my patch. Basically I have
tried by commenting ProcArrayLock in ProcArrayEndTransaction.

I did not get that. You mean the TPS is same if you run with commenting out
ProcArrayLock in ProcArrayEndTransaction? Is that safe to do?

But those who don't get the lock will sleep and hence the contention is

moved somewhere else, at least partially.

Sure, if contention is reduced at one place it will move
to next lock.

What I meant was that the lock may not show up in the contention because
all but one processes now sleep for their work to do be done by the group
leader. So the total time spent in the critical section remains the same,
but not shown in the profile. Sure, your benchmark numbers show this is
still better than all processes contending for the lock.

No, autovacuum generates I/O due to which sometimes there
is more variation in Write tests.

Sure, but on an average does it still show similar improvements? Or does
the test becomes IO bound and hence the bottleneck shifts somewhere else?
Can you please post those numbers as well when you get chance?

I can do something like that if others also agree with this new
API in LWLock series, but personally I don't think LWLock.c is
the right place to expose API for this work. Broadly the work
we are doing can be thought of below sub-tasks.

1. Advertise each backend's xid.
2. Push all backend's except one on global list.
3. wait till some-one wakes and check if the xid is cleared,
repeat untll the xid is clear
4. Acquire the lock
5. Pop all the backend's and clear each one's xid and used
their published xid to advance global latestCompleteXid.
6. Release Lock
7. Wake all the processes waiting for their xid to be cleared
and before waking mark that Xid of the backend is clear.

So among these only step 2 can be common among different
algorithms, other's need some work specific to each optimization.

Right, but if we could encapsulate that work in a function that just needs
to work on some shared memory, I think we can build such infrastructure.
Its possible though a more elaborate infrastructure is needed that just one
function pointer. For example, in this case, we also want to set the
latestCompletedXid after clearing xids for all pending processes.

Regarding the patch, the compare-and-exchange function calls that you've

used would work only for 64-bit machines, right? You would need to use
equivalent 32-bit calls on a 32-bit machine.

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.

Hmm. The pointer will be a 32-bit field on a 32-bit machine. I don't know
how exchanging that with 64-bit integer be safe.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Robert Haas (#14)

Re: Reduce ProcArrayLock contention

On Mon, Jul 27, 2015 at 8:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jul 25, 2015 at 12:42 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.

Instead of using pg_atomic_uint64, how about using pg_atomic_uint32
and storing the pgprocno rather than the pointer directly?

Good Suggestion!

I think this can work the way you are suggesting and I am working on
same. Here I have one question, do you prefer to see the code for
this optimisation be done via some LWLock interface as Pavan is
suggesting? I am not very sure if LWLock is a good interface for this
work, but surely I can encapsulate it into different functions rather than
doing everything in ProcArrayEndTransaction.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#16)

Re: Reduce ProcArrayLock contention

On Wed, Jul 29, 2015 at 10:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 27, 2015 at 8:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jul 25, 2015 at 12:42 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.

Instead of using pg_atomic_uint64, how about using pg_atomic_uint32
and storing the pgprocno rather than the pointer directly?

Good Suggestion!

I think this can work the way you are suggesting and I am working on
same. Here I have one question, do you prefer to see the code for
this optimisation be done via some LWLock interface as Pavan is
suggesting? I am not very sure if LWLock is a good interface for this
work, but surely I can encapsulate it into different functions rather than
doing everything in ProcArrayEndTransaction.

I would try to avoid changing lwlock.c. It's pretty easy when so
doing to create mechanisms that work now but make further upgrades to
the general lwlock mechanism difficult. I'd like to avoid that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Robert Haas (#17)

Re: Reduce ProcArrayLock contention

On 2015-07-29 12:54:59 -0400, Robert Haas wrote:

I would try to avoid changing lwlock.c. It's pretty easy when so
doing to create mechanisms that work now but make further upgrades to
the general lwlock mechanism difficult. I'd like to avoid that.

I'm massively doubtful that re-implementing parts of lwlock.c is the
better outcome. Then you have two different infrastructures you need to
improve over time.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Andres Freund (#18)

Re: Reduce ProcArrayLock contention

On Wed, Jul 29, 2015 at 2:18 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-29 12:54:59 -0400, Robert Haas wrote:

I would try to avoid changing lwlock.c. It's pretty easy when so
doing to create mechanisms that work now but make further upgrades to
the general lwlock mechanism difficult. I'd like to avoid that.

I'm massively doubtful that re-implementing parts of lwlock.c is the
better outcome. Then you have two different infrastructures you need to
improve over time.

That is also true, but I don't think we're going to be duplicating
anything from lwlock.c in this case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Andres Freund (#18)

Re: Reduce ProcArrayLock contention

On Wed, Jul 29, 2015 at 11:48 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-29 12:54:59 -0400, Robert Haas wrote:

I would try to avoid changing lwlock.c. It's pretty easy when so
doing to create mechanisms that work now but make further upgrades to
the general lwlock mechanism difficult. I'd like to avoid that.

I'm massively doubtful that re-implementing parts of lwlock.c is the
better outcome. Then you have two different infrastructures you need to
improve over time.

I agree and modified the patch to use 32-bit atomics based on idea
suggested by Robert and didn't modify lwlock.c.

I understand. IMHO it will be a good idea though to ensure that the patch

does not cause regression for other setups such as a less powerful

machine or while running with lower number of clients.

Okay, I have tried it on CentOS VM, but the data is totally screwed (at
same client count across 2 runs the variation is quite high ranging from 300
to 3000 tps) to make any meaning out of it. I think if you want to collect
data
on less powerful m/c, then also atleast we should ensure that it is taken in
a way that we are sure that it is not due to noise and there is actual
regression,
then only we can decide whether we should investigate that or not.
Can you please try taking data with attached script
(perf_pgbench_tpcb_write.sh), few things you need to change in script
based on your setup (environment variables defined in beginning of script
like PGDATA), other thing is that you need to ensure that name of binaries
for HEAD and Patch should be changed in script if you are naming them
differently. It will generate the performance data in test*.txt files.
Also try
to ensure that checkpoint should be configured such that it should not
occur in-between tests, else it will be difficult to conclude the results.
Some parameters you might want to consider for the same are
checkpoint_timeout, checkpoint_completion_target, min_wal_size,
max_wal_size.

I was telling that fact even without my patch. Basically I have
tried by commenting ProcArrayLock in ProcArrayEndTransaction.

I did not get that. You mean the TPS is same if you run with commenting

out ProcArrayLock in ProcArrayEndTransaction?

Yes, TPS is almost same as with Patch.

Is that safe to do?

No, that is not safe. I have done that just to see what is the maximum
gain we can get by reducing the contention around ProcArrayLock.

No, autovacuum generates I/O due to which sometimes there
is more variation in Write tests.

Sure, but on an average does it still show similar improvements?

Yes, number with and without autovacuum show similar trend, it's just
that you can see somewhat better results (more performance improvement)
with autovacuum=off and do manual vacuum at end of each test.

Or does the test becomes IO bound and hence the bottleneck shifts

somewhere

else? Can you please post those numbers as well when you get chance?

The numbers posted in initial mail or in this mail are with autovacuum =on.

So among these only step 2 can be common among different
algorithms, other's need some work specific to each optimization.

Right, but if we could encapsulate that work in a function that just

needs to work on some shared memory, I think we can build such
infrastructure.

For now, I have encapsulated the code into 2 separate functions, rather
than extending LWLock.c as that could easily lead to code which might
not be used in future even though currently it sounds like that and
in-future
if we need to use same technique elsewhere then we can look into it.

Regarding the patch, the compare-and-exchange function calls that

you've used would work only for 64-bit machines, right? You would need to
use equivalent 32-bit calls on a 32-bit machine.

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.

Hmm. The pointer will be a 32-bit field on a 32-bit machine. I don't know

how exchanging that with 64-bit integer be safe.

True, but that has to be in-general taken care by 64bit atomic API's,
like in this case it should fallback to either 32-bit version of API or
something that can work on 32-bit m/c. I think fallback support
is missing as of now in 64-bit API's which we should have if we want
to use those API's, but anyway for now I have modified the patch to
use 32-bit atomics.

Performance Data with modified patch.

pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Data is median of 3 - 30 min runs, the detailed data for
all the 3 runs is in attached document
(perf_write_procarraylock_data.ods)

Head : commit c53f7387
Patch : group_xid_clearing_at_trans_end_rel_v2

Client_Count/Patch_ver (TPS) 1 8 16 32 64 128 HEAD 972 6004 11060 20074
23839 17798 Patch 1005 6260 11368 20318 30775 30215

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#20)

#22

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#21)

#23

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Robert Haas (#22)

#24

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Andres Freund (#23)

#25

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Robert Haas (#22)

#26

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Robert Haas (#24)

#27

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Amit Kapila (#25)

#28

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#25)

#29

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Robert Haas (#22)

#30

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Amit Kapila (#25)

#31

Pavan Deolasee

pavan.deolasee@gmail.com

almost 11 years ago

In reply to: Robert Haas (#22)

#32

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Pavan Deolasee (#31)

#33

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Robert Haas (#22)

#34

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#33)

#35

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Robert Haas (#34)

#36

Jesper Pedersen

jesper.pedersen@redhat.com

almost 11 years ago

In reply to: Amit Kapila (#35)

#37

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Jesper Pedersen (#36)

#38

Jesper Pedersen

jesper.pedersen@redhat.com

almost 11 years ago

In reply to: Amit Kapila (#37)

#39

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Amit Kapila (#37)

#40

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Jesper Pedersen (#36)

#41

Jesper Pedersen

jesper.pedersen@redhat.com

almost 11 years ago

In reply to: Robert Haas (#40)

#42

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Jesper Pedersen (#41)

#43

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Andres Freund (#42)

#44

Jesper Pedersen

jesper.pedersen@redhat.com

almost 11 years ago

In reply to: Andres Freund (#43)

#45

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Jesper Pedersen (#44)

#46

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Robert Haas (#34)

#47

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Andres Freund (#46)

#48

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#47)

#49

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Amit Kapila (#47)

#50

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Andres Freund (#46)

#51

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Robert Haas (#50)

#52

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Andres Freund (#51)

#53

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Andres Freund (#49)

#54

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Amit Kapila (#53)

#55

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Amit Kapila (#53)

#56

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Robert Haas (#55)

Reduce ProcArrayLock contention

Attachments:

Attachments: