Fix performance of generic atomics
Good day, everyone.
I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).
With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.
I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.
It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.
Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.
It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:
Example diff for pg_atomic_exchange_u32_impl:
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32
xchg_)
{
uint32 old;
+ old = pg_atomic_read_u32_impl(ptr);
while (true)
{
- old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}
After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.
Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).
With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
A bit cleaner version of a patch.
Sokolov Yura писал 2017-05-25 15:22:
Good day, everyone.
I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:Example diff for pg_atomic_exchange_u32_impl:
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32
xchg_)
{
uint32 old;
+ old = pg_atomic_read_u32_impl(ptr);
while (true)
{
- old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
Attachments:
0001-Fix-performance-of-Atomics-generic-implementation.patchtext/x-diff; name=0001-Fix-performance-of-Atomics-generic-implementation.patchDownload+52-49
Sokolov Yura <funny.falcon@postgrespro.ru> writes:
@@ -382,12 +358,8 @@ static inline uint64
pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
{
uint64 old;
- while (true)
- {
- old = pg_atomic_read_u64_impl(ptr);
- if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
- break;
- }
+ old = pg_atomic_read_u64_impl(ptr);
+ while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_));
return old;
}
#endif
FWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers. You could perhaps
write the same code with better formatting, eg
while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
/* skip */ ;
but why not leave the formulation with while(true) and a break alone?
(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Yura,
Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).
I don't see any patch in my email client or pgsql-hackers@ archive. I
would also recommend to add your patch to the nearest commitfest [1]https://commitfest.postgresql.org/.
[1]: https://commitfest.postgresql.org/
--
Best regards,
Aleksander Alekseev
Hello, Tom.
I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).
Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
condition for loop exit, so it is clearly just a loop.
Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.
Tom Lane wrote 2017-05-25 17:39:
Sokolov Yura <funny.falcon@postgrespro.ru> writes: @@ -382,12 +358,8 @@ static inline uint64 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) { uint64 old; - while (true) - { - old = pg_atomic_read_u64_impl(ptr); - if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) - break; - } + old = pg_atomic_read_u64_impl(ptr); + while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)); return old; } #endifFWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers. You could perhaps
write the same code with better formatting, egwhile (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
/* skip */ ;but why not leave the formulation with while(true) and a break alone?
(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)regards, tom lane
With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
Attachments:
0001-Fix-performance-of-Atomics-generic-implementation.patchtext/x-diff; name=0001-Fix-performance-of-Atomics-generic-implementation.patchDownload+60-49
Good day, every one.
I'm just posting benchmark numbers for atomics patch.
Hardware: 4 socket 72 core (144HT) x86_64 Centos 7.1
postgresql.conf tuning:
shared_buffers = 32GB
fsync = on
synchronous_commit = on
full_page_writes = off
wal_buffers = 16MB
wal_writer_flush_after = 16MB
commit_delay = 2
max_wal_size = 16GB
Results:
pgbench -i -s 300 + pgbench --skip-some-updates
Clients | master | atomics
========+=========+=======
50 | 53.1k | 53.2k
100 | 101.2k | 103.5k
150 | 119.1k | 121.9k
200 | 128.7k | 132.5k
252 | 120.2k | 130.0k
304 | 100.8k | 115.9k
356 | 78.1k | 90.1k
395 | 70.2k | 79.0k
434 | 61.6k | 70.7k
Also graph with more points attached.
On 2017-05-25 18:12, Sokolov Yura wrote:
Hello, Tom.
I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
condition for loop exit, so it is clearly just a loop.Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.Tom Lane wrote 2017-05-25 17:39:
Sokolov Yura <funny.falcon@postgrespro.ru> writes: @@ -382,12 +358,8 @@ static inline uint64 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) { uint64 old; - while (true) - { - old = pg_atomic_read_u64_impl(ptr); - if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) - break; - } + old = pg_atomic_read_u64_impl(ptr); + while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)); return old; } #endifFWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers. You could perhaps
write the same code with better formatting, egwhile (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
/* skip */ ;but why not leave the formulation with while(true) and a break alone?
(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)regards, tom lane
With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
Attachments:
Hi,
On 05/25/2017 11:12 AM, Sokolov Yura wrote:
I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
� condition for loop exit, so it is clearly just a loop.Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.
I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.
As such, I have marked the entry "Ready for Committer".
Remember to add a version postfix to your patches such that is easy to
identify which is the latest version.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 05/25/2017 11:12 AM, Sokolov Yura wrote:
I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
� condition for loop exit, so it is clearly just a loop.Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.
I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.
As such, I have marked the entry "Ready for Committer".
Remember to add a version postfix to your patches such that is easy to
identify which is the latest version.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jesper Pedersen <jesper.pedersen@redhat.com> writes:
I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.
Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code. What test case(s) did you use?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/05/2017 02:24 PM, Tom Lane wrote:
Jesper Pedersen <jesper.pedersen@redhat.com> writes:
I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code. What test case(s) did you use?
I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using
both logged and unlogged tables. Also ran an internal benchmark which
didn't show anything either.
Setting the entry back to "Needs Review" for additional feedback from
others.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jesper Pedersen <jesper.pedersen@redhat.com> writes:
On 09/05/2017 02:24 PM, Tom Lane wrote:
Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code. What test case(s) did you use?
I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using
both logged and unlogged tables. Also ran an internal benchmark which
didn't show anything either.
That may just mean that pgbench isn't stressing any atomic ops very
hard (at least in the default scenario).
I'm tempted to write a little C function that just hits the relevant
atomic ops in a tight loop, and see how long it takes to do a few
million iterations. That would be erring in the opposite direction,
of overstating the importance of atomic ops to real-world scenarios
--- but if we didn't get any win that way, then it's surely in the noise.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 5, 2017 at 11:39 AM, Jesper Pedersen <jesper.pedersen@redhat.com
wrote:
On 09/05/2017 02:24 PM, Tom Lane wrote:
Jesper Pedersen <jesper.pedersen@redhat.com> writes:
I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code. What test case(s) did you use?I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using
both logged and unlogged tables. Also ran an internal benchmark which
didn't show anything either.
What scale factor and client count? How many cores per socket? It looks
like Sokolov was just starting to see gains at 200 clients on 72 cores,
using -N transaction.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes:
What scale factor and client count? How many cores per socket? It looks
like Sokolov was just starting to see gains at 200 clients on 72 cores,
using -N transaction.
I spent some time poking at this with the test scaffolding I showed at
/messages/by-id/17694.1504665058@sss.pgh.pa.us
AFAICT, there are not huge differences between different coding methods
even for two processes in direct competition in the tightest possible
atomic-ops loops. So if you're testing SQL operations, you're definitely
not going to see anything without a whole lot of concurrent processes.
Moreover, it matters which primitive you're testing, on which platform,
with which compiler, because we have a couple of layers of atomic ops
implementations.
I started out testing pg_atomic_fetch_add_u32(), as shown in the above-
mentioned message. On x86_x64 with gcc, that is implemented by the
code for it in src/include/port/atomics/arch-x86.h. If you dike out
that support, it falls back to the version in atomics/generic-gcc.h,
which seems no worse and possibly better. Only if you also dike out
generic-gcc.h do you get to the version in atomics/generic.h that this
patch wants to change. However, at that point you're also down to a
spinlock-based implementation of the underlying pg_atomic_read_u32_impl
and pg_atomic_compare_exchange_u32_impl, which means that performance
is going to be less than great anyway. No platform that we consider
well-supported ought to be hitting the spinlock paths. This means
that Sokolov's proposed changes in atomics/generic.h ought to be
uninteresting for performance on any platform we care about --- at
least for pg_atomic_fetch_add_u32().
However, Sokolov also proposes adding gcc-intrinsic support for
pg_atomic_fetch_and_xxx and pg_atomic_fetch_or_xxx. This is a
different kettle of fish. I repeated the experiments I'd done
for pg_atomic_fetch_add_u32(), per the above message, using the "or"
primitive, and found largely the same results as for "add": it's
slightly better under contention than the generic code, and
significantly better if the result value of the atomic op isn't needed.
So I think we should definitely take the part of the patch that
changes generic-gcc.h --- and we should check to see if we're missing
out on any other gcc primitives we should be using there.
I'm less excited about the proposed changes in generic.h. There's
nothing wrong with them in principle, but they mostly shouldn't
make any performance difference on any platform we care about,
because we shouldn't get to that code in the first place on any
platform we care about. If we are getting to that code for any
specific primitive, then either there's a gap in the platform-specific
or compiler-specific support, or it's debatable that we ought to
consider that primitive to be very primitive.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-06 07:23, Tom Lane wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
What scale factor and client count? How many cores per socket? It
looks
like Sokolov was just starting to see gains at 200 clients on 72
cores,
using -N transaction.This means that Sokolov's proposed changes in atomics/generic.h ought to be uninteresting for performance on any platform we care about --- at least for pg_atomic_fetch_add_u32().
May I cite you this way:
"Tom Lane says PostgreSQL core team doesn't care for 4 socket 72 core
Intel Xeon servers"
???
To be honestly, I didn't measure exact impact of changing
pg_atomic_fetch_add_u32.
I found issue with pg_atomic_fetch_or_u32, cause it is highly contended
both in LWLock, and in buffer page state management. I found changing
loop for generic pg_atomic_fetch_or_u32 already gives improvement.
And I changed loop for other generic atomic functions to make
all functions uniform.
It is bad style guide not to changing worse (but correct) code into
better (and also correct) just because you can't measure difference
(in my opinion. Perhaps i am mistaken).
(and yes: gcc intrinsic gives more improvement).
--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-06 14:54, Sokolov Yura wrote:
On 2017-09-06 07:23, Tom Lane wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
What scale factor and client count? How many cores per socket? It
looks
like Sokolov was just starting to see gains at 200 clients on 72
cores,
using -N transaction.This means that Sokolov's proposed changes in atomics/generic.h
ought to be uninteresting for performance on any platform we care
about --- at
least for pg_atomic_fetch_add_u32().May I cite you this way:
I apologize for tone of my previous message.
My tongue is my enemy.
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 September 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
What scale factor and client count? How many cores per socket? It looks
like Sokolov was just starting to see gains at 200 clients on 72 cores,
using -N transaction.
...
Moreover, it matters which primitive you're testing, on which platform,
with which compiler, because we have a couple of layers of atomic ops
implementations.
...
I think Sokolov was aiming at 4-socket servers specifically, rather
than as a general performance gain.
If there is no gain on 2-socket, at least there is no loss either.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Simon Riggs <simon@2ndquadrant.com> writes:
On 5 September 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Moreover, it matters which primitive you're testing, on which platform,
with which compiler, because we have a couple of layers of atomic ops
implementations.
If there is no gain on 2-socket, at least there is no loss either.
The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the less-generic
atomics code, and the latter is where our attention should be focused.
I think basically all of the improvement Sokolov got was from upgrading
the coverage of generic-gcc.h.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-06 15:56, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
On 5 September 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Moreover, it matters which primitive you're testing, on which
platform,
with which compiler, because we have a couple of layers of atomic ops
implementations.If there is no gain on 2-socket, at least there is no loss either.
The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the less-generic
atomics code, and the latter is where our attention should be focused.
I think basically all of the improvement Sokolov got was from upgrading
the coverage of generic-gcc.h.regards, tom lane
Not exactly. I've checked, that new version of generic
pg_atomic_fetch_or_u32
loop also gives improvement. Without that check I'd not suggest to fix
generic atomic functions. Of course, gcc intrinsic gives more gain.
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Sokolov Yura <funny.falcon@postgrespro.ru> writes:
On 2017-09-06 15:56, Tom Lane wrote:
The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the less-generic
atomics code, and the latter is where our attention should be focused.
I think basically all of the improvement Sokolov got was from upgrading
the coverage of generic-gcc.h.
Not exactly. I've checked, that new version of generic
pg_atomic_fetch_or_u32
loop also gives improvement.
But once you put in the generic-gcc version, that's not reached anymore.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-06 16:36, Tom Lane wrote:
Sokolov Yura <funny.falcon@postgrespro.ru> writes:
On 2017-09-06 15:56, Tom Lane wrote:
The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the
less-generic
atomics code, and the latter is where our attention should be
focused.
I think basically all of the improvement Sokolov got was from
upgrading
the coverage of generic-gcc.h.Not exactly. I've checked, that new version of generic
pg_atomic_fetch_or_u32
loop also gives improvement.But once you put in the generic-gcc version, that's not reached
anymore.
Yes, you're right.
But I think, generic version still should be "fixed".
If generic version is not reached on any platform, then why it is kept?
If it is reached somewhere, then it should be improved.
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers