Performance degradation in commit ac1d794
Hello hackers!
I suddenly found commit ac1d794 gives up to 3 times performance degradation.
I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265
it's shocking
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
Hello hackers!
I suddenly found commit ac1d794 gives up to 3 times performance
degradation.I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265it's shocking
If so, could you provide a hierarchical before/after profile?
Andres
Hi,
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <
d.vasilyev@postgrespro.ru> wrote:Hello hackers!
I suddenly found commit ac1d794 gives up to 3 times performance
degradation.I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265it's shocking
If so, could you provide a hierarchical before/after profile?Andres
Hi,---
Please excuse brevity and formatting - I am writing this on my mobile
phone.
Yes, about this.
If so, could you provide a hierarchical before/after profile?
Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18
2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
2015-12-25 20:27 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:
2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <
d.vasilyev@postgrespro.ru> wrote:Hello hackers!
I suddenly found commit ac1d794 gives up to 3 times performance
degradation.I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265it's shocking
If so, could you provide a hierarchical before/after profile?Andres
Hi,---
Please excuse brevity and formatting - I am writing this on my mobile
phone.Yes, about this.
If so, could you provide a hierarchical before/after profile?
Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
09:12:18 2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
I came across it by studying the results:
vanilla 9.5 = 30020c3fc3b6de5592978313df929d370f5770ce
vanilla 9.6 = c4a8812cf64b142685e39a69694c5276601f40e4
info | clients | tps
-----------------------+---------+---------
vanilla 9.5 | 1 | 30321
vanilla 9.5 | 8 | 216542
vanilla 9.5 | 16 | 412526
vanilla 9.5 | 32 | 751331
vanilla 9.5 | 48 | 956146
<- this point
vanilla 9.5 | 56 | 990122
vanilla 9.5 | 64 | 842436
vanilla 9.5 | 72 | 913272
vanilla 9.5 | 82 | 659332
vanilla 9.5 | 92 | 630111
vanilla 9.5 | 96 | 616863
vanilla 9.5 | 110 | 592080
vanilla 9.5 | 120 | 575831
vanilla 9.5 | 130 | 557521
vanilla 9.5 | 140 | 537951
vanilla 9.5 | 150 | 517019
vanilla 9.5 | 160 | 502312
vanilla 9.5 | 170 | 489162
vanilla 9.5 | 180 | 477178
vanilla 9.5 | 190 | 464620
vanilla 9.6 | 1 | 31738
vanilla 9.6 | 8 | 219692
vanilla 9.6 | 16 | 422933
vanilla 9.6 | 32 | 375546
vanilla 9.6 | 48 | 363474
<- this point
vanilla 9.6 | 56 | 352943
vanilla 9.6 | 64 | 334498
vanilla 9.6 | 72 | 369802
vanilla 9.6 | 82 | 604867
vanilla 9.6 | 92 | 871048
vanilla 9.6 | 96 | 969265
vanilla 9.6 | 105 | 996794
vanilla 9.6 | 110 | 932853
vanilla 9.6 | 115 | 758485
vanilla 9.6 | 120 | 721365
vanilla 9.6 | 125 | 632265
vanilla 9.6 | 130 | 624666
vanilla 9.6 | 135 | 582120
vanilla 9.6 | 140 | 583080
vanilla 9.6 | 150 | 555608
vanilla 9.6 | 160 | 533340
vanilla 9.6 | 170 | 520308
vanilla 9.6 | 180 | 504536
vanilla 9.6 | 190 | 496967
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"
If so, could you provide a hierarchical before/after profile?
Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
09:12:18
2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500
Profile as in perf oprofile or something.
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2015-12-25 20:44 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"
If so, could you provide a hierarchical before/after profile?
Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
09:12:18
2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500Profile as in perf oprofile or something.
---
Please excuse brevity and formatting - I am writing this on my mobile
phone.
ac1d794:
Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID:
pgpro
Overhead Shared Object Symbol
69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBuffer
0,46% postgres [.] GetSnapshotData
a05dc4d:
Samples: 1M of event 'cycles', Event count (approx.): 508150718694, UID:
pgpro
Overhead Shared Object Symbol
4,77% postgres [.] GetSnapshotData
4,30% postgres [.] _bt_compare
3,13% postgres [.] hash_search_with_hash_value
3,08% postgres [.] LWLockAcquire
2,09% postgres [.] LWLockRelease
2,03% postgres [.] PinBuffer
Perf record generate big traffic:
time perf record -u pgpro -g --call-graph=dwarf
^C[ perf record: Woken up 0 times to write data ]
Warning:
Processed 1078453 events and lost 18257 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 8507.985 MB perf.data (1055663 samples) ]
real 0m8.791s
user 0m0.678s
sys 0m8.120s
If you want i give you ssh-access.
=?UTF-8?B?0JLQsNGB0LjQu9GM0LXQsiDQlNC80LjRgtGA0LjQuQ==?= <d.vasilyev@postgrespro.ru> writes:
Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID:
pgpro
Overhead Shared Object Symbol
69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBuffer
Seems like what you've got here is a kernel bug.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
=?UTF-8?B?0JLQsNGB0LjQu9GM0LXQsiDQlNC80LjRgtGA0LjQuQ==?=
<d.vasilyev@postgrespro.ru> writes:��� Samples: 1M of event 'cycles', Event count (approx.):
816922259995, UID:
pgpro
Overhead Shared Object Symbol69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBufferSeems like what you've got here is a kernel bug.
I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.
It certainly be interesting to see the expanded tree below the spinlock. I wonder if this is related to directed wakeups.
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Seems like what you've got here is a kernel bug.
I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.
Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:
Andres Freund <andres@anarazel.de> writes:
On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Seems like what you've got here is a kernel bug.
I wouldn't go as far as calling it a kernel bug. Were still doing 300k
tps. And were triggering the performance degradation by adding another
socket (IIRC) to the poll(2) call.Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.regards, tom lane
I did bt on backends and found it in following state:
#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4,
wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at
pg_latch.c:333
#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0
<PqRecvBuffer>, len=8192) at be-secure.c:147
#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4 pq_getbyte () at pqcomm.c:958
#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at
postgres.c:345
Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue
There’s screenshots: http://i.imgur.com/pux2bGJ.png
http://i.imgur.com/LJQbm2V.png
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
2015-12-25 22:42 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:
2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:Andres Freund <andres@anarazel.de> writes:
On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Seems like what you've got here is a kernel bug.
I wouldn't go as far as calling it a kernel bug. Were still doing 300k
tps. And were triggering the performance degradation by adding another
socket (IIRC) to the poll(2) call.Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.regards, tom lane
I did bt on backends and found it in following state:
#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4,
wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at
pg_latch.c:333
#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0
<PqRecvBuffer>, len=8192) at be-secure.c:147
#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4 pq_getbyte () at pqcomm.c:958
#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at
postgres.c:345
Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue
There’s screenshots: http://i.imgur.com/pux2bGJ.png
http://i.imgur.com/LJQbm2V.png---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
I’m sorry I meant remove_wait_queue and add_wait_queue functions call
_raw_spin_lock_irqsave what holds main processor time.
uname -a: 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015
x86_64 x86_64 x86_64 GNU/Linux
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
On 2015-12-25 13:28:55 -0500, Tom Lane wrote:
Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.
Sounds fairly likely - and not too surprising. In this scenario we've a
couple 100k registrations/unregistrations to a pretty fundamentally
shared resource (the wait queue for changes to the pipe). Not that
surprising that it becomes a problem.
There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.
2) Create a pair of fds between postmaster/backend for each
backend. While obviously increasing the the number of FDs noticeably,
it's interesting for other features as well: If we ever want to do FD
passing from postmaster to existing backends, we're going to need
that anyway.
3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.
Portability, on the other hand, would be problematic.
2) Create a pair of fds between postmaster/backend for each
backend. While obviously increasing the the number of FDs noticeably,
it's interesting for other features as well: If we ever want to do FD
passing from postmaster to existing backends, we're going to need
that anyway.
Maybe; it'd provide another limit on how many backends we could run.
3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.
And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-25 16:29:53 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.Portability, on the other hand, would be problematic.
Indeed. But we might be able to get away with it because there's
realistically just one platform on which people run four socket
servers. Obviously we'd leave poll and select support in place. It'd be
a genuine improvement for less extreme loads on linux, too.
3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.
Ahem. Um. Look, over there --->
I blame it on all the food.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.Ahem. Um. Look, over there --->
I blame it on all the food.
A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.
Hmm. That would have semantics rather substantially different from
the way that the WL_POSTMASTER_DEATH code behaves. But I don't know
how much we care about that, since the whole scenario is something
that should not happen under normal circumstances. Maybe cross-platform
variation is OK as long as it doesn't make the code too hairy.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Dec 26, 2015 at 5:41 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
3) Replace the postmaster_alive_fds socketpair by some other
signalling
mechanism. E.g. sending a procsignal to each backend, which sets
the
latch and a special flag in the latch structure.
And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.Ahem. Um. Look, over there --->
I blame it on all the food.
A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.
There is a way to make backends exit in Windows as well by using
JobObjects and use limitFlags as JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE
for JOBOBJECT_BASIC_LIMIT_INFORMATION.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
On 2015-12-25 16:29:53 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.Portability, on the other hand, would be problematic.
Indeed. But we might be able to get away with it because there's
realistically just one platform on which people run four socket
servers. Obviously we'd leave poll and select support in place. It'd be
a genuine improvement for less extreme loads on linux, too.
I finally got back to working on this. Attached is a WIP patch series
implementing:
0001: Allow to easily choose between the readiness primitives in unix_latch.c
Pretty helpful for testing, not useful for anything else.
0002: Error out if waiting on socket readiness without a specified socket.
0003: Only clear unix_latch.c's self-pipe if it actually contains data.
~2% on high qps workloads
0004: Support using epoll as the polling primitive in unix_latch.c.
~3% on high qps workloads, massive scalability improvements (x3)
on very large machines.
With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.
The reason I went with using epoll over the PR_SET_PDEATHSIG approach is
that it provides semantics that are more similar to the other platforms,
while being just as platform dependant as PR_SET_PDEATHSIG. It also is
actually measurably faster, at least here.
0004 currently contains one debatable optimization, which I'd like to
discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
removed/added to the epoll fd, if it's the numerically same as in the
last call. That's good for performance, but would be wrong if the socket
were close and a new one with the same value would be waited on. I
think a big warning sign somewhere is sufficient to deal with that
problem - it's not something we're likely to start doing. And even if
it's done at some point, we can just offer an API to reset the last used
socket fd.
Unless somebody comes up with a platform independent way of addressing
this, I'm inclined to press forward using epoll(). Opinions?
Andres
Attachments:
0001-Make-it-easier-to-choose-the-used-waiting-primitive-.patchtext/x-patch; charset=us-asciiDownload+34-17
0002-Error-out-if-waiting-on-socket-readiness-without-a-s.patchtext/x-patch; charset=us-asciiDownload+8-6
0003-Only-clear-unix_latch.c-s-self-pipe-if-it-actually-c.patchtext/x-patch; charset=us-asciiDownload+51-27
0004-Support-using-epoll-as-the-polling-primitive-in-unix.patchtext/x-patch; charset=us-asciiDownload+234-6
Andres Freund <andres@anarazel.de> writes:
0004 currently contains one debatable optimization, which I'd like to
discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
removed/added to the epoll fd, if it's the numerically same as in the
last call. That's good for performance, but would be wrong if the socket
were close and a new one with the same value would be waited on. I
think a big warning sign somewhere is sufficient to deal with that
problem - it's not something we're likely to start doing. And even if
it's done at some point, we can just offer an API to reset the last used
socket fd.
Perhaps a cleaner API solution would be to remove the socket argument per
se from the function altogether, instead providing a separate
SetSocketToWaitOn() call.
(Also, if there is a need for it, we could provide a function that still
takes a socket argument, with the understanding that it's to be used for
short-lived sockets where you don't want to change the process's main
epoll state.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 14, 2016 at 9:39 AM, Andres Freund <andres@anarazel.de> wrote:
I finally got back to working on this. Attached is a WIP patch series
implementing:
0001: Allow to easily choose between the readiness primitives in unix_latch.c
Pretty helpful for testing, not useful for anything else.
Looks good.
0002: Error out if waiting on socket readiness without a specified socket.
Looks good.
0003: Only clear unix_latch.c's self-pipe if it actually contains data.
~2% on high qps workloads
everytime -> every time
+ if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+ {
+ result |= WL_LATCH_SET;
+ }
Excess braces.
Doesn't this code make it possible for the self-pipe to fill up,
self-deadlocking the process? Suppose we repeatedly enter
WaitLatchOrSocket(). Each time we do, just after waiting = true is
set, somebody sets the latch. We handle the signal and put a byte
into the pipe. Returning from the signal handler, we then notice that
is_set is true and return at once, without draining the pipe. Repeat
until something bad happens.
0004: Support using epoll as the polling primitive in unix_latch.c.
~3% on high qps workloads, massive scalability improvements (x3)
on very large machines.With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.
+ /*
+ * Unnecessarily pass data for delete due to bug errorneously
+ * requiring it in the past.
+ */
This is pretty vague. And it has a spelling mistake.
Further down, nodified -> notified.
+ if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock)
I don't like this very much. I think it's a bad idea to test
latch->lastwatchfd != sock. That has an excellent change of letting
people write code that appears to work but then doesn't. I think it
would be better, if we're going to change the API contract, to make it
a hard break, as I see Tom has also suggested while I've been writing
this.
Incidentally, if we're going to whack around the latch API, it would
be nice to pick a design which wouldn't be too hard to extend to
waiting on multiple sockets. The application I have in mind is to
send of queries to several foreign servers at once and then wait until
bytes come back from any of them. It's mostly pie in the sky at this
point, but it seems highly likely to me that we'd want to do such a
thing by waiting for bytes from any of the sockets involved OR a latch
event.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers